Algorithms for interconnection networks are mostly analyzed using the interconnection-network cost model, in which it is assumed that in each step each PU may perform an unlimited amount of communication and then route one unit packet to one or more of its immediate neighbors.
Most algorithms for this cost model deal with the various communication problems, rather than to solve a general problem, say maxflow. The reason for this is that the communication problems are already hard enough, and that harder problems can often be solved quite well by taking a parallel algorithm designed for a more powerful parallel computer model, and then to simulate the operations of this computer on the interconnection network. Such a simulation is composed of two components: an allocation of the data and processes to the PUs and performing the resulting routing operations.
Next to the various communication operations also some of the most basic problems have been studied for interconnection networks. These are sorting, selection, list ranking, matrix multiplication and the like. In a following chapter we will even see how to solve the all-pairs shortest-paths problem optimally on a two-dimensional mesh.
In hardware a complete interconnection network with one-port capability can be realized with a crossbar switch. For interconnecting P PUs, a crossbar switch consist of a grid of P x P switches. These switches have four connections, W, N, E, S. The switches have two settings, either connecting W with E and N with S or connecting W with N and E with S, which can be chosen dynamically. Setting the switches appropriately, point-to-point connections can be created. Notice that using a crossbar switch each PU can send a packet to at most one PU and receive a packet from at most one PU, but that the PU to which it is sending may be different from the PU from which it is receiving. Switch-based complete network topologies are very common in practice. But, because of the increasing delays and cost such systems are typically of moderate size.
A similar network is the circular array which is also called ring. The only difference with the linear array is an additional connection between PU_0 and PU_{n - 1}. This connection is often designated as wrap-around connection. The degree is 2, the same as before. However, the other measures are better: the diameter is round_down{n / 2} and the bisection width is 2. This implies that most problems can be solved on a circular array twice as fast as on a linear array.
An n_1 x n_2 mesh consists of n_1 * n_2 PUs arranged in a grid. Each PU is connected to its up to four neighbors. As a graph, a mesh is the direct product of two linear arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter n_1 + n_2 - 2 and the bisection width is min{n_1, n_2}.
An n_1 x n_2 torus is a mesh with wrap-around connections: there are additional connections between PU_{i, 0} and PU_{i, n_2 - 1} for all i, 0 <= i < n_1, and between PU_{0, j} and PU_{n_1 - 1, j}. As a graph, a torus is the direct product of two circular arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter round_down(n_1 / 2) + round_down(n_2 / 2) and the bisection width is 2 * min{n_1 , n_2}. On tori most problems can be solved twice as fast as on meshes.
There are also higher-dimensional meshes and tori. A d-dimensional n_1 x ... x n_d mesh consists of n_1 * ... * n_d PUs arranged in a d-dimensional grid. Each PU is connected to its at most 2 * d neighbors. The degree is 2 * d, the diameter is n_1 + ... + n_d - d and the bisection width is n_1 * ... * n_d / min{n_1, ..., n_d}. The definition and properties of d-dimensional tori are analogous.
The major weakness of a tree network is its very small bisection width. This means that any problem requiring a single rearrangement of the data has at least linear running time. For such problems a tree is not better than a linear array. Therefore several improved trees have been designed. The simplest is the fat tree. In a fat tree the connections between level i and level i + 1, where the leafs are considered to lie at level 0, have capacity 2^i. This means that such a connection can transfer 2^i packets instead of 1.
Alternatively tree structures have been integrated in other networks. An important example of such a network is the mesh of trees. An n x n mesh of trees is constructed from an n x n grid of not-connected nodes. The PUs in each row and column of this grid are connected by a tree. There are no direct connections between the row trees or between the column trees. The row and column trees are connected at their shared leafs. An n x n mesh of trees has 2 * n * (2 * n - 1) - n^2 = 3 * n^2 - 2 * n nodes.
A k-dimensional cube-connected cycle, abbreviated CCC, network is obtained by taking a k-dimensional hypercube and replacing each node by a cycle with k nodes. Node d, 0 <= d < k, in each cycle takes over the connection of the hypercube along the d-direction. More precisely, the b * 2^k nodes of the CCC are indexed with pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k), (i, (d - 1) mod k) and (i ^ 2^d, d). Here i ^ 2^d denotes i + 2^d if bit d of i is 0 and i - 2^d if this bit is 1. The degree is 3, the diameter is round_down(5 * k / 2) - 2 for k >= 3. A CCC is the simplest network which combines a logarithmic diameter with constant degree. The bisection width is the same as in the corresponding hypercube, 2^{k - 1}.
A k-dimensional butterfly network is similar. The k * 2^k nodes are indexed as pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k) and (i, (d - 1) mod k). Furthermore there is an undirected edge from (i, d) to (i ^ 2^d, (d + 1) mod k). So, the degree is 4. The diameter is round_down(3 * k / 2). Because many router chips provide four connections, and because of the good routing properties of butterflies these networks have been used quite a lot in practice.
Cayley graphs are much nicer than just connected and undirected: they are even node symmetric. A graph is said to be node symmetric, if for any pair (u, v) of nodes there is an automorphism phi = phi(u, v) of the nodes so that if (x, y) is an edge that then also (phi(x), phi(y)) is an edge. For Cayley graphs, these automorphisms are easy to construct: determine elements e_1, ..., e_k so that e_1 * ... * e_k * u = v, then phi maps any x in V to x * e_1 * ... * e_k. If (x, y) in E, this means that y = x * e for some e in S. Thus, (phi(x), phi(y)) = (phi(x), phi(x * e)) = (phi(x), e_1 * ... * e_k * (x * e)) = (phi(x), (e_1 * ... * e_k * x) * e) = (phi(x), phi(x) * e) in E.
Several of the graphs we have encountered before are Cayley graphs: circular arrays, tori, hypercubes and CCCs, though these can be presented in a more natural way. We describe how hypercubes can be obtained as Cayley graphs. For H_k, the k-dimensional hypercube, is obtained from a subgroup of the permutation group of 2 * k elements. There are k generators each having two consecutive elements exchanged. To make the description more concrete we only consider the special case k = 4. The set S of generators consists of (10234567), (01324567), (01235467) and (01234576). The subgroup consists of all elements obtained by starting with (01234567) and letting these elements work. If in the conventional indexing of the hypercube we find a 1 at position i, 0 <= i < k, the elements 2 * i and 2 * i + 1 in the indexing with a permutation are exchanged. For example, (0110) corresponds to (01325467).
For interconnection networks, there is a trade-off between quality on the one hand and complexity and scalability on the other.
An embedding of an interconnection network IN_1 with P_1 PUs into a network IN_2 with P_2 PUs is a mapping phi() of the PUs of IN_1 to the PUs of IN_2. Generally it is part of the definition that phi() should be injective. Not only the PUs should be mapped, but for any two PUs PU_i and PU_j in IN_1 connected by a link there should also be specified a path in IN_2 running between phi(PU_i) and phi(PU_j). There are three cost measures that are used when specifying the quality of an embedding:
As a first example we consider the embedding of a torus into a mesh. This embedding is obtained by first blowing up the n x n torus to a 2 * n x n torus and folding it double along the middle row, so that one PU comes in each row, and then repeating this for a folding along the middle column. The following picture illustrates the process. Notice that each pair of PUs that previously was adjacent now is at distance two at most: the dilation is two. Also, are there at most two paths mapped to any link of the mesh: the congestion is two. For the special case of dilation and congestion both equal to two, any step of the embedded network can be simulated in two steps. So, any problem which on an n x n torus can be solved in T communication steps can be solved on a mesh in 2 * T steps.
Another important example is the embedding of a binomial tree with n = 2^k nodes in a hypercube. The construction is simple: a binomial tree with 2^k nodes, T_k consists of two trees T_{k - 1} whose roots are connected. A k-dimensional hypercube consists of two (k - 1)-dimensional hypercubes with connections between all corresponding nodes, in particular between the nodes (0, 0, ..., 0) and (1, 0, ..., 0). So, the embedding can be performed recursively. The dilation, congestion and expansion are all 1: the graph of the binomial tree is a subgraph of the hypercube, obtained by removing edges.
Two types of embeddings are particularly important: the embeddings of the linear and circular array. The embedding of a linear array, which is nothing but the construction of a Hamiltonian path, is important because it provides a way of indexing the PUs so that any two PUs with consecutive indices are adjacent in the network. Because many algorithms contain steps in which data must be exchanged to PUs with consecutive indices, this is a very desirable property. A systematic way of indexing the PUs of an interconnection network is called an indexing scheme. The embedding of a circular array, which is nothing but the construction of a Hamiltonian cycle, is important because algorithms may even require data to be shifted in a cyclical way.
In any grid it is easy to construct a Hamiltonian path. A two-dimensional mesh is Hamiltonian, which means that it has a Hamiltonian cycle, iff at least one of the sides has even length. Next to the indexing schemes obtained by embedding linear and circular arrays there are several other schemes for indexing the PUs of a grid. The most important is the so-called row-major indexing scheme. In this scheme, PU_{i, j} in an n x n mesh, which is located in row i and column j, is given index i * n + j. Mostly the rows and columns are numbered starting from the upper-left corner as in a matrix. In the column-major indexing scheme, this PU has index i + j * n. The indexing obtained by embedding a linear array, starting at the upper-left corner, then always going along the rows first, is called the snake-like row-major indexing. There are also so-called shuffled indexing schemes. The most important is the shuffled row-major indexing scheme. In this scheme, PU_{i, j}, for numbers i and j with binary expansions given by (i_{k - 1}, ..., i_i, i_0) and (j_{k - 1}, ..., j_1, j_0) respectively, has an index with binary expansion (i_{k - 1}, j_{k - 1}, ..., i_1, j_1, i_0, j_0).
On a hypercube a Hamiltonian path can be constructed by exploiting the recursive structure of the network. H_k denotes the k-dimensional hypercube. The traversal of all PUs of H_k is obtained by gluing together two traversals of H_{k - 1}. If the gluing is done correctly, this even gives a Hamiltonian cycle. The schedule is known as Gray code. It can be described by giving the order in which the PUs are visited. It is easy to verify that in the following sequences all consecutive PUs, including the first and last, are adjacent:
for H_1 we have {0, 1},
for H_2 we have {00, 01, 11, 10},
for H_3 we have {000, 001, 011, 010, 110, 111, 101, 100}.
How to generalize the above sequences? Look at the sequence of H_3. It starts with the sequence of H_2 in which all indices have an additional leading 0. Then this same sequence appears once more in reversed order in which all indices have an additional leading 1. More generally, denoting the Hamiltonian sequence for H_{k - 1} by S_{k - 1}, the Hamiltonian sequence S_k for H_k is the concatenation of 0-S_k and 1-reversed(S_k). Here, for a sequence S and b in {0, 1}, b-S denotes the sequence obtained by preceding all elements in S by a b.
Lemma: The sequences S_k defined by S_1 = {0, 1} and S_k = 0-S_{k - 1} + 1-reversed(S_{k - 1}), where + denotes concatenation, give Hamiltonian enumerations of the PUs of the k-dimensional hypercubes H_k.
Proof: The proof goes by induction. For H_0 the construction is clearly correct. So, assume it is correct for H_{k - 1}. If S_{k - 1} is a Hamiltonian cycle, then so is reversed(S_k). The sequences 0-S_{k - 1} and 1-reversed(S_{k - 1}) are paths through H_k because of the recursive way in which H_k is constructed out of two connected copies of H_{k - 1}. It remains to check that these two paths fit together. The first element of S_{k - 1} and the last element of reversed(S_{k - 1}) are the same. So, the first element of 0-S_{k - 1} and the last element of 1-reversed(S_{k - 1}) differ in exactly one position, the leading one, and are therefore the corresponding PUs are connected n the hypercube. End.
Embeddings provide a major way of deriving algorithms for one topology from algorithms designed for another topology. Embeddings of linear and circular arrays lead to indexings for which consecutively indexed PUs are adjacent.
Each of these operations has its own importance. All-to-all routing is the pattern encountered when there is no specific pattern in the data access: the PUs compute a while on the data they have and by doing so accumulate questions to be sent to other PUs. Unless the algorithm was designed in a specific way, these questions will be more or less randomly distributed over the PUs, giving rise to a more or less balanced all-to-all routing pattern. A permutation-routing pattern arises in algorithms that where designed in such a way that the questions are not scattered over the data set but concentrated. Such a structure is desirable because it reduces the number of start-ups. Broadcasting is a very common operation performed to distribute information from a source PU to a (subset) of PUs that must work on the same data. Typically this happens at the beginning of a computation. The symmetric operation is called gathering. Gathering means that information from all PUs is combined to a single information in some specified PU. For example, this may be done in order to sum up the computed results. Gossiping is used to make local samples of the information available to all PUs. It is also used to synchronize a database and in numerical communications. Because gossiping is a very expensive operation, it will normally be performed only for a small subset of the total data set. Another much studied problem is k-k routing, in which each PU is the source and destination of at most k packets. Both permutation routing and all-to-all routing are special cases of k-k routing. In practice general k-k routing patterns are rarely encountered.
How good is the greedy algorithm? It is easy to see that during phase 1 the packets run without any delay: the packets moving rightwards do not change there relative distances as long as they are moving and therefore they will never encounter each other. The same is true for the leftwards moving packets. So, phase 1 takes as long as the longest distance a packet has to go. The maximum distance that may have to be covered in a row is n - 1. As a result of the routing in phase 1, the packets may have piled up in some PUs. For example if the permutation is the transposition, in which PU_{i, j}, 0 <= i, j < n, sends a packet to PU_{j, i}, then n packets reside in all diagonal PUs at the end of phase 1. So, the queue size, that is, the maximum number of packets that may have to be stored in a single PU at the same time, may be as high as n. Thus, during the second routing phase there may be conflicts. However, due to the applied farthest-first strategy, can a packet which has to move upwards to a PU at distance d from the boundary of the mesh be delayed at most d times. Because the distance this packet has tot travel is bounded by n - d - 1, it will arrive no later than in step n - 1. So, even phase 2 will be completed in n - 1 steps, and therefore the whole algorithm is completed in 2 * n - 2 steps which is optimal. Much more sophisticated algorithms have shown that it is possible to achieve the same time with constant queue size.
In general we hope that permutation routing can be solved in a time equal to or only slightly larger than the diameter. In any case we would like to route permutations in O(diameter) time. The above algorithm for k-dimensional hypercubes achieves this for most permutations, the expected time is O(k), but there are a few permutations for which the performance is unacceptable. Further down we will see how to overcome this problem.
The given algorithm is uni-axial, that is, in any step either only horizontal or only vertical connections are used. If the mesh has full-port capacity, then this is wasteful. However, it is often possible to overlay two uni-axial algorithms. In this particular case the solution is very simple: all packets going from PU_{i_x, i_y} to PU_{j_x, j_y} with i_x + i_y + j_x + j_y even are routed as described, while the other half of the packets are routed first along the columns and then along the rows. Said otherwise: if the PUs are colored black and white as on a chess board, all packets routed between fields of the same color are routed xy, while the others are routed yx. Each of the problems involves half as many packets as before and therefore the whole routing takes only n^3 / 4 now, which is optimal.
This coloring idea is very useful, because it allows to concentrate on designing uni-axial algorithms. This strongly facilitates the design. Of course this idea is not limited to two-dimensional meshes, but can be applied on any network in which separate dimensions can be distinguished, most notably meshes and tori of any dimension and hypercubes.
In the context of all-to-all routing, a k-dimensional hypercube with n = 2^k PUs can be perceived as a k-dimensional mesh with side length equal to two. The uni-axial algorithm runs in k * 2^{k + 1} / 4 = k * n / 2. Overlaying k of these algorithms each taking care of a fraction 1 / k of the packets, the routing time can be reduced to n / 2.
T_one(P, l) = c_v * l * (P - 1) + c_a * (P - 1).This is P - 1 times the time for routing a permutation, which is not surprising because the all-to-all routing is realized by decomposing it into permutations. Of course this works only if all the packets have (approximately) the same length. Otherwise a more sophisticated algorithm should be applied which is presented further down.
Even if all packets have the same size, then it is not necessarily true that sending the packets directly to their destinations is the best one can do. This may sound surprising but is due to the relatively large number of start-ups of the given algorithm: if l is small, then the term c_a * (P - 1) will dominate and it may be profitable to reduce it even if the first term increases because of this.
As an alternative consider the simulation of the algorithm for a two-dimensional mesh. All PUs are indexed as in a mesh with a pair of indices (i, j), 0 <= i, j < sqrt(P). The algorithm consists of two phases. In phase 1, all packets are rearranged within the "rows", in the second phase within the "columns". More precisely, in step t, 1 <= t < sqrt(P) of phase 1 PU_{i, j} routes to PU_{i, (j + t) mod sqrt(P)} all packets which have destination in any PU_{i', (j + t) mod sqrt(P)}, with 0 <= i' < sqrt(P). In step t, 1 <= t < sqrt(P) of phase 2 PU_{i, j} routes to PU_{(i + t) mod sqrt(P), j} all packets which have destination in any PU_{(i + t) mod sqrt(P), j'}, with 0 <= j' < sqrt(P). So, the algorithm consists of 2 * (sqrt(P) - 1) steps in each of which l * sqrt(P) data are routed. This gives
T_two(P, l) = c_v * 2 * l * (P - sqrt(P)) + c_a * 2 * (sqrt(P) - 1).For large l this algorithm is almost twice as slow, but it may be up to sqrt(P) / 2 times faster for very small l.
Of course we do not have to stop here. If the packets are really small, then the algorithm for a three-dimensional mesh can be simulated, or in the extreme case, the whole routing can be performed with just log(P) routing steps by simulating the hypercube algorithm. Doing this, in each step half of the data a PU holds must be routed. This gives
T_log(P, l) = c_v * log(P) * l * P / 2 + c_a * log(P).
The algorithms with several phases also require that in between two phases the received packets are decomposed into their components and recomposed for the next routing. With appropriate bucketing all this can be performed in a time that is linear in the amount of involved data. Nevertheless, if c_i is not so much smaller than c_v, as it is the case on the best parallel computers, this gives a non-negligible delay. More generally, reducing the number of start-ups by multiplying the routing volume by a factor should be limited to subproblems involving only a small fraction of the total data set. Otherwise, considering that c_v > c_i, no reasonable speed-up can be expected, and it would be better to reduce the number of processors used in order to get packets of a more substantial size.
On a two-dimensional n x n mesh, the information is first routed within the row of the source PU and then within all columns. If the source is not PU_{0, 0}, then the routing in each of the phases should start in the direction in which the distance to cover is largest. The algorithm requires 2 * n - 2 steps at most, which is optimal.
On higher-dimensional meshes and hypercubes we can apply the dimension-ordered generalization of the above algorithm. On a hypercube this means that the routing is performed along an embedded binomial tree. On a k-dimensional hypercube the algorithm takes k steps, which is optimal.
T_hypercube(P, l) = c_v * l * log(P) + c_a * log P.
Surprisingly this is mostly not the best. If l is slightly larger, then v * l * log(P) is a large number. In that case it is better to use an algorithm which is similar to that for the linear array. The information is divided in a large number f of flits. Flit i, 0 <= i < f, is routed in step i from PU_0 to PU_1, and then on in a linear way. It takes f + P - 2 steps before flit f reaches PU_{P - 1}. This gives
T_linear(P, l) = c_v * l / f * (f + P - 2) + c_a * (f + P - 2).For sufficiently large f, the first term converges to c_v * l, which is optimal. Of course this produces many more start-ups.
For intermediate values of l, there are intermediate algorithms. A simple idea is to use an algorithm similar to the one for a two-dimensional mesh: first broadcasting in a pipelined way within the rows, then within the columns.
Lemma: Gossiping in the unit-cost telephone model on a network with P PUs and diameter D requires at least max{D, round_up(log(P)) + odd(P)}. Here odd(x) = 1 for odd x and 0 otherwise.
Proof: To cover a distance d, at least d rounds are required. After t >= 0 rounds any information is known in at most 2^t PUs, because in any round the number of informed PUs can at most double. If the number of PUs is odd, then there is at least one PU which remains unmatched in the first round. So, the information from this PU is known in at most 2^{t - 1} PUs after t > 0 rounds. This means that for t <= round_up(log(P)) + odd(P) - 1, the number of informed PUs is at most 2^{round_up(log(P)) - 1} < P. End.
Gossiping on d-dimensional meshes is easy as long as all side lengths are even. In that case the gossiping can be performed in a dimension-ordered way. On an n x ... x n mesh for n even, this takes d * (n - 1) rounds, which is optimal. For odd n, it is always possible to perform the gossiping in d * n rounds, but this is not necessarily optimal. For example, on a 3 x 3 mesh gossiping can be performed in 5 rounds. This is optimal, because for a 3 x 3 mesh, the number of PUs P = 9, and thus round_up(log(P)) + odd(P) = 5.
For small networks gossiping schedules can be described very concisely and convincingly by a series of pictures. The nodes are subdivided according to the shape of the network, marking in each subdivision whether the information from the PU at the corresponding position is already known or not. The used matchings can be highlighted.
The unit-cost model is not the only possible cost model. More flexible is the linear-cost model in which it is assumed that a data exchange between two adjacent PUs in which each sends at most l informations takes c_v * l + c_a time. In principle it is not even necessary that all PUs remain connected equally long. However, the more freedom there is, the harder it becomes to design good algorithms. Therefore it is still useful to perform the gossiping by a sequence of matchings. The cost of a gossiping schedule is given by the number of rounds, the number of consecutive matchings used, and the number of steps, the sum over all rounds of the number of transferred packets.
We consider again the above schedule for gossiping on a 3 x 3 mesh. It consists of 5 rounds which was shown to be optimal. The number of steps is 1 + 2 + 4 + 4 + 7 = 18. This is far from optimal. It has been shown (masters thesis of Rene Beier) that 11 steps is the minimum achievable for any schedule with 5 rounds, and that in general at least 10 steps are required. Both bounds can be matched. We see here, and that is a general phenomenon, that there is a trade-off between rounds and steps. So, in general the best gossiping algorithm depends on the ratio c_a / c_v. Only for some very regular networks, such as linear arrays and hypercubes, the number of rounds and steps can be minimized at the same time. In that case there is a unique optimal gossiping schedule.
T_hypercube(P, l) = c_v * l * (P - 1) + c_a * log P.This is optimal: the number of start-ups is minimized, and the routing volume equals the amount of data each PU must receive. For large P the start-up costs will become negligible.
Another communication operation is gather. In a gather each PU initially has l individual data. The task is to gather all this information in PU_0. This is not the inverse of a broadcast, because in a broadcast finally all PUs know the same information. Clearly this operation requires at least log P start-ups and because l * (P - 1) data must be transferred into PU_0. So, on a DMM gathering has the same lower bound as gossiping and can be solved optimally by applying the hypercube gossiping algorithm.
The packets are denoted p_i. s_i is the index of the source PU of p_i, d_i is its destination PU.
Notice that the mapping s_i -> x_i is not a permutation. Alternatively PU_0 might select the index of a permutation and broadcast it to all PUs in k steps, but this is a waste of time. The analysis of Valiants algorithm is not hard but beyond the scope of this chapter. Both routing phases have the same time distribution, because phase 2 can be viewed as a reversal of phase 1: in phase 2 the packets are moving from a random position to a fixed position. It can be shown that each of these phases takes k + o(k) time with high probability and that the queues remain small as well. So, the whole algorithm essentially 2 * k + o(k) = O(k) steps with high probability. Using randomization all permutations become equally hard, there are no longer good and bad permutations. The identity has the same expected time as a transposition.
A disadvantage of the above approach is that we loose a factor two for all those permutations for which even the greedy algorithm performed well: the expected time of the greedy algorithm is k + o(k), which is optimal up to the lower order term. A nice idea helps to make the randomized algorithm optimal while still making it highly unlikely that much more than the expected time is required. The algorithm is the same as before, with one small change. After selecting x_i, the length of the path s_i -> x_i -> d_i is computed. If this path is longer than k, we use comp(x_i) as intermediate destination instead of x_i. Here comp() is the operation which in a binary number complements all bits. The maximum distance any packet has to travel is now bounded by k: if dist(s_i, x_i) + dist(x_i, d_i) > k, then dist(s_i, comp(x_i)) + dist(comp(x_i), d_i) = k - dist(s_i, x_i) + k - dist(x_i, d_i) = 2 * k - dist(s_i, x_i) + dist(x_i, d_i) < k. Vöcking has shown that not withstanding the reduced randomness of the intermediate destination, the delay due to congestion is still limited to o(k) with high probability.
Consider an all-to-all routing. Assume that in total each PU is sending and receiving the same number of packets, but that not each PU is sending the same number of packets to each other PU. Then we can first send each packet to a random intermediate destination like in the hypercube algorithm. This replaces a possibly unbalanced pattern by two balanced all-to-all routing patterns. A refinement is to first determine the degree of unbalance and to randomize only as much as necessary.
For the analysis it is necessary to understand how much time the routing on the one-dimensional subarrays takes. Without a proof we give the central lemma in this domain:
Lemma: For a routing pattern on a linear array with n PUs, let r_{i, j}, 0 <= i < j < n, denote the number of packets that must travel from a PU with index i or smaller to a PU with index j or larger. When applying the farthest-first strategy, the rightwards moving packets need exactly max_{i < j} {r_{i, j} + j - i - 1} steps to all reach their destinations.
For most distributions the maximum is either assumed for i = 0, j = n - 1, or for i = n / 2 - 1, j = n / 2. The maximum time for a regular pattern like a k-k routing is always given by the maximum of these bounds. This simplifies the analysis a lot.In the above case we are routing random k-k patterns. The maximum distance is n - 1, and in a randomization about half of the packets want to cross the bisection. Thus, each of the phases takes max{n, k * n / 4} + o(k * n). This is not yet good, but we can exploit that the given routing is uni-axial. As for the all-to-all algorithm above, two of these algorithms can be overlaid. To make this effective, the packets must somehow be split in two. In a randomized context it is natural to use randomization for this as well. So, each packet is colored independently white or black with probability 1 / 2. The white packets are routed xy, the black packets yx. In this way we do not exactly obtain two (k / 2)-(k / 2) patterns, but that does not matter. The important point is that now in each routing phase, the expected number of packets crossing the bisection of a one-dimensional subarray is k * n / 8. Using Chernoff bounds, it may be shown that all deviations from the expected value are small with high probability.
Theorem: On a two-dimensional n x n mesh, k-k routing can be performed in max{4 * n, k * n / 2} + o(k * n) steps with high probability.
The above result is optimal for all k >= 8, because there the k * n / 2 bisection lower bound is matched up to lower-order terms. For k = 1, 2 k-k routing can be performed in 2 * n + o(n) time. This is optimal because it matches the distance bound, the lower bound which is obtained by considering how far the packets have to travel. For 2 < k < 8 it is not known how many steps are exactly needed for k-k routing. For example, 4-4 routing can be performed in 3 * n + o(n) steps, but it is not clear whether this is optimal or not.
Assume we are performing an all-to-all routing on a DMM with P PUs, each PU sending and receiving m numbers in total, l = m / P packets traveling between any pair of PUs on average. Each PU sorts the numbers according to the index of their destination PUs. Using bucket sort this takes O(m) time. A number residing in PU_i with rang r is sent to PU_{(i + r) mod P} as intermediate destination in a first routing phase and in a second routing phase to its destination. This is a very simple algorithm. Creating the packets to be sent to each PU may be even faster than in the randomized algorithm because the calls to the random number generator are saved.
How good is the resulting distribution? Clearly in the first routing phase the packets are almost perfectly balanced, the number of packets going from one PU to another is either round_down(m / P) or round_up(m / P). In phase 2 the imbalance is larger. Denote by a_{i, j} the number of packets originating in PU_i with destination in PU_j and by b_{i, k, j} the number of packets which are send from PU_i via PU_k to PU_j. For arbitrary k and j we want to put an upper bound on sum_i b_{i, k, j}. The regular distribution gives b_{i, k, j} <= round_up(a_{i, j} / P). This gives
sum_i b_{i, k, j} <= sum_i round_up(a_{i, j} / P) <= P + sum_i a_{i, j} / P = P + m / P.So, if m = omega(P^2), the imbalance is negligible. For the small values of P that currently are common, this is no problem, but in general this is quite a strong condition.
On n x n meshes any operation in submeshes of size o(n) takes only o(k * n) time and is therefore cheap in comparison to the Theta(k * n) time for solving most problems. Here k denotes the number of packets per PU. So, on meshes we can apply the same idea, but the allocation of intermediate destinations may be coordinated in submeshes of quite considerable size. Thereby the rounding errors can be made of minor importance even for small k. As a result, the same result that was achieved above can also be obtained deterministically.
The idea of coordinating the rounding decision also works on DMMs. Due to rounding errors, each of the rounds of phase 2 might be delayed by c_v * P, in total c_v * P^2. If first all PUs are subdivided in groups of size sqrt(P), these PUs can exchange their values b_{i, j}. These are small gossip operations, costing c_v * P^{3 / 2} + c_a * log(P) / 2 time. Then they can optimize their rounding, reducing the loss in phase 2 to c_v * P^{3 / 2}. So, now we must have m = omega(P^{3 / 2}). This is much better, but still not as good as for the randomized algorithm, for which the rounding errors are negligible for any m = omega(P * log P). Of course, in practice it is quite unlikely that the deterministic algorithm has maximal bad luck. Even though we might not assume that all a_{i, j} are equal, it is quite reasonable to assume that the rounding errors are more or less randomly distributed.
First we construct a bipartite graph with 2 * n nodes. The sets V_1 and V_2 each have n nodes, corresponding to the source and destination rows of the packets, respectively. There is an edge from node i in V_1 to node j in V_2 if there is a packet moving from row i to row j. Because there are exactly n packets starting and finishing in each row, the graph is n-regular (all nodes have degree n). Thus, according to Hall's theorem, the graph has an edge coloring with n colors. Because this is an online algorithm, the existence of such a coloring is the only thing that matters, but actually such colorings can be computed very efficiently. That is, there is an assignment of values, which are called colors, ranging from 0 to n - 1, so that the edges incident on a node all have different colors.
These colors are used in the routing algorithm. Denote the packets by p_i and let x_i be the color attributed to the edge corresponding to this packet.
It remains to check that the algorithm is as good as claimed above. For the analysis of phase 1 we consider the routing in an arbitrary row. Originally each PU holds one packet. Because the edges corresponding to these packets are all incident on the same node of the bipartite graph, they all have different colors, and therefore no two packets will be routed to the same PU. For the routing in phase 2, we consider the routing in an arbitrary column. Because of the previous result, initially each PU holds one packet. If two packets in the column would move to the same row, then two packets with destination in the same row would have gotten the same color, which is in contradiction with the coloring which colors all edges incident on the node in V_2 corresponding to this row differently. Therefore, at the beginning of phase 3, all PUs hold one packet. In each row each PU is the destination of one packet because the complete pattern is a pattern, and in phase 3 the packets are routed to their destinations. So, the algorithm consists of three permutations within one-dimensional subarrays. Therefore, each phase can be executed in n - 1 steps, 3 * n - 3 steps in total, and never a PU is storing more than one packet.
The queue size is optimal, the time is not. It is still open whether it is possible to route all permutations in 2 * n - 2 steps with queue size 1. A few steps more and queue size 2 is possible.
Let A = B x C be a product network. B has n_B nodes and C has n_C nodes. A schedule for routing a permutation on A can be computed offline by first constructing a bipartite graph with node sets V_1 and V_2, each with n_C nodes. Node c, 0 <= c < n_C, in V_1 and V_2 corresponds to the subnetwork B x c in A. The bipartite graph has n_A = n_B * n_C edges. For a packet traveling in A from a node (b_1, c_1) to a node (b_2, c_2) there is an edge from node c_1 in V_1 to node c_2 in V_2. Thus, the bipartite graph is n_B-regular and can be colored with n_B colors. Packet i is denoted p_i, it is traveling from (b_i, c_i) to (b'_i, c'_i), and its color is x_i, 0 <= x_i < n_B. The routing is performed as follows:
Theorem: Permutation routing on a product network A = B * C can be performed in T_A = 2 * T_B + T_C steps, where T_B and T_C denote the number of steps for routing permutations on B and C, respectively. At any time a PU is storing at most one packet.
Proof: Phase 1 is a permutation routing in subnetworks with the structure of B, because for any subnetwork B x c, all destinations occur exactly once due to the coloring of the edges incident on the node in V_1 corresponding to this subnetwork. The coloring of the edges incident on the nodes of V_2 assures that phase 2 is a permutation routing in a network with the structure of C. Phase 3 is again a permutation routing within subnetworks with the structure of B because at the beginning each PU holds one packet, and because the whole pattern is a permutation, each PU is also the destination of one packet. End.
If T_B != T_C, then it is profitable to arrange things so that T_B < T_C. The above construction immediately gives results for d-dimensional meshes. The best is to perceive a d-dimensional n x ... x n mesh as the direct product n x (n x ... x n). Using induction this leads to
Corollary: Permutation routing on a d-dimensional n x ... x n mesh can be performed in (2 * d - 1) * n steps. At any time a PU is storing at most one packet.
The algorithm is simple: a bipartite graph with two sets of P nodes is created. For a packet moving from PU_i to PU_j there is an edge from node i in V_1 to node j in V_2. This gives an m regular graph which can be colored with m colors. Each PU first routes all packets with color x to PU_{x mod P}. In the second routing phase the packets are routed to their destinations. Because of the coloring property for the edges incident on the nodes of V_1, in the first round each PU sends exactly l packets to each PU. Because of the coloring property for the edges incident on the nodes of V_2 each PU receives one packet from each color. These stand perfectly distributed after the first routing.
Clearly for a problem like permutation routing, which is not particularly hard, it appears outrageous to first solve a bipartite graph-coloring problem of a graph with one edge per packet. This problem is far more complicated then the routing itself. However, we can solve a strongly reduced problem in o(n) time, while still obtaining almost as good performance as with the offline algorithm. This idea of reducing the information content of a problem is called sparsification and is useful beyond the scope of turning offline algorithms into online algorithms.
The approach is sketched for a two-dimensional n x n mesh. The mesh is divided in m x m submeshes. m = o(n) is quite large, one should think of m = n^{1 - eps}, for some suitable eps > 0. The packets going to each of these n^{2 * eps} submeshes are counted in each submesh. The bipartite graph does not have one node for each row, but one node for each of the n^eps row bundles. A further idea, which helps to reduce the size of the graph is to not create an edge for each edge but only for each superpacket. A superpacket consists of a large number s = n^alpha packets with common destination. Because we cannot assume that s exactly divides the number of packets traveling between a pair of row bundles, in general there will be one partially filled superpacket for each destination pair. So, in each submesh the graph to color has 2 * n^eps nodes and n^{2 - 2 * eps - alpha} + n^{2 * eps}. Choosing the parameters right, we can achieve the following:
After solving the localized and sparsified coloring problems, the routing cannot be performed so simply as it was done before. Each routing phases must be preceded by a rearrangement in the submeshes, so that the packets stand well-distributed. For example, preceding a horizontal routing phase, the packets must be arranged so that in each row of a submesh the number of packets going to any other submesh is about equal. When reaching a submesh, the packets are not routed to a specific PU, but rather to the first PU in the submesh which is still storing fewer than the maximum allowed number of packets. Together these ideas assure small queues and good routing times. Because the submeshes have size o(n), the cost of each operation is a lower-order term of the total routing time, and because there are only constantly many routing phases, their cost is negligible altogether.
The entire system consists of a number of PUs with conventional powerful arithmetic and storage and a number of switches. The switches are slightly more powerful than in a crossbar: they listen to the signals passing on the buses they are connected to and take out whatever is destined for them. A small number of values can be stored and the switches can be set conditionally as a result of a comparison between stored values. The whole system is assumed to be synchronized and the difference between receiving a signal or not receiving a signal can be detected.
Typically the switches are arranged in a rectangular grid, but in principle they might also be arranged differently. The PUs are arranged on the outside. The switches normally have four ports, denoted W, N, E and S. Sometimes it is assumed that these ports can be connected dynamically in any desirable configuration, sometimes the configurations are limited to a subset thereof. In the presented examples we will not need anything fancy. A switch setting is denoted by indicating which ports are connected. For example, WN-SE indicates the setting in which W is connected with N and S with E.
It is important to notice that the switch setting decomposes the grid into a number of subnets. Within these nets there is broadcasting possibility: as long as it is assured that only one of the connected PUs is sending, all connected PUs and switches will receive the information transmitted in the net. A system as described is called a reconfigurable processor array
One may argue that reconfigurable networks are physically not realistical. There are two major assumptions which give reason for skepticism:
The first point is a true limitation: speed of light is finite and the size of the reconfigurable network certainly increases with the number of switches. This is a limitation, but exactly the same assumption is made in all hypercube, butterfly and PRAM algorithms. Two-dimensional meshes (and to a lesser extend three-dimensional meshes) are the only interconnection networks that can be realized in physical space without scaling problems. In a two-dimensional lay-out of a hypercube with n PUs there are wires of length sqrt(n).
It is harder to estimate the importance of the second point. Radio exists, and the signal does not become noticeably weaker if the number of listeners increases. Even wire-bound broadcasting exists in the form of cable television. So, technologically, this is no problem either. Possibly the signal must be somewhat stronger than for a point-to-point transmission. Also one can imagine that especially branchings weaken a signal. In most algorithms for reconfigurable arrays, these can easily be avoided by performing the broadcasting in two steps.
The algorithm requires n + 3 PUs and 6 * n switches arranged in an 3 x 2 * n grid. In a first round PU_i informs the six switches in the columns 2 * i and 2 * i + 1 about its value. After receiving this value, the switches set their connections in preparation for the next round. If a 0 is received, a switch is set to WE-NS. If a 1 is received the switch must be set WN-SE if it is found in column 2 * i. If the switch is in column 2 * i + 1, then it is set to WS-NE, with exception of the switch in row 1, which is set to WE-NS. Then in the second round, a signal is entered in the net from the lower-left side. If it comes out at the right side in the same row, then the parity is even, if it comes out one row higher, the parity is odd. The PU on the right side receiving the signal can either output the conclusion, or broadcast it to all other PUs.
The input is a set of n numbers a_i, number i originally residing in PU_i. By taking also the index into account, we may assume that all values are different. The algorithm consists of two somewhat independent parts. In the first part, for each a_i its rank r_i within the set of numbers is computed. In the second part, a_i is routed to PU_{r_i}. Each of these operations can be performed in a constant number of steps. For each i there is a dedicated n x n grid of switches. The switches in column j of this network are set depending on a comparison of a_i with a_j similar to the construction in the parity algorithm: if a_j < a_i, a signal which is inserted at the lower-left corner is guided one row upwards. The signal finally leaves the network on the right side in row r_i of the subnetwork. Even when using only the three different settings which were used before, the whole algorithm can be implemented with just 6 steps.
In the section on "Using Randomization" a lemma was given specifying exactly the time for routing on a linear array with n PUs when using the farthest-first strategy.