Interconnection Networks and Routing

Introduction

The PRAM and the DMM are simple and can be defined and described in a single paragraph. For interconnection networks the situation is different: there is a whole body of literature just describing and analyzing network topologies. Also there is extensive literature on the various communication problems on interconnection networks. In this chapter we first give a brief general introduction focusing on a small set of widely used interconnection network topologies and the most important communication problems. Then we go slightly deeper into some slightly more advanced topics which are based on some very interesting ideas with application beyond the precise scope for which they are presented. Because the communication problems which must be solved on DMMs are the same and because most algorithms are directly inspired by communication algorithms for interconnection networks, we also consider routing on DMMs.

Algorithms for interconnection networks are mostly analyzed using the interconnection-network cost model, in which it is assumed that in each step each PU may perform an unlimited amount of communication and then route one unit packet to one or more of its immediate neighbors.

Most algorithms for this cost model deal with the various communication problems, rather than to solve a general problem, say maxflow. The reason for this is that the communication problems are already hard enough, and that harder problems can often be solved quite well by taking a parallel algorithm designed for a more powerful parallel computer model, and then to simulate the operations of this computer on the interconnection network. Such a simulation is composed of two components: an allocation of the data and processes to the PUs and performing the resulting routing operations.

Next to the various communication operations also some of the most basic problems have been studied for interconnection networks. These are sorting, selection, list ranking, matrix multiplication and the like. In a following chapter we will even see how to solve the all-pairs shortest-paths problem optimally on a two-dimensional mesh.

Topologies

Here we introduce some of the most important interconnection-network topologies, fixed classes of interconnection patterns. The degree of a PU in a network is the number of other PUs it is connected to, corresponding to the degree of the node if the network is identified with the underlying graph. Interconnection networks are normally considered to be undirected: if PU_i is connected to PU_j, then PU_j is also connected to PU_i. The most important quality measures of an interconnection network are its

Complete Networks

A natural network topology is the one whose underlying graph is the complete graph. In such a complete network there is a connection between any pair of PUs. This topology is normally studied for the one-port model, while full-port capability on a complete network being unrealistically powerful.

In hardware a complete interconnection network with one-port capability can be realized with a crossbar switch. For interconnecting P PUs, a crossbar switch consist of a grid of P x P switches. These switches have four connections, W, N, E, S. The switches have two settings, either connecting W with E and N with S or connecting W with N and E with S, which can be chosen dynamically. Setting the switches appropriately, point-to-point connections can be created. Notice that using a crossbar switch each PU can send a packet to at most one PU and receive a packet from at most one PU, but that the PU to which it is sending may be different from the PU from which it is receiving. Switch-based complete network topologies are very common in practice. But, because of the increasing delays and cost such systems are typically of moderate size.

8 x 8 Crossbar Switch

Grid-Like Networks

Another very simple network topology is the linear array. In this topology each of the n PUs has one or two neighbors. Between any pair of PUs, (i, i + 1), 0 <= i <= n - 1, there is a connection. So, the degree of the linear array is 2. Its diameter is n - 1 and the bisection width is 1.

A similar network is the circular array which is also called ring. The only difference with the linear array is an additional connection between PU_0 and PU_{n - 1}. This connection is often designated as wrap-around connection. The degree is 2, the same as before. However, the other measures are better: the diameter is round_down{n / 2} and the bisection width is 2. This implies that most problems can be solved on a circular array twice as fast as on a linear array.

An n_1 x n_2 mesh consists of n_1 * n_2 PUs arranged in a grid. Each PU is connected to its up to four neighbors. As a graph, a mesh is the direct product of two linear arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter n_1 + n_2 - 2 and the bisection width is min{n_1, n_2}.

An n_1 x n_2 torus is a mesh with wrap-around connections: there are additional connections between PU_{i, 0} and PU_{i, n_2 - 1} for all i, 0 <= i < n_1, and between PU_{0, j} and PU_{n_1 - 1, j}. As a graph, a torus is the direct product of two circular arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter round_down(n_1 / 2) + round_down(n_2 / 2) and the bisection width is 2 * min{n_1 , n_2}. On tori most problems can be solved twice as fast as on meshes.

There are also higher-dimensional meshes and tori. A d-dimensional n_1 x ... x n_d mesh consists of n_1 * ... * n_d PUs arranged in a d-dimensional grid. Each PU is connected to its at most 2 * d neighbors. The degree is 2 * d, the diameter is n_1 + ... + n_d - d and the bisection width is n_1 * ... * n_d / min{n_1, ..., n_d}. The definition and properties of d-dimensional tori are analogous.

Linear Array, Circular Array, Mesh, Torus

Tree-Like Networks

A tree is a network consisting of 2^{k + 1} - 1 PUs arranged and interconnected as a perfect binary tree of depth k. So, there are k + 1 levels of PUs, with 2^j PUs in level j, 0 <= j <= k. The root, the PU in level 0 has degree two, the leafs, the PUs in level k have degree one, all internal nodes, the PUs at all other levels have degree three. The degree is 3, the diameter 2 * k and the bisection width is 1.

The major weakness of a tree network is its very small bisection width. This means that any problem requiring a single rearrangement of the data has at least linear running time. For such problems a tree is not better than a linear array. Therefore several improved trees have been designed. The simplest is the fat tree. In a fat tree the connections between level i and level i + 1, where the leafs are considered to lie at level 0, have capacity 2^i. This means that such a connection can transfer 2^i packets instead of 1.

Alternatively tree structures have been integrated in other networks. An important example of such a network is the mesh of trees. An n x n mesh of trees is constructed from an n x n grid of not-connected nodes. The PUs in each row and column of this grid are connected by a tree. There are no direct connections between the row trees or between the column trees. The row and column trees are connected at their shared leafs. An n x n mesh of trees has 2 * n * (2 * n - 1) - n^2 = 3 * n^2 - 2 * n nodes.

8 x 8 Mesh of Trees

Cube-Like Networks

A k-dimensional hypercube consists of n = 2^k PUs arranged in a k-dimensional grid with side length 2. Each PU is connected to its k neighbors in the grid. The degree is k = log n, the diameter is k and the bisection width is n / 2.

Hypercube

A k-dimensional cube-connected cycle, abbreviated CCC, network is obtained by taking a k-dimensional hypercube and replacing each node by a cycle with k nodes. Node d, 0 <= d < k, in each cycle takes over the connection of the hypercube along the d-direction. More precisely, the b * 2^k nodes of the CCC are indexed with pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k), (i, (d - 1) mod k) and (i ^ 2^d, d). Here i ^ 2^d denotes i + 2^d if bit d of i is 0 and i - 2^d if this bit is 1. The degree is 3, the diameter is round_down(5 * k / 2) - 2 for k >= 3. A CCC is the simplest network which combines a logarithmic diameter with constant degree. The bisection width is the same as in the corresponding hypercube, 2^{k - 1}.

A k-dimensional butterfly network is similar. The k * 2^k nodes are indexed as pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k) and (i, (d - 1) mod k). Furthermore there is an undirected edge from (i, d) to (i ^ 2^d, (d + 1) mod k). So, the degree is 4. The diameter is round_down(3 * k / 2). Because many router chips provide four connections, and because of the good routing properties of butterflies these networks have been used quite a lot in practice.

Four Dimensional CCC and Butterfly

Cayley Graphs

The concept of Cayley graphs gives a generic way of constructing many different graphs. For a finite group G and a generator set S which is closed under inversion, a graph (V, E) is defined as follows: Because the set S is a generator set, the graph is connected. Because it is closed under inversion, for any edge (g, g * e), there is also an edge (g * e, g * e * e^{-1}) = (g * e, g), and thus the graph is undirected.

Cayley graphs are much nicer than just connected and undirected: they are even node symmetric. A graph is said to be node symmetric, if for any pair (u, v) of nodes there is an automorphism phi = phi(u, v) of the nodes so that if (x, y) is an edge that then also (phi(x), phi(y)) is an edge. For Cayley graphs, these automorphisms are easy to construct: determine elements e_1, ..., e_k so that e_1 * ... * e_k * u = v, then phi maps any x in V to x * e_1 * ... * e_k. If (x, y) in E, this means that y = x * e for some e in S. Thus, (phi(x), phi(y)) = (phi(x), phi(x * e)) = (phi(x), e_1 * ... * e_k * (x * e)) = (phi(x), (e_1 * ... * e_k * x) * e) = (phi(x), phi(x) * e) in E.

Several of the graphs we have encountered before are Cayley graphs: circular arrays, tori, hypercubes and CCCs, though these can be presented in a more natural way. We describe how hypercubes can be obtained as Cayley graphs. For H_k, the k-dimensional hypercube, is obtained from a subgroup of the permutation group of 2 * k elements. There are k generators each having two consecutive elements exchanged. To make the description more concrete we only consider the special case k = 4. The set S of generators consists of (10234567), (01324567), (01235467) and (01234576). The subgroup consists of all elements obtained by starting with (01234567) and letting these elements work. If in the conventional indexing of the hypercube we find a 1 at position i, 0 <= i < k, the elements 2 * i and 2 * i + 1 in the indexing with a permutation are exchanged. For example, (0110) corresponds to (01325467).

Conclusion

Linear and circular arrays are very weak networks. Only for very simple problems it is possible to achieve speed-up on them. The main reason to consider them is that many algorithms for meshes and tori can be viewed as generalizations of 1-dimensional algorithms. Meshes and tori have a relatively high diameter and relatively small bisection width. O(sqrt(P)) is mostly the maximum achievable speed-up with P PUs. On the other hand are these networks easy to construct and arbitrarily scalable and extendible. A tree has small diameter, but too small bisection width. Only problems which require very little data to be exchanged can be solved efficiently on them. On hypercubes many important problems can be solved optimally. The problem with these is that the degree is not constant. Furthermore, in a two-dimensional layout, due to their large bisection width, they require an area which is quadratic in the number of PUs n, which implies that some adjacent PUs are standing sqrt(n) or more apart. So, if the physical distance signals must travel is the determining factor, hypercubes are not better than meshes. The conclusion is that none of the networks is entirely satisfying, and this is precisely the reason that all of these and many more are studied.
For interconnection networks, there is a trade-off between quality on the one hand and complexity and scalability on the other.

Embeddings

There is a multitude of considered interconnection networks and there is a multitude of problems. It would be inefficient to study each problem for each network. This is often not necessary. In the first place many algorithms can be generalized or modified. This is a common practice between networks of a similar nature such. Algorithms for two-dimensional meshes can often be generalized for higher-dimensional meshes. Algorithms for hypercubes after modification often work also for other interconnection networks with logarithmic diameter and high bisection width. A more formal way of carrying over algorithms from one network to another is by using embeddings.

An embedding of an interconnection network IN_1 with P_1 PUs into a network IN_2 with P_2 PUs is a mapping phi() of the PUs of IN_1 to the PUs of IN_2. Generally it is part of the definition that phi() should be injective. Not only the PUs should be mapped, but for any two PUs PU_i and PU_j in IN_1 connected by a link there should also be specified a path in IN_2 running between phi(PU_i) and phi(PU_j). There are three cost measures that are used when specifying the quality of an embedding:

Any step of IN_1 can be simulated on IN_2 with at most dilation * congestion steps. Thus, the work on IN_2 exceeds the work on IN_1 by at most a factor dilation * congestion * expansion. As long as all these numbers are constants, an optimal algorithm for IN_1 automatically gives an optimal algorithm for IN_2.

As a first example we consider the embedding of a torus into a mesh. This embedding is obtained by first blowing up the n x n torus to a 2 * n x n torus and folding it double along the middle row, so that one PU comes in each row, and then repeating this for a folding along the middle column. The following picture illustrates the process. Notice that each pair of PUs that previously was adjacent now is at distance two at most: the dilation is two. Also, are there at most two paths mapped to any link of the mesh: the congestion is two. For the special case of dilation and congestion both equal to two, any step of the embedded network can be simulated in two steps. So, any problem which on an n x n torus can be solved in T communication steps can be solved on a mesh in 2 * T steps.

Embedding a Torus in a Mesh with Dilation Two

Another important example is the embedding of a binomial tree with n = 2^k nodes in a hypercube. The construction is simple: a binomial tree with 2^k nodes, T_k consists of two trees T_{k - 1} whose roots are connected. A k-dimensional hypercube consists of two (k - 1)-dimensional hypercubes with connections between all corresponding nodes, in particular between the nodes (0, 0, ..., 0) and (1, 0, ..., 0). So, the embedding can be performed recursively. The dilation, congestion and expansion are all 1: the graph of the binomial tree is a subgraph of the hypercube, obtained by removing edges.

Embedding Binomial-Tree in Hypercube

Two types of embeddings are particularly important: the embeddings of the linear and circular array. The embedding of a linear array, which is nothing but the construction of a Hamiltonian path, is important because it provides a way of indexing the PUs so that any two PUs with consecutive indices are adjacent in the network. Because many algorithms contain steps in which data must be exchanged to PUs with consecutive indices, this is a very desirable property. A systematic way of indexing the PUs of an interconnection network is called an indexing scheme. The embedding of a circular array, which is nothing but the construction of a Hamiltonian cycle, is important because algorithms may even require data to be shifted in a cyclical way.

In any grid it is easy to construct a Hamiltonian path. A two-dimensional mesh is Hamiltonian, which means that it has a Hamiltonian cycle, iff at least one of the sides has even length. Next to the indexing schemes obtained by embedding linear and circular arrays there are several other schemes for indexing the PUs of a grid. The most important is the so-called row-major indexing scheme. In this scheme, PU_{i, j} in an n x n mesh, which is located in row i and column j, is given index i * n + j. Mostly the rows and columns are numbered starting from the upper-left corner as in a matrix. In the column-major indexing scheme, this PU has index i + j * n. The indexing obtained by embedding a linear array, starting at the upper-left corner, then always going along the rows first, is called the snake-like row-major indexing. There are also so-called shuffled indexing schemes. The most important is the shuffled row-major indexing scheme. In this scheme, PU_{i, j}, for numbers i and j with binary expansions given by (i_{k - 1}, ..., i_i, i_0) and (j_{k - 1}, ..., j_1, j_0) respectively, has an index with binary expansion (i_{k - 1}, j_{k - 1}, ..., i_1, j_1, i_0, j_0).

Indexing Schemes

On a hypercube a Hamiltonian path can be constructed by exploiting the recursive structure of the network. H_k denotes the k-dimensional hypercube. The traversal of all PUs of H_k is obtained by gluing together two traversals of H_{k - 1}. If the gluing is done correctly, this even gives a Hamiltonian cycle. The schedule is known as Gray code. It can be described by giving the order in which the PUs are visited. It is easy to verify that in the following sequences all consecutive PUs, including the first and last, are adjacent:

for H_1 we have {0, 1},
for H_2 we have {00, 01, 11, 10},
for H_3 we have {000, 001, 011, 010, 110, 111, 101, 100}.

How to generalize the above sequences? Look at the sequence of H_3. It starts with the sequence of H_2 in which all indices have an additional leading 0. Then this same sequence appears once more in reversed order in which all indices have an additional leading 1. More generally, denoting the Hamiltonian sequence for H_{k - 1} by S_{k - 1}, the Hamiltonian sequence S_k for H_k is the concatenation of 0-S_k and 1-reversed(S_k). Here, for a sequence S and b in {0, 1}, b-S denotes the sequence obtained by preceding all elements in S by a b.

Lemma: The sequences S_k defined by S_1 = {0, 1} and S_k = 0-S_{k - 1} + 1-reversed(S_{k - 1}), where + denotes concatenation, give Hamiltonian enumerations of the PUs of the k-dimensional hypercubes H_k.

Proof: The proof goes by induction. For H_0 the construction is clearly correct. So, assume it is correct for H_{k - 1}. If S_{k - 1} is a Hamiltonian cycle, then so is reversed(S_k). The sequences 0-S_{k - 1} and 1-reversed(S_{k - 1}) are paths through H_k because of the recursive way in which H_k is constructed out of two connected copies of H_{k - 1}. It remains to check that these two paths fit together. The first element of S_{k - 1} and the last element of reversed(S_{k - 1}) are the same. So, the first element of 0-S_{k - 1} and the last element of 1-reversed(S_{k - 1}) differ in exactly one position, the leading one, and are therefore the corresponding PUs are connected n the hypercube. End.

Hamiltonian Cycle in Hypercube

Embeddings provide a major way of deriving algorithms for one topology from algorithms designed for another topology. Embeddings of linear and circular arrays lead to indexings for which consecutively indexed PUs are adjacent.

Routing

Considered Problems

Communication is a precondition for any computation in which several PUs jointly solve a common task rather than each solving their own task. Even in the latter case some communication is normally needed to hand-out the tasks in the beginning and to collect the answers in the end. In principle, in the course of a computation, we may encounter any pattern of source-destination pairs. However, because most algorithms have a fairly regular structure, there are some patterns which occur particularly frequently. The most frequently occurring patterns are the following:

Each of these operations has its own importance. All-to-all routing is the pattern encountered when there is no specific pattern in the data access: the PUs compute a while on the data they have and by doing so accumulate questions to be sent to other PUs. Unless the algorithm was designed in a specific way, these questions will be more or less randomly distributed over the PUs, giving rise to a more or less balanced all-to-all routing pattern. A permutation-routing pattern arises in algorithms that where designed in such a way that the questions are not scattered over the data set but concentrated. Such a structure is desirable because it reduces the number of start-ups. Broadcasting is a very common operation performed to distribute information from a source PU to a (subset) of PUs that must work on the same data. Typically this happens at the beginning of a computation. The symmetric operation is called gathering. Gathering means that information from all PUs is combined to a single information in some specified PU. For example, this may be done in order to sum up the computed results. Gossiping is used to make local samples of the information available to all PUs. It is also used to synchronize a database and in numerical communications. Because gossiping is a very expensive operation, it will normally be performed only for a small subset of the total data set. Another much studied problem is k-k routing, in which each PU is the source and destination of at most k packets. Both permutation routing and all-to-all routing are special cases of k-k routing. In practice general k-k routing patterns are rarely encountered.

Permutation Routing

Two-Dimensional Meshes

Consider routing a permutation on a two-dimensional n x n mesh. There is a very simple algorithm which works quite well:
  1. Route all packets within their rows to their destination columns.
  2. Route all packets within their columns to their destinations.
This algorithm is called the greedy routing algorithm, also called xy routing. The description is not complete though: in addition to saying how the packets move, a routing algorithm should also specify what is done if more than one packet wants to use the same connection at the same time. Due to our model assumption, only one can be transferred, the others must wait. So, we must specify a conflict-resolution strategy. In many cases the most natural strategy works best: the packet which has to move farthest gets priority. This strategy is called the farthest-first strategy. Also it should be specified whether the first routing phase is completed before the second phase is started, or that the phases are coallesced. Normally it is understood that the phases are separated, though in practice it will be faster and simpler to coallesce them.

How good is the greedy algorithm? It is easy to see that during phase 1 the packets run without any delay: the packets moving rightwards do not change there relative distances as long as they are moving and therefore they will never encounter each other. The same is true for the leftwards moving packets. So, phase 1 takes as long as the longest distance a packet has to go. The maximum distance that may have to be covered in a row is n - 1. As a result of the routing in phase 1, the packets may have piled up in some PUs. For example if the permutation is the transposition, in which PU_{i, j}, 0 <= i, j < n, sends a packet to PU_{j, i}, then n packets reside in all diagonal PUs at the end of phase 1. So, the queue size, that is, the maximum number of packets that may have to be stored in a single PU at the same time, may be as high as n. Thus, during the second routing phase there may be conflicts. However, due to the applied farthest-first strategy, can a packet which has to move upwards to a PU at distance d from the boundary of the mesh be delayed at most d times. Because the distance this packet has tot travel is bounded by n - d - 1, it will arrive no later than in step n - 1. So, even phase 2 will be completed in n - 1 steps, and therefore the whole algorithm is completed in 2 * n - 2 steps which is optimal. Much more sophisticated algorithms have shown that it is possible to achieve the same time with constant queue size.

Higher-Dimensional Meshes and Hypercubes

On higher-dimensional meshes and hypercubes the same algorithmic idea can be applied: the packets are routed dimension-by-dimension. This routing is called dimension-ordered routing and is widely applied in practice. On two-dimensional meshes this may give long queues but the time is optimal as we have seen above. However, on higher-dimensional meshes and hypercubes it can be very bad. Consider for a k-dimensional hypercube with n = 2^k PUs a generalization of the transposition, in which the packet residing in a PU with index (i_0, ..., i_{k - 1}) has to be routed to the PU with index (i_{k - 1}, ..., i_0). If dimension-ordered routing is applied, then all packets starting in the PUs with indices (i_0, ..., i_{k / 2 - 2}, 1, 0, ..., 0) will be routed through the PU with index (0, ..., 0, 0, 0, ..., 0). That is, 2^{k / 2 - 1} = sqrt(n) / 2 packets are routed through a single PU and have to leave it over a single connection. Thus, the routing takes Omega(sqrt(n)) time, which is not better than routing on a two-dimensional grid.

In general we hope that permutation routing can be solved in a time equal to or only slightly larger than the diameter. In any case we would like to route permutations in O(diameter) time. The above algorithm for k-dimensional hypercubes achieves this for most permutations, the expected time is O(k), but there are a few permutations for which the performance is unacceptable. Further down we will see how to overcome this problem.

DMMs

On a DMM permutations can be routed in a trivial way: each PU sends its packet directly to the destination PU. Because the whole pattern is a permutation it is guaranteed that this can be performed without conflicts. If the packets have length l, this takes c_v * l + c_a time. This is clearly optimal because at least one start-up is needed and all data must be transmitted.

All-to-All Routing

All-to-all routing is the working horse of general-purpose parallel computation. Its cost depends on the bisection bandwidth of the interconnection network. If the provided bandwidth is insufficient, then whole classes of problems can not be solved efficiently. Because the routing pattern is fixed and regular, it is not hard to design optimal algorithms for most topologies.

Two-Dimensional Meshes

On an n x n mesh each PU is the source and destination of n^2 packets, one for each of the PUs. Applying the greedy algorithm works fine now. Each PU holds n packets with destinations in each of the n columns. These packets with common destination might be bundled. In each row n / 2 * n^2 / 2 = n^3 / 4 packets must be routed over the bisection. At the end of phase 1, each PU holds again n^2 packets, n of them with destination in each of the PUs of its column. These packets with common destination might again be bundled. The algorithm takes n^3 / 2 routing steps. This is twice the time given by the bisection bound, the lower bound which is obtained by dividing the number of packets that must cross the bisection in one direction (in this case n^4 / 4) by the bisection width (in this case n).

The given algorithm is uni-axial, that is, in any step either only horizontal or only vertical connections are used. If the mesh has full-port capacity, then this is wasteful. However, it is often possible to overlay two uni-axial algorithms. In this particular case the solution is very simple: all packets going from PU_{i_x, i_y} to PU_{j_x, j_y} with i_x + i_y + j_x + j_y even are routed as described, while the other half of the packets are routed first along the columns and then along the rows. Said otherwise: if the PUs are colored black and white as on a chess board, all packets routed between fields of the same color are routed xy, while the others are routed yx. Each of the problems involves half as many packets as before and therefore the whole routing takes only n^3 / 4 now, which is optimal.

This coloring idea is very useful, because it allows to concentrate on designing uni-axial algorithms. This strongly facilitates the design. Of course this idea is not limited to two-dimensional meshes, but can be applied on any network in which separate dimensions can be distinguished, most notably meshes and tori of any dimension and hypercubes.

Higher-Dimensional Meshes and Hypercubes

On a d-dimensional n x ... x n mesh, the algorithm consists of d phases. In phase i, 0 <= i < d, the routing is performed along axis i. In each routing half of the packets must cross the bisection. So, each phase takes n^{d + 1} / 4 time. In total the algorithm takes d * n^{d + 1} / 4 time. This algorithm is uni-axial. Applying a coloring with d colors it can be made optimal: a packet routed from PU_{i_0, ..., i_{d - 1}} to PU_{j_0, ..., j_{d - 1}} is given color (i_0 + ... + i_{d - 1} + j_0 + ... + j_{d - 1}) mod d. A packet with color c is routed along the (c + i) mod d axis in routing phase i, 0 <= i < d. So, the original problem is decomposed in d independent problems each involving a fraction 1 / d of the packets. So, the total routing time of this full-port algorithm is n^{d + 1} / 4.

In the context of all-to-all routing, a k-dimensional hypercube with n = 2^k PUs can be perceived as a k-dimensional mesh with side length equal to two. The uni-axial algorithm runs in k * 2^{k + 1} / 4 = k * n / 2. Overlaying k of these algorithms each taking care of a fraction 1 / k of the packets, the routing time can be reduced to n / 2.

DMMs

For all-to-all routing on a DMM with P PUs there is one standard algorithm which is the best if the packets are sufficiently large. The algorithm consists of P - 1 steps. In step t, 1 <= t < P, PU_i routes its packet to PU_{(i + t) mod P}. Possibly there is even a step 0 in which the packets with destination in the same PU are handled. If all packets have size l, then the routing takes
T_one(P, l) = c_v * l * (P - 1) + c_a * (P - 1).
This is P - 1 times the time for routing a permutation, which is not surprising because the all-to-all routing is realized by decomposing it into permutations. Of course this works only if all the packets have (approximately) the same length. Otherwise a more sophisticated algorithm should be applied which is presented further down.

Even if all packets have the same size, then it is not necessarily true that sending the packets directly to their destinations is the best one can do. This may sound surprising but is due to the relatively large number of start-ups of the given algorithm: if l is small, then the term c_a * (P - 1) will dominate and it may be profitable to reduce it even if the first term increases because of this.

As an alternative consider the simulation of the algorithm for a two-dimensional mesh. All PUs are indexed as in a mesh with a pair of indices (i, j), 0 <= i, j < sqrt(P). The algorithm consists of two phases. In phase 1, all packets are rearranged within the "rows", in the second phase within the "columns". More precisely, in step t, 1 <= t < sqrt(P) of phase 1 PU_{i, j} routes to PU_{i, (j + t) mod sqrt(P)} all packets which have destination in any PU_{i', (j + t) mod sqrt(P)}, with 0 <= i' < sqrt(P). In step t, 1 <= t < sqrt(P) of phase 2 PU_{i, j} routes to PU_{(i + t) mod sqrt(P), j} all packets which have destination in any PU_{(i + t) mod sqrt(P), j'}, with 0 <= j' < sqrt(P). So, the algorithm consists of 2 * (sqrt(P) - 1) steps in each of which l * sqrt(P) data are routed. This gives

T_two(P, l) = c_v * 2 * l * (P - sqrt(P)) + c_a * 2 * (sqrt(P) - 1).
For large l this algorithm is almost twice as slow, but it may be up to sqrt(P) / 2 times faster for very small l.

Of course we do not have to stop here. If the packets are really small, then the algorithm for a three-dimensional mesh can be simulated, or in the extreme case, the whole routing can be performed with just log(P) routing steps by simulating the hypercube algorithm. Doing this, in each step half of the data a PU holds must be routed. This gives

T_log(P, l) = c_v * log(P) * l * P / 2 + c_a * log(P).

The algorithms with several phases also require that in between two phases the received packets are decomposed into their components and recomposed for the next routing. With appropriate bucketing all this can be performed in a time that is linear in the amount of involved data. Nevertheless, if c_i is not so much smaller than c_v, as it is the case on the best parallel computers, this gives a non-negligible delay. More generally, reducing the number of start-ups by multiplying the routing volume by a factor should be limited to subproblems involving only a small fraction of the total data set. Otherwise, considering that c_v > c_i, no reasonable speed-up can be expected, and it would be better to reduce the number of processors used in order to get packets of a more substantial size.

Broadcasting

Broadcasting on a full-port network is trivial: in any step the information is sent to all neighbors. It will reach all PUs in the minimum time, which is bounded by the diameter of the network. On single-port machines the problem is non-trivial.

Linear Arrays, Meshes and Hypercubes

If on a linear array with n PUs an information from PU_0 has to be broadcast, then the trivial algorithm, sending the algorithm in step i, 0 <= i < n - 1, from PU_i to PU_{i + 1}, is optimal, taking n - 1 steps. If the source is PU_j, j < n / 2, then the packet is first routed rightwards, then leftwards.

On a two-dimensional n x n mesh, the information is first routed within the row of the source PU and then within all columns. If the source is not PU_{0, 0}, then the routing in each of the phases should start in the direction in which the distance to cover is largest. The algorithm requires 2 * n - 2 steps at most, which is optimal.

On higher-dimensional meshes and hypercubes we can apply the dimension-ordered generalization of the above algorithm. On a hypercube this means that the routing is performed along an embedded binomial tree. On a k-dimensional hypercube the algorithm takes k steps, which is optimal.

DMMs

On a DMM with P PUs we assume that PU_0 is the source. There are several options. The most natural is to simulate the hypercube algorithm: the routing conists of log(P) rounds, in each round l data have to be transferred. This gives
T_hypercube(P, l) = c_v * l * log(P) + c_a * log P.

Surprisingly this is mostly not the best. If l is slightly larger, then v * l * log(P) is a large number. In that case it is better to use an algorithm which is similar to that for the linear array. The information is divided in a large number f of flits. Flit i, 0 <= i < f, is routed in step i from PU_0 to PU_1, and then on in a linear way. It takes f + P - 2 steps before flit f reaches PU_{P - 1}. This gives

T_linear(P, l) = c_v * l / f * (f + P - 2) + c_a * (f + P - 2).
For sufficiently large f, the first term converges to c_v * l, which is optimal. Of course this produces many more start-ups.

For intermediate values of l, there are intermediate algorithms. A simple idea is to use an algorithm similar to the one for a two-dimensional mesh: first broadcasting in a pipelined way within the rows, then within the columns.

Gossiping

Broadcasting is relatively cheap because there is only one packet. Gossiping is much more expensive. Therefore, in practical applications it should be performed only for very small sets of data.

Interconnection Networks

In the context of gossiping it is customary to use other cost models. The most considered model is to assume that in any round pairs of PUs can exchange data and that in a single round they can exchange all data they want to exchange. This is called the unit-cost telephone model. So, in any round the communicating PUs are given by a matching. Therefore, a gossiping algorithm is a specification of a sequence of matchings that eventually lead to all PUs knowing everything. Such a sequence is called a gossiping schedule.

Lemma: Gossiping in the unit-cost telephone model on a network with P PUs and diameter D requires at least max{D, round_up(log(P)) + odd(P)}. Here odd(x) = 1 for odd x and 0 otherwise.

Proof: To cover a distance d, at least d rounds are required. After t >= 0 rounds any information is known in at most 2^t PUs, because in any round the number of informed PUs can at most double. If the number of PUs is odd, then there is at least one PU which remains unmatched in the first round. So, the information from this PU is known in at most 2^{t - 1} PUs after t > 0 rounds. This means that for t <= round_up(log(P)) + odd(P) - 1, the number of informed PUs is at most 2^{round_up(log(P)) - 1} < P. End.

Gossiping on d-dimensional meshes is easy as long as all side lengths are even. In that case the gossiping can be performed in a dimension-ordered way. On an n x ... x n mesh for n even, this takes d * (n - 1) rounds, which is optimal. For odd n, it is always possible to perform the gossiping in d * n rounds, but this is not necessarily optimal. For example, on a 3 x 3 mesh gossiping can be performed in 5 rounds. This is optimal, because for a 3 x 3 mesh, the number of PUs P = 9, and thus round_up(log(P)) + odd(P) = 5.

Gossiping on 3 x 3 Mesh

For small networks gossiping schedules can be described very concisely and convincingly by a series of pictures. The nodes are subdivided according to the shape of the network, marking in each subdivision whether the information from the PU at the corresponding position is already known or not. The used matchings can be highlighted.

Gossiping on 3-Dimensional Butterfly

The unit-cost model is not the only possible cost model. More flexible is the linear-cost model in which it is assumed that a data exchange between two adjacent PUs in which each sends at most l informations takes c_v * l + c_a time. In principle it is not even necessary that all PUs remain connected equally long. However, the more freedom there is, the harder it becomes to design good algorithms. Therefore it is still useful to perform the gossiping by a sequence of matchings. The cost of a gossiping schedule is given by the number of rounds, the number of consecutive matchings used, and the number of steps, the sum over all rounds of the number of transferred packets.

We consider again the above schedule for gossiping on a 3 x 3 mesh. It consists of 5 rounds which was shown to be optimal. The number of steps is 1 + 2 + 4 + 4 + 7 = 18. This is far from optimal. It has been shown (masters thesis of Rene Beier) that 11 steps is the minimum achievable for any schedule with 5 rounds, and that in general at least 10 steps are required. Both bounds can be matched. We see here, and that is a general phenomenon, that there is a trade-off between rounds and steps. So, in general the best gossiping algorithm depends on the ratio c_a / c_v. Only for some very regular networks, such as linear arrays and hypercubes, the number of rounds and steps can be minimized at the same time. In that case there is a unique optimal gossiping schedule.

Gossiping on 3 x 3 Mesh, 5 rounds 11 steps

DMMs

On a DMM with P PUs, the most natural thing is again to apply a simulated hypercube algorithm, performing the gossiping with log P communication steps. If originally each PU holds l data, in step t, 0 <= t < log P, the size of the packets is l * 2^t. Summing over all steps gives
T_hypercube(P, l) = c_v * l * (P - 1) + c_a * log P.
This is optimal: the number of start-ups is minimized, and the routing volume equals the amount of data each PU must receive. For large P the start-up costs will become negligible.

Another communication operation is gather. In a gather each PU initially has l individual data. The task is to gather all this information in PU_0. This is not the inverse of a broadcast, because in a broadcast finally all PUs know the same information. Clearly this operation requires at least log P start-ups and because l * (P - 1) data must be transferred into PU_0. So, on a DMM gathering has the same lower bound as gossiping and can be solved optimally by applying the hypercube gossiping algorithm.

Using Randomization

Using randomization is a mighty instrument which in many contexts allows to simplify algorithms or even to achieve results which otherwise cannot be achieved. This also applies to the context of routing. In the following we will present quite a surprising application of randomization and show that on meshes the randomization can be substituted by a deterministic equivalent.

Hypercubes

We have seen that routing a permutation on a hypercube with n PUs can take Omega(sqrt(n)) time. This problem is not only of theoretical interest because the permutations in which the indices are permuted are concerned. These permutations are very common. There are several ways to overcome this problem. The simplest is the following idea of Valiant.

The packets are denoted p_i. s_i is the index of the source PU of p_i, d_i is its destination PU.

  1. For each p_i choose randomly and independently an intermediate destination x_i.
  2. Route all packets p_i from PU_{s_i} to PU_{x_i}.
  3. Route all packets p_i from PU_{x_i} to PU_{d_i}.

Notice that the mapping s_i -> x_i is not a permutation. Alternatively PU_0 might select the index of a permutation and broadcast it to all PUs in k steps, but this is a waste of time. The analysis of Valiants algorithm is not hard but beyond the scope of this chapter. Both routing phases have the same time distribution, because phase 2 can be viewed as a reversal of phase 1: in phase 2 the packets are moving from a random position to a fixed position. It can be shown that each of these phases takes k + o(k) time with high probability and that the queues remain small as well. So, the whole algorithm essentially 2 * k + o(k) = O(k) steps with high probability. Using randomization all permutations become equally hard, there are no longer good and bad permutations. The identity has the same expected time as a transposition.

A disadvantage of the above approach is that we loose a factor two for all those permutations for which even the greedy algorithm performed well: the expected time of the greedy algorithm is k + o(k), which is optimal up to the lower order term. A nice idea helps to make the randomized algorithm optimal while still making it highly unlikely that much more than the expected time is required. The algorithm is the same as before, with one small change. After selecting x_i, the length of the path s_i -> x_i -> d_i is computed. If this path is longer than k, we use comp(x_i) as intermediate destination instead of x_i. Here comp() is the operation which in a binary number complements all bits. The maximum distance any packet has to travel is now bounded by k: if dist(s_i, x_i) + dist(x_i, d_i) > k, then dist(s_i, comp(x_i)) + dist(comp(x_i), d_i) = k - dist(s_i, x_i) + k - dist(x_i, d_i) = 2 * k - dist(s_i, x_i) + dist(x_i, d_i) < k. Vöcking has shown that not withstanding the reduced randomness of the intermediate destination, the delay due to congestion is still limited to o(k) with high probability.

DMMs

On DMMs the standard routing algorithms above were all given under the assumption that the packets were of comparable size. It is no problem if there are small variations but if the sizes differ by a factor two or more, then we may expect that the routing takes considerable longer than estimated. Because the routing steps are not synchronized it is not easy to tell how long the routing will take exactly.

Consider an all-to-all routing. Assume that in total each PU is sending and receiving the same number of packets, but that not each PU is sending the same number of packets to each other PU. Then we can first send each packet to a random intermediate destination like in the hypercube algorithm. This replaces a possibly unbalanced pattern by two balanced all-to-all routing patterns. A refinement is to first determine the degree of unbalance and to randomize only as much as necessary.

Meshes

Consider a k-k routing on meshes. A good strategy is to choose for each packet a random intermediate destination and to route the packets to their intermediate destinations before routing them to their actual destinations. The whole algorithm consists of four routing phase: in phase 1 and 3, the packets are routed along the rows, in phase 2 and 4 the packets are routed along the columns.

For the analysis it is necessary to understand how much time the routing on the one-dimensional subarrays takes. Without a proof we give the central lemma in this domain:

Lemma: For a routing pattern on a linear array with n PUs, let r_{i, j}, 0 <= i < j < n, denote the number of packets that must travel from a PU with index i or smaller to a PU with index j or larger. When applying the farthest-first strategy, the rightwards moving packets need exactly max_{i < j} {r_{i, j} + j - i - 1} steps to all reach their destinations.

For most distributions the maximum is either assumed for i = 0, j = n - 1, or for i = n / 2 - 1, j = n / 2. The maximum time for a regular pattern like a k-k routing is always given by the maximum of these bounds. This simplifies the analysis a lot.

In the above case we are routing random k-k patterns. The maximum distance is n - 1, and in a randomization about half of the packets want to cross the bisection. Thus, each of the phases takes max{n, k * n / 4} + o(k * n). This is not yet good, but we can exploit that the given routing is uni-axial. As for the all-to-all algorithm above, two of these algorithms can be overlaid. To make this effective, the packets must somehow be split in two. In a randomized context it is natural to use randomization for this as well. So, each packet is colored independently white or black with probability 1 / 2. The white packets are routed xy, the black packets yx. In this way we do not exactly obtain two (k / 2)-(k / 2) patterns, but that does not matter. The important point is that now in each routing phase, the expected number of packets crossing the bisection of a one-dimensional subarray is k * n / 8. Using Chernoff bounds, it may be shown that all deviations from the expected value are small with high probability.

Theorem: On a two-dimensional n x n mesh, k-k routing can be performed in max{4 * n, k * n / 2} + o(k * n) steps with high probability.

The above result is optimal for all k >= 8, because there the k * n / 2 bisection lower bound is matched up to lower-order terms. For k = 1, 2 k-k routing can be performed in 2 * n + o(n) time. This is optimal because it matches the distance bound, the lower bound which is obtained by considering how far the packets have to travel. For 2 < k < 8 it is not known how many steps are exactly needed for k-k routing. For example, 4-4 routing can be performed in 3 * n + o(n) steps, but it is not clear whether this is optimal or not.

Derandomization

Randomization is a handy means of obtaining good results, but there are practical and theoretical reasons to prefer deterministic algorithms if they can achieve the same bounds. In this case the only purpose of using randomization is to obtain a regular pattern. However, this can also be achieved by handing out the packets in a regular way. We first consider how to turn this idea into an algorithm for all-to-all routing on a DMM.

Assume we are performing an all-to-all routing on a DMM with P PUs, each PU sending and receiving m numbers in total, l = m / P packets traveling between any pair of PUs on average. Each PU sorts the numbers according to the index of their destination PUs. Using bucket sort this takes O(m) time. A number residing in PU_i with rang r is sent to PU_{(i + r) mod P} as intermediate destination in a first routing phase and in a second routing phase to its destination. This is a very simple algorithm. Creating the packets to be sent to each PU may be even faster than in the randomized algorithm because the calls to the random number generator are saved.

How good is the resulting distribution? Clearly in the first routing phase the packets are almost perfectly balanced, the number of packets going from one PU to another is either round_down(m / P) or round_up(m / P). In phase 2 the imbalance is larger. Denote by a_{i, j} the number of packets originating in PU_i with destination in PU_j and by b_{i, k, j} the number of packets which are send from PU_i via PU_k to PU_j. For arbitrary k and j we want to put an upper bound on sum_i b_{i, k, j}. The regular distribution gives b_{i, k, j} <= round_up(a_{i, j} / P). This gives

sum_i b_{i, k, j} <= sum_i round_up(a_{i, j} / P) <= P + sum_i a_{i, j} / P = P + m / P.
So, if m = omega(P^2), the imbalance is negligible. For the small values of P that currently are common, this is no problem, but in general this is quite a strong condition.

On n x n meshes any operation in submeshes of size o(n) takes only o(k * n) time and is therefore cheap in comparison to the Theta(k * n) time for solving most problems. Here k denotes the number of packets per PU. So, on meshes we can apply the same idea, but the allocation of intermediate destinations may be coordinated in submeshes of quite considerable size. Thereby the rounding errors can be made of minor importance even for small k. As a result, the same result that was achieved above can also be obtained deterministically.

The idea of coordinating the rounding decision also works on DMMs. Due to rounding errors, each of the rounds of phase 2 might be delayed by c_v * P, in total c_v * P^2. If first all PUs are subdivided in groups of size sqrt(P), these PUs can exchange their values b_{i, j}. These are small gossip operations, costing c_v * P^{3 / 2} + c_a * log(P) / 2 time. Then they can optimize their rounding, reducing the loss in phase 2 to c_v * P^{3 / 2}. So, now we must have m = omega(P^{3 / 2}). This is much better, but still not as good as for the randomized algorithm, for which the rounding errors are negligible for any m = omega(P * log P). Of course, in practice it is quite unlikely that the deterministic algorithm has maximal bad luck. Even though we might not assume that all a_{i, j} are equal, it is quite reasonable to assume that the rounding errors are more or less randomly distributed.

Offline Routing

In algorithmics there is a fundamental distinction between online and offline problems. In the context of routing an algorithm is said to be online if the destinations of the packets only become known at the beginning of the routing. The cost of precomputation, if any, has to be booked on the routing problem to be solved. In an offline routing algorithm, it is assumed that we are dealing with a fixed routing pattern which has to be performed many times. In this case the cost of precomputation is not considered, and the only concern is the quality of the produced routing schedule. Particularly, the precomputation is not necessarily performed on the parallel computer on which the routing must be performed.

Two-Dimensional Meshes

For two-dimensional n x n meshes there is a ingenious algorithm achieving something none of the online algorithms can achieve: the queue size is only one. That is, at all times, any PU is storing in addition to the packets moving by only a single packet. The algorithm consists of three one-dimensional routing phases, each being a permutation of the n packets standing in the rows or columns. It is a simplification of the more general algorithm for product networks in the next section which is due to Annexstein and Baumschlag.

First we construct a bipartite graph with 2 * n nodes. The sets V_1 and V_2 each have n nodes, corresponding to the source and destination rows of the packets, respectively. There is an edge from node i in V_1 to node j in V_2 if there is a packet moving from row i to row j. Because there are exactly n packets starting and finishing in each row, the graph is n-regular (all nodes have degree n). Thus, according to Hall's theorem, the graph has an edge coloring with n colors. Because this is an online algorithm, the existence of such a coloring is the only thing that matters, but actually such colorings can be computed very efficiently. That is, there is an assignment of values, which are called colors, ranging from 0 to n - 1, so that the edges incident on a node all have different colors.

These colors are used in the routing algorithm. Denote the packets by p_i and let x_i be the color attributed to the edge corresponding to this packet.

  1. Route each packet p_i along the rows to column x_i.
  2. Route each packet p_i along the column to its destination row.
  3. Route each packet p_i along the row to its destination PU.

It remains to check that the algorithm is as good as claimed above. For the analysis of phase 1 we consider the routing in an arbitrary row. Originally each PU holds one packet. Because the edges corresponding to these packets are all incident on the same node of the bipartite graph, they all have different colors, and therefore no two packets will be routed to the same PU. For the routing in phase 2, we consider the routing in an arbitrary column. Because of the previous result, initially each PU holds one packet. If two packets in the column would move to the same row, then two packets with destination in the same row would have gotten the same color, which is in contradiction with the coloring which colors all edges incident on the node in V_2 corresponding to this row differently. Therefore, at the beginning of phase 3, all PUs hold one packet. In each row each PU is the destination of one packet because the complete pattern is a pattern, and in phase 3 the packets are routed to their destinations. So, the algorithm consists of three permutations within one-dimensional subarrays. Therefore, each phase can be executed in n - 1 steps, 3 * n - 3 steps in total, and never a PU is storing more than one packet.

The queue size is optimal, the time is not. It is still open whether it is possible to route all permutations in 2 * n - 2 steps with queue size 1. A few steps more and queue size 2 is possible.

Offline Permutation Routing with Queue Size 1

Product Networks

The above idea can be generalized for any product network A = B x C, where "x" denotes the direct product of graphs: the nodes of A are pairs (b, c), where b is a node in B and c a node in C. There is an edge from (b_1, c_1) to (b_2, c_2) iff b_1 = b_2 and (c_1, c_2) is an edge in C, or c_1 = c_2 and (b_1, b_2) is an edge in B. A two-dimensional n_1 x n_2 mesh is the direct product of a linear array with n_1 and a linear array with n_2 nodes. Tori and hypercubes are also direct products.

Let A = B x C be a product network. B has n_B nodes and C has n_C nodes. A schedule for routing a permutation on A can be computed offline by first constructing a bipartite graph with node sets V_1 and V_2, each with n_C nodes. Node c, 0 <= c < n_C, in V_1 and V_2 corresponds to the subnetwork B x c in A. The bipartite graph has n_A = n_B * n_C edges. For a packet traveling in A from a node (b_1, c_1) to a node (b_2, c_2) there is an edge from node c_1 in V_1 to node c_2 in V_2. Thus, the bipartite graph is n_B-regular and can be colored with n_B colors. Packet i is denoted p_i, it is traveling from (b_i, c_i) to (b'_i, c'_i), and its color is x_i, 0 <= x_i < n_B. The routing is performed as follows:

  1. Route p_i from PU_{b_i, c_i} to PU_{x_i, c_i}.
  2. Route p_i from PU_{x_i, c_i} to PU_{x_i, c'_i}.
  3. Route p_i from PU_{x_i, c'_i} to PU_{b'_i, c'_i}.

Theorem: Permutation routing on a product network A = B * C can be performed in T_A = 2 * T_B + T_C steps, where T_B and T_C denote the number of steps for routing permutations on B and C, respectively. At any time a PU is storing at most one packet.

Proof: Phase 1 is a permutation routing in subnetworks with the structure of B, because for any subnetwork B x c, all destinations occur exactly once due to the coloring of the edges incident on the node in V_1 corresponding to this subnetwork. The coloring of the edges incident on the nodes of V_2 assures that phase 2 is a permutation routing in a network with the structure of C. Phase 3 is again a permutation routing within subnetworks with the structure of B because at the beginning each PU holds one packet, and because the whole pattern is a permutation, each PU is also the destination of one packet. End.

If T_B != T_C, then it is profitable to arrange things so that T_B < T_C. The above construction immediately gives results for d-dimensional meshes. The best is to perceive a d-dimensional n x ... x n mesh as the direct product n x (n x ... x n). Using induction this leads to

Corollary: Permutation routing on a d-dimensional n x ... x n mesh can be performed in (2 * d - 1) * n steps. At any time a PU is storing at most one packet.

DMMs

Above we have studied all-to-all routing on DMMs with P PUs. Let m = P * l be the total number of packets each PU has to send and receive. The greedy algorithm can perform very poorly and is acceptable only if we reasonably can assume that the distribution is balanced. The random and deterministic algorithms which send the packets through an intermediate destination overcome the problem of unbalances to a large extend, but are not perfect either. First solving a coloring problem for a bipartite graph the problem can be overcome entirely: the routing consists of two phases in which each PU sends exactly l packets to each of the other PUs.

The algorithm is simple: a bipartite graph with two sets of P nodes is created. For a packet moving from PU_i to PU_j there is an edge from node i in V_1 to node j in V_2. This gives an m regular graph which can be colored with m colors. Each PU first routes all packets with color x to PU_{x mod P}. In the second routing phase the packets are routed to their destinations. Because of the coloring property for the edges incident on the nodes of V_1, in the first round each PU sends exactly l packets to each PU. Because of the coloring property for the edges incident on the nodes of V_2 each PU receives one packet from each color. These stand perfectly distributed after the first routing.

Towards Online Routing

Offline routing is fine for fixed patterns, but patterns which become known only at runtime are more common. Just as it is often possible to turn randomized algorithms into deterministic algorithms with essentially the same performance, so is it often also possible to turn offline algorithms into online algorithms. Both conversions are of great importance, because they allow for a separation of concerns. Knowing that such conversions exist, one may when designing an algorithm assume some smooth distribution as it is achieved by a randomization, or one may first consider what might be achieved offline.

Clearly for a problem like permutation routing, which is not particularly hard, it appears outrageous to first solve a bipartite graph-coloring problem of a graph with one edge per packet. This problem is far more complicated then the routing itself. However, we can solve a strongly reduced problem in o(n) time, while still obtaining almost as good performance as with the offline algorithm. This idea of reducing the information content of a problem is called sparsification and is useful beyond the scope of turning offline algorithms into online algorithms.

The approach is sketched for a two-dimensional n x n mesh. The mesh is divided in m x m submeshes. m = o(n) is quite large, one should think of m = n^{1 - eps}, for some suitable eps > 0. The packets going to each of these n^{2 * eps} submeshes are counted in each submesh. The bipartite graph does not have one node for each row, but one node for each of the n^eps row bundles. A further idea, which helps to reduce the size of the graph is to not create an edge for each edge but only for each superpacket. A superpacket consists of a large number s = n^alpha packets with common destination. Because we cannot assume that s exactly divides the number of packets traveling between a pair of row bundles, in general there will be one partially filled superpacket for each destination pair. So, in each submesh the graph to color has 2 * n^eps nodes and n^{2 - 2 * eps - alpha} + n^{2 * eps}. Choosing the parameters right, we can achieve the following:

After solving the localized and sparsified coloring problems, the routing cannot be performed so simply as it was done before. Each routing phases must be preceded by a rearrangement in the submeshes, so that the packets stand well-distributed. For example, preceding a horizontal routing phase, the packets must be arranged so that in each row of a submesh the number of packets going to any other submesh is about equal. When reaching a submesh, the packets are not routed to a specific PU, but rather to the first PU in the submesh which is still storing fewer than the maximum allowed number of packets. Together these ideas assure small queues and good routing times. Because the submeshes have size o(n), the cost of each operation is a lower-order term of the total routing time, and because there are only constantly many routing phases, their cost is negligible altogether.

Reconfigurable Arrays

Model

So far we have only considered parallel computers which are connected by a fixed interconnection network. This is the most natural model and such parallel computers are almost the only ones that exist. But, consider again the complete interconnection network realized by a crossbar switch. This can also be viewed as a set of P PUs which are dynamically interconnected. Said otherwise, the technology for making dynamic interconnections is available and apparently the price for this technology is reasonable and the speed at which the connections can be made is acceptable.

The entire system consists of a number of PUs with conventional powerful arithmetic and storage and a number of switches. The switches are slightly more powerful than in a crossbar: they listen to the signals passing on the buses they are connected to and take out whatever is destined for them. A small number of values can be stored and the switches can be set conditionally as a result of a comparison between stored values. The whole system is assumed to be synchronized and the difference between receiving a signal or not receiving a signal can be detected.

Typically the switches are arranged in a rectangular grid, but in principle they might also be arranged differently. The PUs are arranged on the outside. The switches normally have four ports, denoted W, N, E and S. Sometimes it is assumed that these ports can be connected dynamically in any desirable configuration, sometimes the configurations are limited to a subset thereof. In the presented examples we will not need anything fancy. A switch setting is denoted by indicating which ports are connected. For example, WN-SE indicates the setting in which W is connected with N and S with E.

It is important to notice that the switch setting decomposes the grid into a number of subnets. Within these nets there is broadcasting possibility: as long as it is assured that only one of the connected PUs is sending, all connected PUs and switches will receive the information transmitted in the net. A system as described is called a reconfigurable processor array

One may argue that reconfigurable networks are physically not realistical. There are two major assumptions which give reason for skepticism:

The first point is a true limitation: speed of light is finite and the size of the reconfigurable network certainly increases with the number of switches. This is a limitation, but exactly the same assumption is made in all hypercube, butterfly and PRAM algorithms. Two-dimensional meshes (and to a lesser extend three-dimensional meshes) are the only interconnection networks that can be realized in physical space without scaling problems. In a two-dimensional lay-out of a hypercube with n PUs there are wires of length sqrt(n).

It is harder to estimate the importance of the second point. Radio exists, and the signal does not become noticeably weaker if the number of listeners increases. Even wire-bound broadcasting exists in the form of cable television. So, technologically, this is no problem either. Possibly the signal must be somewhat stronger than for a point-to-point transmission. Also one can imagine that especially branchings weaken a signal. In most algorithms for reconfigurable arrays, these can easily be avoided by performing the broadcasting in two steps.

Power of Reconfiguration

Parity is the problem of determining the parity of the sum of n bits. On a CRCW PRAM this problem requires Omega(log n / loglog n) time. Even taking exponentially many PUs does not help. Parity is important because it is a simplification of many other more useful problems. Surprisingly, parity can be solved on a reconfigurable array in constant time. The amount of used hardware is O(n). Hence, the algorithm is not only fast, but even efficient.

The algorithm requires n + 3 PUs and 6 * n switches arranged in an 3 x 2 * n grid. In a first round PU_i informs the six switches in the columns 2 * i and 2 * i + 1 about its value. After receiving this value, the switches set their connections in preparation for the next round. If a 0 is received, a switch is set to WE-NS. If a 1 is received the switch must be set WN-SE if it is found in column 2 * i. If the switch is in column 2 * i + 1, then it is set to WS-NE, with exception of the switch in row 1, which is set to WE-NS. Then in the second round, a signal is entered in the net from the lower-left side. If it comes out at the right side in the same row, then the parity is even, if it comes out one row higher, the parity is odd. The PU on the right side receiving the signal can either output the conclusion, or broadcast it to all other PUs.

Computing Parity of 1 + 1 + 0 + 1 + 0 + 1

Constant Time Sorting

Parity is not the only problem that can be solved in constant time on reconfigurable arrays. Actually, there are many interesting problems which can be solved in constant time. However, mostly these algorithms are far from efficient, requiring considerably more switches than the sequential time. As an example we show here how to sort n numbers using O(n^2) PUs and O(n^3) switches in O(1) time. In the chapter on sorting we will see how the number of switches can be reduced to O(n^2). This is optimal within this context, because routing a permutation of n elements in constant time requires an n x n array.

The input is a set of n numbers a_i, number i originally residing in PU_i. By taking also the index into account, we may assume that all values are different. The algorithm consists of two somewhat independent parts. In the first part, for each a_i its rank r_i within the set of numbers is computed. In the second part, a_i is routed to PU_{r_i}. Each of these operations can be performed in a constant number of steps. For each i there is a dedicated n x n grid of switches. The switches in column j of this network are set depending on a comparison of a_i with a_j similar to the construction in the parity algorithm: if a_j < a_i, a signal which is inserted at the lower-left corner is guided one row upwards. The signal finally leaves the network on the right side in row r_i of the subnetwork. Even when using only the three different settings which were used before, the whole algorithm can be implemented with just 6 steps.

Sorting in Constant Time

Exercises

  1. Show with a picture how to embed a linear array in a two-dimensional mesh. Generalize this embedding for d-dimensional meshes. Using the embedding as the basis for an indexing scheme, specify exactly the index of PU_{i_1, ..., i_d} in an n_1 x ... x n_d mesh.

  2. Show with a picture how to embed a circular array in a two-dimensional n_1 x n_2 mesh with n_1 even. Generalize this embedding for d-dimensional meshes. Considering the side lengths n_1, ..., n_d, which condition must they satisfy in order that this embedding is possible?

  3. The main importance of embeddings of linear arrays stems from the fact that they lead to indexings schemes in which PUs with consecutive indices are adjacent. More generally, we would like that any pair of PUs whose indices do not differ much are close to each other in the network. Construct an indexing scheme for a mesh so that the distance in the network between PU_i and PU_j is bounded by O(sqrt(|i - j|)).

  4. Assume each PU of a circular array with n PUs holds k packets that must be routed so that each PU receives at most k packets in total. This routing pattern is called k-k routing. Specify lower bounds on the routing time in terms of n and k. Distinguish several cases. Consider the routing in a greedy way: each packet is routed to its destination along the shortest path, applying farthest-first priority strategy. How long may the routing take? For which values of k is this algorithm optimal? Present another algorithm which is optimal for all k >= 4.

  5. Draw two- and three-dimensional CCC networks and determine their diameters.

  6. Give Cayley graphs corresponding to circular arrays, two-dimensional tori and CCC networks.

  7. The k-dimensional pancake graph is a Cayley graph defined over the permutation group S_k which has as elements the k! permutations of the elements 0, 1, ..., k - 1. The product in this group is given by the composition. The set of generators S = {swap_i| 2 <= i <= n}, where swap_i is the permutation ((i - 1) (i - 2) ... 0 i (i + 1) ... (n - 1)). So, swap_i reverses the order of the first i elements.

  8. The greedy routing algorithm on two-dimensional meshes has maximum queue size if the two routing phases are separated. If the phases are coallesced, then this queue size becomes smaller. Show that coallescing does not really help by providing a permutation for which the greedy algorithm with coallesced phases still has queue size Omega(n). The worst achievable queue size is 2 * n / 3.

  9. Give an algorithm for broadcasting on a k-dimensional butterfly network running in 2 * k + O(1) communication steps. Specify the exact number of steps used by your algorithm.

  10. Consider gossiping on a circular array with n PUs in the unit-cost telephone model. Give algorithms and specify the number of steps. Distinguish the cases that n is even and n is odd.

  11. We consider gossiping in the unit-cost telephone model on n_1 x n_2 meshes for which at least one of the sides has odd length.

  12. Show that gossiping in the linear-cost model on a linear array with P PUs requires at least 2 * P - 3 steps.

  13. Show how to gossip on a 3 x 3 mesh in the linear-cost model using only 10 steps.

  14. Give an algorithm for gossiping on a k-dimensional CCC network running in 5 / 2 * k + O(1) communication steps of the unit-cost telephone model. Specify the exact number of steps used by your algorithm.

  15. Give an algorithm for gossiping on a k-dimensional butterfly network running in 5 / 2 * k + O(1) communication steps of the uni-cost telephone model. Specify the exact number of steps used by your algorithm.

  16. In the text it was demonstrated that in the DMM model gathering has the same complexity as gossiping. Give an example of an interconnection network on which gathering is cheaper than gossiping.

  17. A torus can be embedded in a mesh and a circular array in a linear array. These embeddings imply that any torus or circular-array algorithm requiring t steps can be run on a mesh or linear array, respectively, in 2 * t steps. An interesting question is whether the reverse holds: can any problem be solved twice as fast on a torus / circular array as it can be solved on a mesh / linear array? This exercise provides a simple and convincing counter example: online 2-2 routing can be solved trivially performed in n steps on a linear array, while 2-2 routing on a circular array may require at least 2 / 3 * n steps, even when all source-destination pairs are known beforehand and the routing may be planned offline.

    In the section on "Using Randomization" a lemma was given specifying exactly the time for routing on a linear array with n PUs when using the farthest-first strategy.

  18. Consider random permutations. That is, for a processor network with P PUs a permutation pi is picked uniformly out off the P! permutations of the P indices. The packet from PU_i has to be routed to PU_{pi(i)}.

  19. Consider again the k-k routing algorithm for two-dimensional n x n meshes. It consisted of four phases. The first two sending the packets to random intermediate destinations, the last two for sending the packets to their destinations. Because each phase may take at least n steps, such a four-phase algorithm inherently requires 4 * n steps, and therefore it cannot be good for k < 8. The four phases where xy followed by xy. Of course it is just as good to follow xy by yx. In that case there are two consecutive phases of routing along the columns. These can be replaced by a single phase. This gives a three-phase algorithm which is better for k < 8. Specify how many steps this algorithm takes for all k. Take care, in this case not all phases are equally expensive.

  20. The idea of Annexstein and Baumschlag for offline routing works for any product network.

  21. Consider permutation routing on k-dimensional CCCs. A CCC is not a product network, but it is useful to view it as a hypercube with nodes replaced by cycles of length k. Above we have considered how to route on hypercubes, routing on cycles is trivial. The PUs of the CCC are indexed (i, d), with 0 <= i < 2^k and 0 <= d < k, the first index gives the position within the cube, the second index gives the position within the cycle.

  22. Consider routing permutations on a reconfigurable array. There are n PUs connected to an n x n array. PU_i holds a number a_i which must be communicated to PU_{r_i}. Show how to solve this problem in three steps. The switches can only be put in the three states that were used before: WE-NS, WN-SE and WS-NE.


This page was created by Jop Sibeyn.
Last update Monday, 06 September 04 - 09:19.
For any comments: send an email.