Basic Problems

Starting with this chapter and throughout the rest of the text, we follow a problem-oriented approach. That is, each chapter deals with a single problem or a small set of related problems, which are studied for each of the considered cost models: WT framework and PRAM, interconnection networks and DMM. The goal is to provide insight how the cost model influences the choice of the algorithm. In all cases a minor modification of the PRAM algorithm works reasonable even for a DMM but more often than not, this does not lead to the best possible DMM algorithm. We will see that algorithms designed for interconnection networks provide just as good a source of inspiration.

Sum and Prefix Sums

We consider the problem of computing the sum of n values. More precisely for an array a[], we want to compute S = sum_{i = 0}^{n - 1} a[i]. A slightly harder problem is to compute the prefix sums, that is, to compute all values s[j] = sum_{i = 0}^j a[i] for all j, 0 <= j < n. Sequentially both problems can be computed in O(n) with a single for loop.

Linear and Circular Array

On a linear array with n PUs we assume that initially PU_i holds a[i]. In step i, 1 <= i <= n - 1, PU_{i - 1} routes its number to PU_i. Upon receiving a number, PU_i adds the received number to its value a[i]. The result is s[i]. At the end of step n - 1, PU_{n - 1} holds s[n - 1], which is the sum of all numbers.

On a circular array each number is routed from the PU where it starts in each direction for n / 2 steps. If in step t, 1 <= t <= n / 2, PU_i receives a number from PU_{(i - 1) mod n}, then it adds its value to the value it has if i >= t. If PU_i receives in step t a number from PU_{(i + 1) mod n}, then its value is added if i + t >= n.

The algorithm on the circular array runs twice as fast and minimizes the number of communication steps. However, this algorithm is not work-optimal: all PUs are busy all steps, so the total work is Theta(n^2).

WT Framework and PRAM

In the WT framework, the sum can be computed in O(log n) steps. The idea is the following modified sequential algorithm:
  int sum(int[] a, int n)
  {
    int d = n / 2;
    while (d > 0)
    {
      for (i = 0; i < d; i++)
        a[i] += a[i + d];
      d /= 2;
    }
    return a[0];
  }
For n = 2^k for some k >= 0 this computes the sum in k = log n passes of the while loop. The operations of the for loop can be executed in parallel. The work is given by the sum of the used d-values which is n / 2 + n / 4 + ... + 1 = n - 1. Because the scheduling on an EREW PRAM is no problem, this gives a PRAM algorithm running in O(log n) time with n / log n PUs, an optimal algorithm.

For the prefix sums problem there is an elegant recursive algorithm:

  void prefix(int low, int n)
  {
    if (n > 1) 
    {
      prefix(low, n / 2); 
      prefix(low + n / 2, n - n / 2); 
      for (int i = n / 2; i < n; i++) 
        a[low + i] += a[low + n / 2 - 1]; 
    }
  }
In a sequential context it makes no sense to formulate the prefix sums computation this way, but this formulation immediately leads to a parallel algorithm running in logarithmic time: T(n) = T(n / 2) + O(1). The work can be estimated as W(n) = 2 * W(n / 2) + O(n), which gives W(n) = O(n * log n). So, this algorithm is elegant but not optimal.

With a twist the work can be reduced to O(n). First consider an algorithm for solving the problem on a processor network with a tree topology consisting of 2 * n - 1 PUs arranged in log n + 1 levels with n / 2^t PUs at level t, 0 <= t <= log n. The algorithm consists of two main phases, the first going upwards, the second downwards. Initially a[i] is stored in PU_i at level 0. In step 0 of the upwards phase, each PU at level 0 routes its number to the PU at level 1 above it. In step t, 1 <= t < log n, of the upwards phase, a PU at level t routes to the PU at level t + 1 above it the sum of the values it received from the two PUs below it. In step t, 0 <= t < log n, of the downwards phase, a PU at level log n - t routes to the two PUs at level log n - t - 1: to the left it routes the value it received from the PU at level log n - t + 1 (which is supposed to be zero for t = 0); to the right it routes the sum of the value it received from above and the value it received from the left during the upwards phase. At level 0 PU_i adds the received value to a[i]. The whole algorithm consists of 2 * log n + 1 steps.

In the WT framework, it suffices to notice that there are 2 * log n + 1 time steps and that the total work is O(n), because each of the 2 * n - 1 PUs of the tree network is performing only O(1) work. Thinking about what is actually going on, we get the following sharper formulation:

  void prefix()
  {
    for (int step = 1; step <= n / 2; step *= 2)
      for (int i = step - 1; i < n; i += 2 * step) 
        a[i + step] += a[i];
    for (int step = n / 4; step >= 1; step /= 2)
      for (int i = 3 * step - 1; i < n; i += 2 * step) 
        a[i] += a[i - step];
  }
The number of additions is n / 2 + n / 4 + ... + 1 in the first loop and 1 + 3 + ... + n / 2 - 1 in the second. Because the scheduling on an EREW PRAM is no problem, this gives a PRAM algorithm running in O(log n) time with n / log n PUs, an optimal algorithm.

Click here to see this procedure integrated in a working program. The program also contains the sequential and the recursive algorithm. Running the program for large n shows that not withstanding the optimality, the parallel algorithm is considerably slower than the sequential one. One of the reasons is the much larger number of cache faults it generates. This can be overcome by a better organization. However, a second reason is that the algorithm is inherently more complex and performs twice as many additions.

DMM

Solving a sum or prefix-sum problem for an array a[] with n elements on a DMM with P << n PUs is easy and efficient. The elements of a[] are allocated to the PUs so that PU_i holds the number a_j, with i * (n / P) <= j <(1 + 1) * (n / P). First each PU computes the sum of the values its holds. Then a sum or prefix-sum problem is solved for these P values. This takes log P or 2 * log P - 3 rounds, respectively. When solving the prefix problem, it remains to add an offset to the internal values. Thus, for the sum problem
T_par(P, n) = c_i * O(n / P) + (c_v + c_a) * log P.
Similarly, for the prefix-sum problem
T_par(P, n) = c_i * O(n / P) + (c_v + c_a) * (2 * log P - 3).
Because the routing volume is so small, this algorithm works even on networks with large c_v. The condition for getting good performance is c_a * log P << c_i * n / P. This is satisfied for P << c_i / c_a * n / log n, which is an exceptionally large number. Said otherwise, even relatively small problem can be solved efficiently.

Matrix Product

WT Framework and PRAM

An easy problem with a high sequential complexity is matrix multiplication. For computing C = A * B one conventionally performs
  void matrix_product(int[][] A, int[][] B, int[][] C, int n)
  {
    for (i = 0; i < n, i++)
      for (k = 0; k < n, k++)
        C[i, k] = 0;
    for (i = 0; i < n, i++)
      for (k = 0; k < n, k++)
        for (j = 0; j < n, j++)
          C[i][k] += A[i][j] * B[j][k];
  }

It is handy to rewrite the algorithm as follows:

  int sum(int[] c, int n)
  {
    int s = 0;
    for (j = 0; j < n, j++)
      s += c[j];
  }

  void matrix_product(int[][] A, int[][] B, int[][] C, int n)
  {
    int[][][] c = new int[n][n][n];
    for (i = 0; i < n, i++)
      for (k = 0; k < n, k++)
        for (j = 0; j < n, j++)
          c[i][k][j] = A[i][j] * B[j][k];
    for (i = 0; i < n, i++)
      for (k = 0; k n, k++)
        C[i][k] = sum(c[i][k], n);
  }

This immediately gives the following WT algorithm:

  void matrix_product(int[][] A, int[][] B, int[][] C, int n)
  {
    int[][][] c = new int[n][n][n];
    for all (i, k, j, 0 <= i, k, j < n) pardo
      c[i][k][j] = A[i][j] * B[j][k];
    for all (i, k, 0 <= i, k < n) pardo
      C[i][k] = sum(c[i][k], n);
  }
The algorithm consists of two parallel instructions, of which the second requires further O(log n) instructions. The work is n^3 + n^2 * O(n) = O(n^3). We have given here the transformation in several steps and explicitly indicate which steps are to be performed in parallel. In the following such straight-forward parallelizations will not be worked out.

It is obvious how to schedule the algorithm on a PRAM with n^3 PUs. Taking P = W(n) / T(n) = n^3 / log n as suggested above, reduces the cost without substantially increasing the time. In that case C(P, n) = n^3 / log n * O(log n) = O(n^3). One may debate whether this should be called optimal or not: there are sequential matrix-multiplication algorithms running in o(n^3). On the other hand, are these not so terribly much better in practice, and the conventional version presented above is still used.

Two-Dimensional Mesh

Surprisingly, matrix multiplication can be performed very efficiently on tori and meshes. Matrix multiplication and the all-pairs shortest paths problem treated further down are among the very few non-trivial problems that can be solved optimally on meshes: on an n x n mesh, two n x n matrices can be multiplied in O(n) communication steps. Even under the assumptions of systolic computation, we get this time bound. Systolic computation means that the data are assumed to enter the mesh from the top and left side of the mesh (or something similar) and then stream at one step at a time through the network. The PUs can take out of the passing streams whatever is of their liking. The PUs can also change the values of the streaming data. The hardware which motivated the design of systolic algorithms is no longer popular, but systolic algorithms are particularly simple and therefore it continues to be worth considering them.

In our case the PU at position (i, k) computes C[i][k]. So, the data from A[][] and B[][] must be arranged so, that A[i][j] and B[j][k] reach PU_{i, k} at the same time. This is achieved when the arrival is delayed so, that A[i][j] reaches PU_{i, 0} in step i + j and then runs rightwards without delay, while B[j][k] reaches PU_{0, k} in step j + k and then runs downwards without delay. Thus, A[n - 1][n - 1] and B[n - 1][n - 1] enter the mesh in step 2 * n - 2 and reach PU_{n - 1, n - 1} in step 3 * n - 3. By then all other numbers have been processed as well. So, the whole computation takes 3 * n - 3 communication steps.

Systolic Matrix Product

On a torus the systolic algorithm can be implemented easily: the data from A[][] which in the systolic algorithm reach column 0 in step t, 0 <= t < n, are positioned in column (n - t) mod n. The data from B[][] which reach row 0 in step t are positioned in row (n - t) mod n. More precisely, A[i][j] starts in PU_{i, (n - i - j) mod n} and B[j][k] starts in PU_{(n - j - k) mod n, k}. The A[][] data move rightwards using the wrap-around connections, the B[][] data move downwards. In all PUs there is something to do every step. So, after n - 1 communication steps, the matrix product has been computed. This is optimal because there are n^3 multiplications to make with n^2 PUs.

How to implement this algorithm on a mesh? On a practical system with worm-hole routing, it is never a problem when one packet must be routed a long distance. So, cyclical shifts can always be performed at almost full speed. In the store-and-forward model, however, it is harder. Just taking the above algorithm with the indicated mapping requires (n - 1) * (n - 1) routing steps. But, using the standard embedding of a torus on a mesh, matrix multiplication can be performed on the mesh in 2 * n - 2 steps. It is not clear that this is the optimal result for computing matrix products on meshes, but there are no obvious improvements as long as we insist that each PU holds only one value from A[][] and B[][] at a time.

Three-Dimensional Mesh

There is a matrix-multiplication algorithm for three-dimensional meshes which is directly inspired by the PRAM algorithm. It uses n^3 PUs and runs in O(n) time steps. This performance is not good. The main reason to present the algorithm is that it makes it easier to understand the hypercube algorithm in the next section.

The n^3 PUs of the three-dimensional n x n x n mesh are indexed with three subscripts as in the obvious way: PU_{i, j, k} denotes the PU in row i, column j and plane k. We assume that initially A[i][j] is stored in PU_{i, j, 0} and B[j][k] in PU_{0, j, k}. Then these data are routed so that each PU_{i, j, k} receives A[i][j] and B[j][k]. This routing is trivial: starting from PU_{i, j, 0} the number A[i][j] is routed along the third axis, starting from PU_{0, j, k} the number B[j][k] is routed along the column. Hereafter PU_{i, j, k} computes c[i][j][k] = A[i][j] * B[j][k]. All numbers that must be added together, that is all numbers c[i][j][k] with the same j, stand in the same row. In another n - 1 steps these can be added up to C[i][k] and routed to the PUs PU_{i, 0, k}. So, the whole algorithm takes 2 * n - 2 steps on an all-port machine and 3 * n - 3 on a single-port machine.

The last operation, computing C[i][k], can be performed in two ways:

Both algorithms finish after n - 1 steps. However, there is a great difference between them. In the first algorithm, PU_{i, 0, k} is busy all the time, and in total Theta(n^3) packets are sent. In the second algorithm, each PU is active only one step, and the total numbers of packets is Theta(n^2).

If a single matrix product must be computed there is no difference. If, however, there are L pairs (A_l, B_l), 0 <= l < L, of matrixes to multiply, then using the first algorithm we still need at least n - 1 steps per product because the PU_{i, 0, k} are busy n - 1 steps adding the numbers they are receiving. The second algorithm can be used for a pipelined approach which requires only O(n + L) steps to compute all product matrices C_l, 0 <= l < L. Pipelining means that an excess of computing power is made useful by scheduling computations tightly after each other. This is another kind of parallelism, successfully used in any modern processor.

In our case, the numbers of A_l and B_l start being routed into the network at step l and come out of the network 2 * n - 2 steps later (assuming that the PUs can store O(n) numbers and all-port capacity). So, for computing L matrix products we need 2 * n + L - 3 steps. For L = omega(n), this is 1 + o(1) steps per product with n^3 PUs. On a two-dimensional mesh we needed 2 * n - 2 steps with n^2 PUs. The conclusion is that for computing many products the simple algorithm for three-dimensional meshes is asymptotically twice as efficient.

Hypercube

Now we consider how the above algorithm leads to a hypercube algorithm running in O(log n) time (one of the questions deals with the problem of making the algorithm optimal). For convenience we assume that n = 2^{3 * l} for some integral l. In that case, the hypercube can be perceived as a three-dimensional mesh with handy shortcuts, which allow to perform the routing within one-dimensional submeshes faster.

We are going to use the algorithm for a three-dimensional m x m x m mesh on the hypercube, for m = 2^l. PU_{i, j, k} is identified with the hypercube PU in which the numbers i, j and k are written out as binary k-bit numbers. The algorithm for the three-dimensional mesh consists of three routing operations. In the first information is sent from the PU_{i, j, 0} to all PU_{i, j, k}. Thus, information has to be spread from a single PU to all PUs in a one-dimensional submeshes. This submesh corresponds to an l-dimensional subhypercube. Using the connections of each dimension one after the other, it is trivial to perform this spread of information in l routing steps. In the second operation information is sent from the PU_{0, j, k} to all PU_{i, j, k}. This goes analogously, on a full-port machine it can even be performed in parallel with the first operation. The third operation consists of gathering and summing the information standing in all PU_{i, j, k} into the PU_{i, 0, k}. Again this can be done by performing l routing steps within l-dimensional subhypercubes. In step t, 0 <= t < l, the routing is performed along axis l + t, each PU which has a 1 at position l + t of its index sends a packet to the PU whose index is smaller by 2^{l + t}. The transferred value is the sum of the own value and all received values.

So, in the single-port model the whole algorithm can be implemented in 3 * l = log n steps. The connections along each axis are used exactly once. This last property implies that it is again trivial to pipeline matrix multiplications on a full-port machine: L products can be computed in log n + L - 1 steps.

DMM

The above algorithms inspire two DMM algorithms. The first can be considered to be a direct adaptation of the PRAM algorithm, the second is a coarse-grain version of the torus algorithm. Both perform the same number of arithmetic operations, so on a first glance they may appear to be equally good. However, the second has smaller communication overhead and is therefore superior in practice. The algorithms are compared here to illustrate that adapting PRAM algorithms often does not lead to the best DMM algorithms.

Bundle-Wise Approach

For the first algorithm we assume that P <= n and that P divides n. The matrices A and B which have to be multiplied are divided in row and column bundles, respectively. A row bundle consists of n / P rows, a column bundle of n / P columns. The bundles are numbered from 0 to P - 1. The algorithm consists of P rounds. At all times PU_i holds row bundle i of A[][]. During round t, 0 <= t < P, PU_i holds column bundle (i + t) mod P of B[][]. During round t, PU_i computes the n^2 / P^2 values of C[][] at the intersection of row bundle i and column bundle (i + t) mod P. Finally PU_i has computed all the values in row bundle i of C. Between the computation rounds, the column bundles of B[][] must be routed from one PU to the next. This is a very simple and regular routing pattern. If the network has a Hamiltonian cycle, then with an appropriate indexing only adjacent PUs are communicating. The time consumption of this algorithm is given by
T_par(P, n) = c_i * n^3 / P + c_v * n^2 + c_a * P.

Of the two communication terms, the contribution c_v * n^2 will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * n^2 << c_i * n^3 / P, which gives the condition

P << c_i / c_v * n.

Bundle-Wise Matrix Product on DMM

Block-Wise Approach

There are better ways of organizing matrix multiplication on DMMs which allow to use more PUs. The three matrices are divided into P blocks of size n / sqrt(P) x n / sqrt(P). These blocks are indexed as the positions of a sqrt(P) x sqrt(P) matrix, starting with block (0, 0) in the upper-right corner. The PUs are indexed analogously. In the following description we use n' = sqrt(P). The data in block (i, j) of A[][] start in in PU_{i, (n' - i - j) mod n'} and the data in block (j, k) of B[][] start in PU_{(n' - j - k) mod n', k}. The algorithm consists of n' steps. In step t, 0 <= t < n', each PU_{i, k} multiplies the two blocks it has and forwards the block of A[][] to PU_{i, (k + 1) mod n'} and the block of B[][] to PU_{(i + 1) mod n', k}. Summing up the computed results, PU_{i, k} computes block (i, k) of C[][]. This is the same algorithm that was given above for the torus. The only difference is that the entries which are multiplied and shifted are not integers but submatrices. The time consumption of this algorithm is given by
T_par(P, n) = c_i * n^3 / P + c_v * 2 * n^2 / sqrt(P) + c_a * 2 * sqrt(P).

Of the two communication terms, the contribution c_v * 2 * n^2 / sqrt(P) will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * 2 * n^2 / sqrt(P) << c_i * n^3 / P, which gives the condition

P << (c_i / c_v * n)^2.

Block-Wise Matrix Product on DMM

The second algorithm is much better, but still we cannot use arbitrarily many PUs. The above conditions show that the largest number of PUs that can be used in a sensible way is a very small constant fraction of n^2. Given that there is O(n^3) work to do, this means that, using this algorithm, T_par(infinity, n) = Omega(n). There is a clear discrepancy here between the bounds for the PRAM and the DMM. If one believes that the DMM model is the more realistic of the two, this implies the limited importance of the quest for polylog PRAM algorithms.

All-Pairs Shortest Paths

Sequential Algorithms

We consider the all-pairs shortest path problem for graphs with n nodes. That is, for a graph G = (V, E) with n nodes and m edges we want to compute a complete n x n distance table. Sequentially, for sparse graphs the best idea is to apply n times the best applicable algorithm for single-source shortest paths: if the graph is unweighted this is breadth-first search, if the graph is positively weighted this is Dijkstra's algorithm. These methods require O(n * (n + m)) and O(n * (n * log n + m)) time respectively. For dense graphs or if the graph also has negative-weight edges, it is better to apply a cubic algorithm solving the whole problem at once. The best choice in this case is the Floyd-Warshall algorithm which in addition to having relatively low asymptotic complexity also has the advantage of being extremely simple. This algorithm works without assumptions on the weights of the edges. Of course, if there are negative-cost cycles, the computed values do not make sense. Such problem instances can be recognized by checking the values on the diagonal: if any of them is negative, this flags a negative-cost cycle.

The Floyd-Warshall algorithm proceeds as follows. Here the matrix D[][] contains all direct distances. That is, D[i][k] gives the length of the edge from i to j. If there is no such edge, then D[i][k] = infinity. Finally D[i][k] contains the length of the shortest path from i to k.

  void floyd_warshall(int[][] D, int n)
  {
    for (j = 0; j < n, j++)
      for (i = 0; i < n, i++)
        for (k = 0; k < n, k++)
          if (D[i, j] + D[j, k] < D[i, k])
            D[i, k] = D[i, j] + D[j, k];
  }

Lemma: The Floyd-Warshall algorithm is correct.

Proof: The correctness is proven by inductively showing that after performing the second loop for all values in S_j = {l| 0 <= l <= j}, D[i, k] gives the length of the shortest path from node i to node k running through nodes in S_j. Initially this is true, and finally this implies the correctness of the algorithm. So, assume the claim holds for some j - 1. Then, the shortest path from node i to node k running through nodes in S_j can either run though node j or not. If it does not run through node j, then the shortest path is given by the shortest path through S_{j - 1}, which by the induction assumption has length D[i, k]. In the other case, the shortest path through S_j runs along a shortest path in S_{j - 1} from i to j and then along a shortest path in S_{j - 1} from j to k. The length of these two subpaths are according to the induction assumption given by D[i, j] and D[j, k] and because we assume that this path gives an improvement, we have D[i, j] + D[j, k] < D[i, k] and the assignment D[i, k] = D[i, j] + D[j, k] will be performed. End.

WT Framework and PRAM

Notice that the Floyd-Warshall algorithm is almost identical to the basic algorithm for matrix computation, the main difference being that in the matrix-multiplication algorithm the loop with running index j is ordered last. In the matrix multiplication there are n^2 independent loops, each of which could be parallelized. The correctness proof of the Floyd-Warshall requires this specific execution order, and thereby introduces a kind of sequentialization. Thus, for the Floyd-Warshall algorithm, this give a time of O(n) with O(n^3) work. On a PRAM, using n^2 PUs it is trivial to solve the problem in O(n) time. This is cost-optimal, but not very fast.

An alternative is to apply brute-force: the distances can also be computed by applying an algorithm originally designed for computing the transitive closure of a graph. The transitive closure of a graph G is the graph G^* having an edge from node i to node j iff node j can be reached from node i over a path in G. Of course this problem can be solved efficiently determining the connected components, but for a graph with n nodes one can also compute D^{n - 1}, where D is the adjacency matrix of G. Of course the matrix product here is not defined over the underlying ring of integers, but over the ring of booleans where "+" stands for "or" and "*" stands for "and". Modifying so that the entries of D[][] are integers giving the distances and not 0-1 values, and replacing "+" by "minimum" and "*" by "+", D^{n - 1} does not only give the existence of a path, but even the length of the shortest path. This can be proven by showing that if D^i and D^j give the lengths of the shortest paths using up to i and j edges, respectively, that then D^i * D^j = D^{i + j} gives the length of the shortest paths using up to i + j edges. Because we assume that all edge weights are non-negative, all shortest paths consist of at most n - 1 edges. Therefore, it is sufficient to compute D^{n - 1}.

Clearly, computing D^{n - 1} by starting with D^1 = D and then repeatedly multiplying with D leads to a correct but very inefficient way of doing. A much better idea is to compute D^n by suing D^{2^t} = D^{2^{t - 1}} * D^{2^{t - 1}}. In this way D^{n - 1} can be computed with k = round_up(log_2 (n - 1)) matrix products: 2^k >= n - 1 and because of the above argument, D^m = D^{n - 1}, for any m >= n - 1. This reduces the APSP problem to computing log n matrix products. The time for this is O(log^2 n) with O(n^3 * log n) work. On a PRAM, using n^3 / log n PUs, the APSP problem can be solved in O(log^2 n) time. This algorithm shows that APSP is in NC, but it is not cost-optimal.

Mesh Algorithm

We consider how to execute the Floyd-Warshall algorithm on a mesh so that it has running time O(n). This is not as easy as for the matrix multiplication. Denote by D^(j), -1 <= j < n, the distance matrix obtained after performing the outer loop in the Floyd-Warshall algorithm for all values up to j. D^(-1) = D, D^(n - 1) is the final matrix to compute. We assume that D[i, k] is originally stored in the PU at position (i, k). This PU is responsible for computing D^(j)[i, k] for all j. D^(-1)[i, k] = D[i, k] it knows, for the computation of the other values it needs data coming from the other PUs: for computing D^(j)[i, k] it must receive D^(j - 1)[i, j] and D^(j - 1)[j, k]. The secret to obtaining good performance, is to accept that not all values of D^(j) are computed in the same step.

The communication is performed according to the following rules:

Clearly the given algorithm is correct because of the condition that data are forwarded only as soon as D^(j)[i, k] has been computed and because eventually all D^(j)[i, j] become available to all PUs in row i and all D^(j)[j, k] to all PUs in column k. So, every PU can indeed compute all the values it must compute. The more interesting question is how long this takes.

Lemma: PU_{i, k} computes D^(j)[i, k] in step 3 * j + |i - j| + |k - j|.

Proof The proof goes by induction. The main induction goes over j. Secondary inductions go over i and k. The basis of everything is that PU_{0, 0} computes D^(0)[0, 0] in step 0. So, assume the lemma is true for i = 0 and some k, 0 <= k <= n - 2, then PU_{0, k} sends D^(-1)[0, 0] on to PU_{0, k + 1} which computes D^(0)[0, k + 1] in step k + 1. The same follows for D^(0)[k + 1, 0]. Now assume the lemma holds for j = 0 and all i and k with i < l or k < l, then it can first be proven for D^(0)[l, l] and then for all D^(0)[i, k] with i = l or k = l. Assuming the lemma has been proven for some value j and all i and k, it follows that D^(j - 1)[j, j] is computed in step 3 * j - 1. The values D^(j - 1)[j - 1, j] and D^(j - 1)[j, j - 1] have been computed in step 3 * j - 2 and reach PU_{j, j} at the end of step 3 * j - 1. So, PU_{j, j} computes D^(j)[j, j] in step 3 * j. This is the basis of the proof for the next j. End.

Theorem: The given implementation of the Floyd-Warshall algorithm on an n x n mesh solves the APSP problem in 5 * n - 4 steps.

Click here to see the above ideas integrated in a running program. In this simulation each PU has three arrays. For PU_{i, k} the array own[] indicates which values D^(j)[i, k] have been computed. The array hor[] indicates which horizontally moving information has been received, processed and forwarded. The array ver[] is the analogue for the vertically moving information.

Floyd-Warshall on Mesh

DMM

The Floyd-Warshall algorithm can be implemented on a DMM in various ways, reflecting the various algorithms presented above. The simplest idea is to use an algorithm inspired by the PRAM algorithm. The second, more elaborate, idea is to use a refined implementation of the mesh algorithm.

Bundle-Wise Approach

It is assumed that P <= n and that P divides n. The matrix D[][] giving the distances is divided in row bundles of height n / P. PU_p, 0 <= p < P is handling all rows i with p * n / P <= i < (p + 1) * n / P. It is assumed that the data are distributed according to this subdivision right from the start.

The algorithm consists of n rounds. In round j, 0 <= j < n, the PU which is responsible for row j, this is PU_{round_down(j / (n / P))}, sends the information in row j of D[][] to all other PUs. This information is used to compute all values D^(j)[i][k]. The only question is how to route the information in a row from one PU to all other PUs. This is a broadcasting problem. Applying the algorithm using log P routing operations, the time consumption is given by

T_par(P, n) = c_i * n^3 / P + c_v * n^2 * log P + c_a * n * log P.

Of the two communication terms, the contribution c_v * n^2 * log P will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * n^2 * log P << c_i * n^3 / P, which gives the condition

P * log P << c_i / c_v * n.

Block-Wise Approach

In order to implement the mesh algorithm onto a DMM, we use a block-wise subdivision of the matrix D[][] as was done above for computing ther matrix product. So, each block has size n / sqrt(P) x n / sqrt(P). The algorithm is more or less the same. However, in each step, all computations are made that can be made with the available information. This means that as soon as the necessary information D^(j - 1)[j][k] and D^(j - 1)[i][j] comes in, all n^2 / P values D^(j)[i][k] are computed. So, information progresses one block at a time. Thus, the number of steps is not 5 * n - 4, but n + 4 * (sqrt(P) - 1). In each step n / P numbers may have to be transferred in each direction. For the time consumption this gives
T_par(P, n) = (1 + 4 * (sqrt(P) - 1) / n) * (c_i * n^3 / P + c_v * 4 * n^2 / sqrt(P) + c_a * 4 * n).

The algorithm is almost optimal. We may assume that n >> sqrt(P) and therefore the factor 1 + 4 * (sqrt(P) - 1) / n will lie very close to 1. The routing volume is considerably smaller than with the bundle-wise approach. Of the two communication terms, the contribution c_v * 4 * n^2 / sqrt(P) will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * 4 * n^2 / sqrt(P) << c_i * n^3 / P, which gives the condition

P << (c_i / c_v * n)^2.

If we are working on a parallel computer with exceptionally high start-up cost c_a, then we should try to reduce the number of start-ups. This can be done by not making n steps in which the value of j is increased by 1, but n / b steps in which j is increased by b. Here b is an arbitrary fraction of n / P. In each round the computation now takes c_i * b * n^2 / P and b * n / sqrt(P) data have to be sent to the neighbors. In total the algorithm now needs n / b + 4 * (sqrt(P) - 1) steps. This gives almost the same expression as before:

T_par(P, n) = (1 + 4 * (sqrt(P) - 1) * b / n) * (c_i * n^3 / P + c_v * 4 * n^2 / sqrt(P) + c_a * 4 * n / b).

The only disadvantage of this bundled-block-wise approach is that the leading factor becomes larger. In the extreme case that we take b = n / sqrt(P), there are only 5 * sqrt(P) - 4 steps, but the internal work is almost five times larger. A close to optimal choice of b can be found by equating the additional computation time with the start-up costs. Concentrating on the main contributions and leaving out the constant factors, this gives c_i * b * n^2 / sqrt(P) = c_a * n / b. This suggests to take b ~= sqrt(c_a / c_i * sqrt(P) / n).

The ideas in this whole section on the all-pairs shortest paths problem are for the largest part standard and known for a long time. New contributions on a better PRAM algorithm are given in a paper by Han, Pan and Reif (Algorithmica, 1997). Sibeyn has designed an external-memory algorithm which essentially requires the same number of internal operations as conventional matrix product, and not twice as many as alternative algorithms, including the presented algorithm above.

Exercises

  1. Consider the above given algorithms for computing sums and prefix sums consisting of O(log n) independent steps performing O(n) additions in total. Specify how to allocate the work to the n / log n PUs of a PRAM so that these algorithms can be executed in O(log n) time.

    How long does the execution of the sum algorithm take if we let PU_i, 0 <= i < n / log n, perform all computations with an assignment to some a[j], for i * log n <= j < (i + 1) * log n?

  2. Consider the given parallel prefix algorithm running in O(log n) time on a PRAM with n / log n PUs. Describe how to efficiently implement this algorithm on a DMM with P PUs. Express the running time in terms of n, P and the constants c_i, c_v and c_a. What efficiency can be expected if n is sufficiently large in relation to P?

  3. Give algorithms for computing prefix sums on meshes and hypercubes. On a sqrt(n) x sqrt(n) mesh the algorithm for an input of size n should run in O(sqrt(n)) time. On a k-dimensional hypercube with n = 2^k , for some k > 0, the algorithm should run in O(k) time.

  4. On a PRAM the matrix-multiplication algorithm could be made optimal by taking n^3 / log n PUs instead of n^3. Can we do the same on a hypercube? So, the question is whether two n x n matrices A[][] and B[][] can be multiplied in O(log n) time on a hypercube with n^3 / log n PUs. Initially each number A[i][j] and B[i][j], for all 0 <= i, j < n, is present in only one PU. This PU can be specified freely, a single PU can initially and later hold O(log n) data. The key problem is how to get the data sufficiently fast to the PUs where they are needed.

  5. Describe how to compute the product C = A * B of two n x n matrices A and B on a circular array with n PUs in n steps. For each step specify which products are to be computed and which information has to be routed. Assuming full-port capacity, A * B can even be computed in n / 2 steps. Describe how.

  6. Work out the details of the inductive proof that the Floyd-Warshall algorithm on an n x n mesh runs in 5 * n - 4 steps.

  7. Describe how to solve the APSP problem for an n x n distance matrix D[][] on a single-port linear array with n PUs. For each step specify which values D^(j)[i][k] are to be computed where and which information has to be routed.

    Adapt this algorithm to an alternative bundle-wise DMM algorithm. Specify its time consumption and compare it with the above given DMM algorithms.

  8. Clearly each step of the full-port torus can be performed in four steps of the single-port torus. However, for many problems it is possible to do better. This has practical importance, because running a torus algorithm on a DMM an all-port step means sending four packets while a single-port step means sending a single packet. In the case of the matrix product, in each step only two of the connections out off each node are used. So, the matrix-product algorithm can be performed in 2 * n steps on a single-port torus. We used this in our analysis of the DMM algorithm.

    More interesting is the Floyd-Warshall algorithm. On a full-port mesh it runs in 5 * n - 4 steps. We now want to run it on a mesh on which in each step all PUs can communicate either only in the two positive directions or only in the two negative directions. In how many steps can the Floyd-Warshall algorithm be run in this model? It may be helpful to test ideas by playing around with the program which can be downloaded here.


This page was created by Jop Sibeyn.
Last update Monday, 06 September 04 - 09:19.
For any comments: send an email.