On a circular array each number is routed from the PU where it starts in each direction for n / 2 steps. If in step t, 1 <= t <= n / 2, PU_i receives a number from PU_{(i - 1) mod n}, then it adds its value to the value it has if i >= t. If PU_i receives in step t a number from PU_{(i + 1) mod n}, then its value is added if i + t >= n.
The algorithm on the circular array runs twice as fast and minimizes the number of communication steps. However, this algorithm is not work-optimal: all PUs are busy all steps, so the total work is Theta(n^2).
int sum(int[] a, int n)
{
int d = n / 2;
while (d > 0)
{
for (i = 0; i < d; i++)
a[i] += a[i + d];
d /= 2;
}
return a[0];
}
For n = 2^k for some k >= 0 this computes the sum in k = log n passes
of the while loop. The operations of the for loop can be executed in
parallel. The work is given by the sum of the used d-values which is n
/ 2 + n / 4 + ... + 1 = n - 1. Because the scheduling on an EREW PRAM
is no problem, this gives a PRAM algorithm running in O(log n) time
with n / log n PUs, an optimal algorithm.
For the prefix sums problem there is an elegant recursive algorithm:
void prefix(int low, int n)
{
if (n > 1)
{
prefix(low, n / 2);
prefix(low + n / 2, n - n / 2);
for (int i = n / 2; i < n; i++)
a[low + i] += a[low + n / 2 - 1];
}
}
In a sequential context it makes no sense to formulate the prefix sums
computation this way, but this formulation immediately leads to a
parallel algorithm running in logarithmic time: T(n) = T(n / 2) + O(1).
The work can be estimated as W(n) = 2 * W(n / 2) + O(n), which gives
W(n) = O(n * log n). So, this algorithm is elegant but not optimal.
With a twist the work can be reduced to O(n). First consider an algorithm for solving the problem on a processor network with a tree topology consisting of 2 * n - 1 PUs arranged in log n + 1 levels with n / 2^t PUs at level t, 0 <= t <= log n. The algorithm consists of two main phases, the first going upwards, the second downwards. Initially a[i] is stored in PU_i at level 0. In step 0 of the upwards phase, each PU at level 0 routes its number to the PU at level 1 above it. In step t, 1 <= t < log n, of the upwards phase, a PU at level t routes to the PU at level t + 1 above it the sum of the values it received from the two PUs below it. In step t, 0 <= t < log n, of the downwards phase, a PU at level log n - t routes to the two PUs at level log n - t - 1: to the left it routes the value it received from the PU at level log n - t + 1 (which is supposed to be zero for t = 0); to the right it routes the sum of the value it received from above and the value it received from the left during the upwards phase. At level 0 PU_i adds the received value to a[i]. The whole algorithm consists of 2 * log n + 1 steps.
In the WT framework, it suffices to notice that there are 2 * log n + 1 time steps and that the total work is O(n), because each of the 2 * n - 1 PUs of the tree network is performing only O(1) work. Thinking about what is actually going on, we get the following sharper formulation:
void prefix()
{
for (int step = 1; step <= n / 2; step *= 2)
for (int i = step - 1; i < n; i += 2 * step)
a[i + step] += a[i];
for (int step = n / 4; step >= 1; step /= 2)
for (int i = 3 * step - 1; i < n; i += 2 * step)
a[i] += a[i - step];
}
The number of additions is n / 2 + n / 4 + ... + 1 in the first loop
and 1 + 3 + ... + n / 2 - 1 in the second. Because the scheduling on
an EREW PRAM is no problem, this gives a PRAM algorithm running in
O(log n) time with n / log n PUs, an optimal algorithm.
Click here to see this procedure integrated in a working program. The program also contains the sequential and the recursive algorithm. Running the program for large n shows that not withstanding the optimality, the parallel algorithm is considerably slower than the sequential one. One of the reasons is the much larger number of cache faults it generates. This can be overcome by a better organization. However, a second reason is that the algorithm is inherently more complex and performs twice as many additions.
T_par(P, n) = c_i * O(n / P) + (c_v + c_a) * log P.Similarly, for the prefix-sum problem
T_par(P, n) = c_i * O(n / P) + (c_v + c_a) * (2 * log P - 3).Because the routing volume is so small, this algorithm works even on networks with large c_v. The condition for getting good performance is c_a * log P << c_i * n / P. This is satisfied for P << c_i / c_a * n / log n, which is an exceptionally large number. Said otherwise, even relatively small problem can be solved efficiently.
void matrix_product(int[][] A, int[][] B, int[][] C, int n)
{
for (i = 0; i < n, i++)
for (k = 0; k < n, k++)
C[i, k] = 0;
for (i = 0; i < n, i++)
for (k = 0; k < n, k++)
for (j = 0; j < n, j++)
C[i][k] += A[i][j] * B[j][k];
}
It is handy to rewrite the algorithm as follows:
int sum(int[] c, int n)
{
int s = 0;
for (j = 0; j < n, j++)
s += c[j];
}
void matrix_product(int[][] A, int[][] B, int[][] C, int n)
{
int[][][] c = new int[n][n][n];
for (i = 0; i < n, i++)
for (k = 0; k < n, k++)
for (j = 0; j < n, j++)
c[i][k][j] = A[i][j] * B[j][k];
for (i = 0; i < n, i++)
for (k = 0; k n, k++)
C[i][k] = sum(c[i][k], n);
}
This immediately gives the following WT algorithm:
void matrix_product(int[][] A, int[][] B, int[][] C, int n)
{
int[][][] c = new int[n][n][n];
for all (i, k, j, 0 <= i, k, j < n) pardo
c[i][k][j] = A[i][j] * B[j][k];
for all (i, k, 0 <= i, k < n) pardo
C[i][k] = sum(c[i][k], n);
}
The algorithm consists of two parallel instructions, of which the
second requires further O(log n) instructions. The work is n^3 +
n^2 * O(n) = O(n^3). We have given here the transformation in several
steps and explicitly indicate which steps are to be performed in
parallel. In the following such straight-forward parallelizations will
not be worked out.
It is obvious how to schedule the algorithm on a PRAM with n^3 PUs. Taking P = W(n) / T(n) = n^3 / log n as suggested above, reduces the cost without substantially increasing the time. In that case C(P, n) = n^3 / log n * O(log n) = O(n^3). One may debate whether this should be called optimal or not: there are sequential matrix-multiplication algorithms running in o(n^3). On the other hand, are these not so terribly much better in practice, and the conventional version presented above is still used.
In our case the PU at position (i, k) computes C[i][k]. So, the data from A[][] and B[][] must be arranged so, that A[i][j] and B[j][k] reach PU_{i, k} at the same time. This is achieved when the arrival is delayed so, that A[i][j] reaches PU_{i, 0} in step i + j and then runs rightwards without delay, while B[j][k] reaches PU_{0, k} in step j + k and then runs downwards without delay. Thus, A[n - 1][n - 1] and B[n - 1][n - 1] enter the mesh in step 2 * n - 2 and reach PU_{n - 1, n - 1} in step 3 * n - 3. By then all other numbers have been processed as well. So, the whole computation takes 3 * n - 3 communication steps.
On a torus the systolic algorithm can be implemented easily: the data from A[][] which in the systolic algorithm reach column 0 in step t, 0 <= t < n, are positioned in column (n - t) mod n. The data from B[][] which reach row 0 in step t are positioned in row (n - t) mod n. More precisely, A[i][j] starts in PU_{i, (n - i - j) mod n} and B[j][k] starts in PU_{(n - j - k) mod n, k}. The A[][] data move rightwards using the wrap-around connections, the B[][] data move downwards. In all PUs there is something to do every step. So, after n - 1 communication steps, the matrix product has been computed. This is optimal because there are n^3 multiplications to make with n^2 PUs.
How to implement this algorithm on a mesh? On a practical system with worm-hole routing, it is never a problem when one packet must be routed a long distance. So, cyclical shifts can always be performed at almost full speed. In the store-and-forward model, however, it is harder. Just taking the above algorithm with the indicated mapping requires (n - 1) * (n - 1) routing steps. But, using the standard embedding of a torus on a mesh, matrix multiplication can be performed on the mesh in 2 * n - 2 steps. It is not clear that this is the optimal result for computing matrix products on meshes, but there are no obvious improvements as long as we insist that each PU holds only one value from A[][] and B[][] at a time.
The n^3 PUs of the three-dimensional n x n x n mesh are indexed with three subscripts as in the obvious way: PU_{i, j, k} denotes the PU in row i, column j and plane k. We assume that initially A[i][j] is stored in PU_{i, j, 0} and B[j][k] in PU_{0, j, k}. Then these data are routed so that each PU_{i, j, k} receives A[i][j] and B[j][k]. This routing is trivial: starting from PU_{i, j, 0} the number A[i][j] is routed along the third axis, starting from PU_{0, j, k} the number B[j][k] is routed along the column. Hereafter PU_{i, j, k} computes c[i][j][k] = A[i][j] * B[j][k]. All numbers that must be added together, that is all numbers c[i][j][k] with the same j, stand in the same row. In another n - 1 steps these can be added up to C[i][k] and routed to the PUs PU_{i, 0, k}. So, the whole algorithm takes 2 * n - 2 steps on an all-port machine and 3 * n - 3 on a single-port machine.
The last operation, computing C[i][k], can be performed in two ways:
If a single matrix product must be computed there is no difference. If, however, there are L pairs (A_l, B_l), 0 <= l < L, of matrixes to multiply, then using the first algorithm we still need at least n - 1 steps per product because the PU_{i, 0, k} are busy n - 1 steps adding the numbers they are receiving. The second algorithm can be used for a pipelined approach which requires only O(n + L) steps to compute all product matrices C_l, 0 <= l < L. Pipelining means that an excess of computing power is made useful by scheduling computations tightly after each other. This is another kind of parallelism, successfully used in any modern processor.
In our case, the numbers of A_l and B_l start being routed into the network at step l and come out of the network 2 * n - 2 steps later (assuming that the PUs can store O(n) numbers and all-port capacity). So, for computing L matrix products we need 2 * n + L - 3 steps. For L = omega(n), this is 1 + o(1) steps per product with n^3 PUs. On a two-dimensional mesh we needed 2 * n - 2 steps with n^2 PUs. The conclusion is that for computing many products the simple algorithm for three-dimensional meshes is asymptotically twice as efficient.
We are going to use the algorithm for a three-dimensional m x m x m mesh on the hypercube, for m = 2^l. PU_{i, j, k} is identified with the hypercube PU in which the numbers i, j and k are written out as binary k-bit numbers. The algorithm for the three-dimensional mesh consists of three routing operations. In the first information is sent from the PU_{i, j, 0} to all PU_{i, j, k}. Thus, information has to be spread from a single PU to all PUs in a one-dimensional submeshes. This submesh corresponds to an l-dimensional subhypercube. Using the connections of each dimension one after the other, it is trivial to perform this spread of information in l routing steps. In the second operation information is sent from the PU_{0, j, k} to all PU_{i, j, k}. This goes analogously, on a full-port machine it can even be performed in parallel with the first operation. The third operation consists of gathering and summing the information standing in all PU_{i, j, k} into the PU_{i, 0, k}. Again this can be done by performing l routing steps within l-dimensional subhypercubes. In step t, 0 <= t < l, the routing is performed along axis l + t, each PU which has a 1 at position l + t of its index sends a packet to the PU whose index is smaller by 2^{l + t}. The transferred value is the sum of the own value and all received values.
So, in the single-port model the whole algorithm can be implemented in 3 * l = log n steps. The connections along each axis are used exactly once. This last property implies that it is again trivial to pipeline matrix multiplications on a full-port machine: L products can be computed in log n + L - 1 steps.
T_par(P, n) = c_i * n^3 / P + c_v * n^2 + c_a * P.
Of the two communication terms, the contribution c_v * n^2 will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * n^2 << c_i * n^3 / P, which gives the condition
P << c_i / c_v * n.
T_par(P, n) = c_i * n^3 / P + c_v * 2 * n^2 / sqrt(P) + c_a * 2 * sqrt(P).
Of the two communication terms, the contribution c_v * 2 * n^2 / sqrt(P) will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * 2 * n^2 / sqrt(P) << c_i * n^3 / P, which gives the condition
P << (c_i / c_v * n)^2.
The second algorithm is much better, but still we cannot use arbitrarily many PUs. The above conditions show that the largest number of PUs that can be used in a sensible way is a very small constant fraction of n^2. Given that there is O(n^3) work to do, this means that, using this algorithm, T_par(infinity, n) = Omega(n). There is a clear discrepancy here between the bounds for the PRAM and the DMM. If one believes that the DMM model is the more realistic of the two, this implies the limited importance of the quest for polylog PRAM algorithms.
The Floyd-Warshall algorithm proceeds as follows. Here the matrix D[][] contains all direct distances. That is, D[i][k] gives the length of the edge from i to j. If there is no such edge, then D[i][k] = infinity. Finally D[i][k] contains the length of the shortest path from i to k.
void floyd_warshall(int[][] D, int n)
{
for (j = 0; j < n, j++)
for (i = 0; i < n, i++)
for (k = 0; k < n, k++)
if (D[i, j] + D[j, k] < D[i, k])
D[i, k] = D[i, j] + D[j, k];
}
Lemma: The Floyd-Warshall algorithm is correct.
Proof: The correctness is proven by inductively showing that after performing the second loop for all values in S_j = {l| 0 <= l <= j}, D[i, k] gives the length of the shortest path from node i to node k running through nodes in S_j. Initially this is true, and finally this implies the correctness of the algorithm. So, assume the claim holds for some j - 1. Then, the shortest path from node i to node k running through nodes in S_j can either run though node j or not. If it does not run through node j, then the shortest path is given by the shortest path through S_{j - 1}, which by the induction assumption has length D[i, k]. In the other case, the shortest path through S_j runs along a shortest path in S_{j - 1} from i to j and then along a shortest path in S_{j - 1} from j to k. The length of these two subpaths are according to the induction assumption given by D[i, j] and D[j, k] and because we assume that this path gives an improvement, we have D[i, j] + D[j, k] < D[i, k] and the assignment D[i, k] = D[i, j] + D[j, k] will be performed. End.
An alternative is to apply brute-force: the distances can also be computed by applying an algorithm originally designed for computing the transitive closure of a graph. The transitive closure of a graph G is the graph G^* having an edge from node i to node j iff node j can be reached from node i over a path in G. Of course this problem can be solved efficiently determining the connected components, but for a graph with n nodes one can also compute D^{n - 1}, where D is the adjacency matrix of G. Of course the matrix product here is not defined over the underlying ring of integers, but over the ring of booleans where "+" stands for "or" and "*" stands for "and". Modifying so that the entries of D[][] are integers giving the distances and not 0-1 values, and replacing "+" by "minimum" and "*" by "+", D^{n - 1} does not only give the existence of a path, but even the length of the shortest path. This can be proven by showing that if D^i and D^j give the lengths of the shortest paths using up to i and j edges, respectively, that then D^i * D^j = D^{i + j} gives the length of the shortest paths using up to i + j edges. Because we assume that all edge weights are non-negative, all shortest paths consist of at most n - 1 edges. Therefore, it is sufficient to compute D^{n - 1}.
Clearly, computing D^{n - 1} by starting with D^1 = D and then repeatedly multiplying with D leads to a correct but very inefficient way of doing. A much better idea is to compute D^n by suing D^{2^t} = D^{2^{t - 1}} * D^{2^{t - 1}}. In this way D^{n - 1} can be computed with k = round_up(log_2 (n - 1)) matrix products: 2^k >= n - 1 and because of the above argument, D^m = D^{n - 1}, for any m >= n - 1. This reduces the APSP problem to computing log n matrix products. The time for this is O(log^2 n) with O(n^3 * log n) work. On a PRAM, using n^3 / log n PUs, the APSP problem can be solved in O(log^2 n) time. This algorithm shows that APSP is in NC, but it is not cost-optimal.
The communication is performed according to the following rules:
Clearly the given algorithm is correct because of the condition that data are forwarded only as soon as D^(j)[i, k] has been computed and because eventually all D^(j)[i, j] become available to all PUs in row i and all D^(j)[j, k] to all PUs in column k. So, every PU can indeed compute all the values it must compute. The more interesting question is how long this takes.
Lemma: PU_{i, k} computes D^(j)[i, k] in step 3 * j + |i - j| + |k - j|.
Proof The proof goes by induction. The main induction goes over j. Secondary inductions go over i and k. The basis of everything is that PU_{0, 0} computes D^(0)[0, 0] in step 0. So, assume the lemma is true for i = 0 and some k, 0 <= k <= n - 2, then PU_{0, k} sends D^(-1)[0, 0] on to PU_{0, k + 1} which computes D^(0)[0, k + 1] in step k + 1. The same follows for D^(0)[k + 1, 0]. Now assume the lemma holds for j = 0 and all i and k with i < l or k < l, then it can first be proven for D^(0)[l, l] and then for all D^(0)[i, k] with i = l or k = l. Assuming the lemma has been proven for some value j and all i and k, it follows that D^(j - 1)[j, j] is computed in step 3 * j - 1. The values D^(j - 1)[j - 1, j] and D^(j - 1)[j, j - 1] have been computed in step 3 * j - 2 and reach PU_{j, j} at the end of step 3 * j - 1. So, PU_{j, j} computes D^(j)[j, j] in step 3 * j. This is the basis of the proof for the next j. End.
Theorem: The given implementation of the Floyd-Warshall algorithm on an n x n mesh solves the APSP problem in 5 * n - 4 steps.
Click here to see the above ideas integrated in a running program. In this simulation each PU has three arrays. For PU_{i, k} the array own[] indicates which values D^(j)[i, k] have been computed. The array hor[] indicates which horizontally moving information has been received, processed and forwarded. The array ver[] is the analogue for the vertically moving information.
The algorithm consists of n rounds. In round j, 0 <= j < n, the PU which is responsible for row j, this is PU_{round_down(j / (n / P))}, sends the information in row j of D[][] to all other PUs. This information is used to compute all values D^(j)[i][k]. The only question is how to route the information in a row from one PU to all other PUs. This is a broadcasting problem. Applying the algorithm using log P routing operations, the time consumption is given by
T_par(P, n) = c_i * n^3 / P + c_v * n^2 * log P + c_a * n * log P.
Of the two communication terms, the contribution c_v * n^2 * log P will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * n^2 * log P << c_i * n^3 / P, which gives the condition
P * log P << c_i / c_v * n.
T_par(P, n) = (1 + 4 * (sqrt(P) - 1) / n) * (c_i * n^3 / P + c_v * 4 * n^2 / sqrt(P) + c_a * 4 * n).
The algorithm is almost optimal. We may assume that n >> sqrt(P) and therefore the factor 1 + 4 * (sqrt(P) - 1) / n will lie very close to 1. The routing volume is considerably smaller than with the bundle-wise approach. Of the two communication terms, the contribution c_v * 4 * n^2 / sqrt(P) will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * 4 * n^2 / sqrt(P) << c_i * n^3 / P, which gives the condition
P << (c_i / c_v * n)^2.
If we are working on a parallel computer with exceptionally high start-up cost c_a, then we should try to reduce the number of start-ups. This can be done by not making n steps in which the value of j is increased by 1, but n / b steps in which j is increased by b. Here b is an arbitrary fraction of n / P. In each round the computation now takes c_i * b * n^2 / P and b * n / sqrt(P) data have to be sent to the neighbors. In total the algorithm now needs n / b + 4 * (sqrt(P) - 1) steps. This gives almost the same expression as before:
T_par(P, n) = (1 + 4 * (sqrt(P) - 1) * b / n) * (c_i * n^3 / P + c_v * 4 * n^2 / sqrt(P) + c_a * 4 * n / b).
The only disadvantage of this bundled-block-wise approach is that the leading factor becomes larger. In the extreme case that we take b = n / sqrt(P), there are only 5 * sqrt(P) - 4 steps, but the internal work is almost five times larger. A close to optimal choice of b can be found by equating the additional computation time with the start-up costs. Concentrating on the main contributions and leaving out the constant factors, this gives c_i * b * n^2 / sqrt(P) = c_a * n / b. This suggests to take b ~= sqrt(c_a / c_i * sqrt(P) / n).
The ideas in this whole section on the all-pairs shortest paths problem are for the largest part standard and known for a long time. New contributions on a better PRAM algorithm are given in a paper by Han, Pan and Reif (Algorithmica, 1997). Sibeyn has designed an external-memory algorithm which essentially requires the same number of internal operations as conventional matrix product, and not twice as many as alternative algorithms, including the presented algorithm above.
How long does the execution of the sum algorithm take if we let PU_i, 0 <= i < n / log n, perform all computations with an assignment to some a[j], for i * log n <= j < (i + 1) * log n?
Adapt this algorithm to an alternative bundle-wise DMM algorithm. Specify its time consumption and compare it with the above given DMM algorithms.
More interesting is the Floyd-Warshall algorithm. On a full-port mesh it runs in 5 * n - 4 steps. We now want to run it on a mesh on which in each step all PUs can communicate either only in the two positive directions or only in the two negative directions. In how many steps can the Floyd-Warshall algorithm be run in this model? It may be helpful to test ideas by playing around with the program which can be downloaded here.