T_par(P, n) = c_i * O(n * log n / P) + c_v * O(n * log P / P) + c_a * O(P).
The strong point of divide-and-conquer algorithms in general is that by their localizing nature the number of start-ups is low. For the above algorithm, the start-up costs are negligible if c_a * P << c_i * n * log n / P, that is when n * log n >> c_a / c_i * P^2. If we consider how large the typical problem is for which we will apply a parallel sorting algorithm, this condition will generally be satisfied.
The weakness of the parallel quicksort algorithm is that the routing volume is almost of the same order as the internal work: only when c_i * log n >> c_v * log P, the communication cost becomes negligible. This condition will rarely be satisfied, implying that the practical speed-up with parallel quicksort, though increasing almost linearly with P, will be small.
The complexity of the planar convex-hull problem is closely related to that of sorting. For a set of integer values A = {a_i| 0 <= i < n}, we can determine in linear time the minimum and maximum. Then in linear time the set A can be turned into a set of points S: a_i is mapped to s_i = (a_i, f(a_i)), where f(a_i) is the unique positive value so that s_i lies on the circle through the minimum and maximum. The convex hull of S consists of all elements. Because they must be presented ordered, solving the convex-hull problem can be used to sort. So, T_sort(n) <= T_convex_hull(n) + O(n).
The planar convex-hull problem can also be solved by using sorting as a subroutine and is actually not much harder. The algorithm consists of log n rounds which each take O(log n) time. The total work is bounded by O(n * log n). The algorithm proceeds as follows:
Later we will consider how to perform parallel sorting. It is rather simple however to perform it in O(log^2 n) time and O(n * log n) work on an EREW PRAM with n / log n PUs. All other steps of the algorithm except for determining the convex hulls of S_upper and S_lower can be performed in O(log n) time and O(n) work or less. Because the set of numbers has been sorted before, constructing S_left and S_right is trivial. The same is true for constructing the whole convex hull in the last step. For the time being we are happy with using the sequential algorithm for finding the upper common tangent. So, the time and work for the recursive part are given by:
T(n) = O(log n) + T(n / 2),The solution of these recurrencies is T(n) = O(log^2 n) and W(n) = O(n * log n). Because of this T, we have not put made an effort to push the time for the non-recursive part below O(log^2 n). The work of this parallel algorithm is optimal, but the time consumption can be reduced to O(log n).
W(n) = O(n) + 2 * W(n / 2).
Merging is the process of turning two sorted arrays a[] and b[] of length n each into one array c[] of length 2 * n containing all values of a[] and b[] in sorted order. The rank of an element x in a set X, denoted rank(x, X), is defined to be the number of elements y in X with y < x. Here it is convenient to assume that all values are different, otherwise the index of the position in which a value is stored may be taken as a secondary criterion. The rank of an element x in A union B is given by rank(x, A) + rank(x, B). rank(a[i], A) = i and rank(b[i], B) = i, for all 0 <= i < n. Thus, it suffices to determine rank(a[i], B) and rank(b[i], A), for all 0 <= i < n, because once all ranks in A union B are known, it takes constant time and linear work to write the numbers of a[] and b[] in c[].
A trivial and fast way to determine all ranks is to perform binary search: for each i, rank(a[i], B) and rank(b[i], A) can be determined in O(log n) time and O(log n) work. Thus, merging can be performed in O(log n) time and O(n * log n) work. This time is satisfactory, but the superlinear work should be reduced.
The obvious improvement is to not determine the rank for all numbers but only a fraction. The largest fraction that can be used so that the work is still bounded by O(n) is n / log n. So, take all numbers a[j * log n] and b[j * log n], for 0 < j < n / log n, and rank them in the other set. This takes O(log n) time and O(n) work. This initial ranking partitions the problem of merging a[] and b[] into small subproblems: there are 2 * n / log n - 2 values for which we know the positions in both a[] and b[]. These subdivide each of the arrays in subsets of size at most log n which have to be merged pairwise. For each of these subproblems a sequential algorithm requires O(log n) time and work linear in the sum of the sizes of the subsets. So, the partitioning allows to solve the whole problem in O(log n) time and O(n) work.
As a further example we consider the problem of inserting k new numbers b_i, 0 <= i < k, into a 2-3 tree with n nodes with keys, a_i, 0 <= i < n, in sorted order. It is assume that k < n, otherwise it is more efficient to build a new 2-3 tree from scratch. Sequentially k numbers can be inserted in O(k * log n) time, and in general this cannot be done much faster, because on the one hand Omega(k * log k) is needed because as a result of the insertions the numbers get sorted, and if log k = o(log n), then it may happen that after the highest log k levels of the tree all paths are independent and must be followed to find the positions where to insert the elements. These remaining paths have length Omega(log n - log k) = Omega(log n) each. So, in our parallel algorithm we are happy with O(k * log n) work, but we would like to reduce the time as far as possible, that is, to O(log n).
First the smallest and largest number, which can easily be determined with linear work in O(log k) time, are inserted in the normal sequential fashion. This is done to eliminate some special cases: after this operation, we know that any element to insert fits between two consecutive leafs of the tree. The insertion takes O(log n) time and work. Then for all k - 2 other b_i the leaf a_j of the tree is determined after which b_i has to be inserted. A chain is the ordered subset B_j of the b_i which fits between a_j and a_{j + 1}. Let k_j = |B_j|. sum_j k_j = k - 2. First consider the special case k_j <= 1, for all j. In this case, the nodes of the tree may get degree larger than 3, but never more than 6. So, processing the tree bottom-up processing all insertions at the same level in parallel, the whole tree can be processed in O(log n) time with O(k * log n) work. Here it is essential that any node gets split only once and that therefore the property that between any two existing nodes at most one new node has to be inserted also holds at the higher levels of the tree. We also use that any node can be processed in constant time.
The above we call the basic algorithm which will be used as a subroutine for the general case. If k_j > 1, we do not insert all elements in one stroke, but rather first insert only the middle element of the chain. At this point the lists should have been sorted. If we are lucky all lists are short, but there might be one chain with k - 2 elements. In that case we need an optimal sorting algorithm running in logarithmic time. Inserting the middle element of a chain splits the chain in two. So, after at most log k of these operations all chains have length 1 or 0 and we are done. Thus, the general problem can be solved easily in O(log k * log n) time.
The final improvement comes by observing that there is no neeed to wait with the next application of the basic algorithm until the previous one has been completed. These operations can also be overlayed in such a way that the next insertion starts at the lowest level as soon as the previous insertion has finished there. All these waves run up to the root in O(log n) time each. So, the whole algorithm has running time O(log k + log n) = O(log n). The work is unchanged by this overlaying.
Consider the problem of computing the maximum of n numbers stored in an array a[]. On a Common CRCW PRAM with n * (n - 1) PUs this problem can be solved in O(1) time. The PUs are indexed PU_{i, j}, 0 <= i < n, 0 <= j < n - 1. In addition to the integer array a[] there is a boolean array b[]. The algorithm is simple:
1. for (all i, 0 <= i < n, using PU_{i, 0}) do
b[i] = true;
2. for (all i, j, 0 <= i < n, 0 <= i < n - 1, using PU_{i, j}) do
if (a[i] < a[j] || (a[i] == a[j] && i < j))
b[i] = false;
3. for (all i, 0 <= i < n, using PU_{i, 0}) do
if (b[i])
max_index = i;
Clearly the algorithm can be performed in constant time on the
specified hardware. In step 1 each of the b[i] is written by a single
PU. In step 2 b[i] may be written by up to n - 1 PUs but the written
value is always false. By adding the clause a[i] == a[j] && i < j it
is assured that if the maximum value occurs several times only one of
them survives. In step 3 this single surviving index is written to
max_index.
The algorithm shows that the maximum can be computed very fast, but it is not optimal. However, we know optimal algorithms which can be combined with it. We summarize the algorithms:
Suppose we have 2 * n PUs. Then for each subset of 2 elements there are 4 PUs, so the above algorithm can be applied. Hereafter there are still n / 2 numbers. For each subset of 4 PUs there are 16 PUs, so again the above algorithm can be applied. More generally, we have
Lemma: After t >= 0 rounds there are still n_t = n / 2^{2^t - 1} numbers from among which the maximum has to be determined.
Proof: The proof goes by induction. For t = 0, we have n_0 = n / 2^0 = n. So, assume the lemma holds for some t. The remaining n_t numbers are divided in subsets of size 2^{2^t}. For each of these subsets there are 2 * n * 2^{2^t} / (n / 2^{2^t - 1}) = (2^{2^t})^2 PUs available. So, the algorithm can be run again, and the number of candidates is reduced to 2^{2^t - 1} / 2^{2^t} = 2^{2^{t + 1} - 1}. End.
So, with n PUs the maximum can be computed in O(loglog n) time. This is much more efficient than before and still very fast, but not yet optimal. If we take n / loglog n PUs, then first each PU can compute the maximum of 2 * loglog n numbers in O(loglog n) time. Then the fast-but-not-optimal algorithm is used to complete the task.
Theorem: On a Common CRCW PRAM with n / loglog n PUs the maximum of n numbers can be computed in O(loglog n) time.
The given algorithm is not such a spectacular application of accelerated cascading. It can be made more interesting by performing several rounds of the hypercube algorithm before changing to the CRCW algorithm. This saves some PRAM steps, because the reduction with the CRCW algorithm is not particularly efficient when the subsets have size 2 or 4, but it has no impact on the asymptotic running time.
We want to point out that the given algorithm is not obtained by a simple application of the WT-scheduling principle: the work of the algorithm with n PUs is n * loglog n because all PUs are active in all rounds. So, it is not because of the scheduling principle that fewer PUs were taken. The decision rather was inspired as follows: considering the loglog algorithm, we wondered for what n' the work of this algorithm is still bounded by O(n). n' = n / loglog n is the largest such value. Then we wondered whether there is another algorithm which allows to reduce the problem size from n to n' with O(n) work in O(loglog n) time. In this case this other algorithm was the sequential algorithm applied to small subsets.
As an example we consider the problem of coloring the nodes of a directed cycle of length n. A node coloring of a graph with c colors, is an assignment of numbers c_i 0 <= c_i < c, to the nodes of the graph so that for any two adjacent nodes i and j, c_i != c_j. Coloring a list sequentially is simple: we can start at any node, coloring the nodes with colors 0 and 1 until we reach the last node, which has to be colored with color 2 if the number of nodes is odd.
When considering how to solve the problem in parallel, there is the problem that it does not appear to make sense to start coloring everywhere. In this section we show how to efficiently construct a three-coloring in a deterministic way. Of course randomization allows to solve the problem more easily, but it is good to also know how such problems can be solved deterministically.
The main idea of the parallel algorithm is to start with a correct coloring with many colors and to gradually reduce the number of used colors. A trivial initial coloring is given by c_i = i for all i, 0 <= i < n. The algorithm proceeds in rounds. In any round, all nodes are recolored in parallel as follows: for any node i with color c_i and successor j, the new color is given by c'_i = 2 * dif(c_i, c_j) + bit(dif(c_i, c_j), c_i). Here dif(x, y) gives the index of the least-significant bit which is not the same in x and y and bit(b, x) gives bit b of x, where bit 0 is the least significant one. The computation is simple. dif(,) is the only operation which is not a standard hardware operation, but it might have been so. Much harder operations, such as multiplication, can be performed in one clock cycle, so it is reasonable to assume that even dif(,) can be performed in constant time.
We must check that the new coloring is correct. The proof of this goes by contradiction. So, assume that c'_i = c'_j. Let node k be the successor of j, then we must have dif(c_i, c_j) = dif(c_j, c_k) and bit(dif(c_i, c_j), c_i) = bit(dif(c_j, c_k), c_j). Substituting the first in the second gives bit(dif(c_i, c_j), c_i) = bit(dif(c_i, c_j), c_j), a contradiction with the definition of dif(,), which in particular should give a bit position for which the two numbers are different. Because the previous coloring was correct, such a bit must exist.
If before the recoloring the largest color is c, requiring round_down(log c) + 1 bits, numbered 0, ..., round_down(log c), then afterwards the largest color is at most 2 * round_down(log c) + 1. From this it follows that in O(log* n) rounds the number of colors is reduced from n to 5. To further reduce the number of colors, we apply
for (c = 5; c > 2; c--)
for each node i with color c_i == c, recolor i with the
smallest color c'_i which does not conflict with the color
of its predecessor or its successor.
Theorem: A three-coloring of a cycle can be constructed in O(log*^ n) time and O(log^* n * n) work.
A (purely theoretical) weakness of the above algorithm is its non-linear work. The algorithm can be made work-optimal at the expense of increasing the time consumption. The above algorithm is performed only once, reducing the number of colors to O(log n). For the special case that we have a set of n numbers with maximum value bounded by O(log n), the sorting problem can be solved in O(log n) time with O(n) work. Sorting means in this case creating buckets containing the indices of all nodes with a given color. After this sorting, the colors can be handled one-by-one as was done before to reduce the number of colors from 5 to 3. Because there are only O(log n) colors and each node has to be processed only once due to the sorting, this recoloring takes O(log n) time and O(n) work. If the super-fast algorithm is applied twice, the number of colors is reduced to O(loglog n). This has a positive impact on the cost of the final recoloring, but the sorting still takes O(log n) time and therefore this does not lead to a smaller asymptotic running time of the entire algorithm.