Basic Techniques

Divide and Conquer

In general the divide-and-conquer technique consists of three steps:
  1. Divide the problem in two or more subproblems (not necessarily of same size).
  2. Find solutions for the subproblems.
  3. Combine the solutions of the subproblems to a solution for the whole problem.
Frequently either the first or the third step is almost trivial. In a sequential context, there is also a similar class of problems in which only one of the subproblems has to be considered further, but this does not offer a source of parallelism. Divide and conquer leads to parallel algorithms in a very natural way: first all PUs are allocated to the whole problem and are involved in splitting it. Then not only the
  • WT Framework and PRAM
  • DMM

    WT Framework and PRAM

    Applying this idea leads to a simple and good parallel sorting algorithm, even though practically there are better algorithms. On a PRAM with n PUs, one PU is allocated to each number. The numbers are a_i, 0 <= i < n. The lowest numbered PU in each subset chooses at random a number s, l <= s <= h, where l and h indicate the lowest and highest index of the elements in the subset, respectively. All PUs in the subset read this number and subsequently they read a_s. PU_i compares a_i with a_s and decides in which subset to put a_i. Sequentially the division of a set goes without extra effort. Parallelly this is the hardest. It requires that a prefix problem is solved for each of the subsets, computing the new position where a number has to be stored. For a set with n elements this takes O(log n) time with O(n) work. Because the subdivision takes logarithmic time, there is no need to perform the broadcasting of the chosen number in constant time using concurrent read. So, any level of the algorithm takes O(log n) time and O(n) work. The whole sorting algorithm takes O(log^2 n) time and O(n * log n) work. Using n / log n PUs, we obtain an optimal EREW PRAM algorithm.

    DMM

    On a DMM, the algorithm can be implemented as it is. The prefix-sum problem on a network with P PUs takes O(c_i * n / P + c_a * log P) time. At the end of each round the packets must be redistributed within the subset of PUs collectively solving a subproblem. The packets from each PU have destinations in at most four different PUs: at most two PUs for the numbers smaller than a_s and at most tow for those larger than a_s. On the other hand, a PU may receive packets from many PUs. The packets may also have very different sizes. In this case it may be a good idea to first scatter the packets so as to obtain two balanced routing operations. Doing this, the routing takes O(c_v * n / P + c_a * P) time, overshadowing the cost of the prefix problem. The number of PUs dealing with a subset decreases geometrically, so the number of start-ups decreases as well. After O(log P) rounds, the subproblems are located in single PUs and no further communication is needed. In total we get
    T_par(P, n) = c_i * O(n * log n / P) + c_v * O(n * log P / P) + c_a * O(P).

    The strong point of divide-and-conquer algorithms in general is that by their localizing nature the number of start-ups is low. For the above algorithm, the start-up costs are negligible if c_a * P << c_i * n * log n / P, that is when n * log n >> c_a / c_i * P^2. If we consider how large the typical problem is for which we will apply a parallel sorting algorithm, this condition will generally be satisfied.

    The weakness of the parallel quicksort algorithm is that the routing volume is almost of the same order as the internal work: only when c_i * log n >> c_v * log P, the communication cost becomes negligible. This condition will rarely be satisfied, implying that the practical speed-up with parallel quicksort, though increasing almost linearly with P, will be small.

    Convex Hull

    The convex-hull problem is one of the most important problems from computational geometry. For a set S of points in space the task is to find the smallest convex polygon containing all points. Here we focus on the variant of the problem which deals with points lying in two-dimensional space, called the planar convex-hull problem. More precisely, the task is to compute the oredered list of all elements of S which lie on the corners of this polygon, for example by giving them in clockwise order. There are also higher-dimensional versions of the problem which are substantially harder.

    The complexity of the planar convex-hull problem is closely related to that of sorting. For a set of integer values A = {a_i| 0 <= i < n}, we can determine in linear time the minimum and maximum. Then in linear time the set A can be turned into a set of points S: a_i is mapped to s_i = (a_i, f(a_i)), where f(a_i) is the unique positive value so that s_i lies on the circle through the minimum and maximum. The convex hull of S consists of all elements. Because they must be presented ordered, solving the convex-hull problem can be used to sort. So, T_sort(n) <= T_convex_hull(n) + O(n).

    The planar convex-hull problem can also be solved by using sorting as a subroutine and is actually not much harder. The algorithm consists of log n rounds which each take O(log n) time. The total work is bounded by O(n * log n). The algorithm proceeds as follows:

    1. Sort the points according to their x-coordinates.
    2. Determine the points a and b with the smallest and largest x-coordinate, respectively.
    3. Determine all points which lie above the line through a and b and those that lie under this line. Call these subsets S_upper and S_lower.
    4. Determine the convex hull of S_upper and the hull of S_lower.
    5. Construct the convex hull of S by enumerating the hull of S_upper followed by enumerating the hull of S_lower, enumerating a and b only once.
    This construction allows to concentrate on the special case that all elements lie above a line. If this is convenient, the points may be rotated about the origin and shifted so that the line through a and b coincides with the x-axis. The special case that all elements lie above the line through a and b can be solved as follows:
    1. Select the point s so that half of the numbers have smaller and half of the numbers have larger x-coordinate.
    2. Construct the sets S_left and S_right of points which lie to the left and right of s, respectively. Add s to both sets.
    3. Recursively solve the problem for S_left and S_right.
    4. Determine the upper common tangent of the convex hulls of S_left and S_right. For two finite sets of points, an upper common tangent is a line which intersects the sets in exactly one point and which has none of the points of the sets above it. Let s_left and s_right be the points in S_left and S_right, respectively, that lie on the upper common tangent.
    5. Construct the convex hull of the whole set by enumerating the points in the hull of S_left until reaching s_left, then continue with the elements in S_right starting from s_right.
    The recursion should be terminated as soon as reaching a subset with only constantly many nodes, for which the convex hull can be determined by enumerating all possibilities. It is easy to verify that the sequential algorithm runs in O(n * log n) time. The only non-trivial point is how to determine the upper common tangent. This can be done by a variant of binary search in O(log (n_1 + n_2)) time for sets with n_1 and n_2 elements, respectively. The initial sorting is not really needed, but makes it easier to repeatedly select the median in the subsets of points. In a randomized variant which is very similar to quicksort the points do not need to be sorted first.

    Later we will consider how to perform parallel sorting. It is rather simple however to perform it in O(log^2 n) time and O(n * log n) work on an EREW PRAM with n / log n PUs. All other steps of the algorithm except for determining the convex hulls of S_upper and S_lower can be performed in O(log n) time and O(n) work or less. Because the set of numbers has been sorted before, constructing S_left and S_right is trivial. The same is true for constructing the whole convex hull in the last step. For the time being we are happy with using the sequential algorithm for finding the upper common tangent. So, the time and work for the recursive part are given by:

    T(n) = O(log n) + T(n / 2),
    W(n) = O(n) + 2 * W(n / 2).
    The solution of these recurrencies is T(n) = O(log^2 n) and W(n) = O(n * log n). Because of this T, we have not put made an effort to push the time for the non-recursive part below O(log^2 n). The work of this parallel algorithm is optimal, but the time consumption can be reduced to O(log n).

    Partitioning

    The partitioning technique consists of breaking up a problem in subproblems of almost equal size and solving these subproblems concurrently and independently. Here we will show how this idea can be used in a parallel merging algorithm.

    Merging is the process of turning two sorted arrays a[] and b[] of length n each into one array c[] of length 2 * n containing all values of a[] and b[] in sorted order. The rank of an element x in a set X, denoted rank(x, X), is defined to be the number of elements y in X with y < x. Here it is convenient to assume that all values are different, otherwise the index of the position in which a value is stored may be taken as a secondary criterion. The rank of an element x in A union B is given by rank(x, A) + rank(x, B). rank(a[i], A) = i and rank(b[i], B) = i, for all 0 <= i < n. Thus, it suffices to determine rank(a[i], B) and rank(b[i], A), for all 0 <= i < n, because once all ranks in A union B are known, it takes constant time and linear work to write the numbers of a[] and b[] in c[].

    A trivial and fast way to determine all ranks is to perform binary search: for each i, rank(a[i], B) and rank(b[i], A) can be determined in O(log n) time and O(log n) work. Thus, merging can be performed in O(log n) time and O(n * log n) work. This time is satisfactory, but the superlinear work should be reduced.

    The obvious improvement is to not determine the rank for all numbers but only a fraction. The largest fraction that can be used so that the work is still bounded by O(n) is n / log n. So, take all numbers a[j * log n] and b[j * log n], for 0 < j < n / log n, and rank them in the other set. This takes O(log n) time and O(n) work. This initial ranking partitions the problem of merging a[] and b[] into small subproblems: there are 2 * n / log n - 2 values for which we know the positions in both a[] and b[]. These subdivide each of the arrays in subsets of size at most log n which have to be merged pairwise. For each of these subproblems a sequential algorithm requires O(log n) time and work linear in the sum of the sizes of the subsets. So, the partitioning allows to solve the whole problem in O(log n) time and O(n) work.

    Merging Using Partitioning

    Pipelining

    Pipelining is the process of breaking up a task in subtasks which can be performed after each other so that several of these problems can be overlapped in a time-shifted way. We have encountered this idea already for some hypercube algorithms and also the systolic matrix product algorithm is a pipelining algorithm. The essence of pipelining is the time-shifted processing.

    As a further example we consider the problem of inserting k new numbers b_i, 0 <= i < k, into a 2-3 tree with n nodes with keys, a_i, 0 <= i < n, in sorted order. It is assume that k < n, otherwise it is more efficient to build a new 2-3 tree from scratch. Sequentially k numbers can be inserted in O(k * log n) time, and in general this cannot be done much faster, because on the one hand Omega(k * log k) is needed because as a result of the insertions the numbers get sorted, and if log k = o(log n), then it may happen that after the highest log k levels of the tree all paths are independent and must be followed to find the positions where to insert the elements. These remaining paths have length Omega(log n - log k) = Omega(log n) each. So, in our parallel algorithm we are happy with O(k * log n) work, but we would like to reduce the time as far as possible, that is, to O(log n).

    First the smallest and largest number, which can easily be determined with linear work in O(log k) time, are inserted in the normal sequential fashion. This is done to eliminate some special cases: after this operation, we know that any element to insert fits between two consecutive leafs of the tree. The insertion takes O(log n) time and work. Then for all k - 2 other b_i the leaf a_j of the tree is determined after which b_i has to be inserted. A chain is the ordered subset B_j of the b_i which fits between a_j and a_{j + 1}. Let k_j = |B_j|. sum_j k_j = k - 2. First consider the special case k_j <= 1, for all j. In this case, the nodes of the tree may get degree larger than 3, but never more than 6. So, processing the tree bottom-up processing all insertions at the same level in parallel, the whole tree can be processed in O(log n) time with O(k * log n) work. Here it is essential that any node gets split only once and that therefore the property that between any two existing nodes at most one new node has to be inserted also holds at the higher levels of the tree. We also use that any node can be processed in constant time.

    The above we call the basic algorithm which will be used as a subroutine for the general case. If k_j > 1, we do not insert all elements in one stroke, but rather first insert only the middle element of the chain. At this point the lists should have been sorted. If we are lucky all lists are short, but there might be one chain with k - 2 elements. In that case we need an optimal sorting algorithm running in logarithmic time. Inserting the middle element of a chain splits the chain in two. So, after at most log k of these operations all chains have length 1 or 0 and we are done. Thus, the general problem can be solved easily in O(log k * log n) time.

    The final improvement comes by observing that there is no neeed to wait with the next application of the basic algorithm until the previous one has been completed. These operations can also be overlayed in such a way that the next insertion starts at the lowest level as soon as the previous insertion has finished there. All these waves run up to the root in O(log n) time each. So, the whole algorithm has running time O(log k + log n) = O(log n). The work is unchanged by this overlaying.

    Accelerated Cascading

    A crucial technique, both theoretically and practically is the idea to use several algorithms to solve a single problem. The typical application of this idea is found for problems whose size is iteratively reduced. For such problems, as long as the problem size is large a work-optimal algorithm is used, but once the problem size has been reduced sufficiently another algorithm is used which is sub-optimal but runs faster. In exceptional cases even three algorithms are combined to minimize the time while still achieving optimal work. In the book "Parallel Algorithms" by JaJa, such a construction is called accelerated cascading. In the course of the text we will encounter several examples. A trivial subcase of accelerated cascading is when the last applied algorithm is a sequential one.

    Consider the problem of computing the maximum of n numbers stored in an array a[]. On a Common CRCW PRAM with n * (n - 1) PUs this problem can be solved in O(1) time. The PUs are indexed PU_{i, j}, 0 <= i < n, 0 <= j < n - 1. In addition to the integer array a[] there is a boolean array b[]. The algorithm is simple:

      1. for (all i, 0 <= i < n, using  PU_{i, 0}) do
           b[i] = true;
      2. for (all i, j, 0 <= i < n, 0 <= i < n - 1, using PU_{i, j}) do
           if (a[i] < a[j] || (a[i] == a[j] && i < j))
             b[i] = false;
      3. for (all i, 0 <= i < n, using  PU_{i, 0}) do
           if (b[i])
             max_index = i;
    
    Clearly the algorithm can be performed in constant time on the specified hardware. In step 1 each of the b[i] is written by a single PU. In step 2 b[i] may be written by up to n - 1 PUs but the written value is always false. By adding the clause a[i] == a[j] && i < j it is assured that if the maximum value occurs several times only one of them survives. In step 3 this single surviving index is written to max_index.

    The algorithm shows that the maximum can be computed very fast, but it is not optimal. However, we know optimal algorithms which can be combined with it. We summarize the algorithms:

    Suppose we have 2 * n PUs. Then for each subset of 2 elements there are 4 PUs, so the above algorithm can be applied. Hereafter there are still n / 2 numbers. For each subset of 4 PUs there are 16 PUs, so again the above algorithm can be applied. More generally, we have

    Lemma: After t >= 0 rounds there are still n_t = n / 2^{2^t - 1} numbers from among which the maximum has to be determined.

    Proof: The proof goes by induction. For t = 0, we have n_0 = n / 2^0 = n. So, assume the lemma holds for some t. The remaining n_t numbers are divided in subsets of size 2^{2^t}. For each of these subsets there are 2 * n * 2^{2^t} / (n / 2^{2^t - 1}) = (2^{2^t})^2 PUs available. So, the algorithm can be run again, and the number of candidates is reduced to 2^{2^t - 1} / 2^{2^t} = 2^{2^{t + 1} - 1}. End.

    So, with n PUs the maximum can be computed in O(loglog n) time. This is much more efficient than before and still very fast, but not yet optimal. If we take n / loglog n PUs, then first each PU can compute the maximum of 2 * loglog n numbers in O(loglog n) time. Then the fast-but-not-optimal algorithm is used to complete the task.

    Theorem: On a Common CRCW PRAM with n / loglog n PUs the maximum of n numbers can be computed in O(loglog n) time.

    The given algorithm is not such a spectacular application of accelerated cascading. It can be made more interesting by performing several rounds of the hypercube algorithm before changing to the CRCW algorithm. This saves some PRAM steps, because the reduction with the CRCW algorithm is not particularly efficient when the subsets have size 2 or 4, but it has no impact on the asymptotic running time.

    We want to point out that the given algorithm is not obtained by a simple application of the WT-scheduling principle: the work of the algorithm with n PUs is n * loglog n because all PUs are active in all rounds. So, it is not because of the scheduling principle that fewer PUs were taken. The decision rather was inspired as follows: considering the loglog algorithm, we wondered for what n' the work of this algorithm is still bounded by O(n). n' = n / loglog n is the largest such value. Then we wondered whether there is another algorithm which allows to reduce the problem size from n to n' with O(n) work in O(loglog n) time. In this case this other algorithm was the sequential algorithm applied to small subsets.

    Symmetry Breaking

    Often symmetry is a desirable property which can be exploited. At the same time symmetry can be a problem when we want to reduce the problem size by choosing a subset: how can we single out a subset if all elements look the same? This is the problem of symmetry breaking and the key to its solution is to either use randomization, or to use the indices if these are available.

    As an example we consider the problem of coloring the nodes of a directed cycle of length n. A node coloring of a graph with c colors, is an assignment of numbers c_i 0 <= c_i < c, to the nodes of the graph so that for any two adjacent nodes i and j, c_i != c_j. Coloring a list sequentially is simple: we can start at any node, coloring the nodes with colors 0 and 1 until we reach the last node, which has to be colored with color 2 if the number of nodes is odd.

    When considering how to solve the problem in parallel, there is the problem that it does not appear to make sense to start coloring everywhere. In this section we show how to efficiently construct a three-coloring in a deterministic way. Of course randomization allows to solve the problem more easily, but it is good to also know how such problems can be solved deterministically.

    The main idea of the parallel algorithm is to start with a correct coloring with many colors and to gradually reduce the number of used colors. A trivial initial coloring is given by c_i = i for all i, 0 <= i < n. The algorithm proceeds in rounds. In any round, all nodes are recolored in parallel as follows: for any node i with color c_i and successor j, the new color is given by c'_i = 2 * dif(c_i, c_j) + bit(dif(c_i, c_j), c_i). Here dif(x, y) gives the index of the least-significant bit which is not the same in x and y and bit(b, x) gives bit b of x, where bit 0 is the least significant one. The computation is simple. dif(,) is the only operation which is not a standard hardware operation, but it might have been so. Much harder operations, such as multiplication, can be performed in one clock cycle, so it is reasonable to assume that even dif(,) can be performed in constant time.

    We must check that the new coloring is correct. The proof of this goes by contradiction. So, assume that c'_i = c'_j. Let node k be the successor of j, then we must have dif(c_i, c_j) = dif(c_j, c_k) and bit(dif(c_i, c_j), c_i) = bit(dif(c_j, c_k), c_j). Substituting the first in the second gives bit(dif(c_i, c_j), c_i) = bit(dif(c_i, c_j), c_j), a contradiction with the definition of dif(,), which in particular should give a bit position for which the two numbers are different. Because the previous coloring was correct, such a bit must exist.

    Node Coloring

    If before the recoloring the largest color is c, requiring round_down(log c) + 1 bits, numbered 0, ..., round_down(log c), then afterwards the largest color is at most 2 * round_down(log c) + 1. From this it follows that in O(log* n) rounds the number of colors is reduced from n to 5. To further reduce the number of colors, we apply

      for (c = 5; c > 2; c--)
        for each node i with color c_i == c, recolor i with the 
        smallest color c'_i which does not conflict with the color
        of its predecessor or its successor.
    

    Theorem: A three-coloring of a cycle can be constructed in O(log*^ n) time and O(log^* n * n) work.

    A (purely theoretical) weakness of the above algorithm is its non-linear work. The algorithm can be made work-optimal at the expense of increasing the time consumption. The above algorithm is performed only once, reducing the number of colors to O(log n). For the special case that we have a set of n numbers with maximum value bounded by O(log n), the sorting problem can be solved in O(log n) time with O(n) work. Sorting means in this case creating buckets containing the indices of all nodes with a given color. After this sorting, the colors can be handled one-by-one as was done before to reduce the number of colors from 5 to 3. Because there are only O(log n) colors and each node has to be processed only once due to the sorting, this recoloring takes O(log n) time and O(n) work. If the super-fast algorithm is applied twice, the number of colors is reduced to O(loglog n). This has a positive impact on the cost of the final recoloring, but the sorting still takes O(log n) time and therefore this does not lead to a smaller asymptotic running time of the entire algorithm.

    Exercises

    1. Due to its simple structure parallel quicksort is very suitable for implementation on interconnection networks.
      • Specify missing details and analyze the algorithm for sorting n^2 numbers on an n x n mesh.
      • Specify missing details and analyze the algorithm for sorting n = 2^k numbers on a k-dimensional hypercube.
      • Consider making the hypercube algorithm work-optimal. In other words, consider sorting n * log n numbers on a k-dimensional hypercube. Which communication model do you need: one-port or full-port?

    2. For numbers x and n, we want to compute x^i for all 0 <= i < n. Describe how to do this in O(log n) time and O(n) work. This leads to an efficient and fast way to evaluate a polynomial of degree n - 1 given by an array of coefficients a[i]. Write down the complete algorithm.

    3. There are arrays a[] and b[] of length. a[i] contains integers and b[i] booleans. If b[i] = true, the interpretation is that a new segment starts for index i. The task is to compute the segmented prefix sums, that is, for any i, we want to compute the value s[i] = sum_{j = i_0}^i a[j], where i_0 <= i is the largest value with b[i_0] = true. Give an algorithm that solves this problem in O(log n) time and O(n) work on an EREW PRAM.

    4. Given an array a[] of length n we want to compute the prefix minima, the values m[i], 0 <= i < n, defined by m[i] = min_{j <= i} a[j]. Give an algorithm that solves this problem in O(log n) time and O(n) work on an EREW PRAM.

    5. Describe in some detail how to compute the upper common tangent of two upper hulls S_1 and S_2, with all elements of S_1 lying in the plane to the left of those of S_2. The algorithm should run in O(log n) time, where n = |S_1| + |S_2|.

    6. Give the smallest numbers n_t, 1 <= t <= 5, for which the recoloring algorithm needs t rounds before reaching a situation with 5 colors.

    7. Describe how to efficiently build a 2-3 tree from an array of n elements which stand in the array in sorted order in parallel. Specify the required PRAM submodel and the time and work of the algorithm.

    8. Show that the maximum of n numbers can be determined in O(1) time on a PRAM with n^{1 + eps} PUs for any eps > 0. Which PRAM model do you need?

    9. We consider the problem of sorting n numbers a[i], with 0 <= i < n, with values in a small range: 0 <= a[i] < m, for all i.
      • Give an algorithm for the case m = log n which runs in O(log n) time and O(n) work on an EREW PRAM. Hint: let S_x, 0 <= x < log n, be the subset of indices i, 0 <= i < n, with a[i] = x. First determine for all x and for all i in S_x, the rank of i within S_i.
      • Generalize the algorithm for arbitrary m and specify the corresponding time consumption and work in terms of n and m. Hint: use radix sort.

    This page was created by Jop Sibeyn.
    Last update Monday, 06 September 04 - 09:19.
    For any comments: send an email.