Sorting

Transposition Sort

The first sorting algorithm of this chapter is particularly simple but not very efficient. It is a useful subroutine though and may applied in small subnetworks. It is known under the name odd-even transposition sort, mostly abbreviated to just "transposition sort". It is very similar to bubblesort, only the order in which the comparisons are performed is different.

Linear Array

On a linear array with n PUs, n even, of which each PU holds one number, the algorithm consists of n rounds. In each round the numbers in pairs of consecutive PUs are compared, and if the number in the PU with smaller index is larger, then they are exchanged. Comparing and possibly exchanging the numbers in PU_i and PU_j will be denoted by compare_swap(i, j). In terms of this operation, the transposition-sort algorithm can be written down concisely:
  for (t = 0; t < n / 2; t++)
  {
    for (i = 0; i < n / 2; i++) // Even
      compare_swap(2 * i, 2 * i + 1);
    for (i = 1; i < n / 2; i++) // Odd
      compare_swap(2 * i - 1, 2 * i);
  }

The correctness of the algorithm can be proven by induction. The idea underlying the argument is that the largest number, once it starts moving never gets delayed. It reaches its destination after at most n - 1 steps. The second largest number may be swapped with the largest number if the first step, and then it has to wait one step. Hereafter it will never be delayed anymore, so it arrives no later then after n steps. The operations in each of the inner loops can be executed in parallel. On an interconnection network, the simplest is to implement a compare-swap operation by exchanging the numbers to be compared and to discard one of them in each of the two involved PUs. So, on a linear array, the algorithm can be implemented with n steps in which each PU sends at most one number to a neighbor. Odd-even transposition sort is simple but inefficient: as a sequential algorithm it has complexity Theta(n^2) for sorting n numbers.

Odd-Even Transposition Sort

DMM

On a DMM with P PUs which each hold k = n / P numbers, there are several possible generalizations of the above algorithm. The simplest is to let each PU first sort all k numbers it holds. Then it repeatedly copies all its k numbers to one of its neighbors. The 2 * k numbers are merged and the PU with the smaller index retains the k smallest numbers, while the other PU retains the k largest. The time consumption of this algorithm is given by
T(P, n) = c_i * O(n / P * log n + n) + c_v * n + c_a * P.
For small P and very large n, the term c_i * O(n / P * log n) may dominate. In that case the algorithm gives good performance, but this applies only for n which are exponential in P. In the following we will see that good performance can be achieved for much smaller values of n as a function of P.

Mesh

Sorting n numbers on a DMM or interconnection network with P PUs consists of two subproblems:
  1. Determine for all numbers their rank;
  2. Route the numbers (and associated information) so that the number with rank r, 0 <= r < n, comes to stand in PU_{r / k}, for k = n / P. (Sometimes the number with rank r should be routed to PU_{r mod P}).
Sometimes these problems are solved separately, but often better performance is obtained by integrating them. For example, in the odd-even transposition sort algorithm, the rank of the numbers is never determined explicitly.

In the DMM model, the structure of the interconnection network is ignored. On interconnection networks, there is a structure, and the indexing scheme has impact on the efficiency with which sorting can be performed. Generally speaking, it is convenient if PUs for which the indices do not differ much also lie close to each other in the network. Nevertheless, sorting with respect to any indexing scheme which is fixed in advance is at least as hard as performing a k-k routing: if the hardest permutation pi routes a packet from PU_i to PU_j, then when sorting the value of the corresponding number in PU_i is set to j and this number will have to be routed from PU_i to PU_j in order to perform the sorting. The indexing scheme may have importance, but it should be noticed that the difference between the best and the worst scheme is limited: first sorting according to the best indexing scheme and then routing the numbers so that they come to stand according to the desired scheme costs at most twice as much as just sorting.

The simplest algorithm for sorting on a mesh is obtained by embedding a linear array into it. This algorithm sorts n^2 numbers on an n x n mesh in n^2 steps. This poor performance is due to the fact that the two-dimensional structure of the mesh is ignored. Ideally, if a problem can be solved in O(n) time on a linear array with n PUs it can also be solved on an n x n mesh in O(n) time. A natural way to obtain mesh algorithms is by combining operations on rows and columns. For sorting, it appears natural to consider algorithms which consist of an alternation of one-dimensional sorting phases, using odd-even transposition sort.

The shearsort algorithm, consists of an alternation of operations in which all columns are sorted in increasing order from top to bottom, and operations in which the even rows, that is the rows with index 0, 2, ..., n - 2, are sorted from left to right while the odd rows, with index 1, 3, ..., n - 1, are sorted from left to right. Only in the final sorting of the rows, the sorting direction can be chosen freely. Performing this operation as all others, the numbers are sorted in snake-like row-major order. When sorting all rows increasingly from left to right we get row-major order. The first alternative gives the following extremely simple algorithm:

  for (t = 0; t <= t_max(n); t++)
  {
    sort all columns from top to bottom;
    sort even rows from left to right and odd rows from right to left;
  }
If all PUs hold one number, each pass of the loop takes 2 * n steps, so the whole algorithm runs in 2 * n * (t_max(n) + 1) steps. The question is how large t_max(n) should be chosen to guarantee that any distribution gets sorted. The analysis is based on two lemmas.

0-1 Sorting Lemma: If a comparison-based sorting algorithm correctly sorts any distribution consisting of 0-s and 1-s, it correctly sorts any distribution with arbitrary numbers.

Proof: The proof goes by contradiction. Assume there is a set of numbers X = {x_0, ..., x_{n - 1}} for which the algorithm is not correct. Let pi'() be the inverse of the permutation realized by the algorithm. That is, afterwards the numbers have been rearranged to x_{pi'(0)}, ..., x_{pi'(n - 1)}. Let pi() be the inverse of the permutation giving a correct sorting of the numbers. That is, pi() is so that x_{pi(0)} <= x_{pi(1)} <= ... <= x_{pi(n - 1)}. Let k be the smallest index so that x_{pi'(k)} != x_{pi(k)}. All numbers with values smaller than x_{pi(k)} belong to the set {x_{pi(0)}, ..., x_{pi(k - 1)}} = {x_{pi'(0)}, ..., x_{pi'(k - 1)}}, so we must have x_{pi'(k)} > x_{pi(k)}. Because pi() is a permutation, this implies that there also must be an index l > k, so that x_{pi'(l)} = x_{pi(k)}. Now define the set of 0-1 values X' = {x'_0, ..., x_{n - 1}} by x'_i = 0 if x_i <= x_{pi(k)}, x'_i = 1 if x_i > x_{pi(k)}. Because for all 0 <= i, j < n, x_i < x_j => x'_i < x_j, the algorithm works on X' as it did on X (except that possibly some elements which now have the same value do not get exchanged). So, on X', the algorithm will produce the following sequence of values as output: ..., x'_{pi'(k)}, ..., x'_{pi'(l)}, ... = ..., 1, ..., 0, ... . That is, at position k there occurs a 1 while at position l > k there occurs a 0: the sorting is not correct, a contradiction with the assumption that it correctly sorts all 0-1 distributions. End.

A row is called clean if it only contains 0-s or only 1-s. Otherwise a row is called dirty. For the analysis it is convenient to rewrite the algorithm as follows:
  sort all columns from top to bottom;
  for (t = 0; t < t_max(n); t++)
  {
    sort even rows from left to right and odd rows from right to left;
    sort all columns from top to bottom;
  }
  sort even rows from left to right and odd rows from right to left;
Hereafter we will refer to this variant.

Lemma: For any 0-1 number distribution on an n x n mesh, after t, 0 <= t <= log n, passes through the loop in shearsort there are still at most n / 2^t dirty rows.

Proof: The purpose of the first sorting of the columns is that it establishes the property that for all i, 0 <= i < n - 1, the number of 0-s in row is at least as large as in row i + 1. This is true because if row i + 1 would hold more 0-s than row i, there would be a position in row i with a 1 where row i + 1 has a 0, contradicting the sorting of the columns.

For t >= 0 we use induction. For t = 0 the claim holds. So, assume that the claim holds for some t >= 0. Sorting the clean rows does not change anything. So, we may focus on the n_t dirty rows. Let z_i, 0 <= i < n_t, denote the number of 0-s in the i-th of these rows. z_i is monotonously decreasing. Let m_0 = |{0 <= i < n_t| z_i >= n / 2}| and let m_1 = n_t - m_0. We claim that after pass t + 1, at least round_down(m_0 / 2) of these rows only contain 0-s, and at least round_down(m_1 / 2) only 1-s. This is a direct consequence of the the monotonicity of z_i and the fact that even rows are sorted differently from odd rows. Thus, any two consecutive rows with z_i >= n / 2 lead to one row with only 0-s. There are several cases to distinguish. If n_t is odd, then either m_0 is even or m_1 is even and thus round_down(m_0 / 2) + round_down(m_1 / 2) = n_t / 2. If n_t is even, then if m_0 and m_1 are even, round_down(m_0 / 2) + round_down(m_1 / 2) = (m_0 + m_1) / 2 = n_t / 2. Only the case that both m_0 and m_1 are odd deserves special consideration. So consider row m_0, which is sorted from right to left and row m_0 + 1 which is sorted from left to right. If z_{m_0} + z_{m_0 + 1} >= n, then this results in one more row with only 0-s, otherwise in one more row with only 1-s. So, in this case at least round_down(m_0 / 2) + round_down(m_1 / 2) + 1 = n_t / 2 new clean rows are generated. End.

Theorem: Shearsort sorts any distribution of n^2 numbers on an n x n mesh in (2 * log n + 2) * n steps.

The time consumption can be improved by almost a factor two, by realizing that the sorting in the columns does not need to be performed for n steps in the later passes of the loop: if there are only n_t rows which might contain 0-s and 1-s, it suffices to perform n_t steps. Thus, all sorting in the columns can be reduced to n + n + n / 2 + ... + 2 = 3 * n - 2. So, the total sorting time can be reduced to (log n + 4) * n - 2 steps. Even this is not such a good result, but it might serve as basis case of a recursive algorithm. The question is whether it is better than the result obtained by trivially embedding a linear array. The times that should be compared are given in the following table, the results are not very impressive:

n 2 4 8 16
Transposition Sort 4 16 64 256
Shearsort 8 22 54 126

Shearsort

Merge Sort

Searching

The most important ingredient in the subsequent merging algorithm is a subroutine for localizing an element in a sorted array of length n. So, for a sorted array a[] and a value x the task is to determine an index i so that a[i] <= x < a[i + 1]. Sequentially this variant of searching can be solved in O(log n) time using binary search. We could also call this the problem of ranking x in the set of elements in a[]. In binary search in each round the size of the remaining problem is reduced by a factor two. If we have P PUs, the problem size can be reduced by a factor P in each round.

The algorithm is very simple. Initially x may lie in the whole interval ranging from 0 to n. This interval is reduced in several rounds. In each round the remaining interval is divided in P intervals I_i, 0 <= i < P, of approximately equal size. PU_i, 0 <= i < P, is in charge of I_i. Each of the PUs tests whether x lies in its interval. Assuming that the values are different (otherwise the smallest should be taken) there is a unique interval in which x lies. The PU in charge of this interval writes its index to a global variable c. All PUs read this variable and can start the next round. On an CREW PRAM, each round can be performed in O(1) time. So, the whole searching algorithm runs in O(log_P n).

Theorem: On a CREW PRAM with P PUs, an element x can be ranked in a sorted set of size n in O(log_P n) time.

An important special case is that P = n^eps for some eps > 0. In that case log_P n = 1 / eps = O(1). So, with sufficiently many PUs an element can be ranked in constant time.

Merging through Ranking

In the chapter on "Basic Techniques", we have seen how two sorted arrays of length n can be merged in O(log n) time with O(n) work. This immediately leads to a parallel sorting algorithm running in O(log^2 n) time and O(n * log n) work. In the following we will show that two sorted arrays of length n can actually be merged in O(loglog n) time. This leads to a work-optimal O(log n * loglog n) sorting algorithm.

For two sorted arrays, a[] of length m and b[] of length n, let rank(a, b) denote the array (r_0, ..., r_{m - 1}) of length m with r_i = rank(a[i], b), the rank of a[i] in b[]. Assume that we have a CREW PRAM with n PUs, then if m = O(n^{1 - eps}), for some eps > 0, rank(a, b) can be determined in constant time, because for each of the m elements of a[] we can use n^eps PUs to perform the ranking.

Again we apply the partitioning strategy. For merging two sorted arrays a[] and b[] each of length n with n PUs, we select the subarrays a'[] and b'[] of length sqrt(n) with a'[i] = a[sqrt(n) * i] and b'[i] = b[sqrt(n) * i] (it is convenient to add one more element to each subarray with value infinity). With the available PUs rank(a', b) and rank(b', a) can be computed in constant time. The resulting partitioning reduces the merging problem to merging subarrays of a[] and b[] of length at most sqrt(n). Before we reduced the problem in one stroke to subproblems which could be solved sequentially, now we continue recursively until the size has become constant. Each phase of the recursion takes O(1), so the whole merging takes O(loglog n) time.

Lemma: On a CREW PRAM two sorted arrays of length n each can be merged in O(loglog n) time with linear work.

Proof: Above we have shown how to merge two sorted arrays a[] and b[] of length n on a CREW PRAM with n PUs in O(loglog n) time. The work of this algorithm is O(n * loglog n). This algorithm can be made work-optimal by some kind of partitioning. So, we first determine subarrays a'[] and b'[] each of length n / loglog n by taking a'[i] = a[i * loglog n] and b'[i] = b[i * loglog n]. a'[] and b'[] are merged with n / loglog n PUs in O(loglog n) time. Just as in all previous partitionings, as a result it remains to merge sections of a[] and b[] each of length at most loglog n. With one PU for each subproblem, each of these can be solved with the sequential algorithm in O(loglog n) time. End.

Using the above fast and efficient merging as a subroutine in a merge sort algorithm gives

Theorem: On a CREW PRAM with n / loglog n PUs n numbers can be sorted in O(log n * loglog n) time.

O(log n * loglog n) is not the best possible result: sorting can be solved in O(log n) time and O(n * log n) work on an EREW PRAM. This optimal algorithm, which is known in the literature as Cole's merge-sort algorithm, is considerably more complicated and practically a gap of O(loglog n) is hardly significant. A more serious drawback of the presented algorithm is that it is based on the parallel-search algorithm which really needs the CREW model. The parallel quick-sort algorithm can also be refined to run in O(log n) time with high probability.

Bitonic Merge Sort

Bitonic merge sort is based on the so-called bitonic merge operation. Except for the description and the correctness proof of this operation all is trivial: sorting n = 2^k numbers consists of k rounds, in round i, 1 < i <= k, n / 2^{i - 1} sorted subsequences of length 2^{i - 1} are pairwise merged to obtain sorted subsequences of length 2^i. This algorithm (just as the odd-even merge presented in the next section), in a version for sorting networks, was presented by Batcher already in 1968. The strong point of this sorting procedure is that it is very suitable for a hard-wired implementation using a sorting network. Even the simplest formulation of the algorithm is non-recursive and insitu.

Algorithm

Assume we have an array a[] of length n. Assume that the numbers in a[] from a[0] to a[n / 2 - 1] are increasing, while the numbers from a[n / 2] to a[n - 1] are decreasing. Then the numbers can be made to stand in increasing order in the whole array by running the following procedure:
  void bitonicMergeUp(int[] a, int n) {
    for (int s = n >> 1; s > 0; s = s >> 1)
      for (int i = 0; i < n; i += 2 * s)
        for (int j = i; j < i + s; j++)
          compareSwapUp(a, j, j + s); }
Here compareSwapUp(a, i, j), compares the values of a[i] and a[j] and exchanges them if a[i] > a[j]. The procedure bitonicMergeDn() is analogous. These procedures integrated in a running program can be downloaded here.

Bitonic Merge Sort

The correctness is not proven here. The invariant that must be proven is that after round r, k >= r >= 0, all numbers in a subset of size 2^r with bit r equal to 0 are smaller than all numbers in the corresponding subset with bit r equal to 1. Because later comparisons are performed only within the subsets, the earlier established properties are preserved. Because all properties together imply a sorted sequence, proving the invariant implies the correctness.

Work-Time Framework

Here we give a precise expression for the number of performed comparisons. Let T_merge(2^k) be the time to merge two sorted sequences of length 2^{k - 1}. From the algorithm it immediately follows that
T_merge(2^k) = k / 2 * 2^k, for all k >= 1.

For T_sort(2^k), the time to sort 2^k numbers based on this merge operation, we get

T_sort(2^1) = 1,
T_sort(2^k) = 2 * T_sort(2^{k - 1}) + T_merge(2^k), for all k > 1.
Using induction it is easy to verify, that the solution is given by
T_sort(2^k) = k * (k + 1) / 4 * 2^k, for all k >= 1.
Another important point is the depth of the schedule. For bitonic merge sort, this is simply sum_{i = 1}^k i = k * (k + 1) / 2 = T_sort(2^k) / (2^k / 2).

Theorem: Using bitonic merge sort for sorting n numbers, the work is Theta(n * log^2 n). The time is Theta(log^2 n).

Interconnection Networks

On interconnection networks, the compare-swap operation for two numbers a and a' residing in PU p and p', respectively, can be performed in two substeps:
  1. PU p sends a copy of a to p', PU p' sends a copy of a' to p.
  2. PU p and p' compare a and a' and decide which number to keep and which number to discard.
In a DMM implementation the following might be better:
  1. PU p sends a copy of a to p'.
  2. PU p' compares a and a'. If it should keep a it sends a copy of a' to p, otherwise it simply discards a and keeps a'.
At the expense of one more routing operation, this reduces the average routing volume from 2 to 3/2 per compare-swap.

Bitonic merge sort is very suitable for implementation on a hypercube: in all steps number a[i] is compared with numbers a[j], where j = i +- 2^r, for some r, 0 <= r < k. This means that, when mapping the numbers of a[] in a straightforward way to the 2^k PUs of a k-dimensional hypercube, all communication takes place between neighbors, and that thus all steps can be performed in constant time.

Theorem: Using bitonic merge sort, 2^k numbers can be sorted in Theta(k^2) time on a k-dimensional hypercube.

Asymptotically this is not the best achievable, but the algorithm is extremely simple and, as seen above, the constants hidden in the Theta-notation are very small.

Surprisingly bitonic merge sort is asymptotically optimal on linear arays and meshes. The analysis is most easy for linear arrays. Merging two subsequences standing each in m consecutive PUs takes m + m / 2 + + ... + 1 = 2 * m - 1 steps. So, for a linear array with n = 2^k PUs all merging rounds together take sum_{0 <= s < k} (2 * 2^s - 1) = 2 * sum_{0 <= s < k} 2^s - k = 2 * n - log n - 2 < 2 * n steps. This is not bad at all! Of course, for linear arrays odd-even transposition sort performs fewer routing steps, but at the expense of far more comparisons.

On a two-dimensional n x n mesh n^2 numbers can be sorted in 2 * n + o(n) steps, which is optimal, but the algorithm is complicated and the lower-order term makes this algorithm unsuitable for moderate values of n. How about bitonic merge sorting? It is essential that the PUs are indexed with the shuffled row-major indexing. In that case the distance between PU i and j with j = i +- 2^r, for some r, 0 <= r < k, is exactly 2^{round_down{r / 2}}. Because such PU i and j lie in the same row or column, the routing can be performed without delay. Thus, merging two subsequences of length 2^s each takes 1 + 1 + 2 + 2 + ... + 2^round_down{(s - 1) / 2} + 2^round_down{s / 2} < 4 * 2^{s / 2}. For the number of steps T(n) for the whole sorting this gives T(n) < 4 * sum_{0 <= s < k} 2^{s / 2} < 8 * sum_{0 <= s < k / 2} 2^s < 8 * n. It is worthwhile to compute T(n) exactly. T(2) = 1 + (1 + 1) = 3. T(4) = T(2) + (2 + 1 + 1) + (2 + 2 + 1 + 1) = 13. T(8) = T(4) + (4 + 2 + 2 + 1 + 1) + (4 + 4 + 2 + 2 + 1 + 1) = 37. More generally, T(n) = T(n / 2) + 3 1/2 * n - 4. The solution is T(n) = 7 * n - 4 * log n - 7. As noticed above, asymptotically this is not competitive, but for small n it might be the best achievable. It is better than transposition sort and shearsort for all n.

DMM

Bitonic merge sort can even be applied on a DMM. The n = 2^k numbers are distributed evenly over the P PUs. Assume P = 2^p, for some p > 0. Each PU uses an efficient algorithm to sort its n / P numbers. Then it remains to merge these together. That is, it remains to perform the last p merging rounds. During these rounds the numbers have to be communicated 1, 2, ..., p times. Alternatively, all numbers are routed twice for each merging round: once bringing all number with common leading bits together, once bringing the numbers with common trailing bits together. Doing this, the total number of routing rounds is only 2 * p - 1 and the routing volume is bounded by (2 * p - 1) * P. If the network is sufficiently powerful, that is, if t_v is not too large in comparison with t_i, some speed-up may be achieved, but for DMMs we will encounter much better sorting algorithms.

Odd-Even Merge Sort

Odd-even merge sort is based on the so-called odd-even merge operation. The sorting proceeds just as with the bitonic merge: 2^k numbers are sorted in k merging rounds. Asymptotically the complexity is the same as for odd-even merge, but the number of performed comparisons is slightly smaller. For small problems there is a difference of more than 20%. The easiest formulation of odd-even merge sort uses a secondary array and recursion. These are not essential, but the pattern of positions to compare is slightly less beautiful then for bitonic merge sort, which reduces its suitability for implementation on interconnection networks.

Algorithm

Assume we have an array a[] of length 2^k. Assume that the numbers in a[] from a[0] to a[2^{k - 1} - 1] are increasing, and that the numbers from a[2^{k - 1}] to a[2^k - 1] are increasing as well. Then the numbers can be made to stand in increasing order in the whole array by performing the following steps:
  oddEvenMerge(int a[], int m) {
    if (m > 2) {
      int[] ae = new int[m / 2];
      int[] ao = new int[m / 2];
      for (int i = 0; i < m / 2; i++) {
        ae[i] = a[2 * i];
        ao[i] = a[2 * i + 1]; }
      oddEvenMerge(ae, m / 2);
      oddEvenMerge(ao, m / 2);
      a[0] = ae[0];
      for (int i = 1; i < m / 2; i++)
        if (ae[i] <= ao[i - 1]) {
          a[2 * i - 1] = ae[i];
          a[2 * i] = ao[i - 1]; }
        else {
          a[2 * i - 1] = ao[i - 1];
          a[2 * i] = ae[i]; } 
      a[m - 1] = ao[m / 2 - 1]; }
    else // m == 2
      if (a[0] > a[1]) {
        int x = a[0]; a[0] = a[1]; a[1] = x; } }
This procedure integrated in a running program can be downloaded here. This program can be made considerably more efficient, by eliminating the copying operations.

We consider the same example as before:

  a[]  initially: 10 12 14 16 20 25 32 34  8 13 24 26 28 36 38 40
  ae[] initially: 10 14 20 32  8 24 28 38
  ao[] initially: 12 16 25 34 13 26 36 40
  ae[] merged:     8 10 14 20 24 28 32 38
  ao[] merged:    12 13 16 25 26 34 36 40
  a[]  finally:    8 10 12 13 14 16 20 24 25 26 28 32 34 36 38 40

Odd-Even Merge Sort

The correctness is not proven here, but using the 0-1 lemma this is not hard. The key observation is that the estimate of the position of a number in the global array can be predicted with an accuracy of 1 from the position in ae[] and ao[].

Work-Time Framework

Here we give a precise expression for the number of performed comparisons. Defining T_merge and T_sort as above, we get
T_merge(2^1) = 1,
T_merge(2^k) = 2 * T_merge(2^{k - 1}) + 2^{k - 1} - 1, for all k > 1.
The solution is
T_merge(2^k) = (k - 1) / 2 * 2^k + 1, for all k >= 1.
The expression for T_sort(2^k) is the same as before. The solution is
T_sort(2^k) = ((k - 1) * k / 4 + 1) * 2^k - 1, for all k >= 1.

The following table gives the number of comparisons performed by the bitonic and the odd-even merge sort algorithms. First the difference between the algorithms increases, then it gradually decreases again. Due to their O(log^2 n * n) complexity, neither of the algorithms should be used for large values of n. On the other hand, for small n, odd-even merge sort is close to optimal.

n 1 4 8 16 32 64 128 256
bitonic 1 6 24 80 240 672 1792 4608
odd-even 1 5 19 63 191 543 1471 3839
ratio 1 1.20 1.26 1.27 1.26 1.24 1.22 1.20

Interconnection Networks

The merge operations consist of two types of compara-swaps. In the first the data in some PU i with bit r equal to zero are compared with the data in a PU j, with j = i + 2^r. These operations can easily be performed on butterflies, hypercubes and meshes with shuffled row-major indexing. In the latter operations, the data in some PU i which has a bit pattern *...*01...1*...* are compared with the data in PU j with a pattern *...*10...0*...*. These PUs are not directly connected in butterflies and hypercubes. Also in meshes these PUs are further apart and the routing pattern is harder.

DMM

On DMMs the situation is different, as distance or the availability of connections does not play a role for them. However, even here bitonic merge sort is better, because it is no longer true that most compare-swaps can be performed internally in the PUs. In all of them the data of some PU in the lower half are compared with those of a PU in the upper half. This has a modest impact on the routing volume, but implies a considerable increase of the number of routing operations.

Sample Sort

Deterministic Sampling

The partitioning strategy was illustrated with a fast parallel merging algorithm. This strategy can also be applied to obtain a very efficient parallel sorting algorithm. Actually, this might very well be the best algorithm for sorting on DMMs. For a parallel computer with P PUs, the following is one of several good implementations of the idea, it uses a parameter x the value of which will be settled later:
  1. Each PU sorts its k = n / P numbers.
  2. Each PU selects out off its k numbers the x - 1 numbers with ranks i * k / x, for all i, 0 < i < x, as presplitters.
  3. Gossip the presplitters, so that all PUs receive all P * (x - 1). presplitters.
  4. Each PU sorts the received presplitters using P-way merge sort.
  5. Each PU picks from the set of presplitters the elements with ranks i * (x - 1), 0 < i < P, as splitters.
  6. Each PU partitions its numbers in P buckets using the splitters. Denote the set of numbers in bucket i of PU_j by B_{i, j}.
  7. Perform an all-to-all routing so that PU_j, for all j, 0 <= j < P, receives all numbers in the buckets B_{i, j}, for all i, 0 <= i < P.
  8. Each PU sorts all received numbers using P-way merge sort.

We analyze the algorithm for a DMM. The internal time consumption is dominated by the time for the several sorting and bucketing operations. It is given by c_i * O(x * log x + P * x * log P + k * log n). If we take care to select x so that P * x < k, then the time is dominated by the time for sorting k numbers. There are only two communication operations. Good performance can only be achieved if the all-to-all routing is performed in a direct way. Thus, the number of start-ups is log P + P - 1. The routing volume is (1 - 1 / P) * (P * x + k). If P * x < k, then this is dominated by the second term. So, under condition P * x = o(n / P), we get

T(P, n) = (1 + o(1)) * (T_sort(1, n / P) + c_v * n / P + c_a * P).
This is a great result, because it means that parallel sorting can be parallelized perfectly except for routing all numbers once through the network. This is the best one reasonably can hope for. The achievable speed-up mainly depends on the ratio of c_i and c_v, but of course, for very large n, the log-factor in the sorting time also plays a certain role. The speed-up is approximately given by c'_i * n * log n / (c'_i * n / P * log n + c_v * n / P) = P / (1 + c_v / (c'_i * log n)). Here c'_i is the constant so that T_sort(1, n) ~= c'_i * n * log n, which depends both on c_i and the applied sequential sorting algorithm.

In the above analysis we silently assumed that the all-to-all routing is balanced and that finally all PUs receive approximately the same number of packets. This is not automatically true. There are two limitations:

The first problem is not limited to the sorting problem. It can be overcome at some extra cost by either performing a two-phase randomized routing or by the online version of the off-line routing algorithm based on coloring a bipartite graph of reduced size.

The second problem is more specific and requires an analysis. We compute an upper bound on the number of numbers that finally may end up in a single PU. So, the question is how many numbers may lie between any two consecutive splitters. Consider two splitters s = s_{i - 1} s' = s_i. Let R_j and R'_j denote the rank of s and s', respectively, in the set of numbers initially stored in PU_j. Analogously, let r_j and r'_j denote the rank of s and s', respectively, in the set of presplitters selected by PU_j. By definition of the splitters, sum_{j = 0}^{P - 1} (r'_j - r_j) = x. The number S_j of numbers routed to PU_j is given by S_j = sum_{j = 0}^{P - 1} (R'_j - R_j). Using that in each PU exactly k / x numbers lie between any two presplitters, we get the following estimates:

S_j <= k / x * sum_{j = 0}^{P - 1} (r'_j - r_j + 1) = n / P + n / x.
S_j >= k / x * sum_{j = 0}^{P - 1} (r'_j - r_j - 1) = n / P - n / x.

So, there is an "unbalance" of O(n / x). Thus, in order to obtain the above claimed results we must take x = omega(P). Because at the same time we had the condition P * x = o(n / P), this gives the condition P^3 = o(n). This is quite a tough condition which is nevertheless satisfied for most practical values of P and n. By refining the procedure by which the splitters are selected from the presplitters, performing iterated subsampling the condition can be relaxed almost to P^2 = o(n), the typical condition which must be satisfied to obtain good performance on a DMM.

For optimal performance, x should be chosen so that the sum of all additional costs together is minimized. On the one hand there are the costs for handling the presplitters and splitters, on the other hand the costs due to an unbalanced routing and sorting. The first contribution increases with x in the same way the second decreases with x. In such cases an approximation of the optimal choice can be found by equating both cost contributions and solving x. This gives

(c_v + c_i * log P) * P * x = (c_v + c_i * log P) * n / x.
This suggests to choose x = sqrt(n / P). For this particular choice, the total "loss" term is bounded by O((c_v + c_i * log P) * sqrt(n * P)), to be compared with the "useful" term Theta((c_v + c_i * log P) * n / P).

The given algorithm does not guarantee that finally each PU holds the same number of numbers. Furthermore, even though the numbers stand sorted, their ranks are not known. This can easily be overcome: counting the number of numbers in each PU and computing the prefix sum of these numbers, the ranks of the numbers can be determined cheaply. From this it follows where they have to be stored. In the normal case that n / x < n / P, this can be achieved by a routing in which a small fraction of the numbers in PU_i has to be routed to PU_{i - 1} and/or PU_{i + 1}. Of course this correction can be performed along with the final sorting operations.

Sampling through Randomization

The above algorithm is good and can hardly be improved for a DMM context. However a simpler algorithm is efficient as well. Furthermore, this algorithm leads to the simplest optimal sorting algorithm for circular arrays, two-dimensional meshes and tori.

DMM

The algorithm is first formulated for a DMM with P PUs:
  1. Each PU creates P buckets and uniformly and independently allocates each of its k = n / P numbers to one of the buckets at random. Denote bucket i of PU_j by B_{i, j}.
  2. Perform an all-to-all routing so that PU_i, for all i, 0 <= i < P, receives all numbers in the buckets B_{i, j}, for all j, 0 <= j < P. Let k_i denote the number of numbers received by PU_i.
  3. Each PU sorts all its numbers.
  4. Each PU creates P buckets. PU_i allocates the number with rank r among its k_i numbers to bucket P * r / k_i. Denote bucket i of PU_j by B_{i, j}.
  5. Perform an all-to-all routing so that PU_i, for all i, 0 <= i < P, receives all numbers in the buckets B_{i, j}, for all j, 0 <= j < P. Let k_i denote the number of numbers received by PU_i.
  6. Each PU sorts all received numbers using P-way merge sort.
  7. Each PU exchanges some packets with the PUs with index one smaller and one larger to correct minor sorting faults and to establish the desired final situation in which each PU holds the same number of numbers.
In this algorithm there is no global sample. Rather has each PU its own sample with expected size k = n / P. On a DMM this selection method is unnecessarily expensive, but on many interconnection networks all-to-all routing requires some kind of "randomization" (possibly this is done deterministically) anyway to assure a good routing time. So, here sampling and routing are performed in an integrated way. Or said otherwise, after the randomization, each PU holds approximately a fraction 1 / P of the numbers with destination in any of the P PUs. Different from list ranking, the problem treated in the next chapter, by sorting the numbers it can be locally guessed quite accurately in which PU a number belongs. Of course this estimate is not perfect However, if k is sufficiently large, it can be almost excluded that a number does not end up in its destination PU or one of its neighbors. Furthermore, for sufficiently large k there will only be o(n) packets which do not stand in their destination PU after the second all-to-all routing. Assuming that that this is granted, the time consumption on a DMM is given by
T(P, n) = (1 + o(1)) * (T_sort(1, n / P) + c_v * 2 * n / P + c_a * 2 * P).
The routing time is doubled due to the randomization, but this assures that we do not have to consider the possibility of an unbalanced routing pattern, these are resolved automatically. On the other hand, the internal time consumption is essentially the same. Possibly it is even slightly better because no time is waisted on handling the splitters.

We now analyze the minimal value of n as a function of P to assure that the above result holds with high probability. The following points must be satisfied:

The routing condition is the lightest. The expected number of packets in B_{i, j} is n / P^2. Chernoff bounds tells us that the deviation from the expected value is negligible as soon as the expected value exceeds log n. So, we must have n / P^2 = omega(log n). The second point is satisfied as a consequence. The last point is the most interesting, because it specifically has to do with the quality of the applied sampling technique. Under the above condition the values k_i deviate insignificantly, so in the further analysis we may do as if each PU has obtained a random sample of k = n / P numbers. Consider the k numbers in one of the PUs. Let r_i, 0 <= i < k, denote the local rank of number i, that is, the rank among the k numbers in this PU. Let R_i denote the global rank of this number, that is, the rank among the n numbers distributed over all PUs. R_i is estimated as R'_i = P * r_i. This estimate is not made explicit in the algorithm, but this is the underlying justification for routing the number with rank r_i to PU_{P * r_i / k}. A number is routed to the wrong processor if round_down(R'_i / k) != round_down(R_i / k). For how many numbers this happens depends on the differences between R'_i and R_i. This deviation turns out to be maximal for the number with R_i = n / 2. For this number the expected value of r_i equals k / 2. Using the Chernoff bounds, it follows that the deviation is bounded by O(sqrt(k * log n)) with high probability. So, we may assume that any of the values |R'_i - R_i| = O(P * sqrt(k * log n)). Thus a PU receives at most this many packets that belong to a neighbor. As condition this gives k = omega(P * sqrt(k * log n)) which can be rewritten as n = omega(P^3 * log n).

There are several ways to perform the final correction. Consider PU_i and PU_{i + 1}. Both have their numbers sorted, but possibly the values of the largest numbers in PU_i are larger than those of the smallest numbers in PU_{i + 1}. The above estimate has shown that we may assume that the number of these is at most x = c * P * sqrt(k * log n). Here c is a constant which can be determined by working out the details of the estimate. The simplest solution is to let both PUs exchange the x concerned numbers. Then each of them can merge the two subarrays of length x. PU_i keeps the smallest x and PU_{i + 1} the largest x. First exchanging samples, there is no need to make copy and the data exchange can be minimized.

Mesh

On a mesh we do not assume that the PUs hold many numbers. Instead we process the numbers in a sufficiently large submesh collectively. Such a submesh can be viewed as a virtual PU. If on an n x n mesh each PU holds n numbers, then collectively processing numbers in m x m submeshes has the effect of reducing the number of PUs from n^2 to (n / m)^2 and increasing the number of numbers per PU from k to k * m^2. Because on n x n meshes the cost of routing operations increases linearly with n, all operations within the submeshes have negligible cost if n' = (n / m) = o(n). Exploiting these ideas immediately leads to a good algorithm for the case k = 1 and to an optimal algorithm for the case k >= 8.

For k = 1, a good choice is m = n^{2/3} * log^{1/6} n. Doing that, n' = n / m = n^{1/3} / log^{1/6} n. So, there are P' = n'^2 = n^{2/3} / log^{1/3} n submeshes. k' = 1 * m^2 = n^{4/3} * log^{1/3} n. Any computation within the submeshes, corresponding to steps 1, 3, 4 and 6 of the algorithm above, takes O(m) = O(n^{2/3} * log^{1/6} n) time. The choice of m is made so that P' * (k' * log n)^{1/2} = n^{2/3} / log^{1/3} n * (n^{4/3} * log^{4/3} n)^{1/2} = n^{4/3} * log^{1/3} n = k'. So, in step of the algorithm the rearrangement takes place within subsets of constantly many submeshes. This takes O(m) time as well. Step 2 and 5 amount to routing a randomization and the inverse thereof. It can easily be shown that congestion is very limited. Each routing phase can be performed in n + O(sqrt(n * log n)) time. So, in total the sorting takes 4 * n + O(n^{2/3} * log^{1/6} n) time. This is twice as much as given by the distance lower bound.

For larger k optimality can be established. Now it is better to choose m smaller in dependency of k: the best choice is m = n^{2/3} * log^{1/6} n / k^{1/6}. This means that for very large k we will take m = 1, which sounds natural. The operations in the submeshes take O(k * m) = O(k^{5/6} * n^{2/3} * log^{1/6} n) time. Again the choice is made so that P' * (k' * log n)^{1/2} = n^{2/3} * k^{1/3} / log^{1/3} n * (n^{4/3} * k^{2/3} * log^{4/3} n)^{1/2} = n^{4/3} * k^{2/3} * log^{1/3} n = k'. In this case in step 2 and 5 half of the packets are routed XY and the other half YX. This makes that in each of the routing phases only k * n / 8 packets have to cross the bisection. Thus, together these steps take k * n / 2 + o(k * n) time.

Theorem: On a two-dimensional n x n mesh, k-k sorting can be performed in max{4 * n, k * n / 2} + o(k * n) steps with high probability.

Sampling through Shuffling

Above we have seen how randomizing the numbers makes the selection of splitters superfluous: the distribution of the numbers in each PU may be assumed to provide a sufficiently good impression of the whole distribution to base the bucketing on. We use randomization here for establishing something that is actually much weaker: we only need that each PU holds approximately the same number of numbers with destination in the same PU. For this we do not need randomization.

DMM

The crucial idea is that sorting the numbers in a PU and then handing them out in this sorted order over the PUs gives an even smoother distribution of the numbers over the PUs than doing this randomly. More precisely, if each PU holds k = n / P numbers, these are sorted, and the number with rank r in PU_i is sent to PU_{i + r mod P}. This redistribution is called shuffling. In the context of eliminating the need for randomization from all-to-all routing algorithms idea was already discussed in the section on "Derandomization" in the chapter on "Interconnection Networks and Routing". The operation which is used for the second all-to-all routing in the above randomized algorithm, in which a number with rank r is sent to PU_{P * r / k} is called unshuffling.

Based on shuffles and unshuffles the following is a simple, versatile and efficient sorting algorithm which is first formulated for a DMM with P PUs and k = n / P numbers per PU:

  1. Sort all numbers in each PU.
  2. Shuffle the numbers.
  3. Use P-way merge sort to sort the numbers in each PU.
  4. Unshuffle the numbers.
  5. Use P-way merge sort to sort the numbers in each PU.
  6. Perform data exchange between pairs of PUs with consecutive indices to establish the desired global sorting with k numbers per PU.
The strong point of shuffles and unshuffles is that these are fixed routing patterns, for which optimal routing schedules can be computed offline (which due to their very simple nature is mostly easy). The disadvantage is one more merging.

The analysis of the algorithm goes analogously to those above. For a number x which after the sorting in step 3 has rank r in PU_i, the PU in which it currently resides, the global rank is estimated as R' = P * r. Let R denote the actual global rank of x. It is easy to put upper bounds on |R' - R|. For this we need some additional notation. Let r_j be the rank of x in PU_i among the numbers originating from PU_j. Among the numbers which originally resided in PU_j, x falls between the number with rank j + (r_j - 1) * P and the one with rank j + r_j * P. Summing over all PUs gives

P * (P - 1) / 2 + P * sum_{j = 0}^{P - 1} (r_j - 1) < R < P * (P - 1) / 2 + P * sum_{j = 0}^{P - 1} r_j
So, after step 5 there may reside at most about P^2 / 2 numbers which belong in PU_i in both PU_{i - 1} and PU_{i + 1}. For the correctness this gives the condition that k > P^2 / 2. In order that the cost of the final correction is negligible, the condition becomes k = omega(P^2). Substituting k = n / P gives the condition n = omega(P^3), the same we have encountered before.

Mesh

Above we have formulated the algorithm as a deterministic variant of a sample-sort algorithm with implicit sampling. That is one view of the algorithm, but it is not the only view. The algorithm can also be viewed as an improved variant of shearsort which works with a constant number of one-dimensional sorting phases in total. This algorithm, which will be described next is called column sort and was presented by Leighton in 1985.

The mesh is not square but has height n^2 and width n. Each PU holds one number. The numbers will be sorted in column-major order. The algorithm consists of a few simple one-dimensional operations:

  1. Sort all columns from top to bottom.
  2. Shuffle all numbers within the rows. That is, the number in PU_{i, j}, 0 <= i < n^2, 0 <= j < n, is routed to PU_{i, (i + j) mod n}.
  3. Sort all columns from top to bottom.
  4. Rotate the numbers in the columns. That is, the number in PU_{i, j}, 0 <= i < n^2, 0 <= j < n, is routed to PU_{(i + n * j) mod n^2, j}.
  5. Unshuffle all numbers within the rows, that is, the number in PU_{i, j}, 0 <= i < n^2, 0 <= j < n, is routed to PU_{i, (i / n - 2 * j) mod n}.
  6. Sort all even columns from top to bottom and all odd columns from bottom to top.
  7. Sort pairs of adjacent row positions. That is, compare and swap the values in PU_{i, 2 * j} and PU_{i, 2 * j + 1}, 0 <= i < n^2, 0 <= j < n / 2.
  8. Sort pairs of adjacent row positions. That is, compare and swap the values in PU_{i, 2 * j - 1} and PU_{i, 2 * j}, 0 < i < n^2, 0 <= j < n / 2.
  9. Sort all columns from top to bottom.

Column Sort

Without the zero-one lemma the analysis of this algorithm would be hard. So let us consider an arbitrary distribution of zeroes and ones. In several steps we show that finally this distribution has been sorted.

Lemma: Starting with a zero-one distribution, there are at most n dirty rows after step 3.

Proof: Let z_j, 0 <= j < n, be the number of zeroes in column j. The shuffling is so that any column either receives round_down{z_j / n} or round_up{z_j / n} of these zeroes. So, a column receives at least l = sum_j round_down{z_i / n} and at most h = sum_j round_up{z_i / n} zeroes. So, after step 3, in row 0, ..., l - 1 there are only zeroes, while in row h, ..., n^2 - 1 there are only ones. Only in row l, ..., h - 1 there may be both zeroes and ones. Because h - l <= n, the claimed result follows. End.

Lemma: Starting with a zero-one distribution, there are at most 2 dirty columns after step 5.

Proof: Step 4 and 5 together are so that the contents of the columns is permuted among the rows so that the smallest n numbers in each column go to column 0, and more generally that all numbers standing in row j * n, ..., (j + 1) * n - 1 are ending up in column j. Let l and h be as in the proof of the previous lemma. There are two cases to distinguish. There is a j so that j * n <= l and h < (j + 1) * n. In that case only column j receives zeroes and ones. So, column j will be the only dirty column. Otherwise, because h - l <= n, there is a j so that j * n <= l and h < (j + 2) * n. In that case column 0, ..., j - 1 receives only zeroes, column j + 2, ..., n - 1 receives only ones, and column j and j + 1 may receive both zeroes and ones. End.

Theorem: The given column-sort algorithm correctly sorts any distribution of n^3 numbers on an n^2 x n mesh.

Proof: Given the situation as described in the proof of the above lemma, with zeroes and ones only in column j and j + 1 for some j, 0 <= j < n - 1, sorting the columns up and down alternatingly followed by sorting pairs of adjacent row positions reduces the number of dirty columns to one. This is a special case of the argument for shearsort that the number of dirty columns is reduced by a factor two in each round. The final sorting rearranges the numbers in this dirty column into sorted order. End.

Column-sort can also be formulated processor independent. In this generalized form it becomes immediately clear how to carry it over to interconnection networks. The numbers are divided over n^{1/3} buckets of size n^{2/3}. The role of the PUs is taken over by the buckets.

  1. Sort the numbers in each bucket.
  2. Shuffle the numbers.
  3. Sort the numbers in each bucket.
  4. Unshuffle the numbers.
  5. Sort the numbers in each bucket.
  6. Perform data exchange between pairs of buckets with consecutive indices to establish the desired global sorting with n^{2/3} numbers per bucket.
Sorting n numbers can be performed by a combination of a constant number of operations in which n^{1/3} buckets with n^{2/3} numbers each are sorted and some regular rearrangements of the elements between these buckets.

Theorem: On a 2-dimensional n x n mesh with k >= 8 numbers per PU, sorting can be performed in k * n / 2 + o(n) steps.

Proof: The buckets can be chosen to consist of the k^{2/3} * n^{4/3} numbers residing in each n' x n' submesh for n' = n^{2/3} / k^{1/6}. Sorting these submeshes can be performed with any sorting algorithm which achieves the optimal time order in O(k * n') = O(k^{5/6} * n^{2/3}). The final correction also has this time order. The shuffle and unshuffle send only half of the packets over the bisection and when exploiting the all-port facility they can be performed in k * n / 4 steps each. Thus, the whole sorting can be performed in k * n / 2 + O(k^{5/6} * n^{2/3}) steps. End.

The number of steps is the same as for routing a k-k distribution except for the lower-order term. Thus, we have obtained an optimal sorting algorithm because sorting is at least as hard as routing and k * n / 2 is a lower bound for k-k routing. Comparing with the presented randomized algorithm we see that the performance is almost the same. The lower-order term of the randomized algorithm is asymptotically marginally larger: distributing information regularly is better than randomizing.

Constant-Time Sort

We have seen several good sorting algorithms but the best achieved time remains O(log n). It is an interesting question to consider whether sorting, and other problems, can be performed in constant time and how much hardware is needed for that.

PRAM

On a Common CRCW PRAM an array a[] of length n can be sorted in constant time using n * n! PUs. The algorithm is similar to the constant-time algorithm for computing the maximum of n numbers. Unfortunately, for sorting there is no way to make the algorithm efficient.

The n! permutations can be enumerated systematically, and permutation p, 0 <= p < n!, is denoted pi_p. The PUs are indexed PU_{p, i}, 0 <= p < n!, 0 <= i < n. The algorithm is simple:

  1. for (all p, 0 <= p < n!, using  PU_{p, 0}) do
       b[p] = true;
  2. for (all p, i, 0 <= p < n!, 0 <= i < n - 1, using PU_{p, i}) do
       if ( a[p[i]]  > a[p[i + 1]] || 
           (a[p[i]] == a[p[i + 1]] && p[i] > p[i + 1]))
         b[i] = false;
  3. for (all p, i, 0 <= p < n!, 0 <= i < n, using PU_{p, i}) do
       if (b[p]) {
         c[p[i]] = a[i];
         a[i] = c[i]; }
In the last step it is essentially used that a PRAM is synchronous. On a non-synchronous machine the two instructions inside the if-clause should be separated by a synchronization to assure that the instruction c[p[i]] = a[i] has been executed before the instruction a[i] = c[i] is executed.

Reconfigurable Array

The above algorithm is fast and simple but extremely inefficient. The main importance is to show that in principle sorting can be performed in constant time. Can sorting be performed in constant time with a polynomial number of PUs? The answer is yes: in an earlier chapter we have seen how to sort n numbers on a reconfigurable processor array with O(n^3) PUs in 6 steps.

Using O(n^3) hardware to sort n numbers is not particularly efficient. By how much can this be reduced while still bounding the sorting time to O(1)? We will see that O(n^2) hardware is enough. This is optimal because performing a permutation in O(1) requires a bisection bandwidth of n, while the grid also has a basis of length n.

The main conclusion from column-sort is that sorting n numbers can be performed by several times sorting O(n^{2/3}) numbers and several rearrangements. Of course each of these sortings themselves can also be performed using column-sort. So, the idea for sorting n numbers on a reconfigurable n x n mesh is to recursively apply column sort. The recursion is terminated as soon as the size n' of the subproblem to solve satisfies n'^2 < n, because then it can be solved in constant time using the presented algorithm for sorting on reconfigurable arrays. Let n_i, i >= 0, denote the size of the subproblems at level i of the recursion. n_0 = n, n_1 = n^{2/3} and n_2 = n^{4/9}. So, the recursion tree has constant depth. Because the number of subproblems increases only by a constant factor between two levels of the recursion, there are in total only constantly many subproblems of size n^{4/9} to solve. We conclude

Theorem: On a reconfigurable processor array with n^2 PUs, n numbers can be sorted in O(1) time.

Exercises

  1. Give a transposition-sort algorithm for a linear array with an odd number n of PUs. How many steps do you need? Prove the correctness formally. Hint: use the 0-1 sorting lemma and induction.

  2. Consider sorting n^2 numbers on an n x n mesh. Give an algorithm running in O(n) time. The final arrangement can be freely chosen but must be fixed.

  3. Consider sorting n numbers on a k-dimensional full-port hypercube with P = 2^k PUs. Present an algorithm running in O(n / P) time. What is the minimum n for which your algorithm achieves this time order?

  4. Generalize the shearsort algorithm for three-dimensional meshes. How long does your algorithm take to sort n^3 numbers on an n x n x n mesh?

  5. Consider performing shearsort on the k-dimensional hypercube H_k. Distinguish two interpretations of such a hypercube: it can either be viewed as two interconnected copies of H_{k - 1} or as the direct product H_{k / 2} x H_{k / 2}. Give for both interpretations a rough estimate of the time consumption.

  6. Consider sorting n^2 numbers on a DMM with P PUs using a variant of shearsort. Assume that n >= P. The mesh can be mapped on to the DMM in several ways: For each of these views formulate the costs in the DMM model.

  7. In the shearsort algorithm, the only purpose of the sorting in two-directions in the rows is to spread the values out. This can much more effectively be achieved by randomization. Analyze a variant of shearsort in which the sorting in the rows is replaced by performing a random permutation. A random permutation can efficiently be generated by choosing random numbers and sorting on them. The analysis should be based on the 0-1 sorting lemma. You may use that when n times picking white and black balls which are white uniformly with probability 1 / 2, that then the probability that the number of selected white balls exceeds n / 2 + h is bounded by e^{-h^2 / (c * n)}, for some small constant c.

  8. The sorting algorithm based on first randomizing all numbers among the PUs, which was analyzed above for a mesh with k >= 8 numbers per PU, can also be used for the case that each PU holds exactly one number. Specify the time consumption of the algorithm for an n x n mesh.

  9. In the text above the performance of bitonic merge sort was analyzed for linear arrays and two-dimensional meshes. Perform an analogous analysis for 3-dimensional n x n x n meshes for some n = 2^k. Most interesting is the constant of the term which is linear in n. How does this constant develop when the algorithm is applied for d-dimensional n x ... x n meshes?

  10. Rewrite the method oddEvenMerge() in the given implementation of odd-even merge sort so that it becomes insitu and iterative. Hint: first study the above picture showing how 16 numbers are sorted. Compare the new version with the original one and with the implementation of bitonic merge sort, for a suitable range of problem sizes.

  11. The presented three "sample-sort" algorithms (one with explicit sample, one randomizing the numbers and one shuffling them) can also be used for sorting on a hypercube with n PUs and one number per PU.

  12. Consider sorting n numbers on a reconfigurable array with n^2 PUs. How many steps does it take exactly when performing column-sort recursively? Assume that n is so large that c * n^{8/9} <= n for any constant c you may need and you may ignore problems of divisibility.

  13. Assume that reconfigurable switches are relatively expensive, much more expensive than providing the same bisection bandwidth by conventional means. Then it may make sense to build a hybrid system: consisting of a reconfigurable grid which is used for computational purposes and some other network which is used for massive communication. We want to use this network for sorting n numbers on a system with n PUs. The time for the rearrangements is ignored. How big, in terms of n, must the reconfigurable array be in order to assure O(1) sorting time?


This page was created by Jop Sibeyn.
Last update Monday, 06 September 04 - 09:19.
For any comments: send an email.