for (t = 0; t < n / 2; t++)
{
for (i = 0; i < n / 2; i++) // Even
compare_swap(2 * i, 2 * i + 1);
for (i = 1; i < n / 2; i++) // Odd
compare_swap(2 * i - 1, 2 * i);
}
The correctness of the algorithm can be proven by induction. The idea underlying the argument is that the largest number, once it starts moving never gets delayed. It reaches its destination after at most n - 1 steps. The second largest number may be swapped with the largest number if the first step, and then it has to wait one step. Hereafter it will never be delayed anymore, so it arrives no later then after n steps. The operations in each of the inner loops can be executed in parallel. On an interconnection network, the simplest is to implement a compare-swap operation by exchanging the numbers to be compared and to discard one of them in each of the two involved PUs. So, on a linear array, the algorithm can be implemented with n steps in which each PU sends at most one number to a neighbor. Odd-even transposition sort is simple but inefficient: as a sequential algorithm it has complexity Theta(n^2) for sorting n numbers.
T(P, n) = c_i * O(n / P * log n + n) + c_v * n + c_a * P.For small P and very large n, the term c_i * O(n / P * log n) may dominate. In that case the algorithm gives good performance, but this applies only for n which are exponential in P. In the following we will see that good performance can be achieved for much smaller values of n as a function of P.
In the DMM model, the structure of the interconnection network is ignored. On interconnection networks, there is a structure, and the indexing scheme has impact on the efficiency with which sorting can be performed. Generally speaking, it is convenient if PUs for which the indices do not differ much also lie close to each other in the network. Nevertheless, sorting with respect to any indexing scheme which is fixed in advance is at least as hard as performing a k-k routing: if the hardest permutation pi routes a packet from PU_i to PU_j, then when sorting the value of the corresponding number in PU_i is set to j and this number will have to be routed from PU_i to PU_j in order to perform the sorting. The indexing scheme may have importance, but it should be noticed that the difference between the best and the worst scheme is limited: first sorting according to the best indexing scheme and then routing the numbers so that they come to stand according to the desired scheme costs at most twice as much as just sorting.
The simplest algorithm for sorting on a mesh is obtained by embedding a linear array into it. This algorithm sorts n^2 numbers on an n x n mesh in n^2 steps. This poor performance is due to the fact that the two-dimensional structure of the mesh is ignored. Ideally, if a problem can be solved in O(n) time on a linear array with n PUs it can also be solved on an n x n mesh in O(n) time. A natural way to obtain mesh algorithms is by combining operations on rows and columns. For sorting, it appears natural to consider algorithms which consist of an alternation of one-dimensional sorting phases, using odd-even transposition sort.
The shearsort algorithm, consists of an alternation of operations in which all columns are sorted in increasing order from top to bottom, and operations in which the even rows, that is the rows with index 0, 2, ..., n - 2, are sorted from left to right while the odd rows, with index 1, 3, ..., n - 1, are sorted from left to right. Only in the final sorting of the rows, the sorting direction can be chosen freely. Performing this operation as all others, the numbers are sorted in snake-like row-major order. When sorting all rows increasingly from left to right we get row-major order. The first alternative gives the following extremely simple algorithm:
for (t = 0; t <= t_max(n); t++)
{
sort all columns from top to bottom;
sort even rows from left to right and odd rows from right to left;
}
If all PUs hold one number, each pass of the loop takes 2 * n steps, so
the whole algorithm runs in 2 * n * (t_max(n) + 1) steps. The question
is how large t_max(n) should be chosen to guarantee that any
distribution gets sorted. The analysis is based on two lemmas.
0-1 Sorting Lemma: If a comparison-based sorting algorithm correctly sorts any distribution consisting of 0-s and 1-s, it correctly sorts any distribution with arbitrary numbers.
Proof: The proof goes by contradiction. Assume there is a set of numbers X = {x_0, ..., x_{n - 1}} for which the algorithm is not correct. Let pi'() be the inverse of the permutation realized by the algorithm. That is, afterwards the numbers have been rearranged to x_{pi'(0)}, ..., x_{pi'(n - 1)}. Let pi() be the inverse of the permutation giving a correct sorting of the numbers. That is, pi() is so that x_{pi(0)} <= x_{pi(1)} <= ... <= x_{pi(n - 1)}. Let k be the smallest index so that x_{pi'(k)} != x_{pi(k)}. All numbers with values smaller than x_{pi(k)} belong to the set {x_{pi(0)}, ..., x_{pi(k - 1)}} = {x_{pi'(0)}, ..., x_{pi'(k - 1)}}, so we must have x_{pi'(k)} > x_{pi(k)}. Because pi() is a permutation, this implies that there also must be an index l > k, so that x_{pi'(l)} = x_{pi(k)}. Now define the set of 0-1 values X' = {x'_0, ..., x_{n - 1}} by x'_i = 0 if x_i <= x_{pi(k)}, x'_i = 1 if x_i > x_{pi(k)}. Because for all 0 <= i, j < n, x_i < x_j => x'_i < x_j, the algorithm works on X' as it did on X (except that possibly some elements which now have the same value do not get exchanged). So, on X', the algorithm will produce the following sequence of values as output: ..., x'_{pi'(k)}, ..., x'_{pi'(l)}, ... = ..., 1, ..., 0, ... . That is, at position k there occurs a 1 while at position l > k there occurs a 0: the sorting is not correct, a contradiction with the assumption that it correctly sorts all 0-1 distributions. End.
A row is called clean if it only contains 0-s or only 1-s. Otherwise a row is called dirty. For the analysis it is convenient to rewrite the algorithm as follows:
sort all columns from top to bottom;
for (t = 0; t < t_max(n); t++)
{
sort even rows from left to right and odd rows from right to left;
sort all columns from top to bottom;
}
sort even rows from left to right and odd rows from right to left;
Hereafter we will refer to this variant.
Lemma: For any 0-1 number distribution on an n x n mesh, after t, 0 <= t <= log n, passes through the loop in shearsort there are still at most n / 2^t dirty rows.
Proof: The purpose of the first sorting of the columns is that it establishes the property that for all i, 0 <= i < n - 1, the number of 0-s in row is at least as large as in row i + 1. This is true because if row i + 1 would hold more 0-s than row i, there would be a position in row i with a 1 where row i + 1 has a 0, contradicting the sorting of the columns.
For t >= 0 we use induction. For t = 0 the claim holds. So, assume that the claim holds for some t >= 0. Sorting the clean rows does not change anything. So, we may focus on the n_t dirty rows. Let z_i, 0 <= i < n_t, denote the number of 0-s in the i-th of these rows. z_i is monotonously decreasing. Let m_0 = |{0 <= i < n_t| z_i >= n / 2}| and let m_1 = n_t - m_0. We claim that after pass t + 1, at least round_down(m_0 / 2) of these rows only contain 0-s, and at least round_down(m_1 / 2) only 1-s. This is a direct consequence of the the monotonicity of z_i and the fact that even rows are sorted differently from odd rows. Thus, any two consecutive rows with z_i >= n / 2 lead to one row with only 0-s. There are several cases to distinguish. If n_t is odd, then either m_0 is even or m_1 is even and thus round_down(m_0 / 2) + round_down(m_1 / 2) = n_t / 2. If n_t is even, then if m_0 and m_1 are even, round_down(m_0 / 2) + round_down(m_1 / 2) = (m_0 + m_1) / 2 = n_t / 2. Only the case that both m_0 and m_1 are odd deserves special consideration. So consider row m_0, which is sorted from right to left and row m_0 + 1 which is sorted from left to right. If z_{m_0} + z_{m_0 + 1} >= n, then this results in one more row with only 0-s, otherwise in one more row with only 1-s. So, in this case at least round_down(m_0 / 2) + round_down(m_1 / 2) + 1 = n_t / 2 new clean rows are generated. End.
Theorem: Shearsort sorts any distribution of n^2 numbers on an n x n mesh in (2 * log n + 2) * n steps.
The time consumption can be improved by almost a factor two, by realizing that the sorting in the columns does not need to be performed for n steps in the later passes of the loop: if there are only n_t rows which might contain 0-s and 1-s, it suffices to perform n_t steps. Thus, all sorting in the columns can be reduced to n + n + n / 2 + ... + 2 = 3 * n - 2. So, the total sorting time can be reduced to (log n + 4) * n - 2 steps. Even this is not such a good result, but it might serve as basis case of a recursive algorithm. The question is whether it is better than the result obtained by trivially embedding a linear array. The times that should be compared are given in the following table, the results are not very impressive:
| n | 2 | 4 | 8 | 16 |
| Transposition Sort | 4 | 16 | 64 | 256 |
| Shearsort | 8 | 22 | 54 | 126 |
The algorithm is very simple. Initially x may lie in the whole interval ranging from 0 to n. This interval is reduced in several rounds. In each round the remaining interval is divided in P intervals I_i, 0 <= i < P, of approximately equal size. PU_i, 0 <= i < P, is in charge of I_i. Each of the PUs tests whether x lies in its interval. Assuming that the values are different (otherwise the smallest should be taken) there is a unique interval in which x lies. The PU in charge of this interval writes its index to a global variable c. All PUs read this variable and can start the next round. On an CREW PRAM, each round can be performed in O(1) time. So, the whole searching algorithm runs in O(log_P n).
Theorem: On a CREW PRAM with P PUs, an element x can be ranked in a sorted set of size n in O(log_P n) time.
An important special case is that P = n^eps for some eps > 0. In that case log_P n = 1 / eps = O(1). So, with sufficiently many PUs an element can be ranked in constant time.For two sorted arrays, a[] of length m and b[] of length n, let rank(a, b) denote the array (r_0, ..., r_{m - 1}) of length m with r_i = rank(a[i], b), the rank of a[i] in b[]. Assume that we have a CREW PRAM with n PUs, then if m = O(n^{1 - eps}), for some eps > 0, rank(a, b) can be determined in constant time, because for each of the m elements of a[] we can use n^eps PUs to perform the ranking.
Again we apply the partitioning strategy. For merging two sorted arrays a[] and b[] each of length n with n PUs, we select the subarrays a'[] and b'[] of length sqrt(n) with a'[i] = a[sqrt(n) * i] and b'[i] = b[sqrt(n) * i] (it is convenient to add one more element to each subarray with value infinity). With the available PUs rank(a', b) and rank(b', a) can be computed in constant time. The resulting partitioning reduces the merging problem to merging subarrays of a[] and b[] of length at most sqrt(n). Before we reduced the problem in one stroke to subproblems which could be solved sequentially, now we continue recursively until the size has become constant. Each phase of the recursion takes O(1), so the whole merging takes O(loglog n) time.
Lemma: On a CREW PRAM two sorted arrays of length n each can be merged in O(loglog n) time with linear work.
Proof: Above we have shown how to merge two sorted arrays a[] and b[] of length n on a CREW PRAM with n PUs in O(loglog n) time. The work of this algorithm is O(n * loglog n). This algorithm can be made work-optimal by some kind of partitioning. So, we first determine subarrays a'[] and b'[] each of length n / loglog n by taking a'[i] = a[i * loglog n] and b'[i] = b[i * loglog n]. a'[] and b'[] are merged with n / loglog n PUs in O(loglog n) time. Just as in all previous partitionings, as a result it remains to merge sections of a[] and b[] each of length at most loglog n. With one PU for each subproblem, each of these can be solved with the sequential algorithm in O(loglog n) time. End.
Using the above fast and efficient merging as a subroutine in a merge sort algorithm gives
Theorem: On a CREW PRAM with n / loglog n PUs n numbers can be sorted in O(log n * loglog n) time.
O(log n * loglog n) is not the best possible result: sorting can be solved in O(log n) time and O(n * log n) work on an EREW PRAM. This optimal algorithm, which is known in the literature as Cole's merge-sort algorithm, is considerably more complicated and practically a gap of O(loglog n) is hardly significant. A more serious drawback of the presented algorithm is that it is based on the parallel-search algorithm which really needs the CREW model. The parallel quick-sort algorithm can also be refined to run in O(log n) time with high probability.
void bitonicMergeUp(int[] a, int n) {
for (int s = n >> 1; s > 0; s = s >> 1)
for (int i = 0; i < n; i += 2 * s)
for (int j = i; j < i + s; j++)
compareSwapUp(a, j, j + s); }
Here compareSwapUp(a, i, j), compares the values of a[i] and a[j]
and exchanges them if a[i] > a[j]. The procedure bitonicMergeDn() is
analogous. These procedures integrated in a running program can be
downloaded here.
The correctness is not proven here. The invariant that must be proven is that after round r, k >= r >= 0, all numbers in a subset of size 2^r with bit r equal to 0 are smaller than all numbers in the corresponding subset with bit r equal to 1. Because later comparisons are performed only within the subsets, the earlier established properties are preserved. Because all properties together imply a sorted sequence, proving the invariant implies the correctness.
T_merge(2^k) = k / 2 * 2^k, for all k >= 1.
For T_sort(2^k), the time to sort 2^k numbers based on this merge operation, we get
T_sort(2^1) = 1,Using induction it is easy to verify, that the solution is given by
T_sort(2^k) = 2 * T_sort(2^{k - 1}) + T_merge(2^k), for all k > 1.
T_sort(2^k) = k * (k + 1) / 4 * 2^k, for all k >= 1.Another important point is the depth of the schedule. For bitonic merge sort, this is simply sum_{i = 1}^k i = k * (k + 1) / 2 = T_sort(2^k) / (2^k / 2).
Theorem: Using bitonic merge sort for sorting n numbers, the work is Theta(n * log^2 n). The time is Theta(log^2 n).
Bitonic merge sort is very suitable for implementation on a hypercube: in all steps number a[i] is compared with numbers a[j], where j = i +- 2^r, for some r, 0 <= r < k. This means that, when mapping the numbers of a[] in a straightforward way to the 2^k PUs of a k-dimensional hypercube, all communication takes place between neighbors, and that thus all steps can be performed in constant time.
Theorem: Using bitonic merge sort, 2^k numbers can be sorted in Theta(k^2) time on a k-dimensional hypercube.
Asymptotically this is not the best achievable, but the algorithm is extremely simple and, as seen above, the constants hidden in the Theta-notation are very small.Surprisingly bitonic merge sort is asymptotically optimal on linear arays and meshes. The analysis is most easy for linear arrays. Merging two subsequences standing each in m consecutive PUs takes m + m / 2 + + ... + 1 = 2 * m - 1 steps. So, for a linear array with n = 2^k PUs all merging rounds together take sum_{0 <= s < k} (2 * 2^s - 1) = 2 * sum_{0 <= s < k} 2^s - k = 2 * n - log n - 2 < 2 * n steps. This is not bad at all! Of course, for linear arrays odd-even transposition sort performs fewer routing steps, but at the expense of far more comparisons.
On a two-dimensional n x n mesh n^2 numbers can be sorted in 2 * n + o(n) steps, which is optimal, but the algorithm is complicated and the lower-order term makes this algorithm unsuitable for moderate values of n. How about bitonic merge sorting? It is essential that the PUs are indexed with the shuffled row-major indexing. In that case the distance between PU i and j with j = i +- 2^r, for some r, 0 <= r < k, is exactly 2^{round_down{r / 2}}. Because such PU i and j lie in the same row or column, the routing can be performed without delay. Thus, merging two subsequences of length 2^s each takes 1 + 1 + 2 + 2 + ... + 2^round_down{(s - 1) / 2} + 2^round_down{s / 2} < 4 * 2^{s / 2}. For the number of steps T(n) for the whole sorting this gives T(n) < 4 * sum_{0 <= s < k} 2^{s / 2} < 8 * sum_{0 <= s < k / 2} 2^s < 8 * n. It is worthwhile to compute T(n) exactly. T(2) = 1 + (1 + 1) = 3. T(4) = T(2) + (2 + 1 + 1) + (2 + 2 + 1 + 1) = 13. T(8) = T(4) + (4 + 2 + 2 + 1 + 1) + (4 + 4 + 2 + 2 + 1 + 1) = 37. More generally, T(n) = T(n / 2) + 3 1/2 * n - 4. The solution is T(n) = 7 * n - 4 * log n - 7. As noticed above, asymptotically this is not competitive, but for small n it might be the best achievable. It is better than transposition sort and shearsort for all n.
oddEvenMerge(int a[], int m) {
if (m > 2) {
int[] ae = new int[m / 2];
int[] ao = new int[m / 2];
for (int i = 0; i < m / 2; i++) {
ae[i] = a[2 * i];
ao[i] = a[2 * i + 1]; }
oddEvenMerge(ae, m / 2);
oddEvenMerge(ao, m / 2);
a[0] = ae[0];
for (int i = 1; i < m / 2; i++)
if (ae[i] <= ao[i - 1]) {
a[2 * i - 1] = ae[i];
a[2 * i] = ao[i - 1]; }
else {
a[2 * i - 1] = ao[i - 1];
a[2 * i] = ae[i]; }
a[m - 1] = ao[m / 2 - 1]; }
else // m == 2
if (a[0] > a[1]) {
int x = a[0]; a[0] = a[1]; a[1] = x; } }
This procedure integrated in a running program can be downloaded here. This program can be made
considerably more efficient, by eliminating the copying operations.
We consider the same example as before:
a[] initially: 10 12 14 16 20 25 32 34 8 13 24 26 28 36 38 40 ae[] initially: 10 14 20 32 8 24 28 38 ao[] initially: 12 16 25 34 13 26 36 40 ae[] merged: 8 10 14 20 24 28 32 38 ao[] merged: 12 13 16 25 26 34 36 40 a[] finally: 8 10 12 13 14 16 20 24 25 26 28 32 34 36 38 40
The correctness is not proven here, but using the 0-1 lemma this is not hard. The key observation is that the estimate of the position of a number in the global array can be predicted with an accuracy of 1 from the position in ae[] and ao[].
T_merge(2^1) = 1,The solution is
T_merge(2^k) = 2 * T_merge(2^{k - 1}) + 2^{k - 1} - 1, for all k > 1.
T_merge(2^k) = (k - 1) / 2 * 2^k + 1, for all k >= 1.The expression for T_sort(2^k) is the same as before. The solution is
T_sort(2^k) = ((k - 1) * k / 4 + 1) * 2^k - 1, for all k >= 1.
The following table gives the number of comparisons performed by the bitonic and the odd-even merge sort algorithms. First the difference between the algorithms increases, then it gradually decreases again. Due to their O(log^2 n * n) complexity, neither of the algorithms should be used for large values of n. On the other hand, for small n, odd-even merge sort is close to optimal.
| n | 1 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
| bitonic | 1 | 6 | 24 | 80 | 240 | 672 | 1792 | 4608 |
| odd-even | 1 | 5 | 19 | 63 | 191 | 543 | 1471 | 3839 |
| ratio | 1 | 1.20 | 1.26 | 1.27 | 1.26 | 1.24 | 1.22 | 1.20 |
We analyze the algorithm for a DMM. The internal time consumption is dominated by the time for the several sorting and bucketing operations. It is given by c_i * O(x * log x + P * x * log P + k * log n). If we take care to select x so that P * x < k, then the time is dominated by the time for sorting k numbers. There are only two communication operations. Good performance can only be achieved if the all-to-all routing is performed in a direct way. Thus, the number of start-ups is log P + P - 1. The routing volume is (1 - 1 / P) * (P * x + k). If P * x < k, then this is dominated by the second term. So, under condition P * x = o(n / P), we get
T(P, n) = (1 + o(1)) * (T_sort(1, n / P) + c_v * n / P + c_a * P).This is a great result, because it means that parallel sorting can be parallelized perfectly except for routing all numbers once through the network. This is the best one reasonably can hope for. The achievable speed-up mainly depends on the ratio of c_i and c_v, but of course, for very large n, the log-factor in the sorting time also plays a certain role. The speed-up is approximately given by c'_i * n * log n / (c'_i * n / P * log n + c_v * n / P) = P / (1 + c_v / (c'_i * log n)). Here c'_i is the constant so that T_sort(1, n) ~= c'_i * n * log n, which depends both on c_i and the applied sequential sorting algorithm.
In the above analysis we silently assumed that the all-to-all routing is balanced and that finally all PUs receive approximately the same number of packets. This is not automatically true. There are two limitations:
The second problem is more specific and requires an analysis. We compute an upper bound on the number of numbers that finally may end up in a single PU. So, the question is how many numbers may lie between any two consecutive splitters. Consider two splitters s = s_{i - 1} s' = s_i. Let R_j and R'_j denote the rank of s and s', respectively, in the set of numbers initially stored in PU_j. Analogously, let r_j and r'_j denote the rank of s and s', respectively, in the set of presplitters selected by PU_j. By definition of the splitters, sum_{j = 0}^{P - 1} (r'_j - r_j) = x. The number S_j of numbers routed to PU_j is given by S_j = sum_{j = 0}^{P - 1} (R'_j - R_j). Using that in each PU exactly k / x numbers lie between any two presplitters, we get the following estimates:
S_j <= k / x * sum_{j = 0}^{P - 1} (r'_j - r_j + 1) = n / P + n / x.
S_j >= k / x * sum_{j = 0}^{P - 1} (r'_j - r_j - 1) = n / P - n / x.
So, there is an "unbalance" of O(n / x). Thus, in order to obtain the above claimed results we must take x = omega(P). Because at the same time we had the condition P * x = o(n / P), this gives the condition P^3 = o(n). This is quite a tough condition which is nevertheless satisfied for most practical values of P and n. By refining the procedure by which the splitters are selected from the presplitters, performing iterated subsampling the condition can be relaxed almost to P^2 = o(n), the typical condition which must be satisfied to obtain good performance on a DMM.
For optimal performance, x should be chosen so that the sum of all additional costs together is minimized. On the one hand there are the costs for handling the presplitters and splitters, on the other hand the costs due to an unbalanced routing and sorting. The first contribution increases with x in the same way the second decreases with x. In such cases an approximation of the optimal choice can be found by equating both cost contributions and solving x. This gives
(c_v + c_i * log P) * P * x = (c_v + c_i * log P) * n / x.This suggests to choose x = sqrt(n / P). For this particular choice, the total "loss" term is bounded by O((c_v + c_i * log P) * sqrt(n * P)), to be compared with the "useful" term Theta((c_v + c_i * log P) * n / P).
The given algorithm does not guarantee that finally each PU holds the same number of numbers. Furthermore, even though the numbers stand sorted, their ranks are not known. This can easily be overcome: counting the number of numbers in each PU and computing the prefix sum of these numbers, the ranks of the numbers can be determined cheaply. From this it follows where they have to be stored. In the normal case that n / x < n / P, this can be achieved by a routing in which a small fraction of the numbers in PU_i has to be routed to PU_{i - 1} and/or PU_{i + 1}. Of course this correction can be performed along with the final sorting operations.
T(P, n) = (1 + o(1)) * (T_sort(1, n / P) + c_v * 2 * n / P + c_a * 2 * P).The routing time is doubled due to the randomization, but this assures that we do not have to consider the possibility of an unbalanced routing pattern, these are resolved automatically. On the other hand, the internal time consumption is essentially the same. Possibly it is even slightly better because no time is waisted on handling the splitters.
We now analyze the minimal value of n as a function of P to assure that the above result holds with high probability. The following points must be satisfied:
There are several ways to perform the final correction. Consider PU_i and PU_{i + 1}. Both have their numbers sorted, but possibly the values of the largest numbers in PU_i are larger than those of the smallest numbers in PU_{i + 1}. The above estimate has shown that we may assume that the number of these is at most x = c * P * sqrt(k * log n). Here c is a constant which can be determined by working out the details of the estimate. The simplest solution is to let both PUs exchange the x concerned numbers. Then each of them can merge the two subarrays of length x. PU_i keeps the smallest x and PU_{i + 1} the largest x. First exchanging samples, there is no need to make copy and the data exchange can be minimized.
For k = 1, a good choice is m = n^{2/3} * log^{1/6} n. Doing that, n' = n / m = n^{1/3} / log^{1/6} n. So, there are P' = n'^2 = n^{2/3} / log^{1/3} n submeshes. k' = 1 * m^2 = n^{4/3} * log^{1/3} n. Any computation within the submeshes, corresponding to steps 1, 3, 4 and 6 of the algorithm above, takes O(m) = O(n^{2/3} * log^{1/6} n) time. The choice of m is made so that P' * (k' * log n)^{1/2} = n^{2/3} / log^{1/3} n * (n^{4/3} * log^{4/3} n)^{1/2} = n^{4/3} * log^{1/3} n = k'. So, in step of the algorithm the rearrangement takes place within subsets of constantly many submeshes. This takes O(m) time as well. Step 2 and 5 amount to routing a randomization and the inverse thereof. It can easily be shown that congestion is very limited. Each routing phase can be performed in n + O(sqrt(n * log n)) time. So, in total the sorting takes 4 * n + O(n^{2/3} * log^{1/6} n) time. This is twice as much as given by the distance lower bound.
For larger k optimality can be established. Now it is better to choose m smaller in dependency of k: the best choice is m = n^{2/3} * log^{1/6} n / k^{1/6}. This means that for very large k we will take m = 1, which sounds natural. The operations in the submeshes take O(k * m) = O(k^{5/6} * n^{2/3} * log^{1/6} n) time. Again the choice is made so that P' * (k' * log n)^{1/2} = n^{2/3} * k^{1/3} / log^{1/3} n * (n^{4/3} * k^{2/3} * log^{4/3} n)^{1/2} = n^{4/3} * k^{2/3} * log^{1/3} n = k'. In this case in step 2 and 5 half of the packets are routed XY and the other half YX. This makes that in each of the routing phases only k * n / 8 packets have to cross the bisection. Thus, together these steps take k * n / 2 + o(k * n) time.
Theorem: On a two-dimensional n x n mesh, k-k sorting can be performed in max{4 * n, k * n / 2} + o(k * n) steps with high probability.
Based on shuffles and unshuffles the following is a simple, versatile and efficient sorting algorithm which is first formulated for a DMM with P PUs and k = n / P numbers per PU:
The analysis of the algorithm goes analogously to those above. For a number x which after the sorting in step 3 has rank r in PU_i, the PU in which it currently resides, the global rank is estimated as R' = P * r. Let R denote the actual global rank of x. It is easy to put upper bounds on |R' - R|. For this we need some additional notation. Let r_j be the rank of x in PU_i among the numbers originating from PU_j. Among the numbers which originally resided in PU_j, x falls between the number with rank j + (r_j - 1) * P and the one with rank j + r_j * P. Summing over all PUs gives
P * (P - 1) / 2 + P * sum_{j = 0}^{P - 1} (r_j - 1) < R < P * (P - 1) / 2 + P * sum_{j = 0}^{P - 1} r_jSo, after step 5 there may reside at most about P^2 / 2 numbers which belong in PU_i in both PU_{i - 1} and PU_{i + 1}. For the correctness this gives the condition that k > P^2 / 2. In order that the cost of the final correction is negligible, the condition becomes k = omega(P^2). Substituting k = n / P gives the condition n = omega(P^3), the same we have encountered before.
The mesh is not square but has height n^2 and width n. Each PU holds one number. The numbers will be sorted in column-major order. The algorithm consists of a few simple one-dimensional operations:
Lemma: Starting with a zero-one distribution, there are at most n dirty rows after step 3.
Proof: Let z_j, 0 <= j < n, be the number of zeroes in column j. The shuffling is so that any column either receives round_down{z_j / n} or round_up{z_j / n} of these zeroes. So, a column receives at least l = sum_j round_down{z_i / n} and at most h = sum_j round_up{z_i / n} zeroes. So, after step 3, in row 0, ..., l - 1 there are only zeroes, while in row h, ..., n^2 - 1 there are only ones. Only in row l, ..., h - 1 there may be both zeroes and ones. Because h - l <= n, the claimed result follows. End.
Lemma: Starting with a zero-one distribution, there are at most 2 dirty columns after step 5.
Proof: Step 4 and 5 together are so that the contents of the columns is permuted among the rows so that the smallest n numbers in each column go to column 0, and more generally that all numbers standing in row j * n, ..., (j + 1) * n - 1 are ending up in column j. Let l and h be as in the proof of the previous lemma. There are two cases to distinguish. There is a j so that j * n <= l and h < (j + 1) * n. In that case only column j receives zeroes and ones. So, column j will be the only dirty column. Otherwise, because h - l <= n, there is a j so that j * n <= l and h < (j + 2) * n. In that case column 0, ..., j - 1 receives only zeroes, column j + 2, ..., n - 1 receives only ones, and column j and j + 1 may receive both zeroes and ones. End.
Theorem: The given column-sort algorithm correctly sorts any distribution of n^3 numbers on an n^2 x n mesh.
Proof: Given the situation as described in the proof of the above lemma, with zeroes and ones only in column j and j + 1 for some j, 0 <= j < n - 1, sorting the columns up and down alternatingly followed by sorting pairs of adjacent row positions reduces the number of dirty columns to one. This is a special case of the argument for shearsort that the number of dirty columns is reduced by a factor two in each round. The final sorting rearranges the numbers in this dirty column into sorted order. End.
Column-sort can also be formulated processor independent. In this generalized form it becomes immediately clear how to carry it over to interconnection networks. The numbers are divided over n^{1/3} buckets of size n^{2/3}. The role of the PUs is taken over by the buckets.
Sorting n numbers can be performed by a combination of a constant number of operations in which n^{1/3} buckets with n^{2/3} numbers each are sorted and some regular rearrangements of the elements between these buckets.
Theorem: On a 2-dimensional n x n mesh with k >= 8 numbers per PU, sorting can be performed in k * n / 2 + o(n) steps.
Proof: The buckets can be chosen to consist of the k^{2/3} * n^{4/3} numbers residing in each n' x n' submesh for n' = n^{2/3} / k^{1/6}. Sorting these submeshes can be performed with any sorting algorithm which achieves the optimal time order in O(k * n') = O(k^{5/6} * n^{2/3}). The final correction also has this time order. The shuffle and unshuffle send only half of the packets over the bisection and when exploiting the all-port facility they can be performed in k * n / 4 steps each. Thus, the whole sorting can be performed in k * n / 2 + O(k^{5/6} * n^{2/3}) steps. End.
The number of steps is the same as for routing a k-k distribution except for the lower-order term. Thus, we have obtained an optimal sorting algorithm because sorting is at least as hard as routing and k * n / 2 is a lower bound for k-k routing. Comparing with the presented randomized algorithm we see that the performance is almost the same. The lower-order term of the randomized algorithm is asymptotically marginally larger: distributing information regularly is better than randomizing.The n! permutations can be enumerated systematically, and permutation p, 0 <= p < n!, is denoted pi_p. The PUs are indexed PU_{p, i}, 0 <= p < n!, 0 <= i < n. The algorithm is simple:
1. for (all p, 0 <= p < n!, using PU_{p, 0}) do
b[p] = true;
2. for (all p, i, 0 <= p < n!, 0 <= i < n - 1, using PU_{p, i}) do
if ( a[p[i]] > a[p[i + 1]] ||
(a[p[i]] == a[p[i + 1]] && p[i] > p[i + 1]))
b[i] = false;
3. for (all p, i, 0 <= p < n!, 0 <= i < n, using PU_{p, i}) do
if (b[p]) {
c[p[i]] = a[i];
a[i] = c[i]; }
In the last step it is essentially used that a PRAM is synchronous. On
a non-synchronous machine the two instructions inside the if-clause
should be separated by a synchronization to assure that the instruction
c[p[i]] = a[i] has been executed before the instruction a[i] = c[i] is
executed.
Using O(n^3) hardware to sort n numbers is not particularly efficient. By how much can this be reduced while still bounding the sorting time to O(1)? We will see that O(n^2) hardware is enough. This is optimal because performing a permutation in O(1) requires a bisection bandwidth of n, while the grid also has a basis of length n.
The main conclusion from column-sort is that sorting n numbers can be performed by several times sorting O(n^{2/3}) numbers and several rearrangements. Of course each of these sortings themselves can also be performed using column-sort. So, the idea for sorting n numbers on a reconfigurable n x n mesh is to recursively apply column sort. The recursion is terminated as soon as the size n' of the subproblem to solve satisfies n'^2 < n, because then it can be solved in constant time using the presented algorithm for sorting on reconfigurable arrays. Let n_i, i >= 0, denote the size of the subproblems at level i of the recursion. n_0 = n, n_1 = n^{2/3} and n_2 = n^{4/9}. So, the recursion tree has constant depth. Because the number of subproblems increases only by a constant factor between two levels of the recursion, there are in total only constantly many subproblems of size n^{4/9} to solve. We conclude
Theorem: On a reconfigurable processor array with n^2 PUs, n numbers can be sorted in O(1) time.