We consider a set of lists with n nodes in total. This is internally represented by an array succ[] of length n. For a node u, succ[u] gives the successor of u. So, the lists are directed towards their final nodes. If u is the last node of its list, we have succ[u] = u. The task is to compute the values of the arrays last[] and dist[]. For a node u, last[u] indicates the last node of the list to which u belongs, and dist[u] gives the number of links that must be traversed in order to reach last[u] starting in u.
Sequentially list ranking can be solved by first figuring out the initial elements of the lists and then simply traversing the list while numbering the nodes.
void sequential(int succ[], int[] frst, int dist[], int n)
{
boolean[] first = new boolean[n];
for(int i = 0; i < n; i++)
first[i] = true;
for(int i = 0; i < n; i++)
first[succ[i]] = false;
for(int i = 0; i < n; i++)
if (first[i]) // i is first node
{
int j = last[i] = i;
int d = dist[i] = 0;
while (succ[j] != j)
{
j = succ[j];
d++;
last[j] = i;
dist[j] = d;
}
}
}
The above algorithm performs 4 * n unstructured accesses to arrays of
length n. For large n, this causes close to 4 * n cache faults. As a
result this algorithm is not particularly efficient. Even a fast
computers requires around 1 second for ranking a list with 10^6
elements. Some simple improvements reduce the number of unstructured
array accesses to 2 * n, almost halving the execution time.
There is no obvious parallel algorithm. Parallel list ranking is much harder than parallel sorting: when sorting any two elements can be compared and this adds some information, but while list ranking, the successor of a node is typically stored in another processor and just going there is expensive. Theoretically the parallel list-ranking problem has been solved: for ranking a list of length n, there are PRAM algorithms using n / log n PUs running in O(log n) time. But, practically it is hard to achieve reasonable speed-ups. Even for very large problems and a very good interconnection network it is not easy to achieve a speed-up of more than 30 on a parallel computer with 100 PUs. For sorting it would have been easy to achieve a speed-up of 80 or 90. This is mainly due to the hard, very non-local nature of list ranking, but also to the fact that its sequential complexity is only linear, while sequential sorting has running time O(n * log n).
The ideas in this section go back on many authors. Just to name a few: Wylie, Anderson, Cole, Miller, Vishkin, Reid-Miller, Ranade and Sibeyn.
void pointerJumping(int[] succ, int[] last, int[] dist, int n)
{
// Initialization
for (int i = 0; i < n; i++)
if (succ[i] == i)
{
last[i] = succ[i] - n;
dist[i] = 0;
}
else
{
last[i] = succ[i];
dist[i] = 1;
}
// Performing rounds until all reached final node
boolean flag = true;
while (flag)
{
flag = false;
for (int i = 0; i < n; i++)
if (last[i] >= 0) // i has not yet reached last node
{
flag = true;
dist[i] += dist[last[i]];
last[i] = last[last[i]];
}
}
// Correction of last[] values
for (int i = 0; i < n; i++)
last[i] += n;
}
Operations of the kind last[i] = last[last[i]], for all i, are called pointer-jumping steps. In a parallel algorithm, these are performed synchronizedly. So, as long as last[u] does not yet point to the last node of a list, the value dist[u] doubles in every pass. Therefore the while loop will be executed at most log n times. The work in each pass is O(n), so the total work is O(n * log n) and the time consumption is O(log n). Pointer jumping is simple and fast, but inefficient due to its super-linear work.
On an interconnection network, each pass requires two all-to-all routing operations, in which each question is resulting in two answers. So, in the DMM model this algorithm costs
T(P, n) = c_i * O(n / P * log n) + c_v * 3 * log n * n / P + c_a * 2 * log n * P.Because of the relatively small number of all-to-all communications, this algorithm is quite reasonable for relatively small problems. For n >> c_a / c_v * P^2, the start-up costs are negligible. But, of course, the factor log n in the work and routing volume excludes really good performance.
The independent-set removal algorithm consists of a reduction and an expansion phase. In the reduction phase, repeatedly the nodes in an independent set are excluded. This continues until the problem has been reduced sufficiently to be solved in a more basic way. Then all excluded elements are reinserted in reversed order. For efficiently performing these operations it is practical to have a doubly linked list, that is, for every node i there is also data item pred[i] giving its predecessor next to succ[i] giving its successor.
During the reduction phase, the following steps are iterated: somehow an independent set S is constructed, then each node i in S, is excluded from the list by performing
succ[pred[i]] = succ[i]; dist[pred[i]] += dist[i]; pred[succ[i]] = pred[i];During the expansion phase these operations are reversed: if i was excluded in round r, then in the corresponding expansion round we may inductively assume that node succ[i] already knows the index of the last node of the list and the distance thereto. Thus, by performing
last[i] = last[succ[i]]; dist[i] += dist[succ[i]];also the values of node i become final.
In the work-time framework, each of the exclusion and inclusion operations can be performed in O(1) time and work that is proportional to the number of participating nodes. On an interconnection network each of these operations require a constant number of all-to-all routings, with a routing volume that is proportional to the number of participating nodes.
The main point that remains to be specified is the selection of the independent set. This can be done both randomly and deterministically. Investing some more effort in order to get a larger independent set might be a good idea, because this will result in a faster reduction of the problem size. However, the simplest idea is to perform random coin tossing: for each node 0 or 1 is chosen independently with probability 1/2. S is the set of all nodes with value 1 whose successor has value 0. For a list of length n, the expected number of nodes selected equals (n - 1) / 4. Using the Azuma inequalities it can even be shown that at least n / 4 - o(n) nodes are selected with high probability. This means that applying this selection method, the problem size is approximately multiplied by a factor 3 / 4 in each reduction round. In the work-time framework, S can be selected in constant time, on an interconnection network, it requires one all-to-all routing to inform all predecessors.
So, over all selection, reduction and expansion rounds, the sum of the the performed work is given by c * (1 + 3/4 + 9/16 + ...) * n <= 4 * c * n for some suitable constant c. So, it is linear in n. The time is proportional to the number of performed reduction rounds because for each reduction there is one corresponding selection and expansion. There is no need to continue reducing until the problem size is constant, though this would not be wrong. If the selection of the independent set is performed randomly, it is better to switch to another algorithm earlier. A reasonable choice is to reduce the problem size by a factor log n using independent-set removal in O(loglog n) time, and then to rank the remaining nodes using pointer jumping in O(log n) time. Both algorithms perform O(n) work.
Theorem: On an EREW PRAM with n / log n PUs the list-ranking problem for a set of lists with n nodes can be solved in O(log n) time.
Proof: The main problem is the allocation of work to the PUs. If initially each PU takes care of log n nodes, it is not guaranteed that this number decreases in a smooth way. However, in groups of log n PU, which initially take care of log^2 n nodes, such a smooth decrease is assured with high probability. These subsets of the PUs can be perceived as constituting (loglog n)-dimensional hypercubes. After reduction round r, each PU performs a load balancing with the PU at the other end of the (i mod loglog n)-axis of this conceptual hypercube. Doing this, it becomes highly unlikely that in round r a PU has to deal with substantially more than (3/4)^r * log n nodes. Neither independent-set removal nor pointer jumping require concurrent-memory access. End.
On an interconnection network with P PUs, we can proceed in the same way although in practice it is better to first reduce the problem size slightly more in order to reduce the cost of the pointer jumping. Alternatively, the problem size can first be reduced by a factor P and then all remaining nodes can be concentrated in PU 0, where the subproblem can be solved internally in O(n / P) time. This is not so elegant, but minimizes the number of start-ups. In terms of the DMM the second approach gives
T(P, n) = c_i * O(n / P) + c_v * O(n / P) + c_a * O(log P * P).
As observed in the section on "Routing" in the chapter on "Interconnection Networks", the number of start-ups for an all-to-all routing can be strongly reduced by simulating the routing of a d-dimensional meshes or hypercubes. In general this observation is useful only for problems that are too small for the used number of PUs, because such an indirect routing multiplies the routing volume and causes overhead for rearranging packets as well. So, this may only give an improvement if the time consumption is dominated by the time for the start-ups. However, in our case, the problem size is gradually reduced, and even when the overall time-consumption is dominated by the term (c_i + c_v) * O(n / P), it may happen that in some of the all-to-all routings the start-up time dominates. So, the best performance is obtained, when at any stage the best all-to-all routing subroutine is chosen as a function of P and the size of the packets. Working out this idea more carefully, shows that the factor log P form the term c_a * O(log P * P) can be omited without increasing the leading constants hidden in the other two terms.
The independent set can also be found deterministically. One possibility to obtain a large independent set is to deterministically construct a three-coloring as discussed in the section on "Symmetry Breaking" of the chapter on "Basic Techniques". The nodes with any of the three colors constitute an independent set. The largest of these sets has size at least n / 3 for a set with n nodes, The coloring and the computation of the sizes of the subsets can be performed in O(log n) time and O(n) work. Using this independent-set selection procedure until there is only one node left, the whole algorithm takes O(log^2 n) time and O(n) work. An immediate improvement is obtained by applying accelerated cascading: as soon as the problem size has been reduced to n / log n, that is, after at most loglog n / log (3/2) = O(loglog n) reduction rounds, we switch to pointer jumping. This idea reduces the total time consumption to O(loglog n * log n). A more refined algorithm runs in O(log n) time, but practically none of the deterministic algorithms can compete with the randomized one.
A ruling set in a graph is a subset S of the nodes so that each node is either in S or adjacent to a node in S. A k-ruling set is a subset S, so that each node has distance at most k from a node in S. The algorithm of this section is called the sparse-ruling-set algorithm, because it constructs a small subset of the nodes, which are called rulers, that are more or less regularly interspaced in the list. Once we have walked through the list from one ruler to the next, a new reduced list is constructed which has a node for each ruler and links with weight equal to the distance between the rulers. Once the ranks of the rulers are known, each non-ruler can ask the ruler from which it was reached for its rank and add an offset.
The following formulation of the algorithm is given for the DMM model. With minor modifications it can be turned into a PRAM or sequential algorithm.
while (there are still active waves)
{
Send all prepared packets.
for (each node p' that receives a packet (p, d))
if (p' is a ruler) // wave of p ends in p'
{
Add a link (p, p', d + 1) to L.
if (there are still unvisited non-ruler nodes)
{
Randomly select such a node p''. p'' initiates a new wave:
p'' prepares a packet (p'', 0) to be sent to succ[p''].
}
}
else // wave of p continues from p'
{
mark p' as visited with p as ruler and d + 1 as distance
thereof. Prepare a packet (p, d + 1) to be sent to succ[p'].
}
}
In each round of the main while-loop exactly s previously unvisited nodes are reached, so all n nodes are visited in n / s rounds. Considering the number of unvisited nodes, it follows that the expected number s_t of waves that hits a ruler in round t, 0 <= t < n / s, is given by s_t = s * s / (n - t * s). Summing over t, gives the total number of newly created rulers:
sum_t s_t = sum_{t = 0}^{n / s - 1} s * s / (n - t * s)
= s * sum_{i = 1}^{n / s} 1 / i
= s * H_{n / s}.
Here H_k ~= log_e k is the k-th harmonic number. Thus, the problem size
is reduced from n to a value n' ~= s * log_e (n / s).
In a DMM implementation, the algorithm performs n / s + O(1) all-to-all routings. During the main while-loop each node sends a packet containing the index of the ruler, the distance thereof and the destination, so the routing volume is 3 * n / p. In the last step, each non-ruler asks a question generating two answers, so the routing volume of the complete algorithm is 6 * n / p. So, the cost for a reduction of the problem size by a factor f = n / n' = n / s * log_e (n / s) is approximately
T(P, n) = c_i * O(n / P) + c_v * 6 * n / P + c_a * f * log_e f.
Taking s = o(n / (log n * loglog n)) gives n' = o(n / log n), which allows to solve the subproblem with pointer jumping at negligible costs. If s is chosen much smaller, s = o(n / (P * log P)), n' = o(n / P). In that case the data of the reduced list can be gossiped at negligible costs and the remaining problem can be solved internally. In that case the total routing volume is only 3 * n / P, but this goes at the expense of a much larger number of all-to-all routings. For small P this may be a very good idea though. If no new rulers are chosen in the course of the algorithm, the distance does not need to be sent along because it can be determined by counting the number of performed rounds. This has the disadvantage that we have to somehow deal with the initial nodes but allows to reduce the routing volume to just 2 * n / P, which appears to be the best achievable.
In practical applications, if P is somewhat larger, the focus should lie on reducing the number of routing operations. If s is chosen larger, for example, s = n / 50, the problem size is reduced by about a factor 10 with just over 50 all-to-all routings. Applying the same once more gives a reduction by about a factor 100, with 100 all-to-all routings and a routing volume of about 6.6 * n / P. In practice the choice of s should be made as a function of P, n, c_v and c_a in combination with estimates of the cost to solve the subproblem.
We remind that there are two main reasons why running a parallel program on a parallel computer with P PUs mostly gives much less than P speed-up:
The weakness of sending too many packets can be alleviated by applying the following idea. Instead of performing all-to-all routings, at the end of each phase (which is the basic idea of the BSP approach!), it is better to intermix the computation and the communication: Round t is decomposed in P subrounds, denoted t_p, for 0 <= p < P. In subround t_p the algorithm running on PU i sends all it has to send to PU (i + p) mod P. Then it works on all the data it received and prepares the packet for PU (i + p + 1) mod P, and so on. Doing this, the routed packets become larger: a wave now has to wait only P / 2 subrounds before it is progressing again, going twice as fast. This idea does not only give improvements for list ranking, but for any problem in which a non-local structure has to be traversed.
The sparse-ruling-set algorithm may be considered to be a variant of independent-set removal. The difference is the speed at which the reduction is performed: independent-set removal gives a fixed small reduction factor. With the sparse-ruling-set algorithm the reduction factor can be chosen freely. A more interesting observation is that the structure of the sparse-ruling-set algorithm is very similar to that of sample sort: for a small subset of the elements the ranks are determined exactly, then the other elements are ranked between them.
Click here to see a sequential version of the sequential algorithm, the pointer-jumping algorithm and the sparse-ruling-set algorithm. It is important to observe that in the implemented variant of pointer jumping the distance may double by more than a factor two in one round because it may happen that node succ[i] gets processed before node i.
As before every node has fields succ[], last[] and dist[], where ultimately last[i] will point to the last node of the list of i and dist[i] at all times gives the distance from node i to last[i]. A final node i has succ[i] = i. There are two basic operations. The first is called autoclean, the other altroclean. If last[i] = j, we say that node i points to node j.
Autoclean(p) means that for all nodes i stored in some PU_p for which last[i] points to a node also stored in PU_p, we perform last[i] = last[last[i]] and dist[i] += dist[last[i]] until i is pointing to a node stored in another PU. This operation does not require any communication.
Altroclean(p, p') means that for all nodes i stored in some PU_p for which last[i] points to a node stored in PU_{p'}, we perform last[i] = last[last[i]] and dist[i] += dist[last[i]], by asking PU_{p'} for this information. An autoclean operation means that both involved PUs must send one packet. The first packet contains the index of the node the question is destined for, the second contains its last- and dist-value. In total three integers are exchanged.
Building on these two basic operations, we get the following algorithm for PU_p, 0 <= p < 2:
for (each node i in PU_p)
if (succ[i] == i) // i is a final node
{
last[i] = i - n;
dist[i] = 0;
}
else
{
last[i] = succ[i];
dist[i] = 1;
}
autoclean(p);
altroclean(p, (p + 1) mod 2);
autoclean(p);
for (each node i in PU_p)
last[i] += n;
The algorithm is correct, because after the first autoclean no node points to a node in its own PU. Thus, after the altroclean, each node which does not point to a final node points to a node in the own PU. The second autoclean therefore finishes the problem, because if no node in PU_p points to a node in PU_{(p + 1) mod 2}, at the beginning of an autoclean, then it does not do so at the end of this operation. But by then no node points to a node in the own PU either, so all nodes must point to a final node.
The autocleans perform a linear number of internal operations. Essentially an autoclean is nothing but an internal list-ranking problem in which all nodes pointing to a node in another PU are considered to be final nodes. So, typically there will many very short lists.
T(2, n) = c_i * O(n) + c_v * 3 * n / 2 + c_a * 2.
The kernel of the algorithm consists of a loop in which all other PUs are addressed in a circular way:
for (each node i in PU_p)
if (succ[i] == i) // i is a final node
{
last_left[i] = last_rght[i] = i - n;
dist[i] = 0;
}
else
{
last_left[i] = last_rght[i] = succ[i];
dist[i] = 1;
}
autoclean (p, last_left);
autoclean (p, last_rght);
for (t = 1; t < P; t++)
{
altroclean(p, last_left, (p - t) mod P, last_rght);
altroclean(p, last_rght, (p + t) mod P, last_left);
autoclean (p, last_left);
autoclean (p, last_rght);
}
for (each node i in PU_p)
last_rght[i] += n;
We claim that at the end of the algorithm last_rght[i] gives the last node of the list of i for all i, 0 <= i < n. This can most easily be proven with invariants. We claim that after round t, 0 <= t < P, last_left[i] for all nodes i stored in PU_p is clean for p, p - 1, ..., p - t. By this we mean that last_left[i] does not point to any node stored in a PU with indices from {p, p - 1, ..., p - t}. In addition we claim that after round t, 0 <= t < P, last_rght[i] for all nodes i stored in PU_p is clean for p, p + 1, ..., p + t. For t = 0, we have performed autocleans, so in PU p, last_left and last_rght are clean for 0. More generally, assume the claims hold for some given t. So, at the beginning of round t + 1 last_left[i] is clean for p, p - 1, ..., p - t and last_rght[i] for p, p + 1, ..., p + t. Thus, if last_left[i] points to a node in PU_{p - t - 1}, putting last_left[i] = last_rght[last_left[i]] will return the index of a node which does not lie in a PU with indices in {p - t - 1, p - t, ..., p - 1}, but possibly it may lie in PU_p. Performing the autoclean establishes the claim for t + 1. In the above discussion, all PU indices are meant to be computed modulo P.
How about the time complexity? There are 2 * P - 2 altrocleans. Each operation consists of a question followed by an answer. Sending the second question in each pass of the loop together with the first answer, the number of start-ups is reduced to 3 * P - 3, this is only as much as in three all-to-all routings, without any indirect routing. On the other hand, if we have bad luck, there may be n / P questions in each altroclean. However, if the nodes indices are randomized, then we get much better performance: we may assume that in round t, the last values are the indices of nodes which are uniformly distributed over the remaining P - t PUs. If this assumption holds, the expected number of questions for an altroclean in round t equals (n / P) / (P - t). In total this gives sum_{i = 1}^{P - 1} (n / P) / (P - t) = n / P * H_P ~= n / P * log_e P. For every left-question we get back one answer, only a new last-value, for every right-question we get back two answers. So, for the presented version of the algorithm, the total routing volume is about 5 * (n / P) * log_e P. Altogether this gives
T(P, n) = c_i * O(n * log n) + c_v * 5 * n * log_e P + c_a * 3 * (P - 1).
Comparing with pointer jumping, we have slightly less work and routing volume because 5 * log_e P < 3 * log n, and much fewer start-ups. So, in practice pointer jumping will be worse. Comparing with the sparse-ruling-set algorithm we have lost almost a factor log_e P in work and routing volume, but the number or start-ups is far smaller. So, one-by-one cleaning will be better for small values of n / P^2, because then the start-up costs tend to dominate. There are other algorithms which are better for intermediate values, but it is correct to say that sparse ruling sets is best for large values of n / P^2 and one-by-one cleaning for small values of n / P^2. A combination of these two algorithms always gives reasonable results.
The optimal PRAM algorithms can quite easily be turned into algorithms performing O(P) routing steps. This can be achieved on a linear array by geometrically reducing the problem size and the number of employed PUs until the reduced problem fits in one PU. Then it can be solved internally. Reversing the reduction steps solves the problem. The constant hidden in O(P) depends on how the details are worked out. It lies between 10 and 20. It is not clear how such algorithms that contract the problem in an ever smaller subset of PUs should profit from the wrap-around connection after the first reduction rounds, so on circular arrays, the number of routing steps is only slightly smaller.
The quite amazing thing is that on circular arrays list ranking can be solved much faster than this (and because circular arrays can be embedded on linear arrays with dilation 2, by extension even on linear arrays): P - 1 routing steps are sufficient! This can be achieved by an adaptation of the one-by-one cleaning algorithm. The idea is simple: the values of last_rght[] stay in the PU where they started, while the values of last_left[] travel around the network moving leftwards, that is, from PU_p to PU_{(p - 1) mod P}. In this way, the communication that is performed in the DMM algorithm can be resolved locally.
The packets sent in every step have size O(n / P) and the amount of computation is of the same order. Now one may complain that using P - 1 steps with O(n / P) work in each of them gives O(n) in total. This will not give much speed-up! This is true in general, but the important thing is the idea: this is an algorithm in which all communication is local. It is a kind of systolic list-ranking algorithm: two streams of information which are only interacting where meeting. The algorithm might even have practical importance. Maybe the "PUs" actually give highly connected clusters of PUs in a structure in which the clusters are more loosely connected (NUMA platforms). Or, we may have that n, the problem size, is greater than M, the size of the memory of a PU, but that n < P * M.