As html is not suited for mathematical formulas, some additional notation is used (as used in the typographical package Latex). a_i denotes a with subscript i. a^i denotes a to the power i. <= and >= are used as in most computer languages. Curly brackets, "{" and "}", are use to group things. sum_{i = 0}^j denotes the sum for i running from 0 to j. The same may also be written sum_{0 <= i <= j}. ~= stands for "approximately equal" and ~ for "proportional to". Greek letters are written out. For logical expressions we either use the notation from C or the operators are written out in text. So, "a && !b" is the same as "a and not b". sqrt stands for the square root function and log without specifying the base number for the logarithm to basis 2. There are a few more notations, but they should be understood easily. New notions are printed bold there where they are defined. Inside the chapters particularly important ideas and short key notes are highlighted without introductory text. There are many pictures. These are intended to be self-explanatory. Typically they are placed just after the text fragment to which they belong, generally there is no direct reference to them in the text.
There are very few references to the literature. Clearly most of the presented material is not new. A large part is even common knowledge. More directly this text is based on material found in the following two books:
Notice: the following text may be overcomplete. At the examination it is expected that the students only know all that has been presented during the lectures.
To meet both kinds of demands, future and current, it is natural to consider using more than one processor for solving a single task. Because processors not only become faster but also cheaper this becomes more and more realistic. As soon as there is a demand for it, the production of multi-processor chips will start. Right now the newest processors are still expensive, but processors of the previous generation are cheap: If one strives for the maximum number of clock cycles per money unit, then clearly one should not buy the top model, but rather four with half the speed. Using P processors instead of 1, we may hope to become P times faster, but this is rarely achieved. Therefore, it is not sure that four 2 GHz processors will solve a problem faster than a single 4 GHz processor.
If we know how to exploit the computational power of a system with several processors, such a system is financially attractive and technologically possible.
Definition: A computer with several processors is called a parallel computer. Parallel computing is the (study of) the usage of parallel computers in order to solve tasks faster.
Even though technologically different, this kind of parallelism can be characterized as being task parallelism. The most important theoretical question in this context is how to schedule the tasks, that is, how to allocate the tasks to the processors so that an even load balancing is achieved. The load balancing problem has several variants. For a fixed set of tasks, it is a common goal to minimize the make span, the time to complete all tasks, respecting interdependencies if any. In an online context, the goal is more reasonable to try to minimize the completion time, the time before any job has been completed, or to minimize the time before any job starts being done.
In practical applications until now task parallelism is by far the most important branch of parallel computing. However, from an algorithmic point of view it is not very interesting because there is not so much one can do. Therefore, this is not the topic of this lecture. Only here we will briefly discuss the load-balancing problem.
If the lengths of the tasks are known in advance, then the problem of how to schedule a set of tasks so that the make-span is minimized is NP-hard. A reasonable heuristic is to sort the tasks according to their lengths and to allocate in this order to the processor which until then has the least work to do.
If the lengths of the tasks are not known, then the best one can do is to always allocate a task to the processor which currently has the shortest queue. This is the normal scheduling principle applied in supermarkets. In this context there is the issue of task migration: at certain costs it may be possible for tasks to be reallocated in order to achieve a more even work load.
A problem with this approach is that it requires that the lengths of all queues are centrally known: in a huge supermarket with 30 cash-sites, one will typically not check all of them, but simply take the locally shortest queue. If the central allocating agent does not know the loads, then the best he can do is to perform the allocation in a random way. Queuing theory tells that this is nevertheless quite good: if a processor on average gets allocated k tasks, the deviation from this expected value is bounded by O(sqrt(k * P * log(k * P))). Much better balance is achieved if the allocator always first asks the queue lengths of two randomly selected processors and then allocates the next task to the one with the shorter queue.
Perfect balance is achieved if the scheduling is postponed: a processor fetches a next available task as soon as it is ready with the previous one. This is the scheduling principle applied in modern post offices. The disadvantage is clear: the waiting tasks have to be stored in a central storage and cannot be handed out immediately. Furthermore, the processor will be idle while sending the request for a next task and waiting for this task to arrive.
For an analogy consider a ship wharf. In earlier days work was done in a very labour intensive way. In the rich countries, using better equipment, the number of working hours for building a ship has been reduced considerably. Productivity is not doubling every 18 months, but even in this domain there is an exponential increase, justifying the noticeable salary increases of the past 100 years. Nevertheless, it still takes many thousands of working hours to build a supertanker. Unfortunately the company which ordered the ship normally does not want to wait 20 years. The solution is well-known: a wharf building a supertanker typically employs thousands of workers. Of course having many workers around, there must also be people not actually building the ship but leading the work, managing the company and cleaning the buildings. It also takes time to get instructions from the management down to the people on the work floor, and similarly there may be a need for workers to communicate with the management (paint is running out!) or among each other (we are ready with our section of the ship!). Even worse, the process cannot be speed-up arbitrarily, because there is a fixed order of steps, which cannot be rearranged arbitrarily. Some of these steps require many more workers than others, and it thus becomes inevitable that during part of the process some workers are idle. It may even happen that due to bad organization people are standing around though there is work they could do. Or people are working, but because they did not receive precise instructions, they are drawing cables which later have to be removed again, or which were already installed by someone else. Finally, given that one wants to minimize the construction time, the chief engineer may decide to reschedule the work. For example, it is clearly practical to not start to paint the ship before it is completely welded together, but maybe the painters can already start painting where ever something gets ready. They waste time while climbing the scaffoldings many times and cleaning there equipment.
Let us summarize the aspects which lead to the fact that P workers will not build a ship P times faster than a single worker (provided he has all the skills and is able to physically do the job):
All these aspects arise also if a single problem has to be solved by a set of P processors. Communication and how to minimize the cost of it is a central issue on computers with a distributed memory, but will be abstracted away in most of this lecture. The scheduling is an important problem in general, but we will design the algorithms so, that in most cases it is trivial how to schedule the arising subtasks. Useless work can normally be prevented, double work may be reasonable in some contexts, but will not play a big role in this lecture. The central topic of this lecture is the design of algorithms which allow to use as many processors as possible, while loosing at most a constant factor in efficiency. So, we accept that a parallel algorithm is more work-intensive by a constant factor, but normally not more than that. This does not mean that we will only look at such problems, we will see how to determine the maximum of n numbers in O(1) time using O(n^2) processors, but this is the focus. Quite often we will first only consider the time: given arbitrarily many processors, how fast can the problem be solved. Then we will try to reduce the number of processors employed so that the algorithm becomes optimal in the above sense.
The cost C(P, n) of a parallel algorithm solving a problem of size n on a computer with P PUs is defined as C(P, n) = P * T(P, n). The cost is to be distinguished from the work W(n, P) which equals the sum over all PUs of the number of performed operations. The difference is that the work does not take idle time into account.
The speed-up S(P, n) gives the number of times the parallel program is faster than the sequential program. So, S(P, n) = T(1, n) / T(P, n), for some definition of the sequential time T(1, n). For T(1, n) there are several possibilities, which are all in use:
The efficiency E(P, n) of an algorithm is related to the speed-up: it is given by E(P, n) = S(P, n) / P = T(1, n) / (P * T(P, n)) = C(1, n) / C(P, n). So, it is the ratio of the costs, whereas the speed-up is the ratio of the times. A parallel algorithm is called optimal if E(P, n) = Omega(1), that is, the cost of the parallel algorithm is at most a constant factor larger than that of the best sequential algorithm. A study of the efficiency is especially important in practical contexts: Often the efficiency starts to decrease quite strongly from a certain P value on. The typical reason for this is that when P becomes too large for the value of n, the communication costs (which often increase with P^2) become dominating, and any further increase of P will hardly lead to a decrease of the overall time. Particularly on a multi-user system with P = 32 one should not run an application with an efficiency of under 10%.
As an example we consider sorting 10^10 integers, requiring 40 GB of storage. On a single PU, having only 1 GB of main memory, this will cost about 12 hours. This time is estimated as follows: 40 GB must be read and written twice, a typical transfer rate between main memory and hard disc is 8 MB. So just for the data transfer we need 160 * 10^9 / (8 * 10^9) = 20,000 seconds. If we have 64 of these PUs interconnected by a switch giving 2 Gbit point-to-point connection the time becomes much better. We must assume that originally the data stand distributed over the PUs. After sending around a sample of negligible size, all data must be routed over the network once before each PU has to sort 156 * 10^6 integers. The routing takes a few seconds: each PU holds data with a total size of 5 * 10^9 bits. 2 * 10^9 bits can be routed per second. The sorting takes less than 100 seconds on a 3 GHz processor. So, we achieve a speed-up of 200 with 64 PUs.
The described phenomenon of solving a problem more than P times faster using P PUs is called super-linear speed-up. It is a true phenomenon, and it is practically important, but it does not really tell something about the quality of the algorithm. Clearly, very bad parallel algorithms will not achieve any speed-up, not even in an extreme context. But, comparing with an unreasonably slow sequential algorithm does not provide a fair basis for evaluation: if in the above example the joint 64 GB of main memory would be available to a single PU, then this PU could have performed the sorting in about 50 times the parallel time. In that case the efficiency of the parallel sorting algorithm would have lain around 0.90 and not around 5.0.
Too-good-to-be-true speed-ups are obtained by either comparing with sub-optimal sequential algorithms, or by comparing with sequential algorithms which are slowed down by slow access to a large data set.
In parallel computing the situation is different: there are several cost models none of them dominating. The reason for this multitude is that the simplest of the models is felt to model reality too poorly to be acceptable. The more precise models are considered to be too hard to design algorithms for and in part also considered not to be of long lasting validity. Another reason for the multitude of models is also that the underlying hardware shows much more variability than the platform on which sequential computation is founded.
The most fundamental distinction in parallel computation is on how the memory is accessed: whether there is a shared memory to which all PUs have access in constant time, or whether there is a distributed memory so that each PU has direct access only to its own local memory. In the first case data can be exchanged by "parking" some information at some fixed position making it accessible to the other PUs. In the second case the data are exchanged by some form of communication between the PUs.
Another important distinction is whether the PUs are working synchronously or asynchronously. In a synchronous model, all PUs have access to a central clock. In practice systems are mostly asynchronous: each PU computes at its own speed. The speed differences may be small, but in a parallel computer not necessarily all PUs are of the same type. In principle the speed differences can be arbitrarily large. In practice, considering that there is mostly a non-negligible increase of the overhead with P, it normally does not make sense to combine PUs which are too different. Only for trivial problems, such as searching for extra-terrestrial life in radio signals or finding very large prime numbers, it is profitable to add 100 MHz PUs to a system that also contains 3 GHz PUs.
Historically parallel computers have been distinguished in SIMD, Single Instruction Multiple Data, and MIMD, Multiple Instruction Multiple Data, machines. On an SIMD machine, which is synchronous, all PUs execute the same instruction at all times. Because they execute these instructions on different data, it is still possible to obtain parallelism. The idea behind SIMD machines, is that if all PUs execute the same instruction, there is no need for the PUs to generate these instructions themselves: there is a central host computer connected to all PUs which tells them what instruction to execute next. This implies that the PUs can be much simpler.
In the 1980s there were many parallel computers of this type. An impressive example was the Masspar. This parallel computer consisted of a grid of very simple PUs. The PUs had primitive 4-bit arithmetic, a few kilobytes of storage and were slow. But, they were cheap. This implied that research institutes could afford to buy Masspar machines with 65536 PUs in a 256 x 256 grid. More recently no parallel computer with a comparable number of PUs has been build.
In an MIMD machine all PUs in principle may have their own program. In practice all will execute the same program, but due to conditional instructions, they will not always execute the same instructions. All modern parallel computers, in which the CPUs are mostly conventional PC CPUs, are of this type. Because most parallel algorithms become less efficient if the number P of PUs increases, it is more cost effective to buy 32 expensive but fast PUs than 1024 cheap but slow ones. Only for specific numerical applications architectures like the Masspar might still make sense.
In the following we briefly discuss the most important models. In the last section of this chapter we consider several problems from various perspectives. In the following chapter we present some basic algorithms for interconnection networks. Thereafter, we will only consider the simplest and most theoretical model, which allows to concentrate most strongly on the underlying parallelization ideas.
It is assumed that the cost of communication is linear in the amount of data sent or received plus some constant cost. So, under this assumption the time consumption of a computation is estimated as follows:
T_tot = c_i * t_i + c_v * t_v + c_a * t_a.Here c_i, c_v and c_a are constants whose values depend on the state of technological development. c_a is known under the name start-up latency. t_i gives the number of internal operations; t_v gives the total amount of data sent by a PU, this we will call the routing volume; and t_a gives the total number of packets sent by a PU. Typically a computation is divided in steps, and it would be more correct to define the above values as the sums over all steps of the maxima over all PUs of the number of each type of operations. This model will be referred to as DMM model, where DMM is an abbreviation for Distributed-Memory Machine.
The technological justification for the DMM model is that in modern parallel computers the communication is performed by wormhole routing. This means that if PU_i wants to send something to PU_j, a path from PU_i to PU_j is cleared, and then the data are transferred along this path flit by flit. a flit is a small data unit. The idea is that these flits run through the network like the segments of a worm: first all of them are standing in PU_i, then the first flit enters the network. By the time the first flit reaches PU_j several other flits, depending on the number of hops from PU_i to PU_j, have entered it. In PU_j the flits pile up and are recomposed. In this way the transfer time is something like d + l, where d gives the number of hops from PU_i to PU_j and l gives the number of flits. The time for the d hops is mostly negligible, much more important being the almost fixed start-up time.
Currently c_i ~= 0.3 * 10^-9. The values of c_v and c_a vary much more depending on how much money was invested on the interconnection infrastructure. In some cases the network is virtually unbounded, in that case the bounding factor is the speed of the memory bus of the individual PUs: clearly data cannot be pumped into the network faster than the rate at which they can be supplied by the memory. This means that we may have c_v = 10^-9 (when t_v indicates the number of integers). On the other hand, the PUs may also be coupled very loosely. For example, they may be connected by fast ethernet or worse, giving up to c_v = 10^-6. c_a can vary accordingly. In ideal cases we have c_a = 10^-7, but it may also be orders of magnitude larger. Notwithstanding all variations, we will work under the general assumption that c_i is considerably smaller than c_v, which in turn is orders of magnitude smaller than c_a.
t_i, the number of internal operations, cannot be estimated accurately because of the obscure internal operation of a modern processor. However, in many cases, it is possible to compare it with the number of operations performed by a sequential algorithm. For example, we will encounter sorting algorithms in which it is clear that t_i(P, n) = (1 + o(1)) * t_i(1, n) / P. Here t_i(P, n) denotes the number of internal operations performed by a PU when solving a problem of size n on P PUs. For such an algorithm, as long as we can guarantee that the communication terms are negligible, we may expect to achieve very high efficiencies.
Here we touch on the main point in the context of DMM computation: the contribution c_v * t_v + c_a * t_a to T_tot is perceived as the loss. For most applications t_i(P, n) decreases linearly with P, whereas particularly t_a(P, n) tends to increase at least linearly with P. So, in order to obtain acceptable efficiency, P must not be chosen too large in comparison to n. It depends both on the time complexity of the problem and on the development of t_a(P, n) how large P can be chosen, or said otherwise, for how small n it still makes sense to use P PUs.
The DMM model is a compromise between accurately modeling the time consumption of a parallel program and simplicity. It takes into account that sending small packets should be avoided, but neglects locality aspects.
The time consumption of algorithms for interconnection networks is mostly expressed in steps. A step consists of a certain amount of computation followed by communication. Often there is no limit on the amount of computation that can be performed in a step. The communication is ruled by the precise model assumptions, but typically it is assumed that one packet of unit size can be transferred over each connection in each step. These packets can neither be divided nor composed. This cost model will be referred to as interconnection-network model.
The main assumption is that PUs can only communicate with neighbors and that longer distance communications are achieved by forwarding packets along some path connecting the source PU with the destination PU. This is called store-and-forward routing. Due to the dominance of wormhole routing, this assumption does not reflect the current technological conditions. There are nevertheless good reasons to consider a cost model based on it:
The model is further subdivided according to whether PUs can communicate with all neighbors at the same time, the all-port model, or only with one of them, the single-port model. Sometimes it is specified that the connections are uni-directional (also called half-duplex), in that case a PU cannot send to a PU from which it is also receiving. Otherwise the connections are said to be bi-directional (also called full-duplex).
The interconnection-network model neglects the aspects of computation. Thereby it allows to concentrate on the fundamental aspects of data exchange.
When considering computers with a shared memory, it is important to specify which types of access to the memory are allowed:
Depending on the underlying hardware, any of these models may be realistic. They are not far apart anyway: the most powerful of these models, a step of a Maximum CRCW machine with P PUs can be simulated on an EREW machine in O(log P) steps. The main parallel computer model considered in this text is the PRAM model. A PRAM, Parallel Random Access Machine, is a synchronous parallel computer with a shared memory. When specifying the performance of a PRAM algorithm we will mostly specify the memory access model, so we may say "On a CREW PRAM ... ".
It would be most correct to formulate a PRAM algorithm including load and store operations specifying which data have to be copied from the global memory to the local memory and vice versa. However, the existence of a local memory is mostly ignored, specifying algorithms as if all the PUs operate directly on the shared memory itself. This is similar to the practice of writing sequential algorithms without mentioning the memory management. The main reason why the PRAM model is so popular is that it allows to concentrate on the work to perform.
The shared memory model entirely neglects the aspects of data exchange. Thereby it allows to concentrate on the fundamental aspects of parallelizability.
An algorithm is evaluated with two cost parameters: the time, that is, the number of parallel steps; and the work, the sum over all steps of the number of performed operations. Specifying the performance of an algorithm in this way leaves the allocation of the work to the PUs unsolved. In principle it should be possible to perform a step in which w operations are executed on a PRAM with P PUs in round_up(w / P) time, but only if each PU knows which operations to execute.
If the work-allocation problem can be solved, then a WT algorithm gives a PRAM algorithm: if W(n) and T(n) are the work and time for a problem of size n, respectively, then
T(P, n) = sum_{step i} round_up(w_i / P)
<= T(n) + sum_{step i} w_i / P
<= T(n) + W(n) / P,
C(P, n) = P * T(P, n)
<= P * T(n) + W(n).
These relations immediately suggest a close to optimal choice of P for a given value of n: take P = P(n) = W(n) / T(n): for that value T(P, n) <= 2 * T(n) and C(P, n) <= 2 * W(n). So, for this choice of P, the completion time is at most twice the minimum value, and the cost is at most twice as large as the work. This is a good compromise: taking P larger, C(P, n) increases rapidly, without giving a substantial reduction of T(P, n). Taking P smaller gives substantially larger T(P, n), without giving a substantial reduction of C(P, n).
If there is a parallel algorithm in the work-time framework solving a problem of size n with work W(n) and time T(n), then, provided the work allocation can be solved, there is a PRAM algorithm with time O(T(n)) and cost O(W(n)).
Suppose we have a PRAM algorithm solving a problem with P_PRAM PUs in T_PRAM time. We want to execute it on a DMM with P_DMM PUs. It is understood that P_DMM << P_PRAM. In many cases P_DMM must be fairly small in relation to the problem size in order to obtain speed-up. The easiest is to let PU_i, 0 <= i < P_DMM, of the DMM perform the work executed by the PRAM PUs with indices k, i * P_PRAM / P_DMM <= k < (i + 1) * P_PRAM / P_DMM.
Of course, this is not all. In the first place there is no guarantee that in this way the DMM PUs are equally loaded, even though they take over the work of equally many PRAM PUs. This is called the load balancing problem. For example, when simulating the PRAM algorithm for computing the sum, more and more PRAM PUs become idle. If things are arranged so that at all times the first PUs remain active, then the above processor allocation concentrates all the work unnecessarily on a few DMM PUs. In the case of the sum algorithm this can be prevented by letting the work of PU_j of the PRAM be executed by PU_{j mod P_DMM} of the DMM. However, we strive for a blind PRAM simulation, not one in which every algorithm has to be studied before we can decide how to allocate the PRAM PUs to DMM PUs. A possible solution is to randomly allocate the PRAM PUs to the DMM PUs, for example, by using a universal hash function. This does not exclude bad allocations, but makes them unlikely.
Another point is the memory allocation. In some cases one might in principle distribute the complete data set to all DMM PUs, and then compute on these. This is normally not a good practice: this requires much communication and much memory at the PUs. One of the strong points of a parallel computer is that the joint main memory is much larger than that of most single-processor machines, thereby allowing to solve large problems without exceeding the memory size. Furthermore, there are not so many problems which can be split in non-interacting subproblems. So, normally the data set is distributed over the PUs in such a way that each data item is stored in a single PU.
Often it is good enough to use a regular allocation like for the PRAM PUs. In such a regular allocation, data item i is either stored in DMM PU i mod P_DMM or in i * P_DMM / number_of_items_to_allocate. If now a PRAM PU which is allocated to DMM PU p needs a data item stored in DMM PU p', it sends a request to PU p' and waits for the reply. A typical way of progressing is that each PRAM step is decomposed in three substeps:
A possible communication schedule for sending all requests is to have P_DMM - 1 communication steps: in step t, 1 <= t < P_DMM, PU p sends to PU (p + t) mod P_DMM. In this way each PU is sending and receiving only one packet in each step. If the pattern is balanced this works fine. A pattern is said to be balanced if each DMM PU is approximately sending the same number of requests to any other PU.
If the requests are balanced, then also the answers are balanced. If there are O(P_PRAM / P_DMM) requests, then with the above schedule each transferred packet has size O(P_PRAM / P_DMM^2). So, one such communication round costs O(P_PRAM / P_DMM) * t_v + 2 * (P_DMM - 1) * t_a. From this it becomes clear that we must have P_PRAM / P_DMM^2 >> 1 if we do not want that the start-up costs dominate the overall time consumption.
The condition P_PRAM / P_DMM^2 >> 1 coincides with the condition that in most cases will guarantee that the pattern is balanced. The average packet size is P_PRAM / P_DMM^2 and if we may assume that the requests are randomly distributed over the PUs, the size of a packet is given by a random distribution with expected value S = P_PRAM / P_DMM^2. Such a distribution is tightly concentrated around its expected value. Chernoff bounds give that the deviations from the expected value are bounded by O(sqrt(S * log(P_PRAM))) with high probability.
However, a fixed data allocation can normally not guarantee that the communication pattern is balanced. An approach which makes it very likely that for P_PRAM / P_DMM^2 >> 1 all arising communication patterns are balanced is to allocate the data in a random way just as we did with the PUs. Again a hash function should be used, otherwise it would be hard to figure out where a data item is stored. Nevertheless, evaluating a hash function implies considerable overhead in comparison to a sequential algorithm which can access its data directly. If such a construction is applied, the efficiency drops considerably.
In some cases the problem is more fundamental: if the same data item is used by many PRAM PUs, then there is no way of allocating the item so that the communication pattern becomes balanced. The PU holding this item will be flooded by requests, and applying the above approach will give very poor performance, because all requests are handled sequentially. A clever algorithms designer will know whether such patterns occur in his/her application or not and mostly no action is needed.
If the problem of having many requests for a small subset of data items really occur or may occur, then one must do something about it. A practical solution is to first sort all requests on the index of the data item they are destined for. Sorting is not for free, but normally we can apply bucket sort because the range of values is limited and the amount of involved communication is also quite limited as we will see in a later chapter. Then, for all requests to the same data item one is chosen as representative. This representative is forwarded to the item. The answer is returned and replicated to all requests. Then the routing pattern of the sorting is reversed.
PRAM simulation is the link between a library of PRAM algorithms and programs running on actual machines. Only some elementary communication algorithms must be implemented in a machine-specific way.
Above we noted that often there is a considerable discrepancy between the bounds in the PRAM and the DMM model. This does not mean that it is useless to try to design fast PRAM algorithms: simulating a PRAM algorithm with a running time T normally leads to a DMM algorithm with T communication rounds. A small number of communication rounds is a good thing to have, because this implies a smaller value of the number t_a of start-ups.
On the other hand, the number of start-ups is not the only cost factor. Consider again the formula giving the time consumption of an algorithm running on a DMM:
T_total ~= c_i * t_i + c_v * t_v + c_a * t_a.This means that the best parallel algorithm is mostly the one that minimizes t_a without substantially increasing t_i or t_v. From this we obtain a strong and interesting reduction of the set of practice-relevant PRAM algorithms:
Only cost-optimal PRAM algorithms have potential practical relevance. Most promising are cost-optimal algorithms with minimal running time.
Algorithms for interconnection networks are mostly analyzed using the interconnection-network cost model, in which it is assumed that in each step each PU may perform an unlimited amount of communication and then route one unit packet to one or more of its immediate neighbors.
Most algorithms for this cost model deal with the various communication problems, rather than to solve a general problem, say maxflow. The reason for this is that the communication problems are already hard enough, and that harder problems can often be solved quite well by taking a parallel algorithm designed for a more powerful parallel computer model, and then to simulate the operations of this computer on the interconnection network. Such a simulation is composed of two components: an allocation of the data and processes to the PUs and performing the resulting routing operations.
Next to the various communication operations also some of the most basic problems have been studied for interconnection networks. These are sorting, selection, list ranking, matrix multiplication and the like. In a following chapter we will even see how to solve the all-pairs shortest-paths problem optimally on a two-dimensional mesh.
In hardware a complete interconnection network with one-port capability can be realized with a crossbar switch. For interconnecting P PUs, a crossbar switch consist of a grid of P x P switches. These switches have four connections, W, N, E, S. The switches have two settings, either connecting W with E and N with S or connecting W with N and E with S, which can be chosen dynamically. Setting the switches appropriately, point-to-point connections can be created. Notice that using a crossbar switch each PU can send a packet to at most one PU and receive a packet from at most one PU, but that the PU to which it is sending may be different from the PU from which it is receiving. Switch-based complete network topologies are very common in practice. But, because of the increasing delays and cost such systems are typically of moderate size.
A similar network is the circular array which is also called ring. The only difference with the linear array is an additional connection between PU_0 and PU_{n - 1}. This connection is often designated as wrap-around connection. The degree is 2, the same as before. However, the other measures are better: the diameter is round_down{n / 2} and the bisection width is 2. This implies that most problems can be solved on a circular array twice as fast as on a linear array.
An n_1 x n_2 mesh consists of n_1 * n_2 PUs arranged in a grid. Each PU is connected to its up to four neighbors. As a graph, a mesh is the direct product of two linear arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter n_1 + n_2 - 2 and the bisection width is min{n_1, n_2}.
An n_1 x n_2 torus is a mesh with wrap-around connections: there are additional connections between PU_{i, 0} and PU_{i, n_2 - 1} for all i, 0 <= i < n_1, and between PU_{0, j} and PU_{n_1 - 1, j}. As a graph, a torus is the direct product of two circular arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter round_down(n_1 / 2) + round_down(n_2 / 2) and the bisection width is 2 * min{n_1 , n_2}. On tori most problems can be solved twice as fast as on meshes.
There are also higher-dimensional meshes and tori. A d-dimensional n_1 x ... x n_d mesh consists of n_1 * ... * n_d PUs arranged in a d-dimensional grid. Each PU is connected to its at most 2 * d neighbors. The degree is 2 * d, the diameter is n_1 + ... + n_d - d and the bisection width is n_1 * ... * n_d / min{n_1, ..., n_d}. The definition and properties of d-dimensional tori are analogous.
The major weakness of a tree network is its very small bisection width. This means that any problem requiring a single rearrangement of the data has at least linear running time. For such problems a tree is not better than a linear array. Therefore several improved trees have been designed. The simplest is the fat tree. In a fat tree the connections between level i and level i + 1, where the leafs are considered to lie at level 0, have capacity 2^i. This means that such a connection can transfer 2^i packets instead of 1.
Alternatively tree structures have been integrated in other networks. An important example of such a network is the mesh of trees. An n x n mesh of trees is constructed from an n x n grid of not-connected nodes. The PUs in each row and column of this grid are connected by a tree. There are no direct connections between the row trees or between the column trees. The row and column trees are connected at their shared leafs. An n x n mesh of trees has 2 * n * (2 * n - 1) - n^2 = 3 * n^2 - 2 * n nodes.
A k-dimensional cube-connected cycle, abbreviated CCC, network is obtained by taking a k-dimensional hypercube and replacing each node by a cycle with k nodes. Node d, 0 <= d < k, in each cycle takes over the connection of the hypercube along the d-direction. More precisely, the b * 2^k nodes of the CCC are indexed with pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k), (i, (d - 1) mod k) and (i ^ 2^d, d). Here i ^ 2^d denotes i + 2^d if bit d of i is 0 and i - 2^d if this bit is 1. The degree is 3, the diameter is round_down(5 * k / 2) - 2 for k >= 3. A CCC is the simplest network which combines a logarithmic diameter with constant degree. The bisection width is the same as in the corresponding hypercube, 2^{k - 1}.
A k-dimensional butterfly network is similar. The k * 2^k nodes are indexed as pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k) and (i, (d - 1) mod k). Furthermore there is an undirected edge from (i, d) to (i ^ 2^d, (d + 1) mod k). So, the degree is 4. The diameter is round_down(3 * k / 2). Because many router chips provide four connections, and because of the good routing properties of butterflies these networks have been used quite a lot in practice.
Cayley graphs are much nicer than just connected and undirected: they are even node symmetric. A graph is said to be node symmetric, if for any pair (u, v) of nodes there is an automorphism phi = phi(u, v) of the nodes so that if (x, y) is an edge that then also (phi(x), phi(y)) is an edge. For Cayley graphs, these automorphisms are easy to construct: determine elements e_1, ..., e_k so that e_1 * ... * e_k * u = v, then phi maps any x in V to x * e_1 * ... * e_k. If (x, y) in E, this means that y = x * e for some e in S. Thus, (phi(x), phi(y)) = (phi(x), phi(x * e)) = (phi(x), e_1 * ... * e_k * (x * e)) = (phi(x), (e_1 * ... * e_k * x) * e) = (phi(x), phi(x) * e) in E.
Several of the graphs we have encountered before are Cayley graphs: circular arrays, tori, hypercubes and CCCs, though these can be presented in a more natural way. We describe how hypercubes can be obtained as Cayley graphs. For H_k, the k-dimensional hypercube, is obtained from a subgroup of the permutation group of 2 * k elements. There are k generators each having two consecutive elements exchanged. To make the description more concrete we only consider the special case k = 4. The set S of generators consists of (10234567), (01324567), (01235467) and (01234576). The subgroup consists of all elements obtained by starting with (01234567) and letting these elements work. If in the conventional indexing of the hypercube we find a 1 at position i, 0 <= i < k, the elements 2 * i and 2 * i + 1 in the indexing with a permutation are exchanged. For example, (0110) corresponds to (01325467).
For interconnection networks, there is a trade-off between quality on the one hand and complexity and scalability on the other.
An embedding of an interconnection network IN_1 with P_1 PUs into a network IN_2 with P_2 PUs is a mapping phi() of the PUs of IN_1 to the PUs of IN_2. Generally it is part of the definition that phi() should be injective. Not only the PUs should be mapped, but for any two PUs PU_i and PU_j in IN_1 connected by a link there should also be specified a path in IN_2 running between phi(PU_i) and phi(PU_j). There are three cost measures that are used when specifying the quality of an embedding:
As a first example we consider the embedding of a torus into a mesh. This embedding is obtained by first blowing up the n x n torus to a 2 * n x n torus and folding it double along the middle row, so that one PU comes in each row, and then repeating this for a folding along the middle column. The following picture illustrates the process. Notice that each pair of PUs that previously was adjacent now is at distance two at most: the dilation is two. Also, are there at most two paths mapped to any link of the mesh: the congestion is two. For the special case of dilation and congestion both equal to two, any step of the embedded network can be simulated in two steps. So, any problem which on an n x n torus can be solved in T communication steps can be solved on a mesh in 2 * T steps.
Another important example is the embedding of a binomial tree with n = 2^k nodes in a hypercube. The construction is simple: a binomial tree with 2^k nodes, T_k consists of two trees T_{k - 1} whose roots are connected. A k-dimensional hypercube consists of two (k - 1)-dimensional hypercubes with connections between all corresponding nodes, in particular between the nodes (0, 0, ..., 0) and (1, 0, ..., 0). So, the embedding can be performed recursively. The dilation, congestion and expansion are all 1: the graph of the binomial tree is a subgraph of the hypercube, obtained by removing edges.
Two types of embeddings are particularly important: the embeddings of the linear and circular array. The embedding of a linear array, which is nothing but the construction of a Hamiltonian path, is important because it provides a way of indexing the PUs so that any two PUs with consecutive indices are adjacent in the network. Because many algorithms contain steps in which data must be exchanged to PUs with consecutive indices, this is a very desirable property. A systematic way of indexing the PUs of an interconnection network is called an indexing scheme. The embedding of a circular array, which is nothing but the construction of a Hamiltonian cycle, is important because algorithms may even require data to be shifted in a cyclical way.
In any grid it is easy to construct a Hamiltonian path. A two-dimensional mesh is Hamiltonian, which means that it has a Hamiltonian cycle, iff at least one of the sides has even length. Next to the indexing schemes obtained by embedding linear and circular arrays there are several other schemes for indexing the PUs of a grid. The most important is the so-called row-major indexing scheme. In this scheme, PU_{i, j} in an n x n mesh, which is located in row i and column j, is given index i * n + j. Mostly the rows and columns are numbered starting from the upper-left corner as in a matrix. In the column-major indexing scheme, this PU has index i + j * n. The indexing obtained by embedding a linear array, starting at the upper-left corner, then always going along the rows first, is called the snake-like row-major indexing. There are also so-called shuffled indexing schemes. The most important is the shuffled row-major indexing scheme. In this scheme, PU_{i, j}, for numbers i and j with binary expansions given by (i_{k - 1}, ..., i_i, i_0) and (j_{k - 1}, ..., j_1, j_0) respectively, has an index with binary expansion (i_{k - 1}, j_{k - 1}, ..., i_1, j_1, i_0, j_0).
On a hypercube a Hamiltonian path can be constructed by exploiting the recursive structure of the network. H_k denotes the k-dimensional hypercube. The traversal of all PUs of H_k is obtained by gluing together two traversals of H_{k - 1}. If the gluing is done correctly, this even gives a Hamiltonian cycle. The schedule is known as Gray code. It can be described by giving the order in which the PUs are visited. It is easy to verify that in the following sequences all consecutive PUs, including the first and last, are adjacent:
for H_1 we have {0, 1},
for H_2 we have {00, 01, 11, 10},
for H_3 we have {000, 001, 011, 010, 110, 111, 101, 100}.
How to generalize the above sequences? Look at the sequence of H_3. It starts with the sequence of H_2 in which all indices have an additional leading 0. Then this same sequence appears once more in reversed order in which all indices have an additional leading 1. More generally, denoting the Hamiltonian sequence for H_{k - 1} by S_{k - 1}, the Hamiltonian sequence S_k for H_k is the concatenation of 0-S_k and 1-reversed(S_k). Here, for a sequence S and b in {0, 1}, b-S denotes the sequence obtained by preceding all elements in S by a b.
Lemma: The sequences S_k defined by S_1 = {0, 1} and S_k = 0-S_{k - 1} + 1-reversed(S_{k - 1}), where + denotes concatenation, give Hamiltonian enumerations of the PUs of the k-dimensional hypercubes H_k.
Proof: The proof goes by induction. For H_0 the construction is clearly correct. So, assume it is correct for H_{k - 1}. If S_{k - 1} is a Hamiltonian cycle, then so is reversed(S_k). The sequences 0-S_{k - 1} and 1-reversed(S_{k - 1}) are paths through H_k because of the recursive way in which H_k is constructed out of two connected copies of H_{k - 1}. It remains to check that these two paths fit together. The first element of S_{k - 1} and the last element of reversed(S_{k - 1}) are the same. So, the first element of 0-S_{k - 1} and the last element of 1-reversed(S_{k - 1}) differ in exactly one position, the leading one, and are therefore the corresponding PUs are connected n the hypercube. End.
Embeddings provide a major way of deriving algorithms for one topology from algorithms designed for another topology. Embeddings of linear and circular arrays lead to indexings for which consecutively indexed PUs are adjacent.
Each of these operations has its own importance. All-to-all routing is the pattern encountered when there is no specific pattern in the data access: the PUs compute a while on the data they have and by doing so accumulate questions to be sent to other PUs. Unless the algorithm was designed in a specific way, these questions will be more or less randomly distributed over the PUs, giving rise to a more or less balanced all-to-all routing pattern. A permutation-routing pattern arises in algorithms that where designed in such a way that the questions are not scattered over the data set but concentrated. Such a structure is desirable because it reduces the number of start-ups. Broadcasting is a very common operation performed to distribute information from a source PU to a (subset) of PUs that must work on the same data. Typically this happens at the beginning of a computation. The symmetric operation is called gathering. Gathering means that information from all PUs is combined to a single information in some specified PU. For example, this may be done in order to sum up the computed results. Gossiping is used to make local samples of the information available to all PUs. It is also used to synchronize a database and in numerical communications. Because gossiping is a very expensive operation, it will normally be performed only for a small subset of the total data set. Another much studied problem is k-k routing, in which each PU is the source and destination of at most k packets. Both permutation routing and all-to-all routing are special cases of k-k routing. In practice general k-k routing patterns are rarely encountered.
How good is the greedy algorithm? It is easy to see that during phase 1 the packets run without any delay: the packets moving rightwards do not change there relative distances as long as they are moving and therefore they will never encounter each other. The same is true for the leftwards moving packets. So, phase 1 takes as long as the longest distance a packet has to go. The maximum distance that may have to be covered in a row is n - 1. As a result of the routing in phase 1, the packets may have piled up in some PUs. For example if the permutation is the transposition, in which PU_{i, j}, 0 <= i, j < n, sends a packet to PU_{j, i}, then n packets reside in all diagonal PUs at the end of phase 1. So, the queue size, that is, the maximum number of packets that may have to be stored in a single PU at the same time, may be as high as n. Thus, during the second routing phase there may be conflicts. However, due to the applied farthest-first strategy, can a packet which has to move upwards to a PU at distance d from the boundary of the mesh be delayed at most d times. Because the distance this packet has tot travel is bounded by n - d - 1, it will arrive no later than in step n - 1. So, even phase 2 will be completed in n - 1 steps, and therefore the whole algorithm is completed in 2 * n - 2 steps which is optimal. Much more sophisticated algorithms have shown that it is possible to achieve the same time with constant queue size.
In general we hope that permutation routing can be solved in a time equal to or only slightly larger than the diameter. In any case we would like to route permutations in O(diameter) time. The above algorithm for k-dimensional hypercubes achieves this for most permutations, the expected time is O(k), but there are a few permutations for which the performance is unacceptable. Further down we will see how to overcome this problem.
The given algorithm is uni-axial, that is, in any step either only horizontal or only vertical connections are used. If the mesh has full-port capacity, then this is wasteful. However, it is often possible to overlay two uni-axial algorithms. In this particular case the solution is very simple: all packets going from PU_{i_x, i_y} to PU_{j_x, j_y} with i_x + i_y + j_x + j_y even are routed as described, while the other half of the packets are routed first along the columns and then along the rows. Said otherwise: if the PUs are colored black and white as on a chess board, all packets routed between fields of the same color are routed xy, while the others are routed yx. Each of the problems involves half as many packets as before and therefore the whole routing takes only n^3 / 4 now, which is optimal.
This coloring idea is very useful, because it allows to concentrate on designing uni-axial algorithms. This strongly facilitates the design. Of course this idea is not limited to two-dimensional meshes, but can be applied on any network in which separate dimensions can be distinguished, most notably meshes and tori of any dimension and hypercubes.
In the context of all-to-all routing, a k-dimensional hypercube with n = 2^k PUs can be perceived as a k-dimensional mesh with side length equal to two. The uni-axial algorithm runs in k * 2^{k + 1} / 4 = k * n / 2. Overlaying k of these algorithms each taking care of a fraction 1 / k of the packets, the routing time can be reduced to n / 2.
T_one(P, l) = c_v * l * (P - 1) + c_a * (P - 1).This is P - 1 times the time for routing a permutation, which is not surprising because the all-to-all routing is realized by decomposing it into permutations. Of course this works only if all the packets have (approximately) the same length. Otherwise a more sophisticated algorithm should be applied which is presented further down.
Even if all packets have the same size, then it is not necessarily true that sending the packets directly to their destinations is the best one can do. This may sound surprising but is due to the relatively large number of start-ups of the given algorithm: if l is small, then the term c_a * (P - 1) will dominate and it may be profitable to reduce it even if the first term increases because of this.
As an alternative consider the simulation of the algorithm for a two-dimensional mesh. All PUs are indexed as in a mesh with a pair of indices (i, j), 0 <= i, j < sqrt(P). The algorithm consists of two phases. In phase 1, all packets are rearranged within the "rows", in the second phase within the "columns". More precisely, in step t, 1 <= t < sqrt(P) of phase 1 PU_{i, j} routes to PU_{i, (j + t) mod sqrt(P)} all packets which have destination in any PU_{i', (j + t) mod sqrt(P)}, with 0 <= i' < sqrt(P). In step t, 1 <= t < sqrt(P) of phase 2 PU_{i, j} routes to PU_{(i + t) mod sqrt(P), j} all packets which have destination in any PU_{(i + t) mod sqrt(P), j'}, with 0 <= j' < sqrt(P). So, the algorithm consists of 2 * (sqrt(P) - 1) steps in each of which l * sqrt(P) data are routed. This gives
T_two(P, l) = c_v * 2 * l * (P - sqrt(P)) + c_a * 2 * (sqrt(P) - 1).For large l this algorithm is almost twice as slow, but it may be up to sqrt(P) / 2 times faster for very small l.
Of course we do not have to stop here. If the packets are really small, then the algorithm for a three-dimensional mesh can be simulated, or in the extreme case, the whole routing can be performed with just log(P) routing steps by simulating the hypercube algorithm. Doing this, in each step half of the data a PU holds must be routed. This gives
T_log(P, l) = c_v * log(P) * l * P / 2 + c_a * log(P).
The algorithms with several phases also require that in between two phases the received packets are decomposed into their components and recomposed for the next routing. With appropriate bucketing all this can be performed in a time that is linear in the amount of involved data. Nevertheless, if c_i is not so much smaller than c_v, as it is the case on the best parallel computers, this gives a non-negligible delay. More generally, reducing the number of start-ups by multiplying the routing volume by a factor should be limited to subproblems involving only a small fraction of the total data set. Otherwise, considering that c_v > c_i, no reasonable speed-up can be expected, and it would be better to reduce the number of processors used in order to get packets of a more substantial size.
On a two-dimensional n x n mesh, the information is first routed within the row of the source PU and then within all columns. If the source is not PU_{0, 0}, then the routing in each of the phases should start in the direction in which the distance to cover is largest. The algorithm requires 2 * n - 2 steps at most, which is optimal.
On higher-dimensional meshes and hypercubes we can apply the dimension-ordered generalization of the above algorithm. On a hypercube this means that the routing is performed along an embedded binomial tree. On a k-dimensional hypercube the algorithm takes k steps, which is optimal.
T_hypercube(P, l) = c_v * l * log(P) + c_a * log P.
Surprisingly this is mostly not the best. If l is slightly larger, then v * l * log(P) is a large number. In that case it is better to use an algorithm which is similar to that for the linear array. The information is divided in a large number f of flits. Flit i, 0 <= i < f, is routed in step i from PU_0 to PU_1, and then on in a linear way. It takes f + P - 2 steps before flit f reaches PU_{P - 1}. This gives
T_linear(P, l) = c_v * l / f * (f + P - 2) + c_a * (f + P - 2).For sufficiently large f, the first term converges to c_v * l, which is optimal. Of course this produces many more start-ups.
For intermediate values of l, there are intermediate algorithms. A simple idea is to use an algorithm similar to the one for a two-dimensional mesh: first broadcasting in a pipelined way within the rows, then within the columns.
Lemma: Gossiping in the unit-cost telephone model on a network with P PUs and diameter D requires at least max{D, round_up(log(P)) + odd(P)}. Here odd(x) = 1 for odd x and 0 otherwise.
Proof: To cover a distance d, at least d rounds are required. After t >= 0 rounds any information is known in at most 2^t PUs, because in any round the number of informed PUs can at most double. If the number of PUs is odd, then there is at least one PU which remains unmatched in the first round. So, the information from this PU is known in at most 2^{t - 1} PUs after t > 0 rounds. This means that for t <= round_up(log(P)) + odd(P) - 1, the number of informed PUs is at most 2^{round_up(log(P)) - 1} < P. End.
Gossiping on d-dimensional meshes is easy as long as all side lengths are even. In that case the gossiping can be performed in a dimension-ordered way. On an n x ... x n mesh for n even, this takes d * (n - 1) rounds, which is optimal. For odd n, it is always possible to perform the gossiping in d * n rounds, but this is not necessarily optimal. For example, on a 3 x 3 mesh gossiping can be performed in 5 rounds. This is optimal, because for a 3 x 3 mesh, the number of PUs P = 9, and thus round_up(log(P)) + odd(P) = 5.
For small networks gossiping schedules can be described very concisely and convincingly by a series of pictures. The nodes are subdivided according to the shape of the network, marking in each subdivision whether the information from the PU at the corresponding position is already known or not. The used matchings can be highlighted.
The unit-cost model is not the only possible cost model. More flexible is the linear-cost model in which it is assumed that a data exchange between two adjacent PUs in which each sends at most l informations takes c_v * l + c_a time. In principle it is not even necessary that all PUs remain connected equally long. However, the more freedom there is, the harder it becomes to design good algorithms. Therefore it is still useful to perform the gossiping by a sequence of matchings. The cost of a gossiping schedule is given by the number of rounds, the number of consecutive matchings used, and the number of steps, the sum over all rounds of the number of transferred packets.
We consider again the above schedule for gossiping on a 3 x 3 mesh. It consists of 5 rounds which was shown to be optimal. The number of steps is 1 + 2 + 4 + 4 + 7 = 18. This is far from optimal. It has been shown (masters thesis of Rene Beier) that 11 steps is the minimum achievable for any schedule with 5 rounds, and that in general at least 10 steps are required. Both bounds can be matched. We see here, and that is a general phenomenon, that there is a trade-off between rounds and steps. So, in general the best gossiping algorithm depends on the ratio c_a / c_v. Only for some very regular networks, such as linear arrays and hypercubes, the number of rounds and steps can be minimized at the same time. In that case there is a unique optimal gossiping schedule.
T_hypercube(P, l) = c_v * l * (P - 1) + c_a * log P.This is optimal: the number of start-ups is minimized, and the routing volume equals the amount of data each PU must receive. For large P the start-up costs will become negligible.
Another communication operation is gather. In a gather each PU initially has l individual data. The task is to gather all this information in PU_0. This is not the inverse of a broadcast, because in a broadcast finally all PUs know the same information. Clearly this operation requires at least log P start-ups and because l * (P - 1) data must be transferred into PU_0. So, on a DMM gathering has the same lower bound as gossiping and can be solved optimally by applying the hypercube gossiping algorithm.
The packets are denoted p_i. s_i is the index of the source PU of p_i, d_i is its destination PU.
Notice that the mapping s_i -> x_i is not a permutation. Alternatively PU_0 might select the index of a permutation and broadcast it to all PUs in k steps, but this is a waste of time. The analysis of Valiants algorithm is not hard but beyond the scope of this chapter. Both routing phases have the same time distribution, because phase 2 can be viewed as a reversal of phase 1: in phase 2 the packets are moving from a random position to a fixed position. It can be shown that each of these phases takes k + o(k) time with high probability and that the queues remain small as well. So, the whole algorithm essentially 2 * k + o(k) = O(k) steps with high probability. Using randomization all permutations become equally hard, there are no longer good and bad permutations. The identity has the same expected time as a transposition.
A disadvantage of the above approach is that we loose a factor two for all those permutations for which even the greedy algorithm performed well: the expected time of the greedy algorithm is k + o(k), which is optimal up to the lower order term. A nice idea helps to make the randomized algorithm optimal while still making it highly unlikely that much more than the expected time is required. The algorithm is the same as before, with one small change. After selecting x_i, the length of the path s_i -> x_i -> d_i is computed. If this path is longer than k, we use comp(x_i) as intermediate destination instead of x_i. Here comp() is the operation which in a binary number complements all bits. The maximum distance any packet has to travel is now bounded by k: if dist(s_i, x_i) + dist(x_i, d_i) > k, then dist(s_i, comp(x_i)) + dist(comp(x_i), d_i) = k - dist(s_i, x_i) + k - dist(x_i, d_i) = 2 * k - dist(s_i, x_i) + dist(x_i, d_i) < k. Vöcking has shown that not withstanding the reduced randomness of the intermediate destination, the delay due to congestion is still limited to o(k) with high probability.
Consider an all-to-all routing. Assume that in total each PU is sending and receiving the same number of packets, but that not each PU is sending the same number of packets to each other PU. Then we can first send each packet to a random intermediate destination like in the hypercube algorithm. This replaces a possibly unbalanced pattern by two balanced all-to-all routing patterns. A refinement is to first determine the degree of unbalance and to randomize only as much as necessary.
For the analysis it is necessary to understand how much time the routing on the one-dimensional subarrays takes. Without a proof we give the central lemma in this domain:
Lemma: For a routing pattern on a linear array with n PUs, let r_{i, j}, 0 <= i < j < n, denote the number of packets that must travel from a PU with index i or smaller to a PU with index j or larger. When applying the farthest-first strategy, the rightwards moving packets need exactly max_{i < j} {r_{i, j} + j - i - 1} steps to all reach their destinations.
For most distributions the maximum is either assumed for i = 0, j = n - 1, or for i = n / 2 - 1, j = n / 2. The maximum time for a regular pattern like a k-k routing is always given by the maximum of these bounds. This simplifies the analysis a lot.In the above case we are routing random k-k patterns. The maximum distance is n - 1, and in a randomization about half of the packets want to cross the bisection. Thus, each of the phases takes max{n, k * n / 4} + o(k * n). This is not yet good, but we can exploit that the given routing is uni-axial. As for the all-to-all algorithm above, two of these algorithms can be overlaid. To make this effective, the packets must somehow be split in two. In a randomized context it is natural to use randomization for this as well. So, each packet is colored independently white or black with probability 1 / 2. The white packets are routed xy, the black packets yx. In this way we do not exactly obtain two (k / 2)-(k / 2) patterns, but that does not matter. The important point is that now in each routing phase, the expected number of packets crossing the bisection of a one-dimensional subarray is k * n / 8. Using Chernoff bounds, it may be shown that all deviations from the expected value are small with high probability.
Theorem: On a two-dimensional n x n mesh, k-k routing can be performed in max{4 * n, k * n / 2} + o(k * n) steps with high probability.
The above result is optimal for all k >= 8, because there the k * n / 2 bisection lower bound is matched up to lower-order terms. For k = 1, 2 k-k routing can be performed in 2 * n + o(n) time. This is optimal because it matches the distance bound, the lower bound which is obtained by considering how far the packets have to travel. For 2 < k < 8 it is not known how many steps are exactly needed for k-k routing. For example, 4-4 routing can be performed in 3 * n + o(n) steps, but it is not clear whether this is optimal or not.
Assume we are performing an all-to-all routing on a DMM with P PUs, each PU sending and receiving m numbers in total, l = m / P packets traveling between any pair of PUs on average. Each PU sorts the numbers according to the index of their destination PUs. Using bucket sort this takes O(m) time. A number residing in PU_i with rang r is sent to PU_{(i + r) mod P} as intermediate destination in a first routing phase and in a second routing phase to its destination. This is a very simple algorithm. Creating the packets to be sent to each PU may be even faster than in the randomized algorithm because the calls to the random number generator are saved.
How good is the resulting distribution? Clearly in the first routing phase the packets are almost perfectly balanced, the number of packets going from one PU to another is either round_down(m / P) or round_up(m / P). In phase 2 the imbalance is larger. Denote by a_{i, j} the number of packets originating in PU_i with destination in PU_j and by b_{i, k, j} the number of packets which are send from PU_i via PU_k to PU_j. For arbitrary k and j we want to put an upper bound on sum_i b_{i, k, j}. The regular distribution gives b_{i, k, j} <= round_up(a_{i, j} / P). This gives
sum_i b_{i, k, j} <= sum_i round_up(a_{i, j} / P) <= P + sum_i a_{i, j} / P = P + m / P.So, if m = omega(P^2), the imbalance is negligible. For the small values of P that currently are common, this is no problem, but in general this is quite a strong condition.
On n x n meshes any operation in submeshes of size o(n) takes only o(k * n) time and is therefore cheap in comparison to the Theta(k * n) time for solving most problems. Here k denotes the number of packets per PU. So, on meshes we can apply the same idea, but the allocation of intermediate destinations may be coordinated in submeshes of quite considerable size. Thereby the rounding errors can be made of minor importance even for small k. As a result, the same result that was achieved above can also be obtained deterministically.
The idea of coordinating the rounding decision also works on DMMs. Due to rounding errors, each of the rounds of phase 2 might be delayed by c_v * P, in total c_v * P^2. If first all PUs are subdivided in groups of size sqrt(P), these PUs can exchange their values b_{i, j}. These are small gossip operations, costing c_v * P^{3 / 2} + c_a * log(P) / 2 time. Then they can optimize their rounding, reducing the loss in phase 2 to c_v * P^{3 / 2}. So, now we must have m = omega(P^{3 / 2}). This is much better, but still not as good as for the randomized algorithm, for which the rounding errors are negligible for any m = omega(P * log P). Of course, in practice it is quite unlikely that the deterministic algorithm has maximal bad luck. Even though we might not assume that all a_{i, j} are equal, it is quite reasonable to assume that the rounding errors are more or less randomly distributed.
First we construct a bipartite graph with 2 * n nodes. The sets V_1 and V_2 each have n nodes, corresponding to the source and destination rows of the packets, respectively. There is an edge from node i in V_1 to node j in V_2 if there is a packet moving from row i to row j. Because there are exactly n packets starting and finishing in each row, the graph is n-regular (all nodes have degree n). Thus, according to Hall's theorem, the graph has an edge coloring with n colors. Because this is an online algorithm, the existence of such a coloring is the only thing that matters, but actually such colorings can be computed very efficiently. That is, there is an assignment of values, which are called colors, ranging from 0 to n - 1, so that the edges incident on a node all have different colors.
These colors are used in the routing algorithm. Denote the packets by p_i and let x_i be the color attributed to the edge corresponding to this packet.
It remains to check that the algorithm is as good as claimed above. For the analysis of phase 1 we consider the routing in an arbitrary row. Originally each PU holds one packet. Because the edges corresponding to these packets are all incident on the same node of the bipartite graph, they all have different colors, and therefore no two packets will be routed to the same PU. For the routing in phase 2, we consider the routing in an arbitrary column. Because of the previous result, initially each PU holds one packet. If two packets in the column would move to the same row, then two packets with destination in the same row would have gotten the same color, which is in contradiction with the coloring which colors all edges incident on the node in V_2 corresponding to this row differently. Therefore, at the beginning of phase 3, all PUs hold one packet. In each row each PU is the destination of one packet because the complete pattern is a pattern, and in phase 3 the packets are routed to their destinations. So, the algorithm consists of three permutations within one-dimensional subarrays. Therefore, each phase can be executed in n - 1 steps, 3 * n - 3 steps in total, and never a PU is storing more than one packet.
The queue size is optimal, the time is not. It is still open whether it is possible to route all permutations in 2 * n - 2 steps with queue size 1. A few steps more and queue size 2 is possible.
Let A = B x C be a product network. B has n_B nodes and C has n_C nodes. A schedule for routing a permutation on A can be computed offline by first constructing a bipartite graph with node sets V_1 and V_2, each with n_C nodes. Node c, 0 <= c < n_C, in V_1 and V_2 corresponds to the subnetwork B x c in A. The bipartite graph has n_A = n_B * n_C edges. For a packet traveling in A from a node (b_1, c_1) to a node (b_2, c_2) there is an edge from node c_1 in V_1 to node c_2 in V_2. Thus, the bipartite graph is n_B-regular and can be colored with n_B colors. Packet i is denoted p_i, it is traveling from (b_i, c_i) to (b'_i, c'_i), and its color is x_i, 0 <= x_i < n_B. The routing is performed as follows:
Theorem: Permutation routing on a product network A = B * C can be performed in T_A = 2 * T_B + T_C steps, where T_B and T_C denote the number of steps for routing permutations on B and C, respectively. At any time a PU is storing at most one packet.
Proof: Phase 1 is a permutation routing in subnetworks with the structure of B, because for any subnetwork B x c, all destinations occur exactly once due to the coloring of the edges incident on the node in V_1 corresponding to this subnetwork. The coloring of the edges incident on the nodes of V_2 assures that phase 2 is a permutation routing in a network with the structure of C. Phase 3 is again a permutation routing within subnetworks with the structure of B because at the beginning each PU holds one packet, and because the whole pattern is a permutation, each PU is also the destination of one packet. End.
If T_B != T_C, then it is profitable to arrange things so that T_B < T_C. The above construction immediately gives results for d-dimensional meshes. The best is to perceive a d-dimensional n x ... x n mesh as the direct product n x (n x ... x n). Using induction this leads to
Corollary: Permutation routing on a d-dimensional n x ... x n mesh can be performed in (2 * d - 1) * n steps. At any time a PU is storing at most one packet.
The algorithm is simple: a bipartite graph with two sets of P nodes is created. For a packet moving from PU_i to PU_j there is an edge from node i in V_1 to node j in V_2. This gives an m regular graph which can be colored with m colors. Each PU first routes all packets with color x to PU_{x mod P}. In the second routing phase the packets are routed to their destinations. Because of the coloring property for the edges incident on the nodes of V_1, in the first round each PU sends exactly l packets to each PU. Because of the coloring property for the edges incident on the nodes of V_2 each PU receives one packet from each color. These stand perfectly distributed after the first routing.
Clearly for a problem like permutation routing, which is not particularly hard, it appears outrageous to first solve a bipartite graph-coloring problem of a graph with one edge per packet. This problem is far more complicated then the routing itself. However, we can solve a strongly reduced problem in o(n) time, while still obtaining almost as good performance as with the offline algorithm. This idea of reducing the information content of a problem is called sparsification and is useful beyond the scope of turning offline algorithms into online algorithms.
The approach is sketched for a two-dimensional n x n mesh. The mesh is divided in m x m submeshes. m = o(n) is quite large, one should think of m = n^{1 - eps}, for some suitable eps > 0. The packets going to each of these n^{2 * eps} submeshes are counted in each submesh. The bipartite graph does not have one node for each row, but one node for each of the n^eps row bundles. A further idea, which helps to reduce the size of the graph is to not create an edge for each edge but only for each superpacket. A superpacket consists of a large number s = n^alpha packets with common destination. Because we cannot assume that s exactly divides the number of packets traveling between a pair of row bundles, in general there will be one partially filled superpacket for each destination pair. So, in each submesh the graph to color has 2 * n^eps nodes and n^{2 - 2 * eps - alpha} + n^{2 * eps}. Choosing the parameters right, we can achieve the following:
After solving the localized and sparsified coloring problems, the routing cannot be performed so simply as it was done before. Each routing phases must be preceded by a rearrangement in the submeshes, so that the packets stand well-distributed. For example, preceding a horizontal routing phase, the packets must be arranged so that in each row of a submesh the number of packets going to any other submesh is about equal. When reaching a submesh, the packets are not routed to a specific PU, but rather to the first PU in the submesh which is still storing fewer than the maximum allowed number of packets. Together these ideas assure small queues and good routing times. Because the submeshes have size o(n), the cost of each operation is a lower-order term of the total routing time, and because there are only constantly many routing phases, their cost is negligible altogether.
The entire system consists of a number of PUs with conventional powerful arithmetic and storage and a number of switches. The switches are slightly more powerful than in a crossbar: they listen to the signals passing on the buses they are connected to and take out whatever is destined for them. A small number of values can be stored and the switches can be set conditionally as a result of a comparison between stored values. The whole system is assumed to be synchronized and the difference between receiving a signal or not receiving a signal can be detected.
Typically the switches are arranged in a rectangular grid, but in principle they might also be arranged differently. The PUs are arranged on the outside. The switches normally have four ports, denoted W, N, E and S. Sometimes it is assumed that these ports can be connected dynamically in any desirable configuration, sometimes the configurations are limited to a subset thereof. In the presented examples we will not need anything fancy. A switch setting is denoted by indicating which ports are connected. For example, WN-SE indicates the setting in which W is connected with N and S with E.
It is important to notice that the switch setting decomposes the grid into a number of subnets. Within these nets there is broadcasting possibility: as long as it is assured that only one of the connected PUs is sending, all connected PUs and switches will receive the information transmitted in the net. A system as described is called a reconfigurable processor array
One may argue that reconfigurable networks are physically not realistical. There are two major assumptions which give reason for skepticism:
The first point is a true limitation: speed of light is finite and the size of the reconfigurable network certainly increases with the number of switches. This is a limitation, but exactly the same assumption is made in all hypercube, butterfly and PRAM algorithms. Two-dimensional meshes (and to a lesser extend three-dimensional meshes) are the only interconnection networks that can be realized in physical space without scaling problems. In a two-dimensional lay-out of a hypercube with n PUs there are wires of length sqrt(n).
It is harder to estimate the importance of the second point. Radio exists, and the signal does not become noticeably weaker if the number of listeners increases. Even wire-bound broadcasting exists in the form of cable television. So, technologically, this is no problem either. Possibly the signal must be somewhat stronger than for a point-to-point transmission. Also one can imagine that especially branchings weaken a signal. In most algorithms for reconfigurable arrays, these can easily be avoided by performing the broadcasting in two steps.
The algorithm requires n + 3 PUs and 6 * n switches arranged in an 3 x 2 * n grid. In a first round PU_i informs the six switches in the columns 2 * i and 2 * i + 1 about its value. After receiving this value, the switches set their connections in preparation for the next round. If a 0 is received, a switch is set to WE-NS. If a 1 is received the switch must be set WN-SE if it is found in column 2 * i. If the switch is in column 2 * i + 1, then it is set to WS-NE, with exception of the switch in row 1, which is set to WE-NS. Then in the second round, a signal is entered in the net from the lower-left side. If it comes out at the right side in the same row, then the parity is even, if it comes out one row higher, the parity is odd. The PU on the right side receiving the signal can either output the conclusion, or broadcast it to all other PUs.
The input is a set of n numbers a_i, number i originally residing in PU_i. By taking also the index into account, we may assume that all values are different. The algorithm consists of two somewhat independent parts. In the first part, for each a_i its rank r_i within the set of numbers is computed. In the second part, a_i is routed to PU_{r_i}. Each of these operations can be performed in a constant number of steps. For each i there is a dedicated n x n grid of switches. The switches in column j of this network are set depending on a comparison of a_i with a_j similar to the construction in the parity algorithm: if a_j < a_i, a signal which is inserted at the lower-left corner is guided one row upwards. The signal finally leaves the network on the right side in row r_i of the subnetwork. Even when using only the three different settings which were used before, the whole algorithm can be implemented with just 6 steps.
In the section on "Using Randomization" a lemma was given specifying exactly the time for routing on a linear array with n PUs when using the farthest-first strategy.
On a circular array each number is routed from the PU where it starts in each direction for n / 2 steps. If in step t, 1 <= t <= n / 2, PU_i receives a number from PU_{(i - 1) mod n}, then it adds its value to the value it has if i >= t. If PU_i receives in step t a number from PU_{(i + 1) mod n}, then its value is added if i + t >= n.
The algorithm on the circular array runs twice as fast and minimizes the number of communication steps. However, this algorithm is not work-optimal: all PUs are busy all steps, so the total work is Theta(n^2).
int sum(int[] a, int n)
{
int d = n / 2;
while (d > 0)
{
for (i = 0; i < d; i++)
a[i] += a[i + d];
d /= 2;
}
return a[0];
}
For n = 2^k for some k >= 0 this computes the sum in k = log n passes
of the while loop. The operations of the for loop can be executed in
parallel. The work is given by the sum of the used d-values which is n
/ 2 + n / 4 + ... + 1 = n - 1. Because the scheduling on an EREW PRAM
is no problem, this gives a PRAM algorithm running in O(log n) time
with n / log n PUs, an optimal algorithm.
For the prefix sums problem there is an elegant recursive algorithm:
void prefix(int low, int n)
{
if (n > 1)
{
prefix(low, n / 2);
prefix(low + n / 2, n - n / 2);
for (int i = n / 2; i < n; i++)
a[low + i] += a[low + n / 2 - 1];
}
}
In a sequential context it makes no sense to formulate the prefix sums
computation this way, but this formulation immediately leads to a
parallel algorithm running in logarithmic time: T(n) = T(n / 2) + O(1).
The work can be estimated as W(n) = 2 * W(n / 2) + O(n), which gives
W(n) = O(n * log n). So, this algorithm is elegant but not optimal.
With a twist the work can be reduced to O(n). First consider an algorithm for solving the problem on a processor network with a tree topology consisting of 2 * n - 1 PUs arranged in log n + 1 levels with n / 2^t PUs at level t, 0 <= t <= log n. The algorithm consists of two main phases, the first going upwards, the second downwards. Initially a[i] is stored in PU_i at level 0. In step 0 of the upwards phase, each PU at level 0 routes its number to the PU at level 1 above it. In step t, 1 <= t < log n, of the upwards phase, a PU at level t routes to the PU at level t + 1 above it the sum of the values it received from the two PUs below it. In step t, 0 <= t < log n, of the downwards phase, a PU at level log n - t routes to the two PUs at level log n - t - 1: to the left it routes the value it received from the PU at level log n - t + 1 (which is supposed to be zero for t = 0); to the right it routes the sum of the value it received from above and the value it received from the left during the upwards phase. At level 0 PU_i adds the received value to a[i]. The whole algorithm consists of 2 * log n + 1 steps.
In the WT framework, it suffices to notice that there are 2 * log n + 1 time steps and that the total work is O(n), because each of the 2 * n - 1 PUs of the tree network is performing only O(1) work. Thinking about what is actually going on, we get the following sharper formulation:
void prefix()
{
for (int step = 1; step <= n / 2; step *= 2)
for (int i = step - 1; i < n; i += 2 * step)
a[i + step] += a[i];
for (int step = n / 4; step >= 1; step /= 2)
for (int i = 3 * step - 1; i < n; i += 2 * step)
a[i] += a[i - step];
}
The number of additions is n / 2 + n / 4 + ... + 1 in the first loop
and 1 + 3 + ... + n / 2 - 1 in the second. Because the scheduling on
an EREW PRAM is no problem, this gives a PRAM algorithm running in
O(log n) time with n / log n PUs, an optimal algorithm.
Click here to see this procedure integrated in a working program. The program also contains the sequential and the recursive algorithm. Running the program for large n shows that not withstanding the optimality, the parallel algorithm is considerably slower than the sequential one. One of the reasons is the much larger number of cache faults it generates. This can be overcome by a better organization. However, a second reason is that the algorithm is inherently more complex and performs twice as many additions.
T_par(P, n) = c_i * O(n / P) + (c_v + c_a) * log P.Similarly, for the prefix-sum problem
T_par(P, n) = c_i * O(n / P) + (c_v + c_a) * (2 * log P - 3).Because the routing volume is so small, this algorithm works even on networks with large c_v. The condition for getting good performance is c_a * log P << c_i * n / P. This is satisfied for P << c_i / c_a * n / log n, which is an exceptionally large number. Said otherwise, even relatively small problem can be solved efficiently.
void matrix_product(int[][] A, int[][] B, int[][] C, int n)
{
for (i = 0; i < n, i++)
for (k = 0; k < n, k++)
C[i, k] = 0;
for (i = 0; i < n, i++)
for (k = 0; k < n, k++)
for (j = 0; j < n, j++)
C[i][k] += A[i][j] * B[j][k];
}
It is handy to rewrite the algorithm as follows:
int sum(int[] c, int n)
{
int s = 0;
for (j = 0; j < n, j++)
s += c[j];
}
void matrix_product(int[][] A, int[][] B, int[][] C, int n)
{
int[][][] c = new int[n][n][n];
for (i = 0; i < n, i++)
for (k = 0; k < n, k++)
for (j = 0; j < n, j++)
c[i][k][j] = A[i][j] * B[j][k];
for (i = 0; i < n, i++)
for (k = 0; k n, k++)
C[i][k] = sum(c[i][k], n);
}
This immediately gives the following WT algorithm:
void matrix_product(int[][] A, int[][] B, int[][] C, int n)
{
int[][][] c = new int[n][n][n];
for all (i, k, j, 0 <= i, k, j < n) pardo
c[i][k][j] = A[i][j] * B[j][k];
for all (i, k, 0 <= i, k < n) pardo
C[i][k] = sum(c[i][k], n);
}
The algorithm consists of two parallel instructions, of which the
second requires further O(log n) instructions. The work is n^3 +
n^2 * O(n) = O(n^3). We have given here the transformation in several
steps and explicitly indicate which steps are to be performed in
parallel. In the following such straight-forward parallelizations will
not be worked out.
It is obvious how to schedule the algorithm on a PRAM with n^3 PUs. Taking P = W(n) / T(n) = n^3 / log n as suggested above, reduces the cost without substantially increasing the time. In that case C(P, n) = n^3 / log n * O(log n) = O(n^3). One may debate whether this should be called optimal or not: there are sequential matrix-multiplication algorithms running in o(n^3). On the other hand, are these not so terribly much better in practice, and the conventional version presented above is still used.
In our case the PU at position (i, k) computes C[i][k]. So, the data from A[][] and B[][] must be arranged so, that A[i][j] and B[j][k] reach PU_{i, k} at the same time. This is achieved when the arrival is delayed so, that A[i][j] reaches PU_{i, 0} in step i + j and then runs rightwards without delay, while B[j][k] reaches PU_{0, k} in step j + k and then runs downwards without delay. Thus, A[n - 1][n - 1] and B[n - 1][n - 1] enter the mesh in step 2 * n - 2 and reach PU_{n - 1, n - 1} in step 3 * n - 3. By then all other numbers have been processed as well. So, the whole computation takes 3 * n - 3 communication steps.
On a torus the systolic algorithm can be implemented easily: the data from A[][] which in the systolic algorithm reach column 0 in step t, 0 <= t < n, are positioned in column (n - t) mod n. The data from B[][] which reach row 0 in step t are positioned in row (n - t) mod n. More precisely, A[i][j] starts in PU_{i, (n - i - j) mod n} and B[j][k] starts in PU_{(n - j - k) mod n, k}. The A[][] data move rightwards using the wrap-around connections, the B[][] data move downwards. In all PUs there is something to do every step. So, after n - 1 communication steps, the matrix product has been computed. This is optimal because there are n^3 multiplications to make with n^2 PUs.
How to implement this algorithm on a mesh? On a practical system with worm-hole routing, it is never a problem when one packet must be routed a long distance. So, cyclical shifts can always be performed at almost full speed. In the store-and-forward model, however, it is harder. Just taking the above algorithm with the indicated mapping requires (n - 1) * (n - 1) routing steps. But, using the standard embedding of a torus on a mesh, matrix multiplication can be performed on the mesh in 2 * n - 2 steps. It is not clear that this is the optimal result for computing matrix products on meshes, but there are no obvious improvements as long as we insist that each PU holds only one value from A[][] and B[][] at a time.
The n^3 PUs of the three-dimensional n x n x n mesh are indexed with three subscripts as in the obvious way: PU_{i, j, k} denotes the PU in row i, column j and plane k. We assume that initially A[i][j] is stored in PU_{i, j, 0} and B[j][k] in PU_{0, j, k}. Then these data are routed so that each PU_{i, j, k} receives A[i][j] and B[j][k]. This routing is trivial: starting from PU_{i, j, 0} the number A[i][j] is routed along the third axis, starting from PU_{0, j, k} the number B[j][k] is routed along the column. Hereafter PU_{i, j, k} computes c[i][j][k] = A[i][j] * B[j][k]. All numbers that must be added together, that is all numbers c[i][j][k] with the same j, stand in the same row. In another n - 1 steps these can be added up to C[i][k] and routed to the PUs PU_{i, 0, k}. So, the whole algorithm takes 2 * n - 2 steps on an all-port machine and 3 * n - 3 on a single-port machine.
The last operation, computing C[i][k], can be performed in two ways:
If a single matrix product must be computed there is no difference. If, however, there are L pairs (A_l, B_l), 0 <= l < L, of matrixes to multiply, then using the first algorithm we still need at least n - 1 steps per product because the PU_{i, 0, k} are busy n - 1 steps adding the numbers they are receiving. The second algorithm can be used for a pipelined approach which requires only O(n + L) steps to compute all product matrices C_l, 0 <= l < L. Pipelining means that an excess of computing power is made useful by scheduling computations tightly after each other. This is another kind of parallelism, successfully used in any modern processor.
In our case, the numbers of A_l and B_l start being routed into the network at step l and come out of the network 2 * n - 2 steps later (assuming that the PUs can store O(n) numbers and all-port capacity). So, for computing L matrix products we need 2 * n + L - 3 steps. For L = omega(n), this is 1 + o(1) steps per product with n^3 PUs. On a two-dimensional mesh we needed 2 * n - 2 steps with n^2 PUs. The conclusion is that for computing many products the simple algorithm for three-dimensional meshes is asymptotically twice as efficient.
We are going to use the algorithm for a three-dimensional m x m x m mesh on the hypercube, for m = 2^l. PU_{i, j, k} is identified with the hypercube PU in which the numbers i, j and k are written out as binary k-bit numbers. The algorithm for the three-dimensional mesh consists of three routing operations. In the first information is sent from the PU_{i, j, 0} to all PU_{i, j, k}. Thus, information has to be spread from a single PU to all PUs in a one-dimensional submeshes. This submesh corresponds to an l-dimensional subhypercube. Using the connections of each dimension one after the other, it is trivial to perform this spread of information in l routing steps. In the second operation information is sent from the PU_{0, j, k} to all PU_{i, j, k}. This goes analogously, on a full-port machine it can even be performed in parallel with the first operation. The third operation consists of gathering and summing the information standing in all PU_{i, j, k} into the PU_{i, 0, k}. Again this can be done by performing l routing steps within l-dimensional subhypercubes. In step t, 0 <= t < l, the routing is performed along axis l + t, each PU which has a 1 at position l + t of its index sends a packet to the PU whose index is smaller by 2^{l + t}. The transferred value is the sum of the own value and all received values.
So, in the single-port model the whole algorithm can be implemented in 3 * l = log n steps. The connections along each axis are used exactly once. This last property implies that it is again trivial to pipeline matrix multiplications on a full-port machine: L products can be computed in log n + L - 1 steps.
T_par(P, n) = c_i * n^3 / P + c_v * n^2 + c_a * P.
Of the two communication terms, the contribution c_v * n^2 will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * n^2 << c_i * n^3 / P, which gives the condition
P << c_i / c_v * n.
T_par(P, n) = c_i * n^3 / P + c_v * 2 * n^2 / sqrt(P) + c_a * 2 * sqrt(P).
Of the two communication terms, the contribution c_v * 2 * n^2 / sqrt(P) will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * 2 * n^2 / sqrt(P) << c_i * n^3 / P, which gives the condition
P << (c_i / c_v * n)^2.
The second algorithm is much better, but still we cannot use arbitrarily many PUs. The above conditions show that the largest number of PUs that can be used in a sensible way is a very small constant fraction of n^2. Given that there is O(n^3) work to do, this means that, using this algorithm, T_par(infinity, n) = Omega(n). There is a clear discrepancy here between the bounds for the PRAM and the DMM. If one believes that the DMM model is the more realistic of the two, this implies the limited importance of the quest for polylog PRAM algorithms.
The Floyd-Warshall algorithm proceeds as follows. Here the matrix D[][] contains all direct distances. That is, D[i][k] gives the length of the edge from i to j. If there is no such edge, then D[i][k] = infinity. Finally D[i][k] contains the length of the shortest path from i to k.
void floyd_warshall(int[][] D, int n)
{
for (j = 0; j < n, j++)
for (i = 0; i < n, i++)
for (k = 0; k < n, k++)
if (D[i, j] + D[j, k] < D[i, k])
D[i, k] = D[i, j] + D[j, k];
}
Lemma: The Floyd-Warshall algorithm is correct.
Proof: The correctness is proven by inductively showing that after performing the second loop for all values in S_j = {l| 0 <= l <= j}, D[i, k] gives the length of the shortest path from node i to node k running through nodes in S_j. Initially this is true, and finally this implies the correctness of the algorithm. So, assume the claim holds for some j - 1. Then, the shortest path from node i to node k running through nodes in S_j can either run though node j or not. If it does not run through node j, then the shortest path is given by the shortest path through S_{j - 1}, which by the induction assumption has length D[i, k]. In the other case, the shortest path through S_j runs along a shortest path in S_{j - 1} from i to j and then along a shortest path in S_{j - 1} from j to k. The length of these two subpaths are according to the induction assumption given by D[i, j] and D[j, k] and because we assume that this path gives an improvement, we have D[i, j] + D[j, k] < D[i, k] and the assignment D[i, k] = D[i, j] + D[j, k] will be performed. End.
An alternative is to apply brute-force: the distances can also be computed by applying an algorithm originally designed for computing the transitive closure of a graph. The transitive closure of a graph G is the graph G^* having an edge from node i to node j iff node j can be reached from node i over a path in G. Of course this problem can be solved efficiently determining the connected components, but for a graph with n nodes one can also compute D^{n - 1}, where D is the adjacency matrix of G. Of course the matrix product here is not defined over the underlying ring of integers, but over the ring of booleans where "+" stands for "or" and "*" stands for "and". Modifying so that the entries of D[][] are integers giving the distances and not 0-1 values, and replacing "+" by "minimum" and "*" by "+", D^{n - 1} does not only give the existence of a path, but even the length of the shortest path. This can be proven by showing that if D^i and D^j give the lengths of the shortest paths using up to i and j edges, respectively, that then D^i * D^j = D^{i + j} gives the length of the shortest paths using up to i + j edges. Because we assume that all edge weights are non-negative, all shortest paths consist of at most n - 1 edges. Therefore, it is sufficient to compute D^{n - 1}.
Clearly, computing D^{n - 1} by starting with D^1 = D and then repeatedly multiplying with D leads to a correct but very inefficient way of doing. A much better idea is to compute D^n by suing D^{2^t} = D^{2^{t - 1}} * D^{2^{t - 1}}. In this way D^{n - 1} can be computed with k = round_up(log_2 (n - 1)) matrix products: 2^k >= n - 1 and because of the above argument, D^m = D^{n - 1}, for any m >= n - 1. This reduces the APSP problem to computing log n matrix products. The time for this is O(log^2 n) with O(n^3 * log n) work. On a PRAM, using n^3 / log n PUs, the APSP problem can be solved in O(log^2 n) time. This algorithm shows that APSP is in NC, but it is not cost-optimal.
The communication is performed according to the following rules:
Clearly the given algorithm is correct because of the condition that data are forwarded only as soon as D^(j)[i, k] has been computed and because eventually all D^(j)[i, j] become available to all PUs in row i and all D^(j)[j, k] to all PUs in column k. So, every PU can indeed compute all the values it must compute. The more interesting question is how long this takes.
Lemma: PU_{i, k} computes D^(j)[i, k] in step 3 * j + |i - j| + |k - j|.
Proof The proof goes by induction. The main induction goes over j. Secondary inductions go over i and k. The basis of everything is that PU_{0, 0} computes D^(0)[0, 0] in step 0. So, assume the lemma is true for i = 0 and some k, 0 <= k <= n - 2, then PU_{0, k} sends D^(-1)[0, 0] on to PU_{0, k + 1} which computes D^(0)[0, k + 1] in step k + 1. The same follows for D^(0)[k + 1, 0]. Now assume the lemma holds for j = 0 and all i and k with i < l or k < l, then it can first be proven for D^(0)[l, l] and then for all D^(0)[i, k] with i = l or k = l. Assuming the lemma has been proven for some value j and all i and k, it follows that D^(j - 1)[j, j] is computed in step 3 * j - 1. The values D^(j - 1)[j - 1, j] and D^(j - 1)[j, j - 1] have been computed in step 3 * j - 2 and reach PU_{j, j} at the end of step 3 * j - 1. So, PU_{j, j} computes D^(j)[j, j] in step 3 * j. This is the basis of the proof for the next j. End.
Theorem: The given implementation of the Floyd-Warshall algorithm on an n x n mesh solves the APSP problem in 5 * n - 4 steps.
Click here to see the above ideas integrated in a running program. In this simulation each PU has three arrays. For PU_{i, k} the array own[] indicates which values D^(j)[i, k] have been computed. The array hor[] indicates which horizontally moving information has been received, processed and forwarded. The array ver[] is the analogue for the vertically moving information.
The algorithm consists of n rounds. In round j, 0 <= j < n, the PU which is responsible for row j, this is PU_{round_down(j / (n / P))}, sends the information in row j of D[][] to all other PUs. This information is used to compute all values D^(j)[i][k]. The only question is how to route the information in a row from one PU to all other PUs. This is a broadcasting problem. Applying the algorithm using log P routing operations, the time consumption is given by
T_par(P, n) = c_i * n^3 / P + c_v * n^2 * log P + c_a * n * log P.
Of the two communication terms, the contribution c_v * n^2 * log P will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * n^2 * log P << c_i * n^3 / P, which gives the condition
P * log P << c_i / c_v * n.
T_par(P, n) = (1 + 4 * (sqrt(P) - 1) / n) * (c_i * n^3 / P + c_v * 4 * n^2 / sqrt(P) + c_a * 4 * n).
The algorithm is almost optimal. We may assume that n >> sqrt(P) and therefore the factor 1 + 4 * (sqrt(P) - 1) / n will lie very close to 1. The routing volume is considerably smaller than with the bundle-wise approach. Of the two communication terms, the contribution c_v * 4 * n^2 / sqrt(P) will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * 4 * n^2 / sqrt(P) << c_i * n^3 / P, which gives the condition
P << (c_i / c_v * n)^2.
If we are working on a parallel computer with exceptionally high start-up cost c_a, then we should try to reduce the number of start-ups. This can be done by not making n steps in which the value of j is increased by 1, but n / b steps in which j is increased by b. Here b is an arbitrary fraction of n / P. In each round the computation now takes c_i * b * n^2 / P and b * n / sqrt(P) data have to be sent to the neighbors. In total the algorithm now needs n / b + 4 * (sqrt(P) - 1) steps. This gives almost the same expression as before:
T_par(P, n) = (1 + 4 * (sqrt(P) - 1) * b / n) * (c_i * n^3 / P + c_v * 4 * n^2 / sqrt(P) + c_a * 4 * n / b).
The only disadvantage of this bundled-block-wise approach is that the leading factor becomes larger. In the extreme case that we take b = n / sqrt(P), there are only 5 * sqrt(P) - 4 steps, but the internal work is almost five times larger. A close to optimal choice of b can be found by equating the additional computation time with the start-up costs. Concentrating on the main contributions and leaving out the constant factors, this gives c_i * b * n^2 / sqrt(P) = c_a * n / b. This suggests to take b ~= sqrt(c_a / c_i * sqrt(P) / n).
The ideas in this whole section on the all-pairs shortest paths problem are for the largest part standard and known for a long time. New contributions on a better PRAM algorithm are given in a paper by Han, Pan and Reif (Algorithmica, 1997). Sibeyn has designed an external-memory algorithm which essentially requires the same number of internal operations as conventional matrix product, and not twice as many as alternative algorithms, including the presented algorithm above.
How long does the execution of the sum algorithm take if we let PU_i, 0 <= i < n / log n, perform all computations with an assignment to some a[j], for i * log n <= j < (i + 1) * log n?
Adapt this algorithm to an alternative bundle-wise DMM algorithm. Specify its time consumption and compare it with the above given DMM algorithms.
More interesting is the Floyd-Warshall algorithm. On a full-port mesh it runs in 5 * n - 4 steps. We now want to run it on a mesh on which in each step all PUs can communicate either only in the two positive directions or only in the two negative directions. In how many steps can the Floyd-Warshall algorithm be run in this model? It may be helpful to test ideas by playing around with the program which can be downloaded here.
T_par(P, n) = c_i * O(n * log n / P) + c_v * O(n * log P / P) + c_a * O(P).
The strong point of divide-and-conquer algorithms in general is that by their localizing nature the number of start-ups is low. For the above algorithm, the start-up costs are negligible if c_a * P << c_i * n * log n / P, that is when n * log n >> c_a / c_i * P^2. If we consider how large the typical problem is for which we will apply a parallel sorting algorithm, this condition will generally be satisfied.
The weakness of the parallel quicksort algorithm is that the routing volume is almost of the same order as the internal work: only when c_i * log n >> c_v * log P, the communication cost becomes negligible. This condition will rarely be satisfied, implying that the practical speed-up with parallel quicksort, though increasing almost linearly with P, will be small.
The complexity of the planar convex-hull problem is closely related to that of sorting. For a set of integer values A = {a_i| 0 <= i < n}, we can determine in linear time the minimum and maximum. Then in linear time the set A can be turned into a set of points S: a_i is mapped to s_i = (a_i, f(a_i)), where f(a_i) is the unique positive value so that s_i lies on the circle through the minimum and maximum. The convex hull of S consists of all elements. Because they must be presented ordered, solving the convex-hull problem can be used to sort. So, T_sort(n) <= T_convex_hull(n) + O(n).
The planar convex-hull problem can also be solved by using sorting as a subroutine and is actually not much harder. The algorithm consists of log n rounds which each take O(log n) time. The total work is bounded by O(n * log n). The algorithm proceeds as follows:
Later we will consider how to perform parallel sorting. It is rather simple however to perform it in O(log^2 n) time and O(n * log n) work on an EREW PRAM with n / log n PUs. All other steps of the algorithm except for determining the convex hulls of S_upper and S_lower can be performed in O(log n) time and O(n) work or less. Because the set of numbers has been sorted before, constructing S_left and S_right is trivial. The same is true for constructing the whole convex hull in the last step. For the time being we are happy with using the sequential algorithm for finding the upper common tangent. So, the time and work for the recursive part are given by:
T(n) = O(log n) + T(n / 2),The solution of these recurrencies is T(n) = O(log^2 n) and W(n) = O(n * log n). Because of this T, we have not put made an effort to push the time for the non-recursive part below O(log^2 n). The work of this parallel algorithm is optimal, but the time consumption can be reduced to O(log n).
W(n) = O(n) + 2 * W(n / 2).
Merging is the process of turning two sorted arrays a[] and b[] of length n each into one array c[] of length 2 * n containing all values of a[] and b[] in sorted order. The rank of an element x in a set X, denoted rank(x, X), is defined to be the number of elements y in X with y < x. Here it is convenient to assume that all values are different, otherwise the index of the position in which a value is stored may be taken as a secondary criterion. The rank of an element x in A union B is given by rank(x, A) + rank(x, B). rank(a[i], A) = i and rank(b[i], B) = i, for all 0 <= i < n. Thus, it suffices to determine rank(a[i], B) and rank(b[i], A), for all 0 <= i < n, because once all ranks in A union B are known, it takes constant time and linear work to write the numbers of a[] and b[] in c[].
A trivial and fast way to determine all ranks is to perform binary search: for each i, rank(a[i], B) and rank(b[i], A) can be determined in O(log n) time and O(log n) work. Thus, merging can be performed in O(log n) time and O(n * log n) work. This time is satisfactory, but the superlinear work should be reduced.
The obvious improvement is to not determine the rank for all numbers but only a fraction. The largest fraction that can be used so that the work is still bounded by O(n) is n / log n. So, take all numbers a[j * log n] and b[j * log n], for 0 < j < n / log n, and rank them in the other set. This takes O(log n) time and O(n) work. This initial ranking partitions the problem of merging a[] and b[] into small subproblems: there are 2 * n / log n - 2 values for which we know the positions in both a[] and b[]. These subdivide each of the arrays in subsets of size at most log n which have to be merged pairwise. For each of these subproblems a sequential algorithm requires O(log n) time and work linear in the sum of the sizes of the subsets. So, the partitioning allows to solve the whole problem in O(log n) time and O(n) work.
As a further example we consider the problem of inserting k new numbers b_i, 0 <= i < k, into a 2-3 tree with n nodes with keys, a_i, 0 <= i < n, in sorted order. It is assume that k < n, otherwise it is more efficient to build a new 2-3 tree from scratch. Sequentially k numbers can be inserted in O(k * log n) time, and in general this cannot be done much faster, because on the one hand Omega(k * log k) is needed because as a result of the insertions the numbers get sorted, and if log k = o(log n), then it may happen that after the highest log k levels of the tree all paths are independent and must be followed to find the positions where to insert the elements. These remaining paths have length Omega(log n - log k) = Omega(log n) each. So, in our parallel algorithm we are happy with O(k * log n) work, but we would like to reduce the time as far as possible, that is, to O(log n).
First the smallest and largest number, which can easily be determined with linear work in O(log k) time, are inserted in the normal sequential fashion. This is done to eliminate some special cases: after this operation, we know that any element to insert fits between two consecutive leafs of the tree. The insertion takes O(log n) time and work. Then for all k - 2 other b_i the leaf a_j of the tree is determined after which b_i has to be inserted. A chain is the ordered subset B_j of the b_i which fits between a_j and a_{j + 1}. Let k_j = |B_j|. sum_j k_j = k - 2. First consider the special case k_j <= 1, for all j. In this case, the nodes of the tree may get degree larger than 3, but never more than 6. So, processing the tree bottom-up processing all insertions at the same level in parallel, the whole tree can be processed in O(log n) time with O(k * log n) work. Here it is essential that any node gets split only once and that therefore the property that between any two existing nodes at most one new node has to be inserted also holds at the higher levels of the tree. We also use that any node can be processed in constant time.
The above we call the basic algorithm which will be used as a subroutine for the general case. If k_j > 1, we do not insert all elements in one stroke, but rather first insert only the middle element of the chain. At this point the lists should have been sorted. If we are lucky all lists are short, but there might be one chain with k - 2 elements. In that case we need an optimal sorting algorithm running in logarithmic time. Inserting the middle element of a chain splits the chain in two. So, after at most log k of these operations all chains have length 1 or 0 and we are done. Thus, the general problem can be solved easily in O(log k * log n) time.
The final improvement comes by observing that there is no neeed to wait with the next application of the basic algorithm until the previous one has been completed. These operations can also be overlayed in such a way that the next insertion starts at the lowest level as soon as the previous insertion has finished there. All these waves run up to the root in O(log n) time each. So, the whole algorithm has running time O(log k + log n) = O(log n). The work is unchanged by this overlaying.
Consider the problem of computing the maximum of n numbers stored in an array a[]. On a Common CRCW PRAM with n * (n - 1) PUs this problem can be solved in O(1) time. The PUs are indexed PU_{i, j}, 0 <= i < n, 0 <= j < n - 1. In addition to the integer array a[] there is a boolean array b[]. The algorithm is simple:
1. for (all i, 0 <= i < n, using PU_{i, 0}) do
b[i] = true;
2. for (all i, j, 0 <= i < n, 0 <= i < n - 1, using PU_{i, j}) do
if (a[i] < a[j] || (a[i] == a[j] && i < j))
b[i] = false;
3. for (all i, 0 <= i < n, using PU_{i, 0}) do
if (b[i])
max_index = i;
Clearly the algorithm can be performed in constant time on the
specified hardware. In step 1 each of the b[i] is written by a single
PU. In step 2 b[i] may be written by up to n - 1 PUs but the written
value is always false. By adding the clause a[i] == a[j] && i < j it
is assured that if the maximum value occurs several times only one of
them survives. In step 3 this single surviving index is written to
max_index.
The algorithm shows that the maximum can be computed very fast, but it is not optimal. However, we know optimal algorithms which can be combined with it. We summarize the algorithms:
Suppose we have 2 * n PUs. Then for each subset of 2 elements there are 4 PUs, so the above algorithm can be applied. Hereafter there are still n / 2 numbers. For each subset of 4 PUs there are 16 PUs, so again the above algorithm can be applied. More generally, we have
Lemma: After t >= 0 rounds there are still n_t = n / 2^{2^t - 1} numbers from among which the maximum has to be determined.
Proof: The proof goes by induction. For t = 0, we have n_0 = n / 2^0 = n. So, assume the lemma holds for some t. The remaining n_t numbers are divided in subsets of size 2^{2^t}. For each of these subsets there are 2 * n * 2^{2^t} / (n / 2^{2^t - 1}) = (2^{2^t})^2 PUs available. So, the algorithm can be run again, and the number of candidates is reduced to 2^{2^t - 1} / 2^{2^t} = 2^{2^{t + 1} - 1}. End.
So, with n PUs the maximum can be computed in O(loglog n) time. This is much more efficient than before and still very fast, but not yet optimal. If we take n / loglog n PUs, then first each PU can compute the maximum of 2 * loglog n numbers in O(loglog n) time. Then the fast-but-not-optimal algorithm is used to complete the task.
Theorem: On a Common CRCW PRAM with n / loglog n PUs the maximum of n numbers can be computed in O(loglog n) time.
The given algorithm is not such a spectacular application of accelerated cascading. It can be made more interesting by performing several rounds of the hypercube algorithm before changing to the CRCW algorithm. This saves some PRAM steps, because the reduction with the CRCW algorithm is not particularly efficient when the subsets have size 2 or 4, but it has no impact on the asymptotic running time.
We want to point out that the given algorithm is not obtained by a simple application of the WT-scheduling principle: the work of the algorithm with n PUs is n * loglog n because all PUs are active in all rounds. So, it is not because of the scheduling principle that fewer PUs were taken. The decision rather was inspired as follows: considering the loglog algorithm, we wondered for what n' the work of this algorithm is still bounded by O(n). n' = n / loglog n is the largest such value. Then we wondered whether there is another algorithm which allows to reduce the problem size from n to n' with O(n) work in O(loglog n) time. In this case this other algorithm was the sequential algorithm applied to small subsets.
As an example we consider the problem of coloring the nodes of a directed cycle of length n. A node coloring of a graph with c colors, is an assignment of numbers c_i 0 <= c_i < c, to the nodes of the graph so that for any two adjacent nodes i and j, c_i != c_j. Coloring a list sequentially is simple: we can start at any node, coloring the nodes with colors 0 and 1 until we reach the last node, which has to be colored with color 2 if the number of nodes is odd.
When considering how to solve the problem in parallel, there is the problem that it does not appear to make sense to start coloring everywhere. In this section we show how to efficiently construct a three-coloring in a deterministic way. Of course randomization allows to solve the problem more easily, but it is good to also know how such problems can be solved deterministically.
The main idea of the parallel algorithm is to start with a correct coloring with many colors and to gradually reduce the number of used colors. A trivial initial coloring is given by c_i = i for all i, 0 <= i < n. The algorithm proceeds in rounds. In any round, all nodes are recolored in parallel as follows: for any node i with color c_i and successor j, the new color is given by c'_i = 2 * dif(c_i, c_j) + bit(dif(c_i, c_j), c_i). Here dif(x, y) gives the index of the least-significant bit which is not the same in x and y and bit(b, x) gives bit b of x, where bit 0 is the least significant one. The computation is simple. dif(,) is the only operation which is not a standard hardware operation, but it might have been so. Much harder operations, such as multiplication, can be performed in one clock cycle, so it is reasonable to assume that even dif(,) can be performed in constant time.
We must check that the new coloring is correct. The proof of this goes by contradiction. So, assume that c'_i = c'_j. Let node k be the successor of j, then we must have dif(c_i, c_j) = dif(c_j, c_k) and bit(dif(c_i, c_j), c_i) = bit(dif(c_j, c_k), c_j). Substituting the first in the second gives bit(dif(c_i, c_j), c_i) = bit(dif(c_i, c_j), c_j), a contradiction with the definition of dif(,), which in particular should give a bit position for which the two numbers are different. Because the previous coloring was correct, such a bit must exist.
If before the recoloring the largest color is c, requiring round_down(log c) + 1 bits, numbered 0, ..., round_down(log c), then afterwards the largest color is at most 2 * round_down(log c) + 1. From this it follows that in O(log* n) rounds the number of colors is reduced from n to 5. To further reduce the number of colors, we apply
for (c = 5; c > 2; c--)
for each node i with color c_i == c, recolor i with the
smallest color c'_i which does not conflict with the color
of its predecessor or its successor.
Theorem: A three-coloring of a cycle can be constructed in O(log*^ n) time and O(log^* n * n) work.
A (purely theoretical) weakness of the above algorithm is its non-linear work. The algorithm can be made work-optimal at the expense of increasing the time consumption. The above algorithm is performed only once, reducing the number of colors to O(log n). For the special case that we have a set of n numbers with maximum value bounded by O(log n), the sorting problem can be solved in O(log n) time with O(n) work. Sorting means in this case creating buckets containing the indices of all nodes with a given color. After this sorting, the colors can be handled one-by-one as was done before to reduce the number of colors from 5 to 3. Because there are only O(log n) colors and each node has to be processed only once due to the sorting, this recoloring takes O(log n) time and O(n) work. If the super-fast algorithm is applied twice, the number of colors is reduced to O(loglog n). This has a positive impact on the cost of the final recoloring, but the sorting still takes O(log n) time and therefore this does not lead to a smaller asymptotic running time of the entire algorithm.
for (t = 0; t < n / 2; t++)
{
for (i = 0; i < n / 2; i++) // Even
compare_swap(2 * i, 2 * i + 1);
for (i = 1; i < n / 2; i++) // Odd
compare_swap(2 * i - 1, 2 * i);
}
The correctness of the algorithm can be proven by induction. The idea underlying the argument is that the largest number, once it starts moving never gets delayed. It reaches its destination after at most n - 1 steps. The second largest number may be swapped with the largest number if the first step, and then it has to wait one step. Hereafter it will never be delayed anymore, so it arrives no later then after n steps. The operations in each of the inner loops can be executed in parallel. On an interconnection network, the simplest is to implement a compare-swap operation by exchanging the numbers to be compared and to discard one of them in each of the two involved PUs. So, on a linear array, the algorithm can be implemented with n steps in which each PU sends at most one number to a neighbor. Odd-even transposition sort is simple but inefficient: as a sequential algorithm it has complexity Theta(n^2) for sorting n numbers.
T(P, n) = c_i * O(n / P * log n + n) + c_v * n + c_a * P.For small P and very large n, the term c_i * O(n / P * log n) may dominate. In that case the algorithm gives good performance, but this applies only for n which are exponential in P. In the following we will see that good performance can be achieved for much smaller values of n as a function of P.
In the DMM model, the structure of the interconnection network is ignored. On interconnection networks, there is a structure, and the indexing scheme has impact on the efficiency with which sorting can be performed. Generally speaking, it is convenient if PUs for which the indices do not differ much also lie close to each other in the network. Nevertheless, sorting with respect to any indexing scheme which is fixed in advance is at least as hard as performing a k-k routing: if the hardest permutation pi routes a packet from PU_i to PU_j, then when sorting the value of the corresponding number in PU_i is set to j and this number will have to be routed from PU_i to PU_j in order to perform the sorting. The indexing scheme may have importance, but it should be noticed that the difference between the best and the worst scheme is limited: first sorting according to the best indexing scheme and then routing the numbers so that they come to stand according to the desired scheme costs at most twice as much as just sorting.
The simplest algorithm for sorting on a mesh is obtained by embedding a linear array into it. This algorithm sorts n^2 numbers on an n x n mesh in n^2 steps. This poor performance is due to the fact that the two-dimensional structure of the mesh is ignored. Ideally, if a problem can be solved in O(n) time on a linear array with n PUs it can also be solved on an n x n mesh in O(n) time. A natural way to obtain mesh algorithms is by combining operations on rows and columns. For sorting, it appears natural to consider algorithms which consist of an alternation of one-dimensional sorting phases, using odd-even transposition sort.
The shearsort algorithm, consists of an alternation of operations in which all columns are sorted in increasing order from top to bottom, and operations in which the even rows, that is the rows with index 0, 2, ..., n - 2, are sorted from left to right while the odd rows, with index 1, 3, ..., n - 1, are sorted from left to right. Only in the final sorting of the rows, the sorting direction can be chosen freely. Performing this operation as all others, the numbers are sorted in snake-like row-major order. When sorting all rows increasingly from left to right we get row-major order. The first alternative gives the following extremely simple algorithm:
for (t = 0; t <= t_max(n); t++)
{
sort all columns from top to bottom;
sort even rows from left to right and odd rows from right to left;
}
If all PUs hold one number, each pass of the loop takes 2 * n steps, so
the whole algorithm runs in 2 * n * (t_max(n) + 1) steps. The question
is how large t_max(n) should be chosen to guarantee that any
distribution gets sorted. The analysis is based on two lemmas.
0-1 Sorting Lemma: If a comparison-based sorting algorithm correctly sorts any distribution consisting of 0-s and 1-s, it correctly sorts any distribution with arbitrary numbers.
Proof: The proof goes by contradiction. Assume there is a set of numbers X = {x_0, ..., x_{n - 1}} for which the algorithm is not correct. Let pi'() be the inverse of the permutation realized by the algorithm. That is, afterwards the numbers have been rearranged to x_{pi'(0)}, ..., x_{pi'(n - 1)}. Let pi() be the inverse of the permutation giving a correct sorting of the numbers. That is, pi() is so that x_{pi(0)} <= x_{pi(1)} <= ... <= x_{pi(n - 1)}. Let k be the smallest index so that x_{pi'(k)} != x_{pi(k)}. All numbers with values smaller than x_{pi(k)} belong to the set {x_{pi(0)}, ..., x_{pi(k - 1)}} = {x_{pi'(0)}, ..., x_{pi'(k - 1)}}, so we must have x_{pi'(k)} > x_{pi(k)}. Because pi() is a permutation, this implies that there also must be an index l > k, so that x_{pi'(l)} = x_{pi(k)}. Now define the set of 0-1 values X' = {x'_0, ..., x_{n - 1}} by x'_i = 0 if x_i <= x_{pi(k)}, x'_i = 1 if x_i > x_{pi(k)}. Because for all 0 <= i, j < n, x_i < x_j => x'_i < x_j, the algorithm works on X' as it did on X (except that possibly some elements which now have the same value do not get exchanged). So, on X', the algorithm will produce the following sequence of values as output: ..., x'_{pi'(k)}, ..., x'_{pi'(l)}, ... = ..., 1, ..., 0, ... . That is, at position k there occurs a 1 while at position l > k there occurs a 0: the sorting is not correct, a contradiction with the assumption that it correctly sorts all 0-1 distributions. End.
A row is called clean if it only contains 0-s or only 1-s. Otherwise a row is called dirty. For the analysis it is convenient to rewrite the algorithm as follows:
sort all columns from top to bottom;
for (t = 0; t < t_max(n); t++)
{
sort even rows from left to right and odd rows from right to left;
sort all columns from top to bottom;
}
sort even rows from left to right and odd rows from right to left;
Hereafter we will refer to this variant.
Lemma: For any 0-1 number distribution on an n x n mesh, after t, 0 <= t <= log n, passes through the loop in shearsort there are still at most n / 2^t dirty rows.
Proof: The purpose of the first sorting of the columns is that it establishes the property that for all i, 0 <= i < n - 1, the number of 0-s in row is at least as large as in row i + 1. This is true because if row i + 1 would hold more 0-s than row i, there would be a position in row i with a 1 where row i + 1 has a 0, contradicting the sorting of the columns.
For t >= 0 we use induction. For t = 0 the claim holds. So, assume that the claim holds for some t >= 0. Sorting the clean rows does not change anything. So, we may focus on the n_t dirty rows. Let z_i, 0 <= i < n_t, denote the number of 0-s in the i-th of these rows. z_i is monotonously decreasing. Let m_0 = |{0 <= i < n_t| z_i >= n / 2}| and let m_1 = n_t - m_0. We claim that after pass t + 1, at least round_down(m_0 / 2) of these rows only contain 0-s, and at least round_down(m_1 / 2) only 1-s. This is a direct consequence of the the monotonicity of z_i and the fact that even rows are sorted differently from odd rows. Thus, any two consecutive rows with z_i >= n / 2 lead to one row with only 0-s. There are several cases to distinguish. If n_t is odd, then either m_0 is even or m_1 is even and thus round_down(m_0 / 2) + round_down(m_1 / 2) = n_t / 2. If n_t is even, then if m_0 and m_1 are even, round_down(m_0 / 2) + round_down(m_1 / 2) = (m_0 + m_1) / 2 = n_t / 2. Only the case that both m_0 and m_1 are odd deserves special consideration. So consider row m_0, which is sorted from right to left and row m_0 + 1 which is sorted from left to right. If z_{m_0} + z_{m_0 + 1} >= n, then this results in one more row with only 0-s, otherwise in one more row with only 1-s. So, in this case at least round_down(m_0 / 2) + round_down(m_1 / 2) + 1 = n_t / 2 new clean rows are generated. End.
Theorem: Shearsort sorts any distribution of n^2 numbers on an n x n mesh in (2 * log n + 2) * n steps.
The time consumption can be improved by almost a factor two, by realizing that the sorting in the columns does not need to be performed for n steps in the later passes of the loop: if there are only n_t rows which might contain 0-s and 1-s, it suffices to perform n_t steps. Thus, all sorting in the columns can be reduced to n + n + n / 2 + ... + 2 = 3 * n - 2. So, the total sorting time can be reduced to (log n + 4) * n - 2 steps. Even this is not such a good result, but it might serve as basis case of a recursive algorithm. The question is whether it is better than the result obtained by trivially embedding a linear array. The times that should be compared are given in the following table, the results are not very impressive:
| n | 2 | 4 | 8 | 16 |
| Transposition Sort | 4 | 16 | 64 | 256 |
| Shearsort | 8 | 22 | 54 | 126 |
The algorithm is very simple. Initially x may lie in the whole interval ranging from 0 to n. This interval is reduced in several rounds. In each round the remaining interval is divided in P intervals I_i, 0 <= i < P, of approximately equal size. PU_i, 0 <= i < P, is in charge of I_i. Each of the PUs tests whether x lies in its interval. Assuming that the values are different (otherwise the smallest should be taken) there is a unique interval in which x lies. The PU in charge of this interval writes its index to a global variable c. All PUs read this variable and can start the next round. On an CREW PRAM, each round can be performed in O(1) time. So, the whole searching algorithm runs in O(log_P n).
Theorem: On a CREW PRAM with P PUs, an element x can be ranked in a sorted set of size n in O(log_P n) time.
An important special case is that P = n^eps for some eps > 0. In that case log_P n = 1 / eps = O(1). So, with sufficiently many PUs an element can be ranked in constant time.For two sorted arrays, a[] of length m and b[] of length n, let rank(a, b) denote the array (r_0, ..., r_{m - 1}) of length m with r_i = rank(a[i], b), the rank of a[i] in b[]. Assume that we have a CREW PRAM with n PUs, then if m = O(n^{1 - eps}), for some eps > 0, rank(a, b) can be determined in constant time, because for each of the m elements of a[] we can use n^eps PUs to perform the ranking.
Again we apply the partitioning strategy. For merging two sorted arrays a[] and b[] each of length n with n PUs, we select the subarrays a'[] and b'[] of length sqrt(n) with a'[i] = a[sqrt(n) * i] and b'[i] = b[sqrt(n) * i] (it is convenient to add one more element to each subarray with value infinity). With the available PUs rank(a', b) and rank(b', a) can be computed in constant time. The resulting partitioning reduces the merging problem to merging subarrays of a[] and b[] of length at most sqrt(n). Before we reduced the problem in one stroke to subproblems which could be solved sequentially, now we continue recursively until the size has become constant. Each phase of the recursion takes O(1), so the whole merging takes O(loglog n) time.
Lemma: On a CREW PRAM two sorted arrays of length n each can be merged in O(loglog n) time with linear work.
Proof: Above we have shown how to merge two sorted arrays a[] and b[] of length n on a CREW PRAM with n PUs in O(loglog n) time. The work of this algorithm is O(n * loglog n). This algorithm can be made work-optimal by some kind of partitioning. So, we first determine subarrays a'[] and b'[] each of length n / loglog n by taking a'[i] = a[i * loglog n] and b'[i] = b[i * loglog n]. a'[] and b'[] are merged with n / loglog n PUs in O(loglog n) time. Just as in all previous partitionings, as a result it remains to merge sections of a[] and b[] each of length at most loglog n. With one PU for each subproblem, each of these can be solved with the sequential algorithm in O(loglog n) time. End.
Using the above fast and efficient merging as a subroutine in a merge sort algorithm gives
Theorem: On a CREW PRAM with n / loglog n PUs n numbers can be sorted in O(log n * loglog n) time.
O(log n * loglog n) is not the best possible result: sorting can be solved in O(log n) time and O(n * log n) work on an EREW PRAM. This optimal algorithm, which is known in the literature as Cole's merge-sort algorithm, is considerably more complicated and practically a gap of O(loglog n) is hardly significant. A more serious drawback of the presented algorithm is that it is based on the parallel-search algorithm which really needs the CREW model. The parallel quick-sort algorithm can also be refined to run in O(log n) time with high probability.
void bitonicMergeUp(int[] a, int n) {
for (int s = n >> 1; s > 0; s = s >> 1)
for (int i = 0; i < n; i += 2 * s)
for (int j = i; j < i + s; j++)
compareSwapUp(a, j, j + s); }
Here compareSwapUp(a, i, j), compares the values of a[i] and a[j]
and exchanges them if a[i] > a[j]. The procedure bitonicMergeDn() is
analogous. These procedures integrated in a running program can be
downloaded here.
The correctness is not proven here. The invariant that must be proven is that after round r, k >= r >= 0, all numbers in a subset of size 2^r with bit r equal to 0 are smaller than all numbers in the corresponding subset with bit r equal to 1. Because later comparisons are performed only within the subsets, the earlier established properties are preserved. Because all properties together imply a sorted sequence, proving the invariant implies the correctness.
T_merge(2^k) = k / 2 * 2^k, for all k >= 1.
For T_sort(2^k), the time to sort 2^k numbers based on this merge operation, we get
T_sort(2^1) = 1,Using induction it is easy to verify, that the solution is given by
T_sort(2^k) = 2 * T_sort(2^{k - 1}) + T_merge(2^k), for all k > 1.
T_sort(2^k) = k * (k + 1) / 4 * 2^k, for all k >= 1.Another important point is the depth of the schedule. For bitonic merge sort, this is simply sum_{i = 1}^k i = k * (k + 1) / 2 = T_sort(2^k) / (2^k / 2).
Theorem: Using bitonic merge sort for sorting n numbers, the work is Theta(n * log^2 n). The time is Theta(log^2 n).
Bitonic merge sort is very suitable for implementation on a hypercube: in all steps number a[i] is compared with numbers a[j], where j = i +- 2^r, for some r, 0 <= r < k. This means that, when mapping the numbers of a[] in a straightforward way to the 2^k PUs of a k-dimensional hypercube, all communication takes place between neighbors, and that thus all steps can be performed in constant time.
Theorem: Using bitonic merge sort, 2^k numbers can be sorted in Theta(k^2) time on a k-dimensional hypercube.
Asymptotically this is not the best achievable, but the algorithm is extremely simple and, as seen above, the constants hidden in the Theta-notation are very small.Surprisingly bitonic merge sort is asymptotically optimal on linear arays and meshes. The analysis is most easy for linear arrays. Merging two subsequences standing each in m consecutive PUs takes m + m / 2 + + ... + 1 = 2 * m - 1 steps. So, for a linear array with n = 2^k PUs all merging rounds together take sum_{0 <= s < k} (2 * 2^s - 1) = 2 * sum_{0 <= s < k} 2^s - k = 2 * n - log n - 2 < 2 * n steps. This is not bad at all! Of course, for linear arrays odd-even transposition sort performs fewer routing steps, but at the expense of far more comparisons.
On a two-dimensional n x n mesh n^2 numbers can be sorted in 2 * n + o(n) steps, which is optimal, but the algorithm is complicated and the lower-order term makes this algorithm unsuitable for moderate values of n. How about bitonic merge sorting? It is essential that the PUs are indexed with the shuffled row-major indexing. In that case the distance between PU i and j with j = i +- 2^r, for some r, 0 <= r < k, is exactly 2^{round_down{r / 2}}. Because such PU i and j lie in the same row or column, the routing can be performed without delay. Thus, merging two subsequences of length 2^s each takes 1 + 1 + 2 + 2 + ... + 2^round_down{(s - 1) / 2} + 2^round_down{s / 2} < 4 * 2^{s / 2}. For the number of steps T(n) for the whole sorting this gives T(n) < 4 * sum_{0 <= s < k} 2^{s / 2} < 8 * sum_{0 <= s < k / 2} 2^s < 8 * n. It is worthwhile to compute T(n) exactly. T(2) = 1 + (1 + 1) = 3. T(4) = T(2) + (2 + 1 + 1) + (2 + 2 + 1 + 1) = 13. T(8) = T(4) + (4 + 2 + 2 + 1 + 1) + (4 + 4 + 2 + 2 + 1 + 1) = 37. More generally, T(n) = T(n / 2) + 3 1/2 * n - 4. The solution is T(n) = 7 * n - 4 * log n - 7. As noticed above, asymptotically this is not competitive, but for small n it might be the best achievable. It is better than transposition sort and shearsort for all n.
oddEvenMerge(int a[], int m) {
if (m > 2) {
int[] ae = new int[m / 2];
int[] ao = new int[m / 2];
for (int i = 0; i < m / 2; i++) {
ae[i] = a[2 * i];
ao[i] = a[2 * i + 1]; }
oddEvenMerge(ae, m / 2);
oddEvenMerge(ao, m / 2);
a[0] = ae[0];
for (int i = 1; i < m / 2; i++)
if (ae[i] <= ao[i - 1]) {
a[2 * i - 1] = ae[i];
a[2 * i] = ao[i - 1]; }
else {
a[2 * i - 1] = ao[i - 1];
a[2 * i] = ae[i]; }
a[m - 1] = ao[m / 2 - 1]; }
else // m == 2
if (a[0] > a[1]) {
int x = a[0]; a[0] = a[1]; a[1] = x; } }
This procedure integrated in a running program can be downloaded here. This program can be made
considerably more efficient, by eliminating the copying operations.
We consider the same example as before:
a[] initially: 10 12 14 16 20 25 32 34 8 13 24 26 28 36 38 40 ae[] initially: 10 14 20 32 8 24 28 38 ao[] initially: 12 16 25 34 13 26 36 40 ae[] merged: 8 10 14 20 24 28 32 38 ao[] merged: 12 13 16 25 26 34 36 40 a[] finally: 8 10 12 13 14 16 20 24 25 26 28 32 34 36 38 40
The correctness is not proven here, but using the 0-1 lemma this is not hard. The key observation is that the estimate of the position of a number in the global array can be predicted with an accuracy of 1 from the position in ae[] and ao[].
T_merge(2^1) = 1,The solution is
T_merge(2^k) = 2 * T_merge(2^{k - 1}) + 2^{k - 1} - 1, for all k > 1.
T_merge(2^k) = (k - 1) / 2 * 2^k + 1, for all k >= 1.The expression for T_sort(2^k) is the same as before. The solution is
T_sort(2^k) = ((k - 1) * k / 4 + 1) * 2^k - 1, for all k >= 1.
The following table gives the number of comparisons performed by the bitonic and the odd-even merge sort algorithms. First the difference between the algorithms increases, then it gradually decreases again. Due to their O(log^2 n * n) complexity, neither of the algorithms should be used for large values of n. On the other hand, for small n, odd-even merge sort is close to optimal.
| n | 1 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
| bitonic | 1 | 6 | 24 | 80 | 240 | 672 | 1792 | 4608 |
| odd-even | 1 | 5 | 19 | 63 | 191 | 543 | 1471 | 3839 |
| ratio | 1 | 1.20 | 1.26 | 1.27 | 1.26 | 1.24 | 1.22 | 1.20 |
We analyze the algorithm for a DMM. The internal time consumption is dominated by the time for the several sorting and bucketing operations. It is given by c_i * O(x * log x + P * x * log P + k * log n). If we take care to select x so that P * x < k, then the time is dominated by the time for sorting k numbers. There are only two communication operations. Good performance can only be achieved if the all-to-all routing is performed in a direct way. Thus, the number of start-ups is log P + P - 1. The routing volume is (1 - 1 / P) * (P * x + k). If P * x < k, then this is dominated by the second term. So, under condition P * x = o(n / P), we get
T(P, n) = (1 + o(1)) * (T_sort(1, n / P) + c_v * n / P + c_a * P).This is a great result, because it means that parallel sorting can be parallelized perfectly except for routing all numbers once through the network. This is the best one reasonably can hope for. The achievable speed-up mainly depends on the ratio of c_i and c_v, but of course, for very large n, the log-factor in the sorting time also plays a certain role. The speed-up is approximately given by c'_i * n * log n / (c'_i * n / P * log n + c_v * n / P) = P / (1 + c_v / (c'_i * log n)). Here c'_i is the constant so that T_sort(1, n) ~= c'_i * n * log n, which depends both on c_i and the applied sequential sorting algorithm.
In the above analysis we silently assumed that the all-to-all routing is balanced and that finally all PUs receive approximately the same number of packets. This is not automatically true. There are two limitations:
The second problem is more specific and requires an analysis. We compute an upper bound on the number of numbers that finally may end up in a single PU. So, the question is how many numbers may lie between any two consecutive splitters. Consider two splitters s = s_{i - 1} s' = s_i. Let R_j and R'_j denote the rank of s and s', respectively, in the set of numbers initially stored in PU_j. Analogously, let r_j and r'_j denote the rank of s and s', respectively, in the set of presplitters selected by PU_j. By definition of the splitters, sum_{j = 0}^{P - 1} (r'_j - r_j) = x. The number S_j of numbers routed to PU_j is given by S_j = sum_{j = 0}^{P - 1} (R'_j - R_j). Using that in each PU exactly k / x numbers lie between any two presplitters, we get the following estimates:
S_j <= k / x * sum_{j = 0}^{P - 1} (r'_j - r_j + 1) = n / P + n / x.
S_j >= k / x * sum_{j = 0}^{P - 1} (r'_j - r_j - 1) = n / P - n / x.
So, there is an "unbalance" of O(n / x). Thus, in order to obtain the above claimed results we must take x = omega(P). Because at the same time we had the condition P * x = o(n / P), this gives the condition P^3 = o(n). This is quite a tough condition which is nevertheless satisfied for most practical values of P and n. By refining the procedure by which the splitters are selected from the presplitters, performing iterated subsampling the condition can be relaxed almost to P^2 = o(n), the typical condition which must be satisfied to obtain good performance on a DMM.
For optimal performance, x should be chosen so that the sum of all additional costs together is minimized. On the one hand there are the costs for handling the presplitters and splitters, on the other hand the costs due to an unbalanced routing and sorting. The first contribution increases with x in the same way the second decreases with x. In such cases an approximation of the optimal choice can be found by equating both cost contributions and solving x. This gives
(c_v + c_i * log P) * P * x = (c_v + c_i * log P) * n / x.This suggests to choose x = sqrt(n / P). For this particular choice, the total "loss" term is bounded by O((c_v + c_i * log P) * sqrt(n * P)), to be compared with the "useful" term Theta((c_v + c_i * log P) * n / P).
The given algorithm does not guarantee that finally each PU holds the same number of numbers. Furthermore, even though the numbers stand sorted, their ranks are not known. This can easily be overcome: counting the number of numbers in each PU and computing the prefix sum of these numbers, the ranks of the numbers can be determined cheaply. From this it follows where they have to be stored. In the normal case that n / x < n / P, this can be achieved by a routing in which a small fraction of the numbers in PU_i has to be routed to PU_{i - 1} and/or PU_{i + 1}. Of course this correction can be performed along with the final sorting operations.
T(P, n) = (1 + o(1)) * (T_sort(1, n / P) + c_v * 2 * n / P + c_a * 2 * P).The routing time is doubled due to the randomization, but this assures that we do not have to consider the possibility of an unbalanced routing pattern, these are resolved automatically. On the other hand, the internal time consumption is essentially the same. Possibly it is even slightly better because no time is waisted on handling the splitters.
We now analyze the minimal value of n as a function of P to assure that the above result holds with high probability. The following points must be satisfied:
There are several ways to perform the final correction. Consider PU_i and PU_{i + 1}. Both have their numbers sorted, but possibly the values of the largest numbers in PU_i are larger than those of the smallest numbers in PU_{i + 1}. The above estimate has shown that we may assume that the number of these is at most x = c * P * sqrt(k * log n). Here c is a constant which can be determined by working out the details of the estimate. The simplest solution is to let both PUs exchange the x concerned numbers. Then each of them can merge the two subarrays of length x. PU_i keeps the smallest x and PU_{i + 1} the largest x. First exchanging samples, there is no need to make copy and the data exchange can be minimized.
For k = 1, a good choice is m = n^{2/3} * log^{1/6} n. Doing that, n' = n / m = n^{1/3} / log^{1/6} n. So, there are P' = n'^2 = n^{2/3} / log^{1/3} n submeshes. k' = 1 * m^2 = n^{4/3} * log^{1/3} n. Any computation within the submeshes, corresponding to steps 1, 3, 4 and 6 of the algorithm above, takes O(m) = O(n^{2/3} * log^{1/6} n) time. The choice of m is made so that P' * (k' * log n)^{1/2} = n^{2/3} / log^{1/3} n * (n^{4/3} * log^{4/3} n)^{1/2} = n^{4/3} * log^{1/3} n = k'. So, in step of the algorithm the rearrangement takes place within subsets of constantly many submeshes. This takes O(m) time as well. Step 2 and 5 amount to routing a randomization and the inverse thereof. It can easily be shown that congestion is very limited. Each routing phase can be performed in n + O(sqrt(n * log n)) time. So, in total the sorting takes 4 * n + O(n^{2/3} * log^{1/6} n) time. This is twice as much as given by the distance lower bound.
For larger k optimality can be established. Now it is better to choose m smaller in dependency of k: the best choice is m = n^{2/3} * log^{1/6} n / k^{1/6}. This means that for very large k we will take m = 1, which sounds natural. The operations in the submeshes take O(k * m) = O(k^{5/6} * n^{2/3} * log^{1/6} n) time. Again the choice is made so that P' * (k' * log n)^{1/2} = n^{2/3} * k^{1/3} / log^{1/3} n * (n^{4/3} * k^{2/3} * log^{4/3} n)^{1/2} = n^{4/3} * k^{2/3} * log^{1/3} n = k'. In this case in step 2 and 5 half of the packets are routed XY and the other half YX. This makes that in each of the routing phases only k * n / 8 packets have to cross the bisection. Thus, together these steps take k * n / 2 + o(k * n) time.
Theorem: On a two-dimensional n x n mesh, k-k sorting can be performed in max{4 * n, k * n / 2} + o(k * n) steps with high probability.
Based on shuffles and unshuffles the following is a simple, versatile and efficient sorting algorithm which is first formulated for a DMM with P PUs and k = n / P numbers per PU:
The analysis of the algorithm goes analogously to those above. For a number x which after the sorting in step 3 has rank r in PU_i, the PU in which it currently resides, the global rank is estimated as R' = P * r. Let R denote the actual global rank of x. It is easy to put upper bounds on |R' - R|. For this we need some additional notation. Let r_j be the rank of x in PU_i among the numbers originating from PU_j. Among the numbers which originally resided in PU_j, x falls between the number with rank j + (r_j - 1) * P and the one with rank j + r_j * P. Summing over all PUs gives
P * (P - 1) / 2 + P * sum_{j = 0}^{P - 1} (r_j - 1) < R < P * (P - 1) / 2 + P * sum_{j = 0}^{P - 1} r_jSo, after step 5 there may reside at most about P^2 / 2 numbers which belong in PU_i in both PU_{i - 1} and PU_{i + 1}. For the correctness this gives the condition that k > P^2 / 2. In order that the cost of the final correction is negligible, the condition becomes k = omega(P^2). Substituting k = n / P gives the condition n = omega(P^3), the same we have encountered before.
The mesh is not square but has height n^2 and width n. Each PU holds one number. The numbers will be sorted in column-major order. The algorithm consists of a few simple one-dimensional operations:
Lemma: Starting with a zero-one distribution, there are at most n dirty rows after step 3.
Proof: Let z_j, 0 <= j < n, be the number of zeroes in column j. The shuffling is so that any column either receives round_down{z_j / n} or round_up{z_j / n} of these zeroes. So, a column receives at least l = sum_j round_down{z_i / n} and at most h = sum_j round_up{z_i / n} zeroes. So, after step 3, in row 0, ..., l - 1 there are only zeroes, while in row h, ..., n^2 - 1 there are only ones. Only in row l, ..., h - 1 there may be both zeroes and ones. Because h - l <= n, the claimed result follows. End.
Lemma: Starting with a zero-one distribution, there are at most 2 dirty columns after step 5.
Proof: Step 4 and 5 together are so that the contents of the columns is permuted among the rows so that the smallest n numbers in each column go to column 0, and more generally that all numbers standing in row j * n, ..., (j + 1) * n - 1 are ending up in column j. Let l and h be as in the proof of the previous lemma. There are two cases to distinguish. There is a j so that j * n <= l and h < (j + 1) * n. In that case only column j receives zeroes and ones. So, column j will be the only dirty column. Otherwise, because h - l <= n, there is a j so that j * n <= l and h < (j + 2) * n. In that case column 0, ..., j - 1 receives only zeroes, column j + 2, ..., n - 1 receives only ones, and column j and j + 1 may receive both zeroes and ones. End.
Theorem: The given column-sort algorithm correctly sorts any distribution of n^3 numbers on an n^2 x n mesh.
Proof: Given the situation as described in the proof of the above lemma, with zeroes and ones only in column j and j + 1 for some j, 0 <= j < n - 1, sorting the columns up and down alternatingly followed by sorting pairs of adjacent row positions reduces the number of dirty columns to one. This is a special case of the argument for shearsort that the number of dirty columns is reduced by a factor two in each round. The final sorting rearranges the numbers in this dirty column into sorted order. End.
Column-sort can also be formulated processor independent. In this generalized form it becomes immediately clear how to carry it over to interconnection networks. The numbers are divided over n^{1/3} buckets of size n^{2/3}. The role of the PUs is taken over by the buckets.
Sorting n numbers can be performed by a combination of a constant number of operations in which n^{1/3} buckets with n^{2/3} numbers each are sorted and some regular rearrangements of the elements between these buckets.
Theorem: On a 2-dimensional n x n mesh with k >= 8 numbers per PU, sorting can be performed in k * n / 2 + o(n) steps.
Proof: The buckets can be chosen to consist of the k^{2/3} * n^{4/3} numbers residing in each n' x n' submesh for n' = n^{2/3} / k^{1/6}. Sorting these submeshes can be performed with any sorting algorithm which achieves the optimal time order in O(k * n') = O(k^{5/6} * n^{2/3}). The final correction also has this time order. The shuffle and unshuffle send only half of the packets over the bisection and when exploiting the all-port facility they can be performed in k * n / 4 steps each. Thus, the whole sorting can be performed in k * n / 2 + O(k^{5/6} * n^{2/3}) steps. End.
The number of steps is the same as for routing a k-k distribution except for the lower-order term. Thus, we have obtained an optimal sorting algorithm because sorting is at least as hard as routing and k * n / 2 is a lower bound for k-k routing. Comparing with the presented randomized algorithm we see that the performance is almost the same. The lower-order term of the randomized algorithm is asymptotically marginally larger: distributing information regularly is better than randomizing.The n! permutations can be enumerated systematically, and permutation p, 0 <= p < n!, is denoted pi_p. The PUs are indexed PU_{p, i}, 0 <= p < n!, 0 <= i < n. The algorithm is simple:
1. for (all p, 0 <= p < n!, using PU_{p, 0}) do
b[p] = true;
2. for (all p, i, 0 <= p < n!, 0 <= i < n - 1, using PU_{p, i}) do
if ( a[p[i]] > a[p[i + 1]] ||
(a[p[i]] == a[p[i + 1]] && p[i] > p[i + 1]))
b[i] = false;
3. for (all p, i, 0 <= p < n!, 0 <= i < n, using PU_{p, i}) do
if (b[p]) {
c[p[i]] = a[i];
a[i] = c[i]; }
In the last step it is essentially used that a PRAM is synchronous. On
a non-synchronous machine the two instructions inside the if-clause
should be separated by a synchronization to assure that the instruction
c[p[i]] = a[i] has been executed before the instruction a[i] = c[i] is
executed.
Using O(n^3) hardware to sort n numbers is not particularly efficient. By how much can this be reduced while still bounding the sorting time to O(1)? We will see that O(n^2) hardware is enough. This is optimal because performing a permutation in O(1) requires a bisection bandwidth of n, while the grid also has a basis of length n.
The main conclusion from column-sort is that sorting n numbers can be performed by several times sorting O(n^{2/3}) numbers and several rearrangements. Of course each of these sortings themselves can also be performed using column-sort. So, the idea for sorting n numbers on a reconfigurable n x n mesh is to recursively apply column sort. The recursion is terminated as soon as the size n' of the subproblem to solve satisfies n'^2 < n, because then it can be solved in constant time using the presented algorithm for sorting on reconfigurable arrays. Let n_i, i >= 0, denote the size of the subproblems at level i of the recursion. n_0 = n, n_1 = n^{2/3} and n_2 = n^{4/9}. So, the recursion tree has constant depth. Because the number of subproblems increases only by a constant factor between two levels of the recursion, there are in total only constantly many subproblems of size n^{4/9} to solve. We conclude
Theorem: On a reconfigurable processor array with n^2 PUs, n numbers can be sorted in O(1) time.
We consider a set of lists with n nodes in total. This is internally represented by an array succ[] of length n. For a node u, succ[u] gives the successor of u. So, the lists are directed towards their final nodes. If u is the last node of its list, we have succ[u] = u. The task is to compute the values of the arrays last[] and dist[]. For a node u, last[u] indicates the last node of the list to which u belongs, and dist[u] gives the number of links that must be traversed in order to reach last[u] starting in u.
Sequentially list ranking can be solved by first figuring out the initial elements of the lists and then simply traversing the list while numbering the nodes.
void sequential(int succ[], int[] frst, int dist[], int n)
{
boolean[] first = new boolean[n];
for(int i = 0; i < n; i++)
first[i] = true;
for(int i = 0; i < n; i++)
first[succ[i]] = false;
for(int i = 0; i < n; i++)
if (first[i]) // i is first node
{
int j = last[i] = i;
int d = dist[i] = 0;
while (succ[j] != j)
{
j = succ[j];
d++;
last[j] = i;
dist[j] = d;
}
}
}
The above algorithm performs 4 * n unstructured accesses to arrays of
length n. For large n, this causes close to 4 * n cache faults. As a
result this algorithm is not particularly efficient. Even a fast
computers requires around 1 second for ranking a list with 10^6
elements. Some simple improvements reduce the number of unstructured
array accesses to 2 * n, almost halving the execution time.
There is no obvious parallel algorithm. Parallel list ranking is much harder than parallel sorting: when sorting any two elements can be compared and this adds some information, but while list ranking, the successor of a node is typically stored in another processor and just going there is expensive. Theoretically the parallel list-ranking problem has been solved: for ranking a list of length n, there are PRAM algorithms using n / log n PUs running in O(log n) time. But, practically it is hard to achieve reasonable speed-ups. Even for very large problems and a very good interconnection network it is not easy to achieve a speed-up of more than 30 on a parallel computer with 100 PUs. For sorting it would have been easy to achieve a speed-up of 80 or 90. This is mainly due to the hard, very non-local nature of list ranking, but also to the fact that its sequential complexity is only linear, while sequential sorting has running time O(n * log n).
The ideas in this section go back on many authors. Just to name a few: Wylie, Anderson, Cole, Miller, Vishkin, Reid-Miller, Ranade and Sibeyn.
void pointerJumping(int[] succ, int[] last, int[] dist, int n)
{
// Initialization
for (int i = 0; i < n; i++)
if (succ[i] == i)
{
last[i] = succ[i] - n;
dist[i] = 0;
}
else
{
last[i] = succ[i];
dist[i] = 1;
}
// Performing rounds until all reached final node
boolean flag = true;
while (flag)
{
flag = false;
for (int i = 0; i < n; i++)
if (last[i] >= 0) // i has not yet reached last node
{
flag = true;
dist[i] += dist[last[i]];
last[i] = last[last[i]];
}
}
// Correction of last[] values
for (int i = 0; i < n; i++)
last[i] += n;
}
Operations of the kind last[i] = last[last[i]], for all i, are called pointer-jumping steps. In a parallel algorithm, these are performed synchronizedly. So, as long as last[u] does not yet point to the last node of a list, the value dist[u] doubles in every pass. Therefore the while loop will be executed at most log n times. The work in each pass is O(n), so the total work is O(n * log n) and the time consumption is O(log n). Pointer jumping is simple and fast, but inefficient due to its super-linear work.
On an interconnection network, each pass requires two all-to-all routing operations, in which each question is resulting in two answers. So, in the DMM model this algorithm costs
T(P, n) = c_i * O(n / P * log n) + c_v * 3 * log n * n / P + c_a * 2 * log n * P.Because of the relatively small number of all-to-all communications, this algorithm is quite reasonable for relatively small problems. For n >> c_a / c_v * P^2, the start-up costs are negligible. But, of course, the factor log n in the work and routing volume excludes really good performance.
The independent-set removal algorithm consists of a reduction and an expansion phase. In the reduction phase, repeatedly the nodes in an independent set are excluded. This continues until the problem has been reduced sufficiently to be solved in a more basic way. Then all excluded elements are reinserted in reversed order. For efficiently performing these operations it is practical to have a doubly linked list, that is, for every node i there is also data item pred[i] giving its predecessor next to succ[i] giving its successor.
During the reduction phase, the following steps are iterated: somehow an independent set S is constructed, then each node i in S, is excluded from the list by performing
succ[pred[i]] = succ[i]; dist[pred[i]] += dist[i]; pred[succ[i]] = pred[i];During the expansion phase these operations are reversed: if i was excluded in round r, then in the corresponding expansion round we may inductively assume that node succ[i] already knows the index of the last node of the list and the distance thereto. Thus, by performing
last[i] = last[succ[i]]; dist[i] += dist[succ[i]];also the values of node i become final.
In the work-time framework, each of the exclusion and inclusion operations can be performed in O(1) time and work that is proportional to the number of participating nodes. On an interconnection network each of these operations require a constant number of all-to-all routings, with a routing volume that is proportional to the number of participating nodes.
The main point that remains to be specified is the selection of the independent set. This can be done both randomly and deterministically. Investing some more effort in order to get a larger independent set might be a good idea, because this will result in a faster reduction of the problem size. However, the simplest idea is to perform random coin tossing: for each node 0 or 1 is chosen independently with probability 1/2. S is the set of all nodes with value 1 whose successor has value 0. For a list of length n, the expected number of nodes selected equals (n - 1) / 4. Using the Azuma inequalities it can even be shown that at least n / 4 - o(n) nodes are selected with high probability. This means that applying this selection method, the problem size is approximately multiplied by a factor 3 / 4 in each reduction round. In the work-time framework, S can be selected in constant time, on an interconnection network, it requires one all-to-all routing to inform all predecessors.
So, over all selection, reduction and expansion rounds, the sum of the the performed work is given by c * (1 + 3/4 + 9/16 + ...) * n <= 4 * c * n for some suitable constant c. So, it is linear in n. The time is proportional to the number of performed reduction rounds because for each reduction there is one corresponding selection and expansion. There is no need to continue reducing until the problem size is constant, though this would not be wrong. If the selection of the independent set is performed randomly, it is better to switch to another algorithm earlier. A reasonable choice is to reduce the problem size by a factor log n using independent-set removal in O(loglog n) time, and then to rank the remaining nodes using pointer jumping in O(log n) time. Both algorithms perform O(n) work.
Theorem: On an EREW PRAM with n / log n PUs the list-ranking problem for a set of lists with n nodes can be solved in O(log n) time.
Proof: The main problem is the allocation of work to the PUs. If initially each PU takes care of log n nodes, it is not guaranteed that this number decreases in a smooth way. However, in groups of log n PU, which initially take care of log^2 n nodes, such a smooth decrease is assured with high probability. These subsets of the PUs can be perceived as constituting (loglog n)-dimensional hypercubes. After reduction round r, each PU performs a load balancing with the PU at the other end of the (i mod loglog n)-axis of this conceptual hypercube. Doing this, it becomes highly unlikely that in round r a PU has to deal with substantially more than (3/4)^r * log n nodes. Neither independent-set removal nor pointer jumping require concurrent-memory access. End.
On an interconnection network with P PUs, we can proceed in the same way although in practice it is better to first reduce the problem size slightly more in order to reduce the cost of the pointer jumping. Alternatively, the problem size can first be reduced by a factor P and then all remaining nodes can be concentrated in PU 0, where the subproblem can be solved internally in O(n / P) time. This is not so elegant, but minimizes the number of start-ups. In terms of the DMM the second approach gives
T(P, n) = c_i * O(n / P) + c_v * O(n / P) + c_a * O(log P * P).
As observed in the section on "Routing" in the chapter on "Interconnection Networks", the number of start-ups for an all-to-all routing can be strongly reduced by simulating the routing of a d-dimensional meshes or hypercubes. In general this observation is useful only for problems that are too small for the used number of PUs, because such an indirect routing multiplies the routing volume and causes overhead for rearranging packets as well. So, this may only give an improvement if the time consumption is dominated by the time for the start-ups. However, in our case, the problem size is gradually reduced, and even when the overall time-consumption is dominated by the term (c_i + c_v) * O(n / P), it may happen that in some of the all-to-all routings the start-up time dominates. So, the best performance is obtained, when at any stage the best all-to-all routing subroutine is chosen as a function of P and the size of the packets. Working out this idea more carefully, shows that the factor log P form the term c_a * O(log P * P) can be omited without increasing the leading constants hidden in the other two terms.
The independent set can also be found deterministically. One possibility to obtain a large independent set is to deterministically construct a three-coloring as discussed in the section on "Symmetry Breaking" of the chapter on "Basic Techniques". The nodes with any of the three colors constitute an independent set. The largest of these sets has size at least n / 3 for a set with n nodes, The coloring and the computation of the sizes of the subsets can be performed in O(log n) time and O(n) work. Using this independent-set selection procedure until there is only one node left, the whole algorithm takes O(log^2 n) time and O(n) work. An immediate improvement is obtained by applying accelerated cascading: as soon as the problem size has been reduced to n / log n, that is, after at most loglog n / log (3/2) = O(loglog n) reduction rounds, we switch to pointer jumping. This idea reduces the total time consumption to O(loglog n * log n). A more refined algorithm runs in O(log n) time, but practically none of the deterministic algorithms can compete with the randomized one.
A ruling set in a graph is a subset S of the nodes so that each node is either in S or adjacent to a node in S. A k-ruling set is a subset S, so that each node has distance at most k from a node in S. The algorithm of this section is called the sparse-ruling-set algorithm, because it constructs a small subset of the nodes, which are called rulers, that are more or less regularly interspaced in the list. Once we have walked through the list from one ruler to the next, a new reduced list is constructed which has a node for each ruler and links with weight equal to the distance between the rulers. Once the ranks of the rulers are known, each non-ruler can ask the ruler from which it was reached for its rank and add an offset.
The following formulation of the algorithm is given for the DMM model. With minor modifications it can be turned into a PRAM or sequential algorithm.
while (there are still active waves)
{
Send all prepared packets.
for (each node p' that receives a packet (p, d))
if (p' is a ruler) // wave of p ends in p'
{
Add a link (p, p', d + 1) to L.
if (there are still unvisited non-ruler nodes)
{
Randomly select such a node p''. p'' initiates a new wave:
p'' prepares a packet (p'', 0) to be sent to succ[p''].
}
}
else // wave of p continues from p'
{
mark p' as visited with p as ruler and d + 1 as distance
thereof. Prepare a packet (p, d + 1) to be sent to succ[p'].
}
}
In each round of the main while-loop exactly s previously unvisited nodes are reached, so all n nodes are visited in n / s rounds. Considering the number of unvisited nodes, it follows that the expected number s_t of waves that hits a ruler in round t, 0 <= t < n / s, is given by s_t = s * s / (n - t * s). Summing over t, gives the total number of newly created rulers:
sum_t s_t = sum_{t = 0}^{n / s - 1} s * s / (n - t * s)
= s * sum_{i = 1}^{n / s} 1 / i
= s * H_{n / s}.
Here H_k ~= log_e k is the k-th harmonic number. Thus, the problem size
is reduced from n to a value n' ~= s * log_e (n / s).
In a DMM implementation, the algorithm performs n / s + O(1) all-to-all routings. During the main while-loop each node sends a packet containing the index of the ruler, the distance thereof and the destination, so the routing volume is 3 * n / p. In the last step, each non-ruler asks a question generating two answers, so the routing volume of the complete algorithm is 6 * n / p. So, the cost for a reduction of the problem size by a factor f = n / n' = n / s * log_e (n / s) is approximately
T(P, n) = c_i * O(n / P) + c_v * 6 * n / P + c_a * f * log_e f.
Taking s = o(n / (log n * loglog n)) gives n' = o(n / log n), which allows to solve the subproblem with pointer jumping at negligible costs. If s is chosen much smaller, s = o(n / (P * log P)), n' = o(n / P). In that case the data of the reduced list can be gossiped at negligible costs and the remaining problem can be solved internally. In that case the total routing volume is only 3 * n / P, but this goes at the expense of a much larger number of all-to-all routings. For small P this may be a very good idea though. If no new rulers are chosen in the course of the algorithm, the distance does not need to be sent along because it can be determined by counting the number of performed rounds. This has the disadvantage that we have to somehow deal with the initial nodes but allows to reduce the routing volume to just 2 * n / P, which appears to be the best achievable.
In practical applications, if P is somewhat larger, the focus should lie on reducing the number of routing operations. If s is chosen larger, for example, s = n / 50, the problem size is reduced by about a factor 10 with just over 50 all-to-all routings. Applying the same once more gives a reduction by about a factor 100, with 100 all-to-all routings and a routing volume of about 6.6 * n / P. In practice the choice of s should be made as a function of P, n, c_v and c_a in combination with estimates of the cost to solve the subproblem.
We remind that there are two main reasons why running a parallel program on a parallel computer with P PUs mostly gives much less than P speed-up:
The weakness of sending too many packets can be alleviated by applying the following idea. Instead of performing all-to-all routings, at the end of each phase (which is the basic idea of the BSP approach!), it is better to intermix the computation and the communication: Round t is decomposed in P subrounds, denoted t_p, for 0 <= p < P. In subround t_p the algorithm running on PU i sends all it has to send to PU (i + p) mod P. Then it works on all the data it received and prepares the packet for PU (i + p + 1) mod P, and so on. Doing this, the routed packets become larger: a wave now has to wait only P / 2 subrounds before it is progressing again, going twice as fast. This idea does not only give improvements for list ranking, but for any problem in which a non-local structure has to be traversed.
The sparse-ruling-set algorithm may be considered to be a variant of independent-set removal. The difference is the speed at which the reduction is performed: independent-set removal gives a fixed small reduction factor. With the sparse-ruling-set algorithm the reduction factor can be chosen freely. A more interesting observation is that the structure of the sparse-ruling-set algorithm is very similar to that of sample sort: for a small subset of the elements the ranks are determined exactly, then the other elements are ranked between them.
Click here to see a sequential version of the sequential algorithm, the pointer-jumping algorithm and the sparse-ruling-set algorithm. It is important to observe that in the implemented variant of pointer jumping the distance may double by more than a factor two in one round because it may happen that node succ[i] gets processed before node i.
As before every node has fields succ[], last[] and dist[], where ultimately last[i] will point to the last node of the list of i and dist[i] at all times gives the distance from node i to last[i]. A final node i has succ[i] = i. There are two basic operations. The first is called autoclean, the other altroclean. If last[i] = j, we say that node i points to node j.
Autoclean(p) means that for all nodes i stored in some PU_p for which last[i] points to a node also stored in PU_p, we perform last[i] = last[last[i]] and dist[i] += dist[last[i]] until i is pointing to a node stored in another PU. This operation does not require any communication.
Altroclean(p, p') means that for all nodes i stored in some PU_p for which last[i] points to a node stored in PU_{p'}, we perform last[i] = last[last[i]] and dist[i] += dist[last[i]], by asking PU_{p'} for this information. An autoclean operation means that both involved PUs must send one packet. The first packet contains the index of the node the question is destined for, the second contains its last- and dist-value. In total three integers are exchanged.
Building on these two basic operations, we get the following algorithm for PU_p, 0 <= p < 2:
for (each node i in PU_p)
if (succ[i] == i) // i is a final node
{
last[i] = i - n;
dist[i] = 0;
}
else
{
last[i] = succ[i];
dist[i] = 1;
}
autoclean(p);
altroclean(p, (p + 1) mod 2);
autoclean(p);
for (each node i in PU_p)
last[i] += n;
The algorithm is correct, because after the first autoclean no node points to a node in its own PU. Thus, after the altroclean, each node which does not point to a final node points to a node in the own PU. The second autoclean therefore finishes the problem, because if no node in PU_p points to a node in PU_{(p + 1) mod 2}, at the beginning of an autoclean, then it does not do so at the end of this operation. But by then no node points to a node in the own PU either, so all nodes must point to a final node.
The autocleans perform a linear number of internal operations. Essentially an autoclean is nothing but an internal list-ranking problem in which all nodes pointing to a node in another PU are considered to be final nodes. So, typically there will many very short lists.
T(2, n) = c_i * O(n) + c_v * 3 * n / 2 + c_a * 2.
The kernel of the algorithm consists of a loop in which all other PUs are addressed in a circular way:
for (each node i in PU_p)
if (succ[i] == i) // i is a final node
{
last_left[i] = last_rght[i] = i - n;
dist[i] = 0;
}
else
{
last_left[i] = last_rght[i] = succ[i];
dist[i] = 1;
}
autoclean (p, last_left);
autoclean (p, last_rght);
for (t = 1; t < P; t++)
{
altroclean(p, last_left, (p - t) mod P, last_rght);
altroclean(p, last_rght, (p + t) mod P, last_left);
autoclean (p, last_left);
autoclean (p, last_rght);
}
for (each node i in PU_p)
last_rght[i] += n;
We claim that at the end of the algorithm last_rght[i] gives the last node of the list of i for all i, 0 <= i < n. This can most easily be proven with invariants. We claim that after round t, 0 <= t < P, last_left[i] for all nodes i stored in PU_p is clean for p, p - 1, ..., p - t. By this we mean that last_left[i] does not point to any node stored in a PU with indices from {p, p - 1, ..., p - t}. In addition we claim that after round t, 0 <= t < P, last_rght[i] for all nodes i stored in PU_p is clean for p, p + 1, ..., p + t. For t = 0, we have performed autocleans, so in PU p, last_left and last_rght are clean for 0. More generally, assume the claims hold for some given t. So, at the beginning of round t + 1 last_left[i] is clean for p, p - 1, ..., p - t and last_rght[i] for p, p + 1, ..., p + t. Thus, if last_left[i] points to a node in PU_{p - t - 1}, putting last_left[i] = last_rght[last_left[i]] will return the index of a node which does not lie in a PU with indices in {p - t - 1, p - t, ..., p - 1}, but possibly it may lie in PU_p. Performing the autoclean establishes the claim for t + 1. In the above discussion, all PU indices are meant to be computed modulo P.
How about the time complexity? There are 2 * P - 2 altrocleans. Each operation consists of a question followed by an answer. Sending the second question in each pass of the loop together with the first answer, the number of start-ups is reduced to 3 * P - 3, this is only as much as in three all-to-all routings, without any indirect routing. On the other hand, if we have bad luck, there may be n / P questions in each altroclean. However, if the nodes indices are randomized, then we get much better performance: we may assume that in round t, the last values are the indices of nodes which are uniformly distributed over the remaining P - t PUs. If this assumption holds, the expected number of questions for an altroclean in round t equals (n / P) / (P - t). In total this gives sum_{i = 1}^{P - 1} (n / P) / (P - t) = n / P * H_P ~= n / P * log_e P. For every left-question we get back one answer, only a new last-value, for every right-question we get back two answers. So, for the presented version of the algorithm, the total routing volume is about 5 * (n / P) * log_e P. Altogether this gives
T(P, n) = c_i * O(n * log n) + c_v * 5 * n * log_e P + c_a * 3 * (P - 1).
Comparing with pointer jumping, we have slightly less work and routing volume because 5 * log_e P < 3 * log n, and much fewer start-ups. So, in practice pointer jumping will be worse. Comparing with the sparse-ruling-set algorithm we have lost almost a factor log_e P in work and routing volume, but the number or start-ups is far smaller. So, one-by-one cleaning will be better for small values of n / P^2, because then the start-up costs tend to dominate. There are other algorithms which are better for intermediate values, but it is correct to say that sparse ruling sets is best for large values of n / P^2 and one-by-one cleaning for small values of n / P^2. A combination of these two algorithms always gives reasonable results.
The optimal PRAM algorithms can quite easily be turned into algorithms performing O(P) routing steps. This can be achieved on a linear array by geometrically reducing the problem size and the number of employed PUs until the reduced problem fits in one PU. Then it can be solved internally. Reversing the reduction steps solves the problem. The constant hidden in O(P) depends on how the details are worked out. It lies between 10 and 20. It is not clear how such algorithms that contract the problem in an ever smaller subset of PUs should profit from the wrap-around connection after the first reduction rounds, so on circular arrays, the number of routing steps is only slightly smaller.
The quite amazing thing is that on circular arrays list ranking can be solved much faster than this (and because circular arrays can be embedded on linear arrays with dilation 2, by extension even on linear arrays): P - 1 routing steps are sufficient! This can be achieved by an adaptation of the one-by-one cleaning algorithm. The idea is simple: the values of last_rght[] stay in the PU where they started, while the values of last_left[] travel around the network moving leftwards, that is, from PU_p to PU_{(p - 1) mod P}. In this way, the communication that is performed in the DMM algorithm can be resolved locally.
The packets sent in every step have size O(n / P) and the amount of computation is of the same order. Now one may complain that using P - 1 steps with O(n / P) work in each of them gives O(n) in total. This will not give much speed-up! This is true in general, but the important thing is the idea: this is an algorithm in which all communication is local. It is a kind of systolic list-ranking algorithm: two streams of information which are only interacting where meeting. The algorithm might even have practical importance. Maybe the "PUs" actually give highly connected clusters of PUs in a structure in which the clusters are more loosely connected (NUMA platforms). Or, we may have that n, the problem size, is greater than M, the size of the memory of a PU, but that n < P * M.
suc(v_i, u) = (v, u_{(i + 1) mod deg_u}).
If the edges are numbered with different numbers, which can be performed by computing the prefix sums of the numbers of edges each PU is managing, then the defined successor relation simply gives a linked circular list. This can be turned into a non-circular list, by cutting it at a single place. For example, node 0 can cut the list between its first and last neighbor by setting suc(v_{deg_0 - 1}, 0) = null.
For a tree with n nodes and n - 1 edges, on a PRAM this construction takes O(log n) time with O(n) work for the prefix computation and O(1) time with O(n) work for the linking. On a DMM with P PUs, the prefix takes O(n / P) * c_i + log P * (c_v + c_a). If we assume that the information concerning edge (v_i, u) is stored in the PU responsible for node v_i, then the linking requires that the PU which is responsible for node u sends a packet to this other PU. Thus on a DMM the cost is dominated by that of an all-to-all routing in which each PU sends and receives one information for each edge of a node it is managing. If there are no extreme node degrees and the nodes are distributed evenly, this amounts to a k-k routing with k = 2 * n / P. More generally, the edges, after numbering them, should be distributed evenly over the PUs.
Assume we have a rooted tree as input. Then to compute the preorder numbers, all edges pointing away from the root are given weight +1 the other edges are given weight 0. After running a weighted list-ranking algorithm, a node u can obtain its number from the rank of the edge coming from its parent. Switching the weights, that is giving weight +1 to the edges leading towards the root and 0 to the others computes the postorder numbers.
Performing a weighted list ranking on the thus weighted list, a node u can find its inorder number from the rank of the first edge of the pair of edges that changed direction in u.
Consider a bipartite graph with node sets V_1 and V_2, each with n nodes. Assume that the graph is regular of degree g. A first idea for constructing a coloring is to start allocating the colors to the first node of V_1, then the second and so on. When assigning the colors of node i, we should respect the conditions imposed by earlier assigned colors. If one is lucky this may work, but in general this approach will get stuck when we must assign a color c to an edge (i, j) while node j has an edge (i', j) for some i' < i, which was already assigned color c while coloring node i'. Another "greedy" strategy may also work, but not always. The idea is that one tries to assign the colors one by one. The algorithm gets stuck when, while assigning color c, one comes to a node i for which all the uncolored edges lead to nodes which already have an edge that was given color c before.
A working and efficient strategy is based on Euler splittings. For a g-regular graph G with g even, the graph can easily be split in two edge-disjoint subgraphs G_0 and G_1 each of degree g / 2. This is done by constructing an Euler tour of the graph, numbering the edges on the tour alternatingly 0 and 1. All edges which have been numbered 0 belong to G_0, all other edges to G_1. Because sequentially an Euler tour of a graph with n nodes and m edges can be constructed in O(n + m) time, this splitting costs time O(g * n).
In general, for a graph of degree g = 2^k, the algorithm consists of k rounds. In round i, 0 <= i < k, the algorithm is working on 2^i subgraphs in which all nodes have degree 2^{k - i}. Each of the 2^i operations in round i takes O(2^{k - i} * n) time, so the total amount of work in any of the rounds is O(2^k * n) = O(g * n) = O(m). Thus, in total over all rounds the algorithm takes O(k * m) = O(log g * m) time. For g which are not powers of 2, the construction of such colorings is considerably harder. After much research it has been established (by Cole, Ost and Schirra in Combinatorica, Vol. 21, 2001) that even the general case can be solved in O(log g * m) time.
The basic idea is to turn the problem of finding an Euler tour into a list-ranking problem. For every edge there is a node. These nodes have no predecessor and successor links, but for every edge e, there is a data element left[e] and rght[e]. left[e] gives the edge e' to which e is connected in its left endpoint i in V_1. This decision is made locally, without global coordination in node i. If left[e] = e', then left[e'] = e. rght[e] gives the edge e'' to which e is linked in its right endpoint j in V_2. A node of the list (corresponding to an edge of the graph) which is entered by a left[] link is left through a rght[] link and vice versa.
The list-ranking algorithms can, with minor modifications also be applied to figure out the minimum on a circular list and the distance thereof. This is now done for both lists (not knowing the global structure, it must be done for both). Of course we do not have a conventional list structure: each node has two "successors": one labeled "left" and one labeled "right". So, what is the appropriate successor? The answer is simple: the path starting over a "left" link must at even distances from the original node traverse "left" links and at odd distances "right" links. Further, notice that, because any cycle on a bipartite graph has even length, for any node e, if the distance to the node with the smallest index is even along one direction then it is also even along the other direction. This observation makes that we now can split the set of nodes (corresponding to the edges of the graph) in two: all those at even distances go into one subset, all those at odd distances into another.
Summarizing: first each node of the graph arbitrarily links up the edges connected to it in connected pairs; then a set of bidirectionally linked circular lists of a slightly modified nature with g * n nodes is constructed; then two list ranking problems are constructed. Clearly this whole problem can be solved in O(T_listrank(g * n)). Hereafter we have to solve two instances with each half as many edges. In a straightforward implementation this gives
T_color(P, n, g) = O(T_listrank(P, g * n) + 2 * T_color(P, n, g/2).The solution depends on the list-ranking time. For a DMM, T_listrank(P, g * n) = O(g * n / P * (c_i + c_v) + c_a * P). Substituting and solving gives
T_color(P, n, g) = O(log g * g * n / P * (c_i + c_v) + c_a * g * P).
This is a good result except for the number of startups. Organizing things slightly differently the number of startups can be minimized: after each splitting of the problem, each of the two subproblems should be concentrated in half of the available PUs. This gives a different recurrence relation:
T_color(P, n, g) = O(T_listrank(P, g * n) + T_color(P/2, n, g/2).Solving this recurrence gives
T_color(P, n, g) = O(log g * g * n / P * (c_i + c_v) + c_a * P).So, we saved a factor g in the number of startups. This improved algorithm will work well if log g * g * n >> P^2, the first version required log g * n >> P^2.
If there are two independent problems which must be solved on a DMM or an interconnection network with P Pus, then it is mostly better to solve these problems in parallel each on P / 2 PUs than to solve them after each other each on P PUs.
On a two-dimensional m x m mesh the situation is even much better. Assuming that g * n > m^2, T_listrank(m^2, g * n) = O(T_route(m^2, g * n)) = O(g * n / m). So, we get
T_color(m^2, n, g) = O(g * n / m) + T_color(m^2 / 2, n, g / 2).Because (g / 2) * n / sqrt(m^2 / 2) = (g * n / m) / sqrt(2), after each splitting the subproblems can be solved faster by a constant factor (more accurately: after splitting twice there are four subproblems which can be solved twice as fast on m / 2 x m / 2 meshes). Thus, the sum of the times of all splitting phases is proportional to the time for the first one:
T_color(m^2, n, g) = O(g * n / m).
Of course the speed-up of this mesh algorithm is only m = sqrt(P), but this is the normal situation on meshes. There are very few problems like matrix multiplication and all-pairs shortest paths for which full speed-up can be achieved. All but very trivial problems require that at least all data are communicated once over the bisection of the network and therefore full speed-up can be achieved only if the sequential complexity of the problem is sufficiently high. A result like the obtained one is optimal up to constant factors in a two-dimensional world. The basic assumption of the DMM model that locality plays no role is definitely not valid without limitation on P.
We have seen how to solve the unbalanced routing problem using randomization: sending all packets first to a random intermediate destination (or by shuffling the packets in a deterministic way), any routing pattern may be replaced by two almost balanced routing patterns. The main disadvantage of this approach is that it doubles the routing volume. A coloring approach offers an alternative solution which are mostly satisfied in practical contexts.
We consider a DMM with P PUs. Each PU has to send and receive in total k packets. Let a_{i, j} denote the number of packets PU_i has to send to PU_j. The conditions are
0 <= a_{i, j} <= k,The entire routing pattern constitutes a bipartite k-regular graph with P nodes on either side. Constructing a k coloring decomposes the routing into k permutation routing problems, which can be routed in (c_v + c_a) * k time. Even when ignoring the excessive time for the coloring, this is not good, because in general this will be much more than c_v * 2 * k + c_a * 2 * P, the time for the randomized coloring.
sum_{i = 0}^{P - 1} a_{i, j} = sum_{j = 0}^{P - 1} a_{i, j} = k.
Some form of sparsification is the key to the solution. Possibly we might group PUs together, but this inevitably leads to routing packets several times, and then we could just as well apply the simpler randomized routing. So, bundling the packets into superpackets is the only cost reduction we can try to apply. Let s be the size of the superpackets. That is, in PU_i the a_{i, j} packets which have to be routed to PU_j are packet into b_{i, j} = round_up(a_{i, j} / s) superpackets of size s, creating one partially filled superpacket. In total PU_i creates sum_{j = 0}^{P - 1} round_up(a_{i, j} / s) <= sum_{j = 0}^{P - 1} (a_{i, j} / s + 1) = k / s + P. So, this gives a bipartite graph with maximum degree k / s + P. The nodes do not need to have the same degree, but after computing prefix sums dummy packets can be added to make the graph regular. Because within certain limits s can be chosen freely, we may assume that k / s + P is a power of two.
On this reduced (k / s + P)-regular bipartite graph with P nodes on either side the presented Euler-tour-based coloring algorithm can be applied. This takes O(log(k / s + P) * (k / s + P) * (c_i + c_v) + c_a * P) time. This is a loss term. The second loss term comes from the fact that we have more than k / s superpackets: the subsequent routing takes (k / s + P) * (c_v * s + c_a). We want to determine a good choice for s and a condition on k and P so that all loss terms can be made negligible in comparison with the useful time c_v * k. Clearly P should be negligible in comparison with k / s. This allows to simplify the time for the coloring. Equating the loss terms gives the following equation:
log k * k / s = P * s.Solving gives a close-to-optimal choice for s, namely s = sqrt(k * log k / P).
Theorem: On a DMM with P PUs any k-k routing can be performed in (1 + o(1)) * k * c_v time if k = omega(P * log P).
Proof: Taking s = sqrt(k * log k / P), the loss terms are bounded by O(sqrt(k * log k * P) * c_v). If k = x * P * log P, sqrt(k * log k * P) = k * sqrt((1 + log x / log P + loglog P / log P) / x). Clearly for x = omega(1), this is o(k). End.
So, for unbalanced k-k routing optimality can be achieved for values of k which are only marginally larger than the k = omega(P) required for balanced k-k routing. Practically the only disadvantage is the rather complicated construction.Evaluating arithmetic expressions in parallel does not appear to make sense because it is hard to imagine in which context sufficiently large expressions occur, but the expression-evaluation problem models other tree problems which are important and for which even have to be solved for large trees. An example of such a problem is that of computing the height of each node u in a tree, that is, the distance to the deepest lying leaf in the subtree of u. For this problem the Euler-tour technique does not appear to be suitable. As an expression-evaluation problem it can be solved by putting 0 in all leafs and using as operator max + 1, taking the maximum value of the children and adding 1. Another example is that of computing for each node u of a tree the minimum key value in the subtree of u. This problem can be solved as an expression-evaluation problem in which each node computes the minimum of its own value and that of its children.
For expression trees of moderate depth the evaluation problem can be solved efficiently in parallel by allocating a processor to each node connected to two leafs. This initial processor allocation can be performed using the Euler-tour technique. For each such a node its value can be computed in constant time. When as a result a new node connected to two leafs is generated, the allocated PU is taking over this job, otherwise it becomes passive. In each step the depth of the tree is reduced by one. Thus, for a tree of depth d with n nodes, the time is O(d + log n) and the work is O(n). If the tree is deep and skinny, for example a path, this way of evaluating gives very limited parallelization.
Many rakes can be performed in parallel, but it should be avoided to rake two leafs with the same grandparent. This is not hard to realize. The algorithm starts by consecutively numbering the leafs using the Euler-tour technique. Then the rake operation is first performed on all leafs with numbers 4 * k + 0, for some k, and then on all leafs with numbers 4 * k + 2, for some k. Hereafter each leaf with an odd numbers i is renumbered as (i - 1) / 2, and the process is repeated until there are only two leafs left. Because the raked leafs are always four numbers apart, they can never have the same grandparent. At this point we essentially use that the tree is binary. So, provided we know how to perform a rake operation, this correctly contracts the tree.
The above tree contraction can be performed very efficiently in parallel: the Euler-tour technique was analyzed before. On an EREW PRAM it can be performed in O(log n) time with O(n) work. Assuming that raking a leaf takes constant time, each contraction phase, consisting of two parallel rakings and a trivial renumbering, takes O(1) time. In each contraction phase the number of leafs is halved, so the whole contraction can be performed in O(log n) time. For some constant c, the work during the contraction is given by sum_t c * n / 2^t = O(n). In the time estimate it is assumed that there are no internal nodes of degree one, otherwise for a tree which has very few leafs only a few nodes disappear in every contraction phase until there is one leaf left. The easiest is to add a dummy leaf to any internal node with degree one. This leaf should get a neutral value so that the evaluated value remains the same for all real tree nodes.
Invariant: At all times for any leaf u, a_u = 0 and b_u = val(u). For any internal node u, with children v and w, val(u) = (a_v * val(v) + b_v) op_u (a_w * val(w) + b_w).
Initially the invariant condition is satisfied, while raking the values of the parameters must be modified appropriately. Assume we are raking a node v with parent u and sibling w. The nodes v and u are eliminated, so w should take over the values of these three. v is a leaf, thus, according to the invariant, val(u) = b_v op_u (a_w * val(w) + b_w). We distinguish two cases:The way the algorithm is specified, the correct value will be computed in the root of the tree, but not for all other nodes. In the following examples it will be discussed how to correct this. The idea is analogous to that for computing a prefix sum: the whole algorithm consists of two phases, one running upwards, one running downwards. The difference is now that operations may take place at different levels of the tree in the same parallel operation.
This is an easy case. All nodes are given initial weight 1. In addition there are dummy leafs added to all internal nodes of degree one, these get weight 0. The only operation is the addition. It adds the value of both children to the weight. More precisely, when raking u with parent v and grandparent w which have weights a_u, a_v and a_w, respectively, then u and v are removed and a_w is replaced by a_w = a_w + a_v + a_u.
Notice that the above construction does not compute the size of the subtree for the node v which is skipped over. This problem can be solved easily. After the tree contraction there is a tree expansion, reversing the operations. So, all nodes which are excluded in phase i, i = 1, 2, ..., t, are reinserted in phase i = t, t - 1, ..., 1. After reinsertion phase i we may assume that all nodes which have been reinserted already know their correct values. So, by the time v is reinserted, u and its sibling u' at the time of the exclusion of v have been reinserted and obtained their final values a_u and a_{u'}. So, v can compute its value by setting a_v = a_v + a_u + a_{u'}.
If the tree represents some kind of evolutionistic development, then the LCA of u and v gives the historical event where two "species" went their own independent way. If the tree T is a spanning tree of some undirected graph and we want to make it more like a DFS tree, then when discovering a cross edge (u, v) it may be clever to replace one of the edges from LCA(u, v) leading to u or v by the edge (u, v).
The distance from a node u to a node v in a tree is given by the sum of the distance from u to LCA(u, v) and that from v to LCA(u, v). The distance of u to LCA(u, v) is given by the difference of their depths in the tree. So, after computing the depths of all nodes, which can be performed in linear time, the distances between any pair of nodes can be computed in constant time plus the time for for computing the LCA. Replacing the depths by the sum of all weights on the path from the root to a node, this idea can even be used for computing the distances in weighted trees.
An analogous problem is when for a fixed weighted tree we frequently have to determine the heaviest edge on the path from a node u to another node v in the tree. So, instead of using the sum operator we use the maximum operator here. The difference is that the maximum operator has no inverse, so we cannot simply subtract the contribution from the path between the root and LCA(u, v). In this case we first have to construct an associated tree T'. We describe how this can be done sequentially. We need a union-find data structure in which all nodes are inserted. If T has n nodes, then T' has n leafs in one-one correspondence with the nodes. The leafs all get key value 0. The edges of T are sorted according to their weight and processed in order of increasing weight. For an edge (u, v) of weight weight(u, v), a new node w of T' with two children is created. w has key value weight(u, v). w is connected to the node corresponding to the component of u and the component of v. Hereafter these components are unified and w becomes the node representing this component. Hereafter, for any pair of node u and v the weight of the heaviest edge on the path from u to v in T is given by the key of the LCA of u and v in T'.
The above examples show that for several applications it is desirable to be able to answer the question "what is the LCA of u and v" for two nodes u and v of a tree. For any single such query the problem can be solved easily by walking towards the root from u and v marking the nodes on the path as visited. The first node that is reached for the second time is the LCA. If this walking is done by alternatingly making a step on the path from u and the path from v, the total number of visited nodes is proportional to the length of the path from u to v. The marks can be set back in the same time. This is clearly optimal and if there are several such queries, then these can be performed in parallel without problem (though concurrent reads may occur).
However, if there are very many such queries, it is better to preprocess the tree so that the queries can be faster than by walking along the tree. The problem of preprocessing the tree so that subsequent queries can be performed in constant time is called the LCA problem. So, the topic of this section is not how to parallelize the queries, which is trivial, but to show how this preprocessing, which sequentially can be performed in O(n) time for a tree with n nodes, can be performed in parallel.
More important is the case that the tree T is a perfect binary tree with n = 2^k - 1 nodes for some k > 0. The key idea is to perform an infix numbering of this tree, giving number 1 to the leftmost leaf. So, the node numbers are 1, ..., 2^k - 1. Let w = LCA(u, v) be the LCA of two nodes u and v. Let (u_{k - 1}, ..., u_0) and (v_{k - 1}, ..., v_0), be the binary expansions of the numbers of u and v respectively. Let x(u, v) be the index of the most-significant bit of u and v which are different. Using this notation, the binary expansion of the number of w can be expressed easily: w_i = u_i for all i > x(u, v); w_{x(u, v)} = 1; w_i = 0 for all i < x(u, v). Of course it is not necessary that the number of nodes is exactly a power of two minus one.
Theorem: For a perfect binary tree T with n = 2^k - 1 nodes a numbering of the nodes can be computed in O(log n) parallel time and O(n) work so that based on this numbering for any pair of nodes u and v of T a query LCA(u, v) can be solved in O(1) sequential time.
The idea for the perfect binary trees can be extended to all trees which are sufficiently like perfect binary trees. The sequential linear time algorithm is based on such an extension. We do not further consider this construction but assume the following as a fact:
Theorem: For any tree T with n nodes a tree T' of size O(n) can be constructed in O(n) time so that based on the information contained in T' for any pair of nodes u and v of T a query LCA(u, v) can be solved in O(1) time.
The construction is again based on the Euler-tour technique. First an Euler tour is constructed running along all edges in the tree, starting and ending in the root node r. This tour is used to construct arrays index[] and depth[]. For a tree with n nodes each has length 2 * n - 1. index[i] gives the index of the node that is visited after traversing i edges of the tour. index[0] = index[2 * n - 2] = r. depth[i] gives the depth of node index[i]. It can be computed by weighting all list edges leading away from the root with +1 and all other edges with -1. Running a list-ranking algorithm, each of these arrays can be computed in O(log n) time and O(n) work.
From index[] and depth[] we compute the arrays first[] and last[]. These arrays have length n. For all u, 0 <= u < n, first[u] = min_{0 <= i < 2 * n - 1} {index[i] = u}, that is, it gives the first occurrence of u on the tour. Analogously, last[u] gives the last occurrence of j on the tour. Because the tree is connected all these values are well defined. For a leaf u, first[u] = last[u]. For an internal node u first[u] < last[u]. On an EREW PRAM these arrays can be constructed in O(1) time and O(n) work: For all i, 0 <= i < 2 * n - 1, it is checked whether depth[i - 1] < depth[i] or not. If this is the case, we know that the tour came to node u = index[i] from above. So, this must be the first visit to node u. In that case the testing PU sets first[u] = i. If depth[i + 1] > depth[i], this is the last visit to node u = index[i] and the testing PU sets last[u] = i.
For any pair of nodes u and v, there are five mutually exclusive possibilities:
Lemma: If last[u] < first[v], LCA(u, v) = index[j], where j = min_{last[u] < i < first[v]}{depth[i] = d}, where d = min_{last[u] < i < first[v]}{depth[i]}.
Proof:
Let d and j be as defined and let w = index[j]. All nodes on the tour between last[u] and first[v] have depth larger than w. So, the tour has not left the subtree rooted at w. Thus, all these nodes, particularly u and v are descendants of w, that is, w is a common ancestor of u and v. Now consider another node w' which also lies on the subtour lying between last[u] and first[v]. Suppose w' is an ancestor of u. The tour is constructed so that once it leaves a node to a higher lying node it does not return. This holds in particular for w'. So, only after having visited all nodes in the subtree of w' the tour reaches w. These nodes in the subtree of w' will not be visited again. We know that hereafter the tour reaches v at least one more time because first[v] > j. Thus, v does not lie in the subtree of w' and thus w' is not a common ancestor. We conclude that on this subtour w is the only common ancestor of u and v. End.
p[i] = p_1[i], for all 0 <= i < n / 2;In an analogous way the array of suffix minima s[] can be constructed from arrays s_1[] and s_2[] of suffix minima for each half of a[].
p[i] = min{p_1[n / 2 - 1], p_2[i - n / 2]}, for all n / 2 <= i < n.
Assume that n = 2^k for some k >= 0. For other n, we can if necessary round up the length of the array to the next power of two. Let p_{l, m}[], 0 <= l <= k, 0 <= m < n / 2^l, be the array of prefix minima for the subarray of a[] of length 2^l starting at position m * 2^l. Define s_{l, m}[] to be the analogous array of suffix minima. By the above recursive construction all arrays p_{l, m} and s_{l, m} can be constructed in O(log n) time and O(n * log n) work:
T(n) = T(n / 2) + O(1);Once all these arrays have been computed, a range-minimum query can be performed in O(1) sequential time.
W(n) = 2 * W(n / 2) + O(n).
We start with an example. Suppose we want to determine min_{393 <= i <= 438}{a[i]}. Then we first write these numbers u and v in binary: u = 393 = 00110001001, v = 434 = 00110110010. We have added a few leading zeroes to show the generality of the idea. Then we determine the most-significant different bit. In our example it occurs at position 5. The number w which is obtained by taking v and setting the 5 least-significant bits to zero is w = 0011010000 = 416 = 13 * 32. The number we are looking for is min{s_{5, 12}[9], p_{5, 13}[18]}. Here 23 = 393 - 12 * 32 and 18 = 434 - 13 * 32.
More generally, for numbers u and v, u <= v, the most-significant different bit x = x(u, v) is determined. Then the number w is constructed by taking v and setting the least-significant x bits to zero. Let d = 2^x and l = w / d. The definition of w and d is so that (l - 1) * d = w - d <= u and (l + 1) * d = w + d > v. The number we are looking for is min{s_{x, l - 1}[u - (l - 1) * d], p_{x, l}[v - l * d]}. This is so, because s_{x, l - 1}[u - (l - 1) * d] = min_{u <= j < w}{a[j]} and p_{x, l}[v - l * d] = min_{w <= j <= v}{a[i]}.
It is interesting to notice that the number w is just the index of the LCA of nodes u and v in a perfect binary tree with inorder indexing with smallest index 1, the special case of the LCA problem treated above. This shows the close relation between the LCA problem and the range-minima problem. In an implementation it is not necessary to make this tree structure explicit, the set of prefix and suffix arrays can be managed by a double indexed array of arrays of integers. The given computation shows which two values in this construct must be accessed to compute the range minimum from. On a RAM this takes constant time.
Lemma: For an array a[] of length n, in O(log n) time and O(n * log n) work a data structure requiring O(n * log n) storage can be created which allows it to solve range-minima queries on a[] in constant time.
To reduce the work and memory consumption, the array a[] of length n is divided in n / log n subarrays of length log n each. For each of these subarrays an optimal sequential algorithm is applied. For each of the subarrays this takes O(log n) time and work. So, the parallel time is O(log n) and the work is O(n). The created structure allows to perform range-minima queries inside any of the subarrays in constant time. Along with this construction we can also construct an array b[] of length n / log n consisting of the minima in each of the subarrays. The above given suboptimal algorithm is now applied to b[] rather than a[]. This takes O(log n) time and O((n / log n) * log(n / log n)) = O(n) work. So, the whole construction takes O(log n) time and O(n) work.
It remains to specify how this two-stage structure can be used to perform range-minima queries in constant time. For numbers u and v, u <= v, belonging to subarray i_u and i_v, respectively, we distinguish two cases (with a subcase in the second):
Now the construction is perfect except that we have not told how to sequentially preprocess an array in linear time so that range-minima queries can be performed in constant time.
Theorem: For an array a[] of length n, in O(log n) time and O(n) work a data structure requiring O(n) storage can be created which allows it to solve range-minima queries on a[] in constant time.
Corollary: For a tree with n nodes, in O(log n) time and O(n) work a data structure requiring O(n) storage can be created which allows it to solve LCA queries for any pair of nodes in constant time.
For k-bit numbers, any single computation of intlog() can trivially be performed in O(k) time. Using a binary-search strategy, based on comparisons and shifts, this can easily be reduced to O(log k) time. Using bit parallelism further improvements are possible, but we do not obtain a constant-time operation this way. If we have to perform many computations of intlog(), it may be better to first compute an array intlog[] of length n = 2^k containing all values. A later call intlog(i) can then be answered in constant time by returning intlog[i].
Sequentially all n values of intlog[] can easily be computed in O(n) time:
void computeIntlog(int[] intlog, int n)
{
intlog[0] = -1; // Artificial value
intlog[1] = 0;
for (int i = 2; i < n; i++)
if ((i & (1 << (intlog[i - 1] + 1) == 0)
intlog[i] = intlog[i - 1];
else
intlog[i] = intlog[i - 1] + 1;
}
The given subroutine cannot easily be parallelized. There are several alternative constructions. If optimal work would not be a major concern, then we could apply any of the sequential algorithms. For a sequential time T(n), this gives work T(n) * n. We know that T(n) = O(loglog n) or less. It is very simple to make the algorithm optimal: first compute intlog[i] for all values 0 <= i < sqrt(n). This takes T(n) time and O(sqrt(n) * T(n)) = O(n) work. Then for any number i, sqrt(n) <= i < n, set intlog[i] = intlog[i >> (k / 2)] + k / 2. This takes constant time and O(n) work.
Theorem: If T(n) = O(loglog n) denotes the time for sequentially computing the value of intlog(i) for any number 0 <= i < n, then the array intlog[] with intlog[i] = intlog(i) for all i, 0 <= i < n, can be computed in parallel in O(T(n)) time and O(n) work.