Preface and Notation

This document contains the text of a course in parallel structures.

As html is not suited for mathematical formulas, some additional notation is used (as used in the typographical package Latex). a_i denotes a with subscript i. a^i denotes a to the power i. <= and >= are used as in most computer languages. Curly brackets, "{" and "}", are use to group things. sum_{i = 0}^j denotes the sum for i running from 0 to j. The same may also be written sum_{0 <= i <= j}. ~= stands for "approximately equal" and ~ for "proportional to". Greek letters are written out. For logical expressions we either use the notation from C or the operators are written out in text. So, "a && !b" is the same as "a and not b". sqrt stands for the square root function and log without specifying the base number for the logarithm to basis 2. There are a few more notations, but they should be understood easily. New notions are printed bold there where they are defined. Inside the chapters particularly important ideas and short key notes are highlighted without introductory text. There are many pictures. These are intended to be self-explanatory. Typically they are placed just after the text fragment to which they belong, generally there is no direct reference to them in the text.

There are very few references to the literature. Clearly most of the presented material is not new. A large part is even common knowledge. More directly this text is based on material found in the following two books:

The first was used for most of the PRAM algorithms, the second for some results on sorting. Several students have contributed by pointing out errors and spots which required better explanation.

Notice: the following text may be overcomplete. At the examination it is expected that the students only know all that has been presented during the lectures.

Table of Contents





Introduction

Parallel Computing

Introduction

Processors become faster at a tremendous rate. For the last 30 years the clock rate has been doubling about every 18 months. However, there are convincing arguments that this trend cannot continue another 30 years: processors must have a certain minimal size and information cannot travel faster than the speed of light. Nevertheless, the demand for computing power will continue to increase. Even right now there are problems which are at the same time very urgent and very time consuming, weather forecasting is a prominent example.

To meet both kinds of demands, future and current, it is natural to consider using more than one processor for solving a single task. Because processors not only become faster but also cheaper this becomes more and more realistic. As soon as there is a demand for it, the production of multi-processor chips will start. Right now the newest processors are still expensive, but processors of the previous generation are cheap: If one strives for the maximum number of clock cycles per money unit, then clearly one should not buy the top model, but rather four with half the speed. Using P processors instead of 1, we may hope to become P times faster, but this is rarely achieved. Therefore, it is not sure that four 2 GHz processors will solve a problem faster than a single 4 GHz processor.

If we know how to exploit the computational power of a system with several processors, such a system is financially attractive and technologically possible.

Definition: A computer with several processors is called a parallel computer. Parallel computing is the (study of) the usage of parallel computers in order to solve tasks faster.

Task Parallelism

The simplest way to use parallel computers is to let several processes be solved by several processors. This can both be done for a multi-user system, as well as for a single-user system. Large data servers often have many processors: the incoming data requests are scheduled over the processors and each request is handled by a single processor. Such systems are commercially employed and may have hundreds of processors. On a single-user system, the various processes of the user may be allocated to several processors. At a lower level, the various kinds of instructions may be executed by several processors. The degree of parallelism is rather bounded this way, because a single user typically runs only one time-demanding process, and instructions cannot be subdivided arbitrarily.

Even though technologically different, this kind of parallelism can be characterized as being task parallelism. The most important theoretical question in this context is how to schedule the tasks, that is, how to allocate the tasks to the processors so that an even load balancing is achieved. The load balancing problem has several variants. For a fixed set of tasks, it is a common goal to minimize the make span, the time to complete all tasks, respecting interdependencies if any. In an online context, the goal is more reasonable to try to minimize the completion time, the time before any job has been completed, or to minimize the time before any job starts being done.

In practical applications until now task parallelism is by far the most important branch of parallel computing. However, from an algorithmic point of view it is not very interesting because there is not so much one can do. Therefore, this is not the topic of this lecture. Only here we will briefly discuss the load-balancing problem.

If the lengths of the tasks are known in advance, then the problem of how to schedule a set of tasks so that the make-span is minimized is NP-hard. A reasonable heuristic is to sort the tasks according to their lengths and to allocate in this order to the processor which until then has the least work to do.

If the lengths of the tasks are not known, then the best one can do is to always allocate a task to the processor which currently has the shortest queue. This is the normal scheduling principle applied in supermarkets. In this context there is the issue of task migration: at certain costs it may be possible for tasks to be reallocated in order to achieve a more even work load.

A problem with this approach is that it requires that the lengths of all queues are centrally known: in a huge supermarket with 30 cash-sites, one will typically not check all of them, but simply take the locally shortest queue. If the central allocating agent does not know the loads, then the best he can do is to perform the allocation in a random way. Queuing theory tells that this is nevertheless quite good: if a processor on average gets allocated k tasks, the deviation from this expected value is bounded by O(sqrt(k * P * log(k * P))). Much better balance is achieved if the allocator always first asks the queue lengths of two randomly selected processors and then allocates the next task to the one with the shorter queue.

Perfect balance is achieved if the scheduling is postponed: a processor fetches a next available task as soon as it is ready with the previous one. This is the scheduling principle applied in modern post offices. The disadvantage is clear: the waiting tasks have to be stored in a central storage and cannot be handed out immediately. Furthermore, the processor will be idle while sending the request for a next task and waiting for this task to arrive.

"True" Parallelism

The main topic in this lecture is the question how a single computational task can be efficiently solved using more than one processor. As said above, ideally, with P processors the problem is solved P times faster, but this is rarely achieved.

For an analogy consider a ship wharf. In earlier days work was done in a very labour intensive way. In the rich countries, using better equipment, the number of working hours for building a ship has been reduced considerably. Productivity is not doubling every 18 months, but even in this domain there is an exponential increase, justifying the noticeable salary increases of the past 100 years. Nevertheless, it still takes many thousands of working hours to build a supertanker. Unfortunately the company which ordered the ship normally does not want to wait 20 years. The solution is well-known: a wharf building a supertanker typically employs thousands of workers. Of course having many workers around, there must also be people not actually building the ship but leading the work, managing the company and cleaning the buildings. It also takes time to get instructions from the management down to the people on the work floor, and similarly there may be a need for workers to communicate with the management (paint is running out!) or among each other (we are ready with our section of the ship!). Even worse, the process cannot be speed-up arbitrarily, because there is a fixed order of steps, which cannot be rearranged arbitrarily. Some of these steps require many more workers than others, and it thus becomes inevitable that during part of the process some workers are idle. It may even happen that due to bad organization people are standing around though there is work they could do. Or people are working, but because they did not receive precise instructions, they are drawing cables which later have to be removed again, or which were already installed by someone else. Finally, given that one wants to minimize the construction time, the chief engineer may decide to reschedule the work. For example, it is clearly practical to not start to paint the ship before it is completely welded together, but maybe the painters can already start painting where ever something gets ready. They waste time while climbing the scaffoldings many times and cleaning there equipment.

Let us summarize the aspects which lead to the fact that P workers will not build a ship P times faster than a single worker (provided he has all the skills and is able to physically do the job):

All these aspects arise also if a single problem has to be solved by a set of P processors. Communication and how to minimize the cost of it is a central issue on computers with a distributed memory, but will be abstracted away in most of this lecture. The scheduling is an important problem in general, but we will design the algorithms so, that in most cases it is trivial how to schedule the arising subtasks. Useless work can normally be prevented, double work may be reasonable in some contexts, but will not play a big role in this lecture. The central topic of this lecture is the design of algorithms which allow to use as many processors as possible, while loosing at most a constant factor in efficiency. So, we accept that a parallel algorithm is more work-intensive by a constant factor, but normally not more than that. This does not mean that we will only look at such problems, we will see how to determine the maximum of n numbers in O(1) time using O(n^2) processors, but this is the focus. Quite often we will first only consider the time: given arbitrarily many processors, how fast can the problem be solved. Then we will try to reduce the number of processors employed so that the algorithm becomes optimal in the above sense.

Definitions

We introduce some important notions describing the performance and quality of parallel algorithms and programs. In the remainder of the text we will mostly write PU, an abbreviation for "processing unit'', instead of processor. The time of a parallel program solving a problem of size n on a computer with P PUs is typically denoted T(P, n). For an algorithm the same notation may be used to denote a O()-estimate with respect to some computational model.

The cost C(P, n) of a parallel algorithm solving a problem of size n on a computer with P PUs is defined as C(P, n) = P * T(P, n). The cost is to be distinguished from the work W(n, P) which equals the sum over all PUs of the number of performed operations. The difference is that the work does not take idle time into account.

The speed-up S(P, n) gives the number of times the parallel program is faster than the sequential program. So, S(P, n) = T(1, n) / T(P, n), for some definition of the sequential time T(1, n). For T(1, n) there are several possibilities, which are all in use:

Especially with the last notion of T(1, n), the speed-up expresses how the algorithm scales: if the speed-up is close to P (or Omega(P) in theoretical contexts), one says that the algorithm scales well. If the speed-up is much lower than P (or o(P) in theoretical contexts), one says that the algorithm scales poorly.

The efficiency E(P, n) of an algorithm is related to the speed-up: it is given by E(P, n) = S(P, n) / P = T(1, n) / (P * T(P, n)) = C(1, n) / C(P, n). So, it is the ratio of the costs, whereas the speed-up is the ratio of the times. A parallel algorithm is called optimal if E(P, n) = Omega(1), that is, the cost of the parallel algorithm is at most a constant factor larger than that of the best sequential algorithm. A study of the efficiency is especially important in practical contexts: Often the efficiency starts to decrease quite strongly from a certain P value on. The typical reason for this is that when P becomes too large for the value of n, the communication costs (which often increase with P^2) become dominating, and any further increase of P will hardly lead to a decrease of the overall time. Particularly on a multi-user system with P = 32 one should not run an application with an efficiency of under 10%.

Super-Linear Speed-Up

Above it was stated several times: the goal of parallel computation is to achieve speed-up P using P PUs and that this goal cannot be achieved due to several kinds of overhead. There are, however, contexts in which it is possible to solve a problem more than P times faster using P PUs than when using a single PU of the same type. The reason is that P PUs also have P times as much cache and P times as much main memory. On a modern computer cache misses are heavily penalized, costing several hundreds of clock cycles. Page misses even cost on the order of 10^7 clock cycles.

As an example we consider sorting 10^10 integers, requiring 40 GB of storage. On a single PU, having only 1 GB of main memory, this will cost about 12 hours. This time is estimated as follows: 40 GB must be read and written twice, a typical transfer rate between main memory and hard disc is 8 MB. So just for the data transfer we need 160 * 10^9 / (8 * 10^9) = 20,000 seconds. If we have 64 of these PUs interconnected by a switch giving 2 Gbit point-to-point connection the time becomes much better. We must assume that originally the data stand distributed over the PUs. After sending around a sample of negligible size, all data must be routed over the network once before each PU has to sort 156 * 10^6 integers. The routing takes a few seconds: each PU holds data with a total size of 5 * 10^9 bits. 2 * 10^9 bits can be routed per second. The sorting takes less than 100 seconds on a 3 GHz processor. So, we achieve a speed-up of 200 with 64 PUs.

The described phenomenon of solving a problem more than P times faster using P PUs is called super-linear speed-up. It is a true phenomenon, and it is practically important, but it does not really tell something about the quality of the algorithm. Clearly, very bad parallel algorithms will not achieve any speed-up, not even in an extreme context. But, comparing with an unreasonably slow sequential algorithm does not provide a fair basis for evaluation: if in the above example the joint 64 GB of main memory would be available to a single PU, then this PU could have performed the sorting in about 50 times the parallel time. In that case the efficiency of the parallel sorting algorithm would have lain around 0.90 and not around 5.0.

Too-good-to-be-true speed-ups are obtained by either comparing with sub-optimal sequential algorithms, or by comparing with sequential algorithms which are slowed down by slow access to a large data set.

Cost Models

Multitude of Models

In a sequential context there is a single widely accepted cost model: the von Neumann unit-cost model. The basic assumption of this model, that all operations are equally expensive, is far from exact, but in this case convenience tends to prevail over accuracy: accessing the main memory is far more expensive than operating on data residing in the registers or cache. This may be important in practice, but in theoretical studies this distinction is mostly ignored. As long as the complexity of algorithms is given in O()-notation, this is even formally correct as all operations take at most a constant number of clock cycles.

In parallel computing the situation is different: there are several cost models none of them dominating. The reason for this multitude is that the simplest of the models is felt to model reality too poorly to be acceptable. The more precise models are considered to be too hard to design algorithms for and in part also considered not to be of long lasting validity. Another reason for the multitude of models is also that the underlying hardware shows much more variability than the platform on which sequential computation is founded.

The most fundamental distinction in parallel computation is on how the memory is accessed: whether there is a shared memory to which all PUs have access in constant time, or whether there is a distributed memory so that each PU has direct access only to its own local memory. In the first case data can be exchanged by "parking" some information at some fixed position making it accessible to the other PUs. In the second case the data are exchanged by some form of communication between the PUs.

Another important distinction is whether the PUs are working synchronously or asynchronously. In a synchronous model, all PUs have access to a central clock. In practice systems are mostly asynchronous: each PU computes at its own speed. The speed differences may be small, but in a parallel computer not necessarily all PUs are of the same type. In principle the speed differences can be arbitrarily large. In practice, considering that there is mostly a non-negligible increase of the overhead with P, it normally does not make sense to combine PUs which are too different. Only for trivial problems, such as searching for extra-terrestrial life in radio signals or finding very large prime numbers, it is profitable to add 100 MHz PUs to a system that also contains 3 GHz PUs.

Historically parallel computers have been distinguished in SIMD, Single Instruction Multiple Data, and MIMD, Multiple Instruction Multiple Data, machines. On an SIMD machine, which is synchronous, all PUs execute the same instruction at all times. Because they execute these instructions on different data, it is still possible to obtain parallelism. The idea behind SIMD machines, is that if all PUs execute the same instruction, there is no need for the PUs to generate these instructions themselves: there is a central host computer connected to all PUs which tells them what instruction to execute next. This implies that the PUs can be much simpler.

In the 1980s there were many parallel computers of this type. An impressive example was the Masspar. This parallel computer consisted of a grid of very simple PUs. The PUs had primitive 4-bit arithmetic, a few kilobytes of storage and were slow. But, they were cheap. This implied that research institutes could afford to buy Masspar machines with 65536 PUs in a 256 x 256 grid. More recently no parallel computer with a comparable number of PUs has been build.

In an MIMD machine all PUs in principle may have their own program. In practice all will execute the same program, but due to conditional instructions, they will not always execute the same instructions. All modern parallel computers, in which the CPUs are mostly conventional PC CPUs, are of this type. Because most parallel algorithms become less efficient if the number P of PUs increases, it is more cost effective to buy 32 expensive but fast PUs than 1024 cheap but slow ones. Only for specific numerical applications architectures like the Masspar might still make sense.

In the following we briefly discuss the most important models. In the last section of this chapter we consider several problems from various perspectives. In the following chapter we present some basic algorithms for interconnection networks. Thereafter, we will only consider the simplest and most theoretical model, which allows to concentrate most strongly on the underlying parallelization ideas.

Distributed-Memory Model

In the distributed memory model it is assumed that the PUs are interconnected by some network without specifying the topology. More precisely, it is assumed that the topology plays a minor role and can thereby be neglected. Nevertheless any exchange of data must be performed by communication through this network.

Distributed Memory Computer with 4 PUs

It is assumed that the cost of communication is linear in the amount of data sent or received plus some constant cost. So, under this assumption the time consumption of a computation is estimated as follows:

T_tot = c_i * t_i + c_v * t_v + c_a * t_a.
Here c_i, c_v and c_a are constants whose values depend on the state of technological development. c_a is known under the name start-up latency. t_i gives the number of internal operations; t_v gives the total amount of data sent by a PU, this we will call the routing volume; and t_a gives the total number of packets sent by a PU. Typically a computation is divided in steps, and it would be more correct to define the above values as the sums over all steps of the maxima over all PUs of the number of each type of operations. This model will be referred to as DMM model, where DMM is an abbreviation for Distributed-Memory Machine.

The technological justification for the DMM model is that in modern parallel computers the communication is performed by wormhole routing. This means that if PU_i wants to send something to PU_j, a path from PU_i to PU_j is cleared, and then the data are transferred along this path flit by flit. a flit is a small data unit. The idea is that these flits run through the network like the segments of a worm: first all of them are standing in PU_i, then the first flit enters the network. By the time the first flit reaches PU_j several other flits, depending on the number of hops from PU_i to PU_j, have entered it. In PU_j the flits pile up and are recomposed. In this way the transfer time is something like d + l, where d gives the number of hops from PU_i to PU_j and l gives the number of flits. The time for the d hops is mostly negligible, much more important being the almost fixed start-up time.

Currently c_i ~= 0.3 * 10^-9. The values of c_v and c_a vary much more depending on how much money was invested on the interconnection infrastructure. In some cases the network is virtually unbounded, in that case the bounding factor is the speed of the memory bus of the individual PUs: clearly data cannot be pumped into the network faster than the rate at which they can be supplied by the memory. This means that we may have c_v = 10^-9 (when t_v indicates the number of integers). On the other hand, the PUs may also be coupled very loosely. For example, they may be connected by fast ethernet or worse, giving up to c_v = 10^-6. c_a can vary accordingly. In ideal cases we have c_a = 10^-7, but it may also be orders of magnitude larger. Notwithstanding all variations, we will work under the general assumption that c_i is considerably smaller than c_v, which in turn is orders of magnitude smaller than c_a.

t_i, the number of internal operations, cannot be estimated accurately because of the obscure internal operation of a modern processor. However, in many cases, it is possible to compare it with the number of operations performed by a sequential algorithm. For example, we will encounter sorting algorithms in which it is clear that t_i(P, n) = (1 + o(1)) * t_i(1, n) / P. Here t_i(P, n) denotes the number of internal operations performed by a PU when solving a problem of size n on P PUs. For such an algorithm, as long as we can guarantee that the communication terms are negligible, we may expect to achieve very high efficiencies.

Here we touch on the main point in the context of DMM computation: the contribution c_v * t_v + c_a * t_a to T_tot is perceived as the loss. For most applications t_i(P, n) decreases linearly with P, whereas particularly t_a(P, n) tends to increase at least linearly with P. So, in order to obtain acceptable efficiency, P must not be chosen too large in comparison to n. It depends both on the time complexity of the problem and on the development of t_a(P, n) how large P can be chosen, or said otherwise, for how small n it still makes sense to use P PUs.

The DMM model is a compromise between accurately modeling the time consumption of a parallel program and simplicity. It takes into account that sending small packets should be avoided, but neglects locality aspects.

Interconnection-Network Model

A much considered cost model is based on the assumption that the PUs are interconnected by a network in some given topology. While in the DMM model it is assumed that this topology can be ignored, it is here assumed that the structure of the network plays a significant role and algorithms will be designed with the network topology in mind.

4 PUs Connected by Ring Network

The time consumption of algorithms for interconnection networks is mostly expressed in steps. A step consists of a certain amount of computation followed by communication. Often there is no limit on the amount of computation that can be performed in a step. The communication is ruled by the precise model assumptions, but typically it is assumed that one packet of unit size can be transferred over each connection in each step. These packets can neither be divided nor composed. This cost model will be referred to as interconnection-network model.

The main assumption is that PUs can only communicate with neighbors and that longer distance communications are achieved by forwarding packets along some path connecting the source PU with the destination PU. This is called store-and-forward routing. Due to the dominance of wormhole routing, this assumption does not reflect the current technological conditions. There are nevertheless good reasons to consider a cost model based on it:

The model is further subdivided according to whether PUs can communicate with all neighbors at the same time, the all-port model, or only with one of them, the single-port model. Sometimes it is specified that the connections are uni-directional (also called half-duplex), in that case a PU cannot send to a PU from which it is also receiving. Otherwise the connections are said to be bi-directional (also called full-duplex).

The interconnection-network model neglects the aspects of computation. Thereby it allows to concentrate on the fundamental aspects of data exchange.

Shared-Memory Model

The main model of parallel computation is the shared-memory model. In this model all PUs have direct access to a common memory. Such an access is assumed to cost constant time. In addition the PUs have a local memory in which they can store local data.

Shared Memory Computer with 4 PUs

When considering computers with a shared memory, it is important to specify which types of access to the memory are allowed:

In the last case there are several possible ways of handling write conflicts, leading to a further subdivision of the model:

Depending on the underlying hardware, any of these models may be realistic. They are not far apart anyway: the most powerful of these models, a step of a Maximum CRCW machine with P PUs can be simulated on an EREW machine in O(log P) steps. The main parallel computer model considered in this text is the PRAM model. A PRAM, Parallel Random Access Machine, is a synchronous parallel computer with a shared memory. When specifying the performance of a PRAM algorithm we will mostly specify the memory access model, so we may say "On a CREW PRAM ... ".

It would be most correct to formulate a PRAM algorithm including load and store operations specifying which data have to be copied from the global memory to the local memory and vice versa. However, the existence of a local memory is mostly ignored, specifying algorithms as if all the PUs operate directly on the shared memory itself. This is similar to the practice of writing sequential algorithms without mentioning the memory management. The main reason why the PRAM model is so popular is that it allows to concentrate on the work to perform.

The shared memory model entirely neglects the aspects of data exchange. Thereby it allows to concentrate on the fundamental aspects of parallelizability.

Work-Time Framework

The PRAM model is simple, but the work-time framework allows to concentrate even more on the underlying concepts. The idea is to not specify the number of involved PUs: an algorithm is composed of steps, and in each step there are operations to be executed in parallel. Such an instruction may be of the type "for all i, 0 <= i < n, a[i] = b[i] + c[i]". In the literature one may find this expressed also by a pardo instruction.

An algorithm is evaluated with two cost parameters: the time, that is, the number of parallel steps; and the work, the sum over all steps of the number of performed operations. Specifying the performance of an algorithm in this way leaves the allocation of the work to the PUs unsolved. In principle it should be possible to perform a step in which w operations are executed on a PRAM with P PUs in round_up(w / P) time, but only if each PU knows which operations to execute.

If the work-allocation problem can be solved, then a WT algorithm gives a PRAM algorithm: if W(n) and T(n) are the work and time for a problem of size n, respectively, then

  T(P, n)  = sum_{step i} round_up(w_i / P) 
          <= T(n) + sum_{step i} w_i / P
          <= T(n) + W(n) / P,

  C(P, n)  = P * T(P, n)
          <= P * T(n) + W(n).

These relations immediately suggest a close to optimal choice of P for a given value of n: take P = P(n) = W(n) / T(n): for that value T(P, n) <= 2 * T(n) and C(P, n) <= 2 * W(n). So, for this choice of P, the completion time is at most twice the minimum value, and the cost is at most twice as large as the work. This is a good compromise: taking P larger, C(P, n) increases rapidly, without giving a substantial reduction of T(P, n). Taking P smaller gives substantially larger T(P, n), without giving a substantial reduction of C(P, n).

If there is a parallel algorithm in the work-time framework solving a problem of size n with work W(n) and time T(n), then, provided the work allocation can be solved, there is a PRAM algorithm with time O(T(n)) and cost O(W(n)).

Importance of PRAM Algorithms

PRAM Simulation

A common way of obtaining DMM algorithms, is to simulate PRAM algorithms. In this section we describe how this is done. This is a rich topic. Here we only mention how the problem can be solved in most cases occurring in practice, only touching on the topic of non-trivial memory allocations which must be used if one wants to guarantee good performance.

Suppose we have a PRAM algorithm solving a problem with P_PRAM PUs in T_PRAM time. We want to execute it on a DMM with P_DMM PUs. It is understood that P_DMM << P_PRAM. In many cases P_DMM must be fairly small in relation to the problem size in order to obtain speed-up. The easiest is to let PU_i, 0 <= i < P_DMM, of the DMM perform the work executed by the PRAM PUs with indices k, i * P_PRAM / P_DMM <= k < (i + 1) * P_PRAM / P_DMM.

Of course, this is not all. In the first place there is no guarantee that in this way the DMM PUs are equally loaded, even though they take over the work of equally many PRAM PUs. This is called the load balancing problem. For example, when simulating the PRAM algorithm for computing the sum, more and more PRAM PUs become idle. If things are arranged so that at all times the first PUs remain active, then the above processor allocation concentrates all the work unnecessarily on a few DMM PUs. In the case of the sum algorithm this can be prevented by letting the work of PU_j of the PRAM be executed by PU_{j mod P_DMM} of the DMM. However, we strive for a blind PRAM simulation, not one in which every algorithm has to be studied before we can decide how to allocate the PRAM PUs to DMM PUs. A possible solution is to randomly allocate the PRAM PUs to the DMM PUs, for example, by using a universal hash function. This does not exclude bad allocations, but makes them unlikely.

Another point is the memory allocation. In some cases one might in principle distribute the complete data set to all DMM PUs, and then compute on these. This is normally not a good practice: this requires much communication and much memory at the PUs. One of the strong points of a parallel computer is that the joint main memory is much larger than that of most single-processor machines, thereby allowing to solve large problems without exceeding the memory size. Furthermore, there are not so many problems which can be split in non-interacting subproblems. So, normally the data set is distributed over the PUs in such a way that each data item is stored in a single PU.

Often it is good enough to use a regular allocation like for the PRAM PUs. In such a regular allocation, data item i is either stored in DMM PU i mod P_DMM or in i * P_DMM / number_of_items_to_allocate. If now a PRAM PU which is allocated to DMM PU p needs a data item stored in DMM PU p', it sends a request to PU p' and waits for the reply. A typical way of progressing is that each PRAM step is decomposed in three substeps:

  1. Send requests for data
  2. Wait for the data to arrive
  3. Perform a computation step
Because each DMM PU is simulating P_PRAM / P_DMM PUs, and for most computational steps a constant number of data are needed, a common situation is that there are in total O(P_PRAM / P_DMM) requests sent by each DMM PU. In view of the high start-up costs for a routing operation, all requests going to the same PU must be bundled before sending. This requires a bucket sort with P_DMM buckets.

A possible communication schedule for sending all requests is to have P_DMM - 1 communication steps: in step t, 1 <= t < P_DMM, PU p sends to PU (p + t) mod P_DMM. In this way each PU is sending and receiving only one packet in each step. If the pattern is balanced this works fine. A pattern is said to be balanced if each DMM PU is approximately sending the same number of requests to any other PU.

If the requests are balanced, then also the answers are balanced. If there are O(P_PRAM / P_DMM) requests, then with the above schedule each transferred packet has size O(P_PRAM / P_DMM^2). So, one such communication round costs O(P_PRAM / P_DMM) * t_v + 2 * (P_DMM - 1) * t_a. From this it becomes clear that we must have P_PRAM / P_DMM^2 >> 1 if we do not want that the start-up costs dominate the overall time consumption.

The condition P_PRAM / P_DMM^2 >> 1 coincides with the condition that in most cases will guarantee that the pattern is balanced. The average packet size is P_PRAM / P_DMM^2 and if we may assume that the requests are randomly distributed over the PUs, the size of a packet is given by a random distribution with expected value S = P_PRAM / P_DMM^2. Such a distribution is tightly concentrated around its expected value. Chernoff bounds give that the deviations from the expected value are bounded by O(sqrt(S * log(P_PRAM))) with high probability.

However, a fixed data allocation can normally not guarantee that the communication pattern is balanced. An approach which makes it very likely that for P_PRAM / P_DMM^2 >> 1 all arising communication patterns are balanced is to allocate the data in a random way just as we did with the PUs. Again a hash function should be used, otherwise it would be hard to figure out where a data item is stored. Nevertheless, evaluating a hash function implies considerable overhead in comparison to a sequential algorithm which can access its data directly. If such a construction is applied, the efficiency drops considerably.

In some cases the problem is more fundamental: if the same data item is used by many PRAM PUs, then there is no way of allocating the item so that the communication pattern becomes balanced. The PU holding this item will be flooded by requests, and applying the above approach will give very poor performance, because all requests are handled sequentially. A clever algorithms designer will know whether such patterns occur in his/her application or not and mostly no action is needed.

If the problem of having many requests for a small subset of data items really occur or may occur, then one must do something about it. A practical solution is to first sort all requests on the index of the data item they are destined for. Sorting is not for free, but normally we can apply bucket sort because the range of values is limited and the amount of involved communication is also quite limited as we will see in a later chapter. Then, for all requests to the same data item one is chosen as representative. This representative is forwarded to the item. The answer is returned and replicated to all requests. Then the routing pattern of the sorting is reversed.

PRAM simulation is the link between a library of PRAM algorithms and programs running on actual machines. Only some elementary communication algorithms must be implemented in a machine-specific way.

Importance of Minimizing PRAM Time

The central question in the complexity theory of sequential algorithms is the P == NP question. The corresponding question in the complexity theory of parallel algorithms is the P == NC question. NC is the class of all problems which with a polynomial number of PUs can be solved in polylogarithmic time, that is, for a problem of size n in O(log^k n) time, for some constant k. Most of the problems considered in this text belong to NC. For example,

Above we noted that often there is a considerable discrepancy between the bounds in the PRAM and the DMM model. This does not mean that it is useless to try to design fast PRAM algorithms: simulating a PRAM algorithm with a running time T normally leads to a DMM algorithm with T communication rounds. A small number of communication rounds is a good thing to have, because this implies a smaller value of the number t_a of start-ups.

On the other hand, the number of start-ups is not the only cost factor. Consider again the formula giving the time consumption of an algorithm running on a DMM:

T_total ~= c_i * t_i + c_v * t_v + c_a * t_a.
This means that the best parallel algorithm is mostly the one that minimizes t_a without substantially increasing t_i or t_v. From this we obtain a strong and interesting reduction of the set of practice-relevant PRAM algorithms:
Only cost-optimal PRAM algorithms have potential practical relevance. Most promising are cost-optimal algorithms with minimal running time.




Interconnection Networks and Routing

Introduction

The PRAM and the DMM are simple and can be defined and described in a single paragraph. For interconnection networks the situation is different: there is a whole body of literature just describing and analyzing network topologies. Also there is extensive literature on the various communication problems on interconnection networks. In this chapter we first give a brief general introduction focusing on a small set of widely used interconnection network topologies and the most important communication problems. Then we go slightly deeper into some slightly more advanced topics which are based on some very interesting ideas with application beyond the precise scope for which they are presented. Because the communication problems which must be solved on DMMs are the same and because most algorithms are directly inspired by communication algorithms for interconnection networks, we also consider routing on DMMs.

Algorithms for interconnection networks are mostly analyzed using the interconnection-network cost model, in which it is assumed that in each step each PU may perform an unlimited amount of communication and then route one unit packet to one or more of its immediate neighbors.

Most algorithms for this cost model deal with the various communication problems, rather than to solve a general problem, say maxflow. The reason for this is that the communication problems are already hard enough, and that harder problems can often be solved quite well by taking a parallel algorithm designed for a more powerful parallel computer model, and then to simulate the operations of this computer on the interconnection network. Such a simulation is composed of two components: an allocation of the data and processes to the PUs and performing the resulting routing operations.

Next to the various communication operations also some of the most basic problems have been studied for interconnection networks. These are sorting, selection, list ranking, matrix multiplication and the like. In a following chapter we will even see how to solve the all-pairs shortest-paths problem optimally on a two-dimensional mesh.

Topologies

Here we introduce some of the most important interconnection-network topologies, fixed classes of interconnection patterns. The degree of a PU in a network is the number of other PUs it is connected to, corresponding to the degree of the node if the network is identified with the underlying graph. Interconnection networks are normally considered to be undirected: if PU_i is connected to PU_j, then PU_j is also connected to PU_i. The most important quality measures of an interconnection network are its

Complete Networks

A natural network topology is the one whose underlying graph is the complete graph. In such a complete network there is a connection between any pair of PUs. This topology is normally studied for the one-port model, while full-port capability on a complete network being unrealistically powerful.

In hardware a complete interconnection network with one-port capability can be realized with a crossbar switch. For interconnecting P PUs, a crossbar switch consist of a grid of P x P switches. These switches have four connections, W, N, E, S. The switches have two settings, either connecting W with E and N with S or connecting W with N and E with S, which can be chosen dynamically. Setting the switches appropriately, point-to-point connections can be created. Notice that using a crossbar switch each PU can send a packet to at most one PU and receive a packet from at most one PU, but that the PU to which it is sending may be different from the PU from which it is receiving. Switch-based complete network topologies are very common in practice. But, because of the increasing delays and cost such systems are typically of moderate size.

8 x 8 Crossbar Switch

Grid-Like Networks

Another very simple network topology is the linear array. In this topology each of the n PUs has one or two neighbors. Between any pair of PUs, (i, i + 1), 0 <= i <= n - 1, there is a connection. So, the degree of the linear array is 2. Its diameter is n - 1 and the bisection width is 1.

A similar network is the circular array which is also called ring. The only difference with the linear array is an additional connection between PU_0 and PU_{n - 1}. This connection is often designated as wrap-around connection. The degree is 2, the same as before. However, the other measures are better: the diameter is round_down{n / 2} and the bisection width is 2. This implies that most problems can be solved on a circular array twice as fast as on a linear array.

An n_1 x n_2 mesh consists of n_1 * n_2 PUs arranged in a grid. Each PU is connected to its up to four neighbors. As a graph, a mesh is the direct product of two linear arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter n_1 + n_2 - 2 and the bisection width is min{n_1, n_2}.

An n_1 x n_2 torus is a mesh with wrap-around connections: there are additional connections between PU_{i, 0} and PU_{i, n_2 - 1} for all i, 0 <= i < n_1, and between PU_{0, j} and PU_{n_1 - 1, j}. As a graph, a torus is the direct product of two circular arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter round_down(n_1 / 2) + round_down(n_2 / 2) and the bisection width is 2 * min{n_1 , n_2}. On tori most problems can be solved twice as fast as on meshes.

There are also higher-dimensional meshes and tori. A d-dimensional n_1 x ... x n_d mesh consists of n_1 * ... * n_d PUs arranged in a d-dimensional grid. Each PU is connected to its at most 2 * d neighbors. The degree is 2 * d, the diameter is n_1 + ... + n_d - d and the bisection width is n_1 * ... * n_d / min{n_1, ..., n_d}. The definition and properties of d-dimensional tori are analogous.

Linear Array, Circular Array, Mesh, Torus

Tree-Like Networks

A tree is a network consisting of 2^{k + 1} - 1 PUs arranged and interconnected as a perfect binary tree of depth k. So, there are k + 1 levels of PUs, with 2^j PUs in level j, 0 <= j <= k. The root, the PU in level 0 has degree two, the leafs, the PUs in level k have degree one, all internal nodes, the PUs at all other levels have degree three. The degree is 3, the diameter 2 * k and the bisection width is 1.

The major weakness of a tree network is its very small bisection width. This means that any problem requiring a single rearrangement of the data has at least linear running time. For such problems a tree is not better than a linear array. Therefore several improved trees have been designed. The simplest is the fat tree. In a fat tree the connections between level i and level i + 1, where the leafs are considered to lie at level 0, have capacity 2^i. This means that such a connection can transfer 2^i packets instead of 1.

Alternatively tree structures have been integrated in other networks. An important example of such a network is the mesh of trees. An n x n mesh of trees is constructed from an n x n grid of not-connected nodes. The PUs in each row and column of this grid are connected by a tree. There are no direct connections between the row trees or between the column trees. The row and column trees are connected at their shared leafs. An n x n mesh of trees has 2 * n * (2 * n - 1) - n^2 = 3 * n^2 - 2 * n nodes.

8 x 8 Mesh of Trees

Cube-Like Networks

A k-dimensional hypercube consists of n = 2^k PUs arranged in a k-dimensional grid with side length 2. Each PU is connected to its k neighbors in the grid. The degree is k = log n, the diameter is k and the bisection width is n / 2.

Hypercube

A k-dimensional cube-connected cycle, abbreviated CCC, network is obtained by taking a k-dimensional hypercube and replacing each node by a cycle with k nodes. Node d, 0 <= d < k, in each cycle takes over the connection of the hypercube along the d-direction. More precisely, the b * 2^k nodes of the CCC are indexed with pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k), (i, (d - 1) mod k) and (i ^ 2^d, d). Here i ^ 2^d denotes i + 2^d if bit d of i is 0 and i - 2^d if this bit is 1. The degree is 3, the diameter is round_down(5 * k / 2) - 2 for k >= 3. A CCC is the simplest network which combines a logarithmic diameter with constant degree. The bisection width is the same as in the corresponding hypercube, 2^{k - 1}.

A k-dimensional butterfly network is similar. The k * 2^k nodes are indexed as pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k) and (i, (d - 1) mod k). Furthermore there is an undirected edge from (i, d) to (i ^ 2^d, (d + 1) mod k). So, the degree is 4. The diameter is round_down(3 * k / 2). Because many router chips provide four connections, and because of the good routing properties of butterflies these networks have been used quite a lot in practice.

Four Dimensional CCC and Butterfly

Cayley Graphs

The concept of Cayley graphs gives a generic way of constructing many different graphs. For a finite group G and a generator set S which is closed under inversion, a graph (V, E) is defined as follows: Because the set S is a generator set, the graph is connected. Because it is closed under inversion, for any edge (g, g * e), there is also an edge (g * e, g * e * e^{-1}) = (g * e, g), and thus the graph is undirected.

Cayley graphs are much nicer than just connected and undirected: they are even node symmetric. A graph is said to be node symmetric, if for any pair (u, v) of nodes there is an automorphism phi = phi(u, v) of the nodes so that if (x, y) is an edge that then also (phi(x), phi(y)) is an edge. For Cayley graphs, these automorphisms are easy to construct: determine elements e_1, ..., e_k so that e_1 * ... * e_k * u = v, then phi maps any x in V to x * e_1 * ... * e_k. If (x, y) in E, this means that y = x * e for some e in S. Thus, (phi(x), phi(y)) = (phi(x), phi(x * e)) = (phi(x), e_1 * ... * e_k * (x * e)) = (phi(x), (e_1 * ... * e_k * x) * e) = (phi(x), phi(x) * e) in E.

Several of the graphs we have encountered before are Cayley graphs: circular arrays, tori, hypercubes and CCCs, though these can be presented in a more natural way. We describe how hypercubes can be obtained as Cayley graphs. For H_k, the k-dimensional hypercube, is obtained from a subgroup of the permutation group of 2 * k elements. There are k generators each having two consecutive elements exchanged. To make the description more concrete we only consider the special case k = 4. The set S of generators consists of (10234567), (01324567), (01235467) and (01234576). The subgroup consists of all elements obtained by starting with (01234567) and letting these elements work. If in the conventional indexing of the hypercube we find a 1 at position i, 0 <= i < k, the elements 2 * i and 2 * i + 1 in the indexing with a permutation are exchanged. For example, (0110) corresponds to (01325467).

Conclusion

Linear and circular arrays are very weak networks. Only for very simple problems it is possible to achieve speed-up on them. The main reason to consider them is that many algorithms for meshes and tori can be viewed as generalizations of 1-dimensional algorithms. Meshes and tori have a relatively high diameter and relatively small bisection width. O(sqrt(P)) is mostly the maximum achievable speed-up with P PUs. On the other hand are these networks easy to construct and arbitrarily scalable and extendible. A tree has small diameter, but too small bisection width. Only problems which require very little data to be exchanged can be solved efficiently on them. On hypercubes many important problems can be solved optimally. The problem with these is that the degree is not constant. Furthermore, in a two-dimensional layout, due to their large bisection width, they require an area which is quadratic in the number of PUs n, which implies that some adjacent PUs are standing sqrt(n) or more apart. So, if the physical distance signals must travel is the determining factor, hypercubes are not better than meshes. The conclusion is that none of the networks is entirely satisfying, and this is precisely the reason that all of these and many more are studied.
For interconnection networks, there is a trade-off between quality on the one hand and complexity and scalability on the other.

Embeddings

There is a multitude of considered interconnection networks and there is a multitude of problems. It would be inefficient to study each problem for each network. This is often not necessary. In the first place many algorithms can be generalized or modified. This is a common practice between networks of a similar nature such. Algorithms for two-dimensional meshes can often be generalized for higher-dimensional meshes. Algorithms for hypercubes after modification often work also for other interconnection networks with logarithmic diameter and high bisection width. A more formal way of carrying over algorithms from one network to another is by using embeddings.

An embedding of an interconnection network IN_1 with P_1 PUs into a network IN_2 with P_2 PUs is a mapping phi() of the PUs of IN_1 to the PUs of IN_2. Generally it is part of the definition that phi() should be injective. Not only the PUs should be mapped, but for any two PUs PU_i and PU_j in IN_1 connected by a link there should also be specified a path in IN_2 running between phi(PU_i) and phi(PU_j). There are three cost measures that are used when specifying the quality of an embedding:

Any step of IN_1 can be simulated on IN_2 with at most dilation * congestion steps. Thus, the work on IN_2 exceeds the work on IN_1 by at most a factor dilation * congestion * expansion. As long as all these numbers are constants, an optimal algorithm for IN_1 automatically gives an optimal algorithm for IN_2.

As a first example we consider the embedding of a torus into a mesh. This embedding is obtained by first blowing up the n x n torus to a 2 * n x n torus and folding it double along the middle row, so that one PU comes in each row, and then repeating this for a folding along the middle column. The following picture illustrates the process. Notice that each pair of PUs that previously was adjacent now is at distance two at most: the dilation is two. Also, are there at most two paths mapped to any link of the mesh: the congestion is two. For the special case of dilation and congestion both equal to two, any step of the embedded network can be simulated in two steps. So, any problem which on an n x n torus can be solved in T communication steps can be solved on a mesh in 2 * T steps.

Embedding a Torus in a Mesh with Dilation Two

Another important example is the embedding of a binomial tree with n = 2^k nodes in a hypercube. The construction is simple: a binomial tree with 2^k nodes, T_k consists of two trees T_{k - 1} whose roots are connected. A k-dimensional hypercube consists of two (k - 1)-dimensional hypercubes with connections between all corresponding nodes, in particular between the nodes (0, 0, ..., 0) and (1, 0, ..., 0). So, the embedding can be performed recursively. The dilation, congestion and expansion are all 1: the graph of the binomial tree is a subgraph of the hypercube, obtained by removing edges.

Embedding Binomial-Tree in Hypercube

Two types of embeddings are particularly important: the embeddings of the linear and circular array. The embedding of a linear array, which is nothing but the construction of a Hamiltonian path, is important because it provides a way of indexing the PUs so that any two PUs with consecutive indices are adjacent in the network. Because many algorithms contain steps in which data must be exchanged to PUs with consecutive indices, this is a very desirable property. A systematic way of indexing the PUs of an interconnection network is called an indexing scheme. The embedding of a circular array, which is nothing but the construction of a Hamiltonian cycle, is important because algorithms may even require data to be shifted in a cyclical way.

In any grid it is easy to construct a Hamiltonian path. A two-dimensional mesh is Hamiltonian, which means that it has a Hamiltonian cycle, iff at least one of the sides has even length. Next to the indexing schemes obtained by embedding linear and circular arrays there are several other schemes for indexing the PUs of a grid. The most important is the so-called row-major indexing scheme. In this scheme, PU_{i, j} in an n x n mesh, which is located in row i and column j, is given index i * n + j. Mostly the rows and columns are numbered starting from the upper-left corner as in a matrix. In the column-major indexing scheme, this PU has index i + j * n. The indexing obtained by embedding a linear array, starting at the upper-left corner, then always going along the rows first, is called the snake-like row-major indexing. There are also so-called shuffled indexing schemes. The most important is the shuffled row-major indexing scheme. In this scheme, PU_{i, j}, for numbers i and j with binary expansions given by (i_{k - 1}, ..., i_i, i_0) and (j_{k - 1}, ..., j_1, j_0) respectively, has an index with binary expansion (i_{k - 1}, j_{k - 1}, ..., i_1, j_1, i_0, j_0).

Indexing Schemes

On a hypercube a Hamiltonian path can be constructed by exploiting the recursive structure of the network. H_k denotes the k-dimensional hypercube. The traversal of all PUs of H_k is obtained by gluing together two traversals of H_{k - 1}. If the gluing is done correctly, this even gives a Hamiltonian cycle. The schedule is known as Gray code. It can be described by giving the order in which the PUs are visited. It is easy to verify that in the following sequences all consecutive PUs, including the first and last, are adjacent:

for H_1 we have {0, 1},
for H_2 we have {00, 01, 11, 10},
for H_3 we have {000, 001, 011, 010, 110, 111, 101, 100}.

How to generalize the above sequences? Look at the sequence of H_3. It starts with the sequence of H_2 in which all indices have an additional leading 0. Then this same sequence appears once more in reversed order in which all indices have an additional leading 1. More generally, denoting the Hamiltonian sequence for H_{k - 1} by S_{k - 1}, the Hamiltonian sequence S_k for H_k is the concatenation of 0-S_k and 1-reversed(S_k). Here, for a sequence S and b in {0, 1}, b-S denotes the sequence obtained by preceding all elements in S by a b.

Lemma: The sequences S_k defined by S_1 = {0, 1} and S_k = 0-S_{k - 1} + 1-reversed(S_{k - 1}), where + denotes concatenation, give Hamiltonian enumerations of the PUs of the k-dimensional hypercubes H_k.

Proof: The proof goes by induction. For H_0 the construction is clearly correct. So, assume it is correct for H_{k - 1}. If S_{k - 1} is a Hamiltonian cycle, then so is reversed(S_k). The sequences 0-S_{k - 1} and 1-reversed(S_{k - 1}) are paths through H_k because of the recursive way in which H_k is constructed out of two connected copies of H_{k - 1}. It remains to check that these two paths fit together. The first element of S_{k - 1} and the last element of reversed(S_{k - 1}) are the same. So, the first element of 0-S_{k - 1} and the last element of 1-reversed(S_{k - 1}) differ in exactly one position, the leading one, and are therefore the corresponding PUs are connected n the hypercube. End.

Hamiltonian Cycle in Hypercube

Embeddings provide a major way of deriving algorithms for one topology from algorithms designed for another topology. Embeddings of linear and circular arrays lead to indexings for which consecutively indexed PUs are adjacent.

Routing

Considered Problems

Communication is a precondition for any computation in which several PUs jointly solve a common task rather than each solving their own task. Even in the latter case some communication is normally needed to hand-out the tasks in the beginning and to collect the answers in the end. In principle, in the course of a computation, we may encounter any pattern of source-destination pairs. However, because most algorithms have a fairly regular structure, there are some patterns which occur particularly frequently. The most frequently occurring patterns are the following:

Each of these operations has its own importance. All-to-all routing is the pattern encountered when there is no specific pattern in the data access: the PUs compute a while on the data they have and by doing so accumulate questions to be sent to other PUs. Unless the algorithm was designed in a specific way, these questions will be more or less randomly distributed over the PUs, giving rise to a more or less balanced all-to-all routing pattern. A permutation-routing pattern arises in algorithms that where designed in such a way that the questions are not scattered over the data set but concentrated. Such a structure is desirable because it reduces the number of start-ups. Broadcasting is a very common operation performed to distribute information from a source PU to a (subset) of PUs that must work on the same data. Typically this happens at the beginning of a computation. The symmetric operation is called gathering. Gathering means that information from all PUs is combined to a single information in some specified PU. For example, this may be done in order to sum up the computed results. Gossiping is used to make local samples of the information available to all PUs. It is also used to synchronize a database and in numerical communications. Because gossiping is a very expensive operation, it will normally be performed only for a small subset of the total data set. Another much studied problem is k-k routing, in which each PU is the source and destination of at most k packets. Both permutation routing and all-to-all routing are special cases of k-k routing. In practice general k-k routing patterns are rarely encountered.

Permutation Routing

Two-Dimensional Meshes

Consider routing a permutation on a two-dimensional n x n mesh. There is a very simple algorithm which works quite well:
  1. Route all packets within their rows to their destination columns.
  2. Route all packets within their columns to their destinations.
This algorithm is called the greedy routing algorithm, also called xy routing. The description is not complete though: in addition to saying how the packets move, a routing algorithm should also specify what is done if more than one packet wants to use the same connection at the same time. Due to our model assumption, only one can be transferred, the others must wait. So, we must specify a conflict-resolution strategy. In many cases the most natural strategy works best: the packet which has to move farthest gets priority. This strategy is called the farthest-first strategy. Also it should be specified whether the first routing phase is completed before the second phase is started, or that the phases are coallesced. Normally it is understood that the phases are separated, though in practice it will be faster and simpler to coallesce them.

How good is the greedy algorithm? It is easy to see that during phase 1 the packets run without any delay: the packets moving rightwards do not change there relative distances as long as they are moving and therefore they will never encounter each other. The same is true for the leftwards moving packets. So, phase 1 takes as long as the longest distance a packet has to go. The maximum distance that may have to be covered in a row is n - 1. As a result of the routing in phase 1, the packets may have piled up in some PUs. For example if the permutation is the transposition, in which PU_{i, j}, 0 <= i, j < n, sends a packet to PU_{j, i}, then n packets reside in all diagonal PUs at the end of phase 1. So, the queue size, that is, the maximum number of packets that may have to be stored in a single PU at the same time, may be as high as n. Thus, during the second routing phase there may be conflicts. However, due to the applied farthest-first strategy, can a packet which has to move upwards to a PU at distance d from the boundary of the mesh be delayed at most d times. Because the distance this packet has tot travel is bounded by n - d - 1, it will arrive no later than in step n - 1. So, even phase 2 will be completed in n - 1 steps, and therefore the whole algorithm is completed in 2 * n - 2 steps which is optimal. Much more sophisticated algorithms have shown that it is possible to achieve the same time with constant queue size.

Higher-Dimensional Meshes and Hypercubes

On higher-dimensional meshes and hypercubes the same algorithmic idea can be applied: the packets are routed dimension-by-dimension. This routing is called dimension-ordered routing and is widely applied in practice. On two-dimensional meshes this may give long queues but the time is optimal as we have seen above. However, on higher-dimensional meshes and hypercubes it can be very bad. Consider for a k-dimensional hypercube with n = 2^k PUs a generalization of the transposition, in which the packet residing in a PU with index (i_0, ..., i_{k - 1}) has to be routed to the PU with index (i_{k - 1}, ..., i_0). If dimension-ordered routing is applied, then all packets starting in the PUs with indices (i_0, ..., i_{k / 2 - 2}, 1, 0, ..., 0) will be routed through the PU with index (0, ..., 0, 0, 0, ..., 0). That is, 2^{k / 2 - 1} = sqrt(n) / 2 packets are routed through a single PU and have to leave it over a single connection. Thus, the routing takes Omega(sqrt(n)) time, which is not better than routing on a two-dimensional grid.

In general we hope that permutation routing can be solved in a time equal to or only slightly larger than the diameter. In any case we would like to route permutations in O(diameter) time. The above algorithm for k-dimensional hypercubes achieves this for most permutations, the expected time is O(k), but there are a few permutations for which the performance is unacceptable. Further down we will see how to overcome this problem.

DMMs

On a DMM permutations can be routed in a trivial way: each PU sends its packet directly to the destination PU. Because the whole pattern is a permutation it is guaranteed that this can be performed without conflicts. If the packets have length l, this takes c_v * l + c_a time. This is clearly optimal because at least one start-up is needed and all data must be transmitted.

All-to-All Routing

All-to-all routing is the working horse of general-purpose parallel computation. Its cost depends on the bisection bandwidth of the interconnection network. If the provided bandwidth is insufficient, then whole classes of problems can not be solved efficiently. Because the routing pattern is fixed and regular, it is not hard to design optimal algorithms for most topologies.

Two-Dimensional Meshes

On an n x n mesh each PU is the source and destination of n^2 packets, one for each of the PUs. Applying the greedy algorithm works fine now. Each PU holds n packets with destinations in each of the n columns. These packets with common destination might be bundled. In each row n / 2 * n^2 / 2 = n^3 / 4 packets must be routed over the bisection. At the end of phase 1, each PU holds again n^2 packets, n of them with destination in each of the PUs of its column. These packets with common destination might again be bundled. The algorithm takes n^3 / 2 routing steps. This is twice the time given by the bisection bound, the lower bound which is obtained by dividing the number of packets that must cross the bisection in one direction (in this case n^4 / 4) by the bisection width (in this case n).

The given algorithm is uni-axial, that is, in any step either only horizontal or only vertical connections are used. If the mesh has full-port capacity, then this is wasteful. However, it is often possible to overlay two uni-axial algorithms. In this particular case the solution is very simple: all packets going from PU_{i_x, i_y} to PU_{j_x, j_y} with i_x + i_y + j_x + j_y even are routed as described, while the other half of the packets are routed first along the columns and then along the rows. Said otherwise: if the PUs are colored black and white as on a chess board, all packets routed between fields of the same color are routed xy, while the others are routed yx. Each of the problems involves half as many packets as before and therefore the whole routing takes only n^3 / 4 now, which is optimal.

This coloring idea is very useful, because it allows to concentrate on designing uni-axial algorithms. This strongly facilitates the design. Of course this idea is not limited to two-dimensional meshes, but can be applied on any network in which separate dimensions can be distinguished, most notably meshes and tori of any dimension and hypercubes.

Higher-Dimensional Meshes and Hypercubes

On a d-dimensional n x ... x n mesh, the algorithm consists of d phases. In phase i, 0 <= i < d, the routing is performed along axis i. In each routing half of the packets must cross the bisection. So, each phase takes n^{d + 1} / 4 time. In total the algorithm takes d * n^{d + 1} / 4 time. This algorithm is uni-axial. Applying a coloring with d colors it can be made optimal: a packet routed from PU_{i_0, ..., i_{d - 1}} to PU_{j_0, ..., j_{d - 1}} is given color (i_0 + ... + i_{d - 1} + j_0 + ... + j_{d - 1}) mod d. A packet with color c is routed along the (c + i) mod d axis in routing phase i, 0 <= i < d. So, the original problem is decomposed in d independent problems each involving a fraction 1 / d of the packets. So, the total routing time of this full-port algorithm is n^{d + 1} / 4.

In the context of all-to-all routing, a k-dimensional hypercube with n = 2^k PUs can be perceived as a k-dimensional mesh with side length equal to two. The uni-axial algorithm runs in k * 2^{k + 1} / 4 = k * n / 2. Overlaying k of these algorithms each taking care of a fraction 1 / k of the packets, the routing time can be reduced to n / 2.

DMMs

For all-to-all routing on a DMM with P PUs there is one standard algorithm which is the best if the packets are sufficiently large. The algorithm consists of P - 1 steps. In step t, 1 <= t < P, PU_i routes its packet to PU_{(i + t) mod P}. Possibly there is even a step 0 in which the packets with destination in the same PU are handled. If all packets have size l, then the routing takes
T_one(P, l) = c_v * l * (P - 1) + c_a * (P - 1).
This is P - 1 times the time for routing a permutation, which is not surprising because the all-to-all routing is realized by decomposing it into permutations. Of course this works only if all the packets have (approximately) the same length. Otherwise a more sophisticated algorithm should be applied which is presented further down.

Even if all packets have the same size, then it is not necessarily true that sending the packets directly to their destinations is the best one can do. This may sound surprising but is due to the relatively large number of start-ups of the given algorithm: if l is small, then the term c_a * (P - 1) will dominate and it may be profitable to reduce it even if the first term increases because of this.

As an alternative consider the simulation of the algorithm for a two-dimensional mesh. All PUs are indexed as in a mesh with a pair of indices (i, j), 0 <= i, j < sqrt(P). The algorithm consists of two phases. In phase 1, all packets are rearranged within the "rows", in the second phase within the "columns". More precisely, in step t, 1 <= t < sqrt(P) of phase 1 PU_{i, j} routes to PU_{i, (j + t) mod sqrt(P)} all packets which have destination in any PU_{i', (j + t) mod sqrt(P)}, with 0 <= i' < sqrt(P). In step t, 1 <= t < sqrt(P) of phase 2 PU_{i, j} routes to PU_{(i + t) mod sqrt(P), j} all packets which have destination in any PU_{(i + t) mod sqrt(P), j'}, with 0 <= j' < sqrt(P). So, the algorithm consists of 2 * (sqrt(P) - 1) steps in each of which l * sqrt(P) data are routed. This gives

T_two(P, l) = c_v * 2 * l * (P - sqrt(P)) + c_a * 2 * (sqrt(P) - 1).
For large l this algorithm is almost twice as slow, but it may be up to sqrt(P) / 2 times faster for very small l.

Of course we do not have to stop here. If the packets are really small, then the algorithm for a three-dimensional mesh can be simulated, or in the extreme case, the whole routing can be performed with just log(P) routing steps by simulating the hypercube algorithm. Doing this, in each step half of the data a PU holds must be routed. This gives

T_log(P, l) = c_v * log(P) * l * P / 2 + c_a * log(P).

The algorithms with several phases also require that in between two phases the received packets are decomposed into their components and recomposed for the next routing. With appropriate bucketing all this can be performed in a time that is linear in the amount of involved data. Nevertheless, if c_i is not so much smaller than c_v, as it is the case on the best parallel computers, this gives a non-negligible delay. More generally, reducing the number of start-ups by multiplying the routing volume by a factor should be limited to subproblems involving only a small fraction of the total data set. Otherwise, considering that c_v > c_i, no reasonable speed-up can be expected, and it would be better to reduce the number of processors used in order to get packets of a more substantial size.

Broadcasting

Broadcasting on a full-port network is trivial: in any step the information is sent to all neighbors. It will reach all PUs in the minimum time, which is bounded by the diameter of the network. On single-port machines the problem is non-trivial.

Linear Arrays, Meshes and Hypercubes

If on a linear array with n PUs an information from PU_0 has to be broadcast, then the trivial algorithm, sending the algorithm in step i, 0 <= i < n - 1, from PU_i to PU_{i + 1}, is optimal, taking n - 1 steps. If the source is PU_j, j < n / 2, then the packet is first routed rightwards, then leftwards.

On a two-dimensional n x n mesh, the information is first routed within the row of the source PU and then within all columns. If the source is not PU_{0, 0}, then the routing in each of the phases should start in the direction in which the distance to cover is largest. The algorithm requires 2 * n - 2 steps at most, which is optimal.

On higher-dimensional meshes and hypercubes we can apply the dimension-ordered generalization of the above algorithm. On a hypercube this means that the routing is performed along an embedded binomial tree. On a k-dimensional hypercube the algorithm takes k steps, which is optimal.

DMMs

On a DMM with P PUs we assume that PU_0 is the source. There are several options. The most natural is to simulate the hypercube algorithm: the routing conists of log(P) rounds, in each round l data have to be transferred. This gives
T_hypercube(P, l) = c_v * l * log(P) + c_a * log P.

Surprisingly this is mostly not the best. If l is slightly larger, then v * l * log(P) is a large number. In that case it is better to use an algorithm which is similar to that for the linear array. The information is divided in a large number f of flits. Flit i, 0 <= i < f, is routed in step i from PU_0 to PU_1, and then on in a linear way. It takes f + P - 2 steps before flit f reaches PU_{P - 1}. This gives

T_linear(P, l) = c_v * l / f * (f + P - 2) + c_a * (f + P - 2).
For sufficiently large f, the first term converges to c_v * l, which is optimal. Of course this produces many more start-ups.

For intermediate values of l, there are intermediate algorithms. A simple idea is to use an algorithm similar to the one for a two-dimensional mesh: first broadcasting in a pipelined way within the rows, then within the columns.

Gossiping

Broadcasting is relatively cheap because there is only one packet. Gossiping is much more expensive. Therefore, in practical applications it should be performed only for very small sets of data.

Interconnection Networks

In the context of gossiping it is customary to use other cost models. The most considered model is to assume that in any round pairs of PUs can exchange data and that in a single round they can exchange all data they want to exchange. This is called the unit-cost telephone model. So, in any round the communicating PUs are given by a matching. Therefore, a gossiping algorithm is a specification of a sequence of matchings that eventually lead to all PUs knowing everything. Such a sequence is called a gossiping schedule.

Lemma: Gossiping in the unit-cost telephone model on a network with P PUs and diameter D requires at least max{D, round_up(log(P)) + odd(P)}. Here odd(x) = 1 for odd x and 0 otherwise.

Proof: To cover a distance d, at least d rounds are required. After t >= 0 rounds any information is known in at most 2^t PUs, because in any round the number of informed PUs can at most double. If the number of PUs is odd, then there is at least one PU which remains unmatched in the first round. So, the information from this PU is known in at most 2^{t - 1} PUs after t > 0 rounds. This means that for t <= round_up(log(P)) + odd(P) - 1, the number of informed PUs is at most 2^{round_up(log(P)) - 1} < P. End.

Gossiping on d-dimensional meshes is easy as long as all side lengths are even. In that case the gossiping can be performed in a dimension-ordered way. On an n x ... x n mesh for n even, this takes d * (n - 1) rounds, which is optimal. For odd n, it is always possible to perform the gossiping in d * n rounds, but this is not necessarily optimal. For example, on a 3 x 3 mesh gossiping can be performed in 5 rounds. This is optimal, because for a 3 x 3 mesh, the number of PUs P = 9, and thus round_up(log(P)) + odd(P) = 5.

Gossiping on 3 x 3 Mesh

For small networks gossiping schedules can be described very concisely and convincingly by a series of pictures. The nodes are subdivided according to the shape of the network, marking in each subdivision whether the information from the PU at the corresponding position is already known or not. The used matchings can be highlighted.

Gossiping on 3-Dimensional Butterfly

The unit-cost model is not the only possible cost model. More flexible is the linear-cost model in which it is assumed that a data exchange between two adjacent PUs in which each sends at most l informations takes c_v * l + c_a time. In principle it is not even necessary that all PUs remain connected equally long. However, the more freedom there is, the harder it becomes to design good algorithms. Therefore it is still useful to perform the gossiping by a sequence of matchings. The cost of a gossiping schedule is given by the number of rounds, the number of consecutive matchings used, and the number of steps, the sum over all rounds of the number of transferred packets.

We consider again the above schedule for gossiping on a 3 x 3 mesh. It consists of 5 rounds which was shown to be optimal. The number of steps is 1 + 2 + 4 + 4 + 7 = 18. This is far from optimal. It has been shown (masters thesis of Rene Beier) that 11 steps is the minimum achievable for any schedule with 5 rounds, and that in general at least 10 steps are required. Both bounds can be matched. We see here, and that is a general phenomenon, that there is a trade-off between rounds and steps. So, in general the best gossiping algorithm depends on the ratio c_a / c_v. Only for some very regular networks, such as linear arrays and hypercubes, the number of rounds and steps can be minimized at the same time. In that case there is a unique optimal gossiping schedule.

Gossiping on 3 x 3 Mesh, 5 rounds 11 steps

DMMs

On a DMM with P PUs, the most natural thing is again to apply a simulated hypercube algorithm, performing the gossiping with log P communication steps. If originally each PU holds l data, in step t, 0 <= t < log P, the size of the packets is l * 2^t. Summing over all steps gives
T_hypercube(P, l) = c_v * l * (P - 1) + c_a * log P.
This is optimal: the number of start-ups is minimized, and the routing volume equals the amount of data each PU must receive. For large P the start-up costs will become negligible.

Another communication operation is gather. In a gather each PU initially has l individual data. The task is to gather all this information in PU_0. This is not the inverse of a broadcast, because in a broadcast finally all PUs know the same information. Clearly this operation requires at least log P start-ups and because l * (P - 1) data must be transferred into PU_0. So, on a DMM gathering has the same lower bound as gossiping and can be solved optimally by applying the hypercube gossiping algorithm.

Using Randomization

Using randomization is a mighty instrument which in many contexts allows to simplify algorithms or even to achieve results which otherwise cannot be achieved. This also applies to the context of routing. In the following we will present quite a surprising application of randomization and show that on meshes the randomization can be substituted by a deterministic equivalent.

Hypercubes

We have seen that routing a permutation on a hypercube with n PUs can take Omega(sqrt(n)) time. This problem is not only of theoretical interest because the permutations in which the indices are permuted are concerned. These permutations are very common. There are several ways to overcome this problem. The simplest is the following idea of Valiant.

The packets are denoted p_i. s_i is the index of the source PU of p_i, d_i is its destination PU.

  1. For each p_i choose randomly and independently an intermediate destination x_i.
  2. Route all packets p_i from PU_{s_i} to PU_{x_i}.
  3. Route all packets p_i from PU_{x_i} to PU_{d_i}.

Notice that the mapping s_i -> x_i is not a permutation. Alternatively PU_0 might select the index of a permutation and broadcast it to all PUs in k steps, but this is a waste of time. The analysis of Valiants algorithm is not hard but beyond the scope of this chapter. Both routing phases have the same time distribution, because phase 2 can be viewed as a reversal of phase 1: in phase 2 the packets are moving from a random position to a fixed position. It can be shown that each of these phases takes k + o(k) time with high probability and that the queues remain small as well. So, the whole algorithm essentially 2 * k + o(k) = O(k) steps with high probability. Using randomization all permutations become equally hard, there are no longer good and bad permutations. The identity has the same expected time as a transposition.

A disadvantage of the above approach is that we loose a factor two for all those permutations for which even the greedy algorithm performed well: the expected time of the greedy algorithm is k + o(k), which is optimal up to the lower order term. A nice idea helps to make the randomized algorithm optimal while still making it highly unlikely that much more than the expected time is required. The algorithm is the same as before, with one small change. After selecting x_i, the length of the path s_i -> x_i -> d_i is computed. If this path is longer than k, we use comp(x_i) as intermediate destination instead of x_i. Here comp() is the operation which in a binary number complements all bits. The maximum distance any packet has to travel is now bounded by k: if dist(s_i, x_i) + dist(x_i, d_i) > k, then dist(s_i, comp(x_i)) + dist(comp(x_i), d_i) = k - dist(s_i, x_i) + k - dist(x_i, d_i) = 2 * k - dist(s_i, x_i) + dist(x_i, d_i) < k. Vöcking has shown that not withstanding the reduced randomness of the intermediate destination, the delay due to congestion is still limited to o(k) with high probability.

DMMs

On DMMs the standard routing algorithms above were all given under the assumption that the packets were of comparable size. It is no problem if there are small variations but if the sizes differ by a factor two or more, then we may expect that the routing takes considerable longer than estimated. Because the routing steps are not synchronized it is not easy to tell how long the routing will take exactly.

Consider an all-to-all routing. Assume that in total each PU is sending and receiving the same number of packets, but that not each PU is sending the same number of packets to each other PU. Then we can first send each packet to a random intermediate destination like in the hypercube algorithm. This replaces a possibly unbalanced pattern by two balanced all-to-all routing patterns. A refinement is to first determine the degree of unbalance and to randomize only as much as necessary.

Meshes

Consider a k-k routing on meshes. A good strategy is to choose for each packet a random intermediate destination and to route the packets to their intermediate destinations before routing them to their actual destinations. The whole algorithm consists of four routing phase: in phase 1 and 3, the packets are routed along the rows, in phase 2 and 4 the packets are routed along the columns.

For the analysis it is necessary to understand how much time the routing on the one-dimensional subarrays takes. Without a proof we give the central lemma in this domain:

Lemma: For a routing pattern on a linear array with n PUs, let r_{i, j}, 0 <= i < j < n, denote the number of packets that must travel from a PU with index i or smaller to a PU with index j or larger. When applying the farthest-first strategy, the rightwards moving packets need exactly max_{i < j} {r_{i, j} + j - i - 1} steps to all reach their destinations.

For most distributions the maximum is either assumed for i = 0, j = n - 1, or for i = n / 2 - 1, j = n / 2. The maximum time for a regular pattern like a k-k routing is always given by the maximum of these bounds. This simplifies the analysis a lot.

In the above case we are routing random k-k patterns. The maximum distance is n - 1, and in a randomization about half of the packets want to cross the bisection. Thus, each of the phases takes max{n, k * n / 4} + o(k * n). This is not yet good, but we can exploit that the given routing is uni-axial. As for the all-to-all algorithm above, two of these algorithms can be overlaid. To make this effective, the packets must somehow be split in two. In a randomized context it is natural to use randomization for this as well. So, each packet is colored independently white or black with probability 1 / 2. The white packets are routed xy, the black packets yx. In this way we do not exactly obtain two (k / 2)-(k / 2) patterns, but that does not matter. The important point is that now in each routing phase, the expected number of packets crossing the bisection of a one-dimensional subarray is k * n / 8. Using Chernoff bounds, it may be shown that all deviations from the expected value are small with high probability.

Theorem: On a two-dimensional n x n mesh, k-k routing can be performed in max{4 * n, k * n / 2} + o(k * n) steps with high probability.

The above result is optimal for all k >= 8, because there the k * n / 2 bisection lower bound is matched up to lower-order terms. For k = 1, 2 k-k routing can be performed in 2 * n + o(n) time. This is optimal because it matches the distance bound, the lower bound which is obtained by considering how far the packets have to travel. For 2 < k < 8 it is not known how many steps are exactly needed for k-k routing. For example, 4-4 routing can be performed in 3 * n + o(n) steps, but it is not clear whether this is optimal or not.

Derandomization

Randomization is a handy means of obtaining good results, but there are practical and theoretical reasons to prefer deterministic algorithms if they can achieve the same bounds. In this case the only purpose of using randomization is to obtain a regular pattern. However, this can also be achieved by handing out the packets in a regular way. We first consider how to turn this idea into an algorithm for all-to-all routing on a DMM.

Assume we are performing an all-to-all routing on a DMM with P PUs, each PU sending and receiving m numbers in total, l = m / P packets traveling between any pair of PUs on average. Each PU sorts the numbers according to the index of their destination PUs. Using bucket sort this takes O(m) time. A number residing in PU_i with rang r is sent to PU_{(i + r) mod P} as intermediate destination in a first routing phase and in a second routing phase to its destination. This is a very simple algorithm. Creating the packets to be sent to each PU may be even faster than in the randomized algorithm because the calls to the random number generator are saved.

How good is the resulting distribution? Clearly in the first routing phase the packets are almost perfectly balanced, the number of packets going from one PU to another is either round_down(m / P) or round_up(m / P). In phase 2 the imbalance is larger. Denote by a_{i, j} the number of packets originating in PU_i with destination in PU_j and by b_{i, k, j} the number of packets which are send from PU_i via PU_k to PU_j. For arbitrary k and j we want to put an upper bound on sum_i b_{i, k, j}. The regular distribution gives b_{i, k, j} <= round_up(a_{i, j} / P). This gives

sum_i b_{i, k, j} <= sum_i round_up(a_{i, j} / P) <= P + sum_i a_{i, j} / P = P + m / P.
So, if m = omega(P^2), the imbalance is negligible. For the small values of P that currently are common, this is no problem, but in general this is quite a strong condition.

On n x n meshes any operation in submeshes of size o(n) takes only o(k * n) time and is therefore cheap in comparison to the Theta(k * n) time for solving most problems. Here k denotes the number of packets per PU. So, on meshes we can apply the same idea, but the allocation of intermediate destinations may be coordinated in submeshes of quite considerable size. Thereby the rounding errors can be made of minor importance even for small k. As a result, the same result that was achieved above can also be obtained deterministically.

The idea of coordinating the rounding decision also works on DMMs. Due to rounding errors, each of the rounds of phase 2 might be delayed by c_v * P, in total c_v * P^2. If first all PUs are subdivided in groups of size sqrt(P), these PUs can exchange their values b_{i, j}. These are small gossip operations, costing c_v * P^{3 / 2} + c_a * log(P) / 2 time. Then they can optimize their rounding, reducing the loss in phase 2 to c_v * P^{3 / 2}. So, now we must have m = omega(P^{3 / 2}). This is much better, but still not as good as for the randomized algorithm, for which the rounding errors are negligible for any m = omega(P * log P). Of course, in practice it is quite unlikely that the deterministic algorithm has maximal bad luck. Even though we might not assume that all a_{i, j} are equal, it is quite reasonable to assume that the rounding errors are more or less randomly distributed.

Offline Routing

In algorithmics there is a fundamental distinction between online and offline problems. In the context of routing an algorithm is said to be online if the destinations of the packets only become known at the beginning of the routing. The cost of precomputation, if any, has to be booked on the routing problem to be solved. In an offline routing algorithm, it is assumed that we are dealing with a fixed routing pattern which has to be performed many times. In this case the cost of precomputation is not considered, and the only concern is the quality of the produced routing schedule. Particularly, the precomputation is not necessarily performed on the parallel computer on which the routing must be performed.

Two-Dimensional Meshes

For two-dimensional n x n meshes there is a ingenious algorithm achieving something none of the online algorithms can achieve: the queue size is only one. That is, at all times, any PU is storing in addition to the packets moving by only a single packet. The algorithm consists of three one-dimensional routing phases, each being a permutation of the n packets standing in the rows or columns. It is a simplification of the more general algorithm for product networks in the next section which is due to Annexstein and Baumschlag.

First we construct a bipartite graph with 2 * n nodes. The sets V_1 and V_2 each have n nodes, corresponding to the source and destination rows of the packets, respectively. There is an edge from node i in V_1 to node j in V_2 if there is a packet moving from row i to row j. Because there are exactly n packets starting and finishing in each row, the graph is n-regular (all nodes have degree n). Thus, according to Hall's theorem, the graph has an edge coloring with n colors. Because this is an online algorithm, the existence of such a coloring is the only thing that matters, but actually such colorings can be computed very efficiently. That is, there is an assignment of values, which are called colors, ranging from 0 to n - 1, so that the edges incident on a node all have different colors.

These colors are used in the routing algorithm. Denote the packets by p_i and let x_i be the color attributed to the edge corresponding to this packet.

  1. Route each packet p_i along the rows to column x_i.
  2. Route each packet p_i along the column to its destination row.
  3. Route each packet p_i along the row to its destination PU.

It remains to check that the algorithm is as good as claimed above. For the analysis of phase 1 we consider the routing in an arbitrary row. Originally each PU holds one packet. Because the edges corresponding to these packets are all incident on the same node of the bipartite graph, they all have different colors, and therefore no two packets will be routed to the same PU. For the routing in phase 2, we consider the routing in an arbitrary column. Because of the previous result, initially each PU holds one packet. If two packets in the column would move to the same row, then two packets with destination in the same row would have gotten the same color, which is in contradiction with the coloring which colors all edges incident on the node in V_2 corresponding to this row differently. Therefore, at the beginning of phase 3, all PUs hold one packet. In each row each PU is the destination of one packet because the complete pattern is a pattern, and in phase 3 the packets are routed to their destinations. So, the algorithm consists of three permutations within one-dimensional subarrays. Therefore, each phase can be executed in n - 1 steps, 3 * n - 3 steps in total, and never a PU is storing more than one packet.

The queue size is optimal, the time is not. It is still open whether it is possible to route all permutations in 2 * n - 2 steps with queue size 1. A few steps more and queue size 2 is possible.

Offline Permutation Routing with Queue Size 1

Product Networks

The above idea can be generalized for any product network A = B x C, where "x" denotes the direct product of graphs: the nodes of A are pairs (b, c), where b is a node in B and c a node in C. There is an edge from (b_1, c_1) to (b_2, c_2) iff b_1 = b_2 and (c_1, c_2) is an edge in C, or c_1 = c_2 and (b_1, b_2) is an edge in B. A two-dimensional n_1 x n_2 mesh is the direct product of a linear array with n_1 and a linear array with n_2 nodes. Tori and hypercubes are also direct products.

Let A = B x C be a product network. B has n_B nodes and C has n_C nodes. A schedule for routing a permutation on A can be computed offline by first constructing a bipartite graph with node sets V_1 and V_2, each with n_C nodes. Node c, 0 <= c < n_C, in V_1 and V_2 corresponds to the subnetwork B x c in A. The bipartite graph has n_A = n_B * n_C edges. For a packet traveling in A from a node (b_1, c_1) to a node (b_2, c_2) there is an edge from node c_1 in V_1 to node c_2 in V_2. Thus, the bipartite graph is n_B-regular and can be colored with n_B colors. Packet i is denoted p_i, it is traveling from (b_i, c_i) to (b'_i, c'_i), and its color is x_i, 0 <= x_i < n_B. The routing is performed as follows:

  1. Route p_i from PU_{b_i, c_i} to PU_{x_i, c_i}.
  2. Route p_i from PU_{x_i, c_i} to PU_{x_i, c'_i}.
  3. Route p_i from PU_{x_i, c'_i} to PU_{b'_i, c'_i}.

Theorem: Permutation routing on a product network A = B * C can be performed in T_A = 2 * T_B + T_C steps, where T_B and T_C denote the number of steps for routing permutations on B and C, respectively. At any time a PU is storing at most one packet.

Proof: Phase 1 is a permutation routing in subnetworks with the structure of B, because for any subnetwork B x c, all destinations occur exactly once due to the coloring of the edges incident on the node in V_1 corresponding to this subnetwork. The coloring of the edges incident on the nodes of V_2 assures that phase 2 is a permutation routing in a network with the structure of C. Phase 3 is again a permutation routing within subnetworks with the structure of B because at the beginning each PU holds one packet, and because the whole pattern is a permutation, each PU is also the destination of one packet. End.

If T_B != T_C, then it is profitable to arrange things so that T_B < T_C. The above construction immediately gives results for d-dimensional meshes. The best is to perceive a d-dimensional n x ... x n mesh as the direct product n x (n x ... x n). Using induction this leads to

Corollary: Permutation routing on a d-dimensional n x ... x n mesh can be performed in (2 * d - 1) * n steps. At any time a PU is storing at most one packet.

DMMs

Above we have studied all-to-all routing on DMMs with P PUs. Let m = P * l be the total number of packets each PU has to send and receive. The greedy algorithm can perform very poorly and is acceptable only if we reasonably can assume that the distribution is balanced. The random and deterministic algorithms which send the packets through an intermediate destination overcome the problem of unbalances to a large extend, but are not perfect either. First solving a coloring problem for a bipartite graph the problem can be overcome entirely: the routing consists of two phases in which each PU sends exactly l packets to each of the other PUs.

The algorithm is simple: a bipartite graph with two sets of P nodes is created. For a packet moving from PU_i to PU_j there is an edge from node i in V_1 to node j in V_2. This gives an m regular graph which can be colored with m colors. Each PU first routes all packets with color x to PU_{x mod P}. In the second routing phase the packets are routed to their destinations. Because of the coloring property for the edges incident on the nodes of V_1, in the first round each PU sends exactly l packets to each PU. Because of the coloring property for the edges incident on the nodes of V_2 each PU receives one packet from each color. These stand perfectly distributed after the first routing.

Towards Online Routing

Offline routing is fine for fixed patterns, but patterns which become known only at runtime are more common. Just as it is often possible to turn randomized algorithms into deterministic algorithms with essentially the same performance, so is it often also possible to turn offline algorithms into online algorithms. Both conversions are of great importance, because they allow for a separation of concerns. Knowing that such conversions exist, one may when designing an algorithm assume some smooth distribution as it is achieved by a randomization, or one may first consider what might be achieved offline.

Clearly for a problem like permutation routing, which is not particularly hard, it appears outrageous to first solve a bipartite graph-coloring problem of a graph with one edge per packet. This problem is far more complicated then the routing itself. However, we can solve a strongly reduced problem in o(n) time, while still obtaining almost as good performance as with the offline algorithm. This idea of reducing the information content of a problem is called sparsification and is useful beyond the scope of turning offline algorithms into online algorithms.

The approach is sketched for a two-dimensional n x n mesh. The mesh is divided in m x m submeshes. m = o(n) is quite large, one should think of m = n^{1 - eps}, for some suitable eps > 0. The packets going to each of these n^{2 * eps} submeshes are counted in each submesh. The bipartite graph does not have one node for each row, but one node for each of the n^eps row bundles. A further idea, which helps to reduce the size of the graph is to not create an edge for each edge but only for each superpacket. A superpacket consists of a large number s = n^alpha packets with common destination. Because we cannot assume that s exactly divides the number of packets traveling between a pair of row bundles, in general there will be one partially filled superpacket for each destination pair. So, in each submesh the graph to color has 2 * n^eps nodes and n^{2 - 2 * eps - alpha} + n^{2 * eps}. Choosing the parameters right, we can achieve the following:

After solving the localized and sparsified coloring problems, the routing cannot be performed so simply as it was done before. Each routing phases must be preceded by a rearrangement in the submeshes, so that the packets stand well-distributed. For example, preceding a horizontal routing phase, the packets must be arranged so that in each row of a submesh the number of packets going to any other submesh is about equal. When reaching a submesh, the packets are not routed to a specific PU, but rather to the first PU in the submesh which is still storing fewer than the maximum allowed number of packets. Together these ideas assure small queues and good routing times. Because the submeshes have size o(n), the cost of each operation is a lower-order term of the total routing time, and because there are only constantly many routing phases, their cost is negligible altogether.

Reconfigurable Arrays

Model

So far we have only considered parallel computers which are connected by a fixed interconnection network. This is the most natural model and such parallel computers are almost the only ones that exist. But, consider again the complete interconnection network realized by a crossbar switch. This can also be viewed as a set of P PUs which are dynamically interconnected. Said otherwise, the technology for making dynamic interconnections is available and apparently the price for this technology is reasonable and the speed at which the connections can be made is acceptable.

The entire system consists of a number of PUs with conventional powerful arithmetic and storage and a number of switches. The switches are slightly more powerful than in a crossbar: they listen to the signals passing on the buses they are connected to and take out whatever is destined for them. A small number of values can be stored and the switches can be set conditionally as a result of a comparison between stored values. The whole system is assumed to be synchronized and the difference between receiving a signal or not receiving a signal can be detected.

Typically the switches are arranged in a rectangular grid, but in principle they might also be arranged differently. The PUs are arranged on the outside. The switches normally have four ports, denoted W, N, E and S. Sometimes it is assumed that these ports can be connected dynamically in any desirable configuration, sometimes the configurations are limited to a subset thereof. In the presented examples we will not need anything fancy. A switch setting is denoted by indicating which ports are connected. For example, WN-SE indicates the setting in which W is connected with N and S with E.

It is important to notice that the switch setting decomposes the grid into a number of subnets. Within these nets there is broadcasting possibility: as long as it is assured that only one of the connected PUs is sending, all connected PUs and switches will receive the information transmitted in the net. A system as described is called a reconfigurable processor array

One may argue that reconfigurable networks are physically not realistical. There are two major assumptions which give reason for skepticism:

The first point is a true limitation: speed of light is finite and the size of the reconfigurable network certainly increases with the number of switches. This is a limitation, but exactly the same assumption is made in all hypercube, butterfly and PRAM algorithms. Two-dimensional meshes (and to a lesser extend three-dimensional meshes) are the only interconnection networks that can be realized in physical space without scaling problems. In a two-dimensional lay-out of a hypercube with n PUs there are wires of length sqrt(n).

It is harder to estimate the importance of the second point. Radio exists, and the signal does not become noticeably weaker if the number of listeners increases. Even wire-bound broadcasting exists in the form of cable television. So, technologically, this is no problem either. Possibly the signal must be somewhat stronger than for a point-to-point transmission. Also one can imagine that especially branchings weaken a signal. In most algorithms for reconfigurable arrays, these can easily be avoided by performing the broadcasting in two steps.

Power of Reconfiguration

Parity is the problem of determining the parity of the sum of n bits. On a CRCW PRAM this problem requires Omega(log n / loglog n) time. Even taking exponentially many PUs does not help. Parity is important because it is a simplification of many other more useful problems. Surprisingly, parity can be solved on a reconfigurable array in constant time. The amount of used hardware is O(n). Hence, the algorithm is not only fast, but even efficient.

The algorithm requires n + 3 PUs and 6 * n switches arranged in an 3 x 2 * n grid. In a first round PU_i informs the six switches in the columns 2 * i and 2 * i + 1 about its value. After receiving this value, the switches set their connections in preparation for the next round. If a 0 is received, a switch is set to WE-NS. If a 1 is received the switch must be set WN-SE if it is found in column 2 * i. If the switch is in column 2 * i + 1, then it is set to WS-NE, with exception of the switch in row 1, which is set to WE-NS. Then in the second round, a signal is entered in the net from the lower-left side. If it comes out at the right side in the same row, then the parity is even, if it comes out one row higher, the parity is odd. The PU on the right side receiving the signal can either output the conclusion, or broadcast it to all other PUs.

Computing Parity of 1 + 1 + 0 + 1 + 0 + 1

Constant Time Sorting

Parity is not the only problem that can be solved in constant time on reconfigurable arrays. Actually, there are many interesting problems which can be solved in constant time. However, mostly these algorithms are far from efficient, requiring considerably more switches than the sequential time. As an example we show here how to sort n numbers using O(n^2) PUs and O(n^3) switches in O(1) time. In the chapter on sorting we will see how the number of switches can be reduced to O(n^2). This is optimal within this context, because routing a permutation of n elements in constant time requires an n x n array.

The input is a set of n numbers a_i, number i originally residing in PU_i. By taking also the index into account, we may assume that all values are different. The algorithm consists of two somewhat independent parts. In the first part, for each a_i its rank r_i within the set of numbers is computed. In the second part, a_i is routed to PU_{r_i}. Each of these operations can be performed in a constant number of steps. For each i there is a dedicated n x n grid of switches. The switches in column j of this network are set depending on a comparison of a_i with a_j similar to the construction in the parity algorithm: if a_j < a_i, a signal which is inserted at the lower-left corner is guided one row upwards. The signal finally leaves the network on the right side in row r_i of the subnetwork. Even when using only the three different settings which were used before, the whole algorithm can be implemented with just 6 steps.

Sorting in Constant Time

Exercises

  1. Show with a picture how to embed a linear array in a two-dimensional mesh. Generalize this embedding for d-dimensional meshes. Using the embedding as the basis for an indexing scheme, specify exactly the index of PU_{i_1, ..., i_d} in an n_1 x ... x n_d mesh.

  2. Show with a picture how to embed a circular array in a two-dimensional n_1 x n_2 mesh with n_1 even. Generalize this embedding for d-dimensional meshes. Considering the side lengths n_1, ..., n_d, which condition must they satisfy in order that this embedding is possible?

  3. The main importance of embeddings of linear arrays stems from the fact that they lead to indexings schemes in which PUs with consecutive indices are adjacent. More generally, we would like that any pair of PUs whose indices do not differ much are close to each other in the network. Construct an indexing scheme for a mesh so that the distance in the network between PU_i and PU_j is bounded by O(sqrt(|i - j|)).

  4. Assume each PU of a circular array with n PUs holds k packets that must be routed so that each PU receives at most k packets in total. This routing pattern is called k-k routing. Specify lower bounds on the routing time in terms of n and k. Distinguish several cases. Consider the routing in a greedy way: each packet is routed to its destination along the shortest path, applying farthest-first priority strategy. How long may the routing take? For which values of k is this algorithm optimal? Present another algorithm which is optimal for all k >= 4.

  5. Draw two- and three-dimensional CCC networks and determine their diameters.

  6. Give Cayley graphs corresponding to circular arrays, two-dimensional tori and CCC networks.

  7. The k-dimensional pancake graph is a Cayley graph defined over the permutation group S_k which has as elements the k! permutations of the elements 0, 1, ..., k - 1. The product in this group is given by the composition. The set of generators S = {swap_i| 2 <= i <= n}, where swap_i is the permutation ((i - 1) (i - 2) ... 0 i (i + 1) ... (n - 1)). So, swap_i reverses the order of the first i elements.

  8. The greedy routing algorithm on two-dimensional meshes has maximum queue size if the two routing phases are separated. If the phases are coallesced, then this queue size becomes smaller. Show that coallescing does not really help by providing a permutation for which the greedy algorithm with coallesced phases still has queue size Omega(n). The worst achievable queue size is 2 * n / 3.

  9. Give an algorithm for broadcasting on a k-dimensional butterfly network running in 2 * k + O(1) communication steps. Specify the exact number of steps used by your algorithm.

  10. Consider gossiping on a circular array with n PUs in the unit-cost telephone model. Give algorithms and specify the number of steps. Distinguish the cases that n is even and n is odd.

  11. We consider gossiping in the unit-cost telephone model on n_1 x n_2 meshes for which at least one of the sides has odd length.

  12. Show that gossiping in the linear-cost model on a linear array with P PUs requires at least 2 * P - 3 steps.

  13. Show how to gossip on a 3 x 3 mesh in the linear-cost model using only 10 steps.

  14. Give an algorithm for gossiping on a k-dimensional CCC network running in 5 / 2 * k + O(1) communication steps of the unit-cost telephone model. Specify the exact number of steps used by your algorithm.

  15. Give an algorithm for gossiping on a k-dimensional butterfly network running in 5 / 2 * k + O(1) communication steps of the uni-cost telephone model. Specify the exact number of steps used by your algorithm.

  16. In the text it was demonstrated that in the DMM model gathering has the same complexity as gossiping. Give an example of an interconnection network on which gathering is cheaper than gossiping.

  17. A torus can be embedded in a mesh and a circular array in a linear array. These embeddings imply that any torus or circular-array algorithm requiring t steps can be run on a mesh or linear array, respectively, in 2 * t steps. An interesting question is whether the reverse holds: can any problem be solved twice as fast on a torus / circular array as it can be solved on a mesh / linear array? This exercise provides a simple and convincing counter example: online 2-2 routing can be solved trivially performed in n steps on a linear array, while 2-2 routing on a circular array may require at least 2 / 3 * n steps, even when all source-destination pairs are known beforehand and the routing may be planned offline.

    In the section on "Using Randomization" a lemma was given specifying exactly the time for routing on a linear array with n PUs when using the farthest-first strategy.

  18. Consider random permutations. That is, for a processor network with P PUs a permutation pi is picked uniformly out off the P! permutations of the P indices. The packet from PU_i has to be routed to PU_{pi(i)}.

  19. Consider again the k-k routing algorithm for two-dimensional n x n meshes. It consisted of four phases. The first two sending the packets to random intermediate destinations, the last two for sending the packets to their destinations. Because each phase may take at least n steps, such a four-phase algorithm inherently requires 4 * n steps, and therefore it cannot be good for k < 8. The four phases where xy followed by xy. Of course it is just as good to follow xy by yx. In that case there are two consecutive phases of routing along the columns. These can be replaced by a single phase. This gives a three-phase algorithm which is better for k < 8. Specify how many steps this algorithm takes for all k. Take care, in this case not all phases are equally expensive.

  20. The idea of Annexstein and Baumschlag for offline routing works for any product network.

  21. Consider permutation routing on k-dimensional CCCs. A CCC is not a product network, but it is useful to view it as a hypercube with nodes replaced by cycles of length k. Above we have considered how to route on hypercubes, routing on cycles is trivial. The PUs of the CCC are indexed (i, d), with 0 <= i < 2^k and 0 <= d < k, the first index gives the position within the cube, the second index gives the position within the cycle.

  22. Consider routing permutations on a reconfigurable array. There are n PUs connected to an n x n array. PU_i holds a number a_i which must be communicated to PU_{r_i}. Show how to solve this problem in three steps. The switches can only be put in the three states that were used before: WE-NS, WN-SE and WS-NE.





Basic Problems

Starting with this chapter and throughout the rest of the text, we follow a problem-oriented approach. That is, each chapter deals with a single problem or a small set of related problems, which are studied for each of the considered cost models: WT framework and PRAM, interconnection networks and DMM. The goal is to provide insight how the cost model influences the choice of the algorithm. In all cases a minor modification of the PRAM algorithm works reasonable even for a DMM but more often than not, this does not lead to the best possible DMM algorithm. We will see that algorithms designed for interconnection networks provide just as good a source of inspiration.

Sum and Prefix Sums

We consider the problem of computing the sum of n values. More precisely for an array a[], we want to compute S = sum_{i = 0}^{n - 1} a[i]. A slightly harder problem is to compute the prefix sums, that is, to compute all values s[j] = sum_{i = 0}^j a[i] for all j, 0 <= j < n. Sequentially both problems can be computed in O(n) with a single for loop.

Linear and Circular Array

On a linear array with n PUs we assume that initially PU_i holds a[i]. In step i, 1 <= i <= n - 1, PU_{i - 1} routes its number to PU_i. Upon receiving a number, PU_i adds the received number to its value a[i]. The result is s[i]. At the end of step n - 1, PU_{n - 1} holds s[n - 1], which is the sum of all numbers.

On a circular array each number is routed from the PU where it starts in each direction for n / 2 steps. If in step t, 1 <= t <= n / 2, PU_i receives a number from PU_{(i - 1) mod n}, then it adds its value to the value it has if i >= t. If PU_i receives in step t a number from PU_{(i + 1) mod n}, then its value is added if i + t >= n.

The algorithm on the circular array runs twice as fast and minimizes the number of communication steps. However, this algorithm is not work-optimal: all PUs are busy all steps, so the total work is Theta(n^2).

WT Framework and PRAM

In the WT framework, the sum can be computed in O(log n) steps. The idea is the following modified sequential algorithm:
  int sum(int[] a, int n)
  {
    int d = n / 2;
    while (d > 0)
    {
      for (i = 0; i < d; i++)
        a[i] += a[i + d];
      d /= 2;
    }
    return a[0];
  }
For n = 2^k for some k >= 0 this computes the sum in k = log n passes of the while loop. The operations of the for loop can be executed in parallel. The work is given by the sum of the used d-values which is n / 2 + n / 4 + ... + 1 = n - 1. Because the scheduling on an EREW PRAM is no problem, this gives a PRAM algorithm running in O(log n) time with n / log n PUs, an optimal algorithm.

For the prefix sums problem there is an elegant recursive algorithm:

  void prefix(int low, int n)
  {
    if (n > 1) 
    {
      prefix(low, n / 2); 
      prefix(low + n / 2, n - n / 2); 
      for (int i = n / 2; i < n; i++) 
        a[low + i] += a[low + n / 2 - 1]; 
    }
  }
In a sequential context it makes no sense to formulate the prefix sums computation this way, but this formulation immediately leads to a parallel algorithm running in logarithmic time: T(n) = T(n / 2) + O(1). The work can be estimated as W(n) = 2 * W(n / 2) + O(n), which gives W(n) = O(n * log n). So, this algorithm is elegant but not optimal.

With a twist the work can be reduced to O(n). First consider an algorithm for solving the problem on a processor network with a tree topology consisting of 2 * n - 1 PUs arranged in log n + 1 levels with n / 2^t PUs at level t, 0 <= t <= log n. The algorithm consists of two main phases, the first going upwards, the second downwards. Initially a[i] is stored in PU_i at level 0. In step 0 of the upwards phase, each PU at level 0 routes its number to the PU at level 1 above it. In step t, 1 <= t < log n, of the upwards phase, a PU at level t routes to the PU at level t + 1 above it the sum of the values it received from the two PUs below it. In step t, 0 <= t < log n, of the downwards phase, a PU at level log n - t routes to the two PUs at level log n - t - 1: to the left it routes the value it received from the PU at level log n - t + 1 (which is supposed to be zero for t = 0); to the right it routes the sum of the value it received from above and the value it received from the left during the upwards phase. At level 0 PU_i adds the received value to a[i]. The whole algorithm consists of 2 * log n + 1 steps.

In the WT framework, it suffices to notice that there are 2 * log n + 1 time steps and that the total work is O(n), because each of the 2 * n - 1 PUs of the tree network is performing only O(1) work. Thinking about what is actually going on, we get the following sharper formulation:

  void prefix()
  {
    for (int step = 1; step <= n / 2; step *= 2)
      for (int i = step - 1; i < n; i += 2 * step) 
        a[i + step] += a[i];
    for (int step = n / 4; step >= 1; step /= 2)
      for (int i = 3 * step - 1; i < n; i += 2 * step) 
        a[i] += a[i - step];
  }
The number of additions is n / 2 + n / 4 + ... + 1 in the first loop and 1 + 3 + ... + n / 2 - 1 in the second. Because the scheduling on an EREW PRAM is no problem, this gives a PRAM algorithm running in O(log n) time with n / log n PUs, an optimal algorithm.

Click here to see this procedure integrated in a working program. The program also contains the sequential and the recursive algorithm. Running the program for large n shows that not withstanding the optimality, the parallel algorithm is considerably slower than the sequential one. One of the reasons is the much larger number of cache faults it generates. This can be overcome by a better organization. However, a second reason is that the algorithm is inherently more complex and performs twice as many additions.

DMM

Solving a sum or prefix-sum problem for an array a[] with n elements on a DMM with P << n PUs is easy and efficient. The elements of a[] are allocated to the PUs so that PU_i holds the number a_j, with i * (n / P) <= j <(1 + 1) * (n / P). First each PU computes the sum of the values its holds. Then a sum or prefix-sum problem is solved for these P values. This takes log P or 2 * log P - 3 rounds, respectively. When solving the prefix problem, it remains to add an offset to the internal values. Thus, for the sum problem
T_par(P, n) = c_i * O(n / P) + (c_v + c_a) * log P.
Similarly, for the prefix-sum problem
T_par(P, n) = c_i * O(n / P) + (c_v + c_a) * (2 * log P - 3).
Because the routing volume is so small, this algorithm works even on networks with large c_v. The condition for getting good performance is c_a * log P << c_i * n / P. This is satisfied for P << c_i / c_a * n / log n, which is an exceptionally large number. Said otherwise, even relatively small problem can be solved efficiently.

Matrix Product

WT Framework and PRAM

An easy problem with a high sequential complexity is matrix multiplication. For computing C = A * B one conventionally performs
  void matrix_product(int[][] A, int[][] B, int[][] C, int n)
  {
    for (i = 0; i < n, i++)
      for (k = 0; k < n, k++)
        C[i, k] = 0;
    for (i = 0; i < n, i++)
      for (k = 0; k < n, k++)
        for (j = 0; j < n, j++)
          C[i][k] += A[i][j] * B[j][k];
  }

It is handy to rewrite the algorithm as follows:

  int sum(int[] c, int n)
  {
    int s = 0;
    for (j = 0; j < n, j++)
      s += c[j];
  }

  void matrix_product(int[][] A, int[][] B, int[][] C, int n)
  {
    int[][][] c = new int[n][n][n];
    for (i = 0; i < n, i++)
      for (k = 0; k < n, k++)
        for (j = 0; j < n, j++)
          c[i][k][j] = A[i][j] * B[j][k];
    for (i = 0; i < n, i++)
      for (k = 0; k n, k++)
        C[i][k] = sum(c[i][k], n);
  }

This immediately gives the following WT algorithm:

  void matrix_product(int[][] A, int[][] B, int[][] C, int n)
  {
    int[][][] c = new int[n][n][n];
    for all (i, k, j, 0 <= i, k, j < n) pardo
      c[i][k][j] = A[i][j] * B[j][k];
    for all (i, k, 0 <= i, k < n) pardo
      C[i][k] = sum(c[i][k], n);
  }
The algorithm consists of two parallel instructions, of which the second requires further O(log n) instructions. The work is n^3 + n^2 * O(n) = O(n^3). We have given here the transformation in several steps and explicitly indicate which steps are to be performed in parallel. In the following such straight-forward parallelizations will not be worked out.

It is obvious how to schedule the algorithm on a PRAM with n^3 PUs. Taking P = W(n) / T(n) = n^3 / log n as suggested above, reduces the cost without substantially increasing the time. In that case C(P, n) = n^3 / log n * O(log n) = O(n^3). One may debate whether this should be called optimal or not: there are sequential matrix-multiplication algorithms running in o(n^3). On the other hand, are these not so terribly much better in practice, and the conventional version presented above is still used.

Two-Dimensional Mesh

Surprisingly, matrix multiplication can be performed very efficiently on tori and meshes. Matrix multiplication and the all-pairs shortest paths problem treated further down are among the very few non-trivial problems that can be solved optimally on meshes: on an n x n mesh, two n x n matrices can be multiplied in O(n) communication steps. Even under the assumptions of systolic computation, we get this time bound. Systolic computation means that the data are assumed to enter the mesh from the top and left side of the mesh (or something similar) and then stream at one step at a time through the network. The PUs can take out of the passing streams whatever is of their liking. The PUs can also change the values of the streaming data. The hardware which motivated the design of systolic algorithms is no longer popular, but systolic algorithms are particularly simple and therefore it continues to be worth considering them.

In our case the PU at position (i, k) computes C[i][k]. So, the data from A[][] and B[][] must be arranged so, that A[i][j] and B[j][k] reach PU_{i, k} at the same time. This is achieved when the arrival is delayed so, that A[i][j] reaches PU_{i, 0} in step i + j and then runs rightwards without delay, while B[j][k] reaches PU_{0, k} in step j + k and then runs downwards without delay. Thus, A[n - 1][n - 1] and B[n - 1][n - 1] enter the mesh in step 2 * n - 2 and reach PU_{n - 1, n - 1} in step 3 * n - 3. By then all other numbers have been processed as well. So, the whole computation takes 3 * n - 3 communication steps.

Systolic Matrix Product

On a torus the systolic algorithm can be implemented easily: the data from A[][] which in the systolic algorithm reach column 0 in step t, 0 <= t < n, are positioned in column (n - t) mod n. The data from B[][] which reach row 0 in step t are positioned in row (n - t) mod n. More precisely, A[i][j] starts in PU_{i, (n - i - j) mod n} and B[j][k] starts in PU_{(n - j - k) mod n, k}. The A[][] data move rightwards using the wrap-around connections, the B[][] data move downwards. In all PUs there is something to do every step. So, after n - 1 communication steps, the matrix product has been computed. This is optimal because there are n^3 multiplications to make with n^2 PUs.

How to implement this algorithm on a mesh? On a practical system with worm-hole routing, it is never a problem when one packet must be routed a long distance. So, cyclical shifts can always be performed at almost full speed. In the store-and-forward model, however, it is harder. Just taking the above algorithm with the indicated mapping requires (n - 1) * (n - 1) routing steps. But, using the standard embedding of a torus on a mesh, matrix multiplication can be performed on the mesh in 2 * n - 2 steps. It is not clear that this is the optimal result for computing matrix products on meshes, but there are no obvious improvements as long as we insist that each PU holds only one value from A[][] and B[][] at a time.

Three-Dimensional Mesh

There is a matrix-multiplication algorithm for three-dimensional meshes which is directly inspired by the PRAM algorithm. It uses n^3 PUs and runs in O(n) time steps. This performance is not good. The main reason to present the algorithm is that it makes it easier to understand the hypercube algorithm in the next section.

The n^3 PUs of the three-dimensional n x n x n mesh are indexed with three subscripts as in the obvious way: PU_{i, j, k} denotes the PU in row i, column j and plane k. We assume that initially A[i][j] is stored in PU_{i, j, 0} and B[j][k] in PU_{0, j, k}. Then these data are routed so that each PU_{i, j, k} receives A[i][j] and B[j][k]. This routing is trivial: starting from PU_{i, j, 0} the number A[i][j] is routed along the third axis, starting from PU_{0, j, k} the number B[j][k] is routed along the column. Hereafter PU_{i, j, k} computes c[i][j][k] = A[i][j] * B[j][k]. All numbers that must be added together, that is all numbers c[i][j][k] with the same j, stand in the same row. In another n - 1 steps these can be added up to C[i][k] and routed to the PUs PU_{i, 0, k}. So, the whole algorithm takes 2 * n - 2 steps on an all-port machine and 3 * n - 3 on a single-port machine.

The last operation, computing C[i][k], can be performed in two ways:

Both algorithms finish after n - 1 steps. However, there is a great difference between them. In the first algorithm, PU_{i, 0, k} is busy all the time, and in total Theta(n^3) packets are sent. In the second algorithm, each PU is active only one step, and the total numbers of packets is Theta(n^2).

If a single matrix product must be computed there is no difference. If, however, there are L pairs (A_l, B_l), 0 <= l < L, of matrixes to multiply, then using the first algorithm we still need at least n - 1 steps per product because the PU_{i, 0, k} are busy n - 1 steps adding the numbers they are receiving. The second algorithm can be used for a pipelined approach which requires only O(n + L) steps to compute all product matrices C_l, 0 <= l < L. Pipelining means that an excess of computing power is made useful by scheduling computations tightly after each other. This is another kind of parallelism, successfully used in any modern processor.

In our case, the numbers of A_l and B_l start being routed into the network at step l and come out of the network 2 * n - 2 steps later (assuming that the PUs can store O(n) numbers and all-port capacity). So, for computing L matrix products we need 2 * n + L - 3 steps. For L = omega(n), this is 1 + o(1) steps per product with n^3 PUs. On a two-dimensional mesh we needed 2 * n - 2 steps with n^2 PUs. The conclusion is that for computing many products the simple algorithm for three-dimensional meshes is asymptotically twice as efficient.

Hypercube

Now we consider how the above algorithm leads to a hypercube algorithm running in O(log n) time (one of the questions deals with the problem of making the algorithm optimal). For convenience we assume that n = 2^{3 * l} for some integral l. In that case, the hypercube can be perceived as a three-dimensional mesh with handy shortcuts, which allow to perform the routing within one-dimensional submeshes faster.

We are going to use the algorithm for a three-dimensional m x m x m mesh on the hypercube, for m = 2^l. PU_{i, j, k} is identified with the hypercube PU in which the numbers i, j and k are written out as binary k-bit numbers. The algorithm for the three-dimensional mesh consists of three routing operations. In the first information is sent from the PU_{i, j, 0} to all PU_{i, j, k}. Thus, information has to be spread from a single PU to all PUs in a one-dimensional submeshes. This submesh corresponds to an l-dimensional subhypercube. Using the connections of each dimension one after the other, it is trivial to perform this spread of information in l routing steps. In the second operation information is sent from the PU_{0, j, k} to all PU_{i, j, k}. This goes analogously, on a full-port machine it can even be performed in parallel with the first operation. The third operation consists of gathering and summing the information standing in all PU_{i, j, k} into the PU_{i, 0, k}. Again this can be done by performing l routing steps within l-dimensional subhypercubes. In step t, 0 <= t < l, the routing is performed along axis l + t, each PU which has a 1 at position l + t of its index sends a packet to the PU whose index is smaller by 2^{l + t}. The transferred value is the sum of the own value and all received values.

So, in the single-port model the whole algorithm can be implemented in 3 * l = log n steps. The connections along each axis are used exactly once. This last property implies that it is again trivial to pipeline matrix multiplications on a full-port machine: L products can be computed in log n + L - 1 steps.

DMM

The above algorithms inspire two DMM algorithms. The first can be considered to be a direct adaptation of the PRAM algorithm, the second is a coarse-grain version of the torus algorithm. Both perform the same number of arithmetic operations, so on a first glance they may appear to be equally good. However, the second has smaller communication overhead and is therefore superior in practice. The algorithms are compared here to illustrate that adapting PRAM algorithms often does not lead to the best DMM algorithms.

Bundle-Wise Approach

For the first algorithm we assume that P <= n and that P divides n. The matrices A and B which have to be multiplied are divided in row and column bundles, respectively. A row bundle consists of n / P rows, a column bundle of n / P columns. The bundles are numbered from 0 to P - 1. The algorithm consists of P rounds. At all times PU_i holds row bundle i of A[][]. During round t, 0 <= t < P, PU_i holds column bundle (i + t) mod P of B[][]. During round t, PU_i computes the n^2 / P^2 values of C[][] at the intersection of row bundle i and column bundle (i + t) mod P. Finally PU_i has computed all the values in row bundle i of C. Between the computation rounds, the column bundles of B[][] must be routed from one PU to the next. This is a very simple and regular routing pattern. If the network has a Hamiltonian cycle, then with an appropriate indexing only adjacent PUs are communicating. The time consumption of this algorithm is given by
T_par(P, n) = c_i * n^3 / P + c_v * n^2 + c_a * P.

Of the two communication terms, the contribution c_v * n^2 will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * n^2 << c_i * n^3 / P, which gives the condition

P << c_i / c_v * n.

Bundle-Wise Matrix Product on DMM

Block-Wise Approach

There are better ways of organizing matrix multiplication on DMMs which allow to use more PUs. The three matrices are divided into P blocks of size n / sqrt(P) x n / sqrt(P). These blocks are indexed as the positions of a sqrt(P) x sqrt(P) matrix, starting with block (0, 0) in the upper-right corner. The PUs are indexed analogously. In the following description we use n' = sqrt(P). The data in block (i, j) of A[][] start in in PU_{i, (n' - i - j) mod n'} and the data in block (j, k) of B[][] start in PU_{(n' - j - k) mod n', k}. The algorithm consists of n' steps. In step t, 0 <= t < n', each PU_{i, k} multiplies the two blocks it has and forwards the block of A[][] to PU_{i, (k + 1) mod n'} and the block of B[][] to PU_{(i + 1) mod n', k}. Summing up the computed results, PU_{i, k} computes block (i, k) of C[][]. This is the same algorithm that was given above for the torus. The only difference is that the entries which are multiplied and shifted are not integers but submatrices. The time consumption of this algorithm is given by
T_par(P, n) = c_i * n^3 / P + c_v * 2 * n^2 / sqrt(P) + c_a * 2 * sqrt(P).

Of the two communication terms, the contribution c_v * 2 * n^2 / sqrt(P) will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * 2 * n^2 / sqrt(P) << c_i * n^3 / P, which gives the condition

P << (c_i / c_v * n)^2.

Block-Wise Matrix Product on DMM

The second algorithm is much better, but still we cannot use arbitrarily many PUs. The above conditions show that the largest number of PUs that can be used in a sensible way is a very small constant fraction of n^2. Given that there is O(n^3) work to do, this means that, using this algorithm, T_par(infinity, n) = Omega(n). There is a clear discrepancy here between the bounds for the PRAM and the DMM. If one believes that the DMM model is the more realistic of the two, this implies the limited importance of the quest for polylog PRAM algorithms.

All-Pairs Shortest Paths

Sequential Algorithms

We consider the all-pairs shortest path problem for graphs with n nodes. That is, for a graph G = (V, E) with n nodes and m edges we want to compute a complete n x n distance table. Sequentially, for sparse graphs the best idea is to apply n times the best applicable algorithm for single-source shortest paths: if the graph is unweighted this is breadth-first search, if the graph is positively weighted this is Dijkstra's algorithm. These methods require O(n * (n + m)) and O(n * (n * log n + m)) time respectively. For dense graphs or if the graph also has negative-weight edges, it is better to apply a cubic algorithm solving the whole problem at once. The best choice in this case is the Floyd-Warshall algorithm which in addition to having relatively low asymptotic complexity also has the advantage of being extremely simple. This algorithm works without assumptions on the weights of the edges. Of course, if there are negative-cost cycles, the computed values do not make sense. Such problem instances can be recognized by checking the values on the diagonal: if any of them is negative, this flags a negative-cost cycle.

The Floyd-Warshall algorithm proceeds as follows. Here the matrix D[][] contains all direct distances. That is, D[i][k] gives the length of the edge from i to j. If there is no such edge, then D[i][k] = infinity. Finally D[i][k] contains the length of the shortest path from i to k.

  void floyd_warshall(int[][] D, int n)
  {
    for (j = 0; j < n, j++)
      for (i = 0; i < n, i++)
        for (k = 0; k < n, k++)
          if (D[i, j] + D[j, k] < D[i, k])
            D[i, k] = D[i, j] + D[j, k];
  }

Lemma: The Floyd-Warshall algorithm is correct.

Proof: The correctness is proven by inductively showing that after performing the second loop for all values in S_j = {l| 0 <= l <= j}, D[i, k] gives the length of the shortest path from node i to node k running through nodes in S_j. Initially this is true, and finally this implies the correctness of the algorithm. So, assume the claim holds for some j - 1. Then, the shortest path from node i to node k running through nodes in S_j can either run though node j or not. If it does not run through node j, then the shortest path is given by the shortest path through S_{j - 1}, which by the induction assumption has length D[i, k]. In the other case, the shortest path through S_j runs along a shortest path in S_{j - 1} from i to j and then along a shortest path in S_{j - 1} from j to k. The length of these two subpaths are according to the induction assumption given by D[i, j] and D[j, k] and because we assume that this path gives an improvement, we have D[i, j] + D[j, k] < D[i, k] and the assignment D[i, k] = D[i, j] + D[j, k] will be performed. End.

WT Framework and PRAM

Notice that the Floyd-Warshall algorithm is almost identical to the basic algorithm for matrix computation, the main difference being that in the matrix-multiplication algorithm the loop with running index j is ordered last. In the matrix multiplication there are n^2 independent loops, each of which could be parallelized. The correctness proof of the Floyd-Warshall requires this specific execution order, and thereby introduces a kind of sequentialization. Thus, for the Floyd-Warshall algorithm, this give a time of O(n) with O(n^3) work. On a PRAM, using n^2 PUs it is trivial to solve the problem in O(n) time. This is cost-optimal, but not very fast.

An alternative is to apply brute-force: the distances can also be computed by applying an algorithm originally designed for computing the transitive closure of a graph. The transitive closure of a graph G is the graph G^* having an edge from node i to node j iff node j can be reached from node i over a path in G. Of course this problem can be solved efficiently determining the connected components, but for a graph with n nodes one can also compute D^{n - 1}, where D is the adjacency matrix of G. Of course the matrix product here is not defined over the underlying ring of integers, but over the ring of booleans where "+" stands for "or" and "*" stands for "and". Modifying so that the entries of D[][] are integers giving the distances and not 0-1 values, and replacing "+" by "minimum" and "*" by "+", D^{n - 1} does not only give the existence of a path, but even the length of the shortest path. This can be proven by showing that if D^i and D^j give the lengths of the shortest paths using up to i and j edges, respectively, that then D^i * D^j = D^{i + j} gives the length of the shortest paths using up to i + j edges. Because we assume that all edge weights are non-negative, all shortest paths consist of at most n - 1 edges. Therefore, it is sufficient to compute D^{n - 1}.

Clearly, computing D^{n - 1} by starting with D^1 = D and then repeatedly multiplying with D leads to a correct but very inefficient way of doing. A much better idea is to compute D^n by suing D^{2^t} = D^{2^{t - 1}} * D^{2^{t - 1}}. In this way D^{n - 1} can be computed with k = round_up(log_2 (n - 1)) matrix products: 2^k >= n - 1 and because of the above argument, D^m = D^{n - 1}, for any m >= n - 1. This reduces the APSP problem to computing log n matrix products. The time for this is O(log^2 n) with O(n^3 * log n) work. On a PRAM, using n^3 / log n PUs, the APSP problem can be solved in O(log^2 n) time. This algorithm shows that APSP is in NC, but it is not cost-optimal.

Mesh Algorithm

We consider how to execute the Floyd-Warshall algorithm on a mesh so that it has running time O(n). This is not as easy as for the matrix multiplication. Denote by D^(j), -1 <= j < n, the distance matrix obtained after performing the outer loop in the Floyd-Warshall algorithm for all values up to j. D^(-1) = D, D^(n - 1) is the final matrix to compute. We assume that D[i, k] is originally stored in the PU at position (i, k). This PU is responsible for computing D^(j)[i, k] for all j. D^(-1)[i, k] = D[i, k] it knows, for the computation of the other values it needs data coming from the other PUs: for computing D^(j)[i, k] it must receive D^(j - 1)[i, j] and D^(j - 1)[j, k]. The secret to obtaining good performance, is to accept that not all values of D^(j) are computed in the same step.

The communication is performed according to the following rules:

Clearly the given algorithm is correct because of the condition that data are forwarded only as soon as D^(j)[i, k] has been computed and because eventually all D^(j)[i, j] become available to all PUs in row i and all D^(j)[j, k] to all PUs in column k. So, every PU can indeed compute all the values it must compute. The more interesting question is how long this takes.

Lemma: PU_{i, k} computes D^(j)[i, k] in step 3 * j + |i - j| + |k - j|.

Proof The proof goes by induction. The main induction goes over j. Secondary inductions go over i and k. The basis of everything is that PU_{0, 0} computes D^(0)[0, 0] in step 0. So, assume the lemma is true for i = 0 and some k, 0 <= k <= n - 2, then PU_{0, k} sends D^(-1)[0, 0] on to PU_{0, k + 1} which computes D^(0)[0, k + 1] in step k + 1. The same follows for D^(0)[k + 1, 0]. Now assume the lemma holds for j = 0 and all i and k with i < l or k < l, then it can first be proven for D^(0)[l, l] and then for all D^(0)[i, k] with i = l or k = l. Assuming the lemma has been proven for some value j and all i and k, it follows that D^(j - 1)[j, j] is computed in step 3 * j - 1. The values D^(j - 1)[j - 1, j] and D^(j - 1)[j, j - 1] have been computed in step 3 * j - 2 and reach PU_{j, j} at the end of step 3 * j - 1. So, PU_{j, j} computes D^(j)[j, j] in step 3 * j. This is the basis of the proof for the next j. End.

Theorem: The given implementation of the Floyd-Warshall algorithm on an n x n mesh solves the APSP problem in 5 * n - 4 steps.

Click here to see the above ideas integrated in a running program. In this simulation each PU has three arrays. For PU_{i, k} the array own[] indicates which values D^(j)[i, k] have been computed. The array hor[] indicates which horizontally moving information has been received, processed and forwarded. The array ver[] is the analogue for the vertically moving information.

Floyd-Warshall on Mesh

DMM

The Floyd-Warshall algorithm can be implemented on a DMM in various ways, reflecting the various algorithms presented above. The simplest idea is to use an algorithm inspired by the PRAM algorithm. The second, more elaborate, idea is to use a refined implementation of the mesh algorithm.

Bundle-Wise Approach

It is assumed that P <= n and that P divides n. The matrix D[][] giving the distances is divided in row bundles of height n / P. PU_p, 0 <= p < P is handling all rows i with p * n / P <= i < (p + 1) * n / P. It is assumed that the data are distributed according to this subdivision right from the start.

The algorithm consists of n rounds. In round j, 0 <= j < n, the PU which is responsible for row j, this is PU_{round_down(j / (n / P))}, sends the information in row j of D[][] to all other PUs. This information is used to compute all values D^(j)[i][k]. The only question is how to route the information in a row from one PU to all other PUs. This is a broadcasting problem. Applying the algorithm using log P routing operations, the time consumption is given by

T_par(P, n) = c_i * n^3 / P + c_v * n^2 * log P + c_a * n * log P.

Of the two communication terms, the contribution c_v * n^2 * log P will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * n^2 * log P << c_i * n^3 / P, which gives the condition

P * log P << c_i / c_v * n.

Block-Wise Approach

In order to implement the mesh algorithm onto a DMM, we use a block-wise subdivision of the matrix D[][] as was done above for computing ther matrix product. So, each block has size n / sqrt(P) x n / sqrt(P). The algorithm is more or less the same. However, in each step, all computations are made that can be made with the available information. This means that as soon as the necessary information D^(j - 1)[j][k] and D^(j - 1)[i][j] comes in, all n^2 / P values D^(j)[i][k] are computed. So, information progresses one block at a time. Thus, the number of steps is not 5 * n - 4, but n + 4 * (sqrt(P) - 1). In each step n / P numbers may have to be transferred in each direction. For the time consumption this gives
T_par(P, n) = (1 + 4 * (sqrt(P) - 1) / n) * (c_i * n^3 / P + c_v * 4 * n^2 / sqrt(P) + c_a * 4 * n).

The algorithm is almost optimal. We may assume that n >> sqrt(P) and therefore the factor 1 + 4 * (sqrt(P) - 1) / n will lie very close to 1. The routing volume is considerably smaller than with the bundle-wise approach. Of the two communication terms, the contribution c_v * 4 * n^2 / sqrt(P) will dominate. Thus, the condition that must be satisfied in order to obtain really good performance is c_v * 4 * n^2 / sqrt(P) << c_i * n^3 / P, which gives the condition

P << (c_i / c_v * n)^2.

If we are working on a parallel computer with exceptionally high start-up cost c_a, then we should try to reduce the number of start-ups. This can be done by not making n steps in which the value of j is increased by 1, but n / b steps in which j is increased by b. Here b is an arbitrary fraction of n / P. In each round the computation now takes c_i * b * n^2 / P and b * n / sqrt(P) data have to be sent to the neighbors. In total the algorithm now needs n / b + 4 * (sqrt(P) - 1) steps. This gives almost the same expression as before:

T_par(P, n) = (1 + 4 * (sqrt(P) - 1) * b / n) * (c_i * n^3 / P + c_v * 4 * n^2 / sqrt(P) + c_a * 4 * n / b).

The only disadvantage of this bundled-block-wise approach is that the leading factor becomes larger. In the extreme case that we take b = n / sqrt(P), there are only 5 * sqrt(P) - 4 steps, but the internal work is almost five times larger. A close to optimal choice of b can be found by equating the additional computation time with the start-up costs. Concentrating on the main contributions and leaving out the constant factors, this gives c_i * b * n^2 / sqrt(P) = c_a * n / b. This suggests to take b ~= sqrt(c_a / c_i * sqrt(P) / n).

The ideas in this whole section on the all-pairs shortest paths problem are for the largest part standard and known for a long time. New contributions on a better PRAM algorithm are given in a paper by Han, Pan and Reif (Algorithmica, 1997). Sibeyn has designed an external-memory algorithm which essentially requires the same number of internal operations as conventional matrix product, and not twice as many as alternative algorithms, including the presented algorithm above.

Exercises

  1. Consider the above given algorithms for computing sums and prefix sums consisting of O(log n) independent steps performing O(n) additions in total. Specify how to allocate the work to the n / log n PUs of a PRAM so that these algorithms can be executed in O(log n) time.

    How long does the execution of the sum algorithm take if we let PU_i, 0 <= i < n / log n, perform all computations with an assignment to some a[j], for i * log n <= j < (i + 1) * log n?

  2. Consider the given parallel prefix algorithm running in O(log n) time on a PRAM with n / log n PUs. Describe how to efficiently implement this algorithm on a DMM with P PUs. Express the running time in terms of n, P and the constants c_i, c_v and c_a. What efficiency can be expected if n is sufficiently large in relation to P?

  3. Give algorithms for computing prefix sums on meshes and hypercubes. On a sqrt(n) x sqrt(n) mesh the algorithm for an input of size n should run in O(sqrt(n)) time. On a k-dimensional hypercube with n = 2^k , for some k > 0, the algorithm should run in O(k) time.

  4. On a PRAM the matrix-multiplication algorithm could be made optimal by taking n^3 / log n PUs instead of n^3. Can we do the same on a hypercube? So, the question is whether two n x n matrices A[][] and B[][] can be multiplied in O(log n) time on a hypercube with n^3 / log n PUs. Initially each number A[i][j] and B[i][j], for all 0 <= i, j < n, is present in only one PU. This PU can be specified freely, a single PU can initially and later hold O(log n) data. The key problem is how to get the data sufficiently fast to the PUs where they are needed.

  5. Describe how to compute the product C = A * B of two n x n matrices A and B on a circular array with n PUs in n steps. For each step specify which products are to be computed and which information has to be routed. Assuming full-port capacity, A * B can even be computed in n / 2 steps. Describe how.

  6. Work out the details of the inductive proof that the Floyd-Warshall algorithm on an n x n mesh runs in 5 * n - 4 steps.

  7. Describe how to solve the APSP problem for an n x n distance matrix D[][] on a single-port linear array with n PUs. For each step specify which values D^(j)[i][k] are to be computed where and which information has to be routed.

    Adapt this algorithm to an alternative bundle-wise DMM algorithm. Specify its time consumption and compare it with the above given DMM algorithms.

  8. Clearly each step of the full-port torus can be performed in four steps of the single-port torus. However, for many problems it is possible to do better. This has practical importance, because running a torus algorithm on a DMM an all-port step means sending four packets while a single-port step means sending a single packet. In the case of the matrix product, in each step only two of the connections out off each node are used. So, the matrix-product algorithm can be performed in 2 * n steps on a single-port torus. We used this in our analysis of the DMM algorithm.

    More interesting is the Floyd-Warshall algorithm. On a full-port mesh it runs in 5 * n - 4 steps. We now want to run it on a mesh on which in each step all PUs can communicate either only in the two positive directions or only in the two negative directions. In how many steps can the Floyd-Warshall algorithm be run in this model? It may be helpful to test ideas by playing around with the program which can be downloaded here.





Basic Techniques

Divide and Conquer

In general the divide-and-conquer technique consists of three steps:
  1. Divide the problem in two or more subproblems (not necessarily of same size).
  2. Find solutions for the subproblems.
  3. Combine the solutions of the subproblems to a solution for the whole problem.
Frequently either the first or the third step is almost trivial. In a sequential context, there is also a similar class of problems in which only one of the subproblems has to be considered further, but this does not offer a source of parallelism. Divide and conquer leads to parallel algorithms in a very natural way: first all PUs are allocated to the whole problem and are involved in splitting it. Then not only the
  • WT Framework and PRAM
  • DMM

    WT Framework and PRAM

    Applying this idea leads to a simple and good parallel sorting algorithm, even though practically there are better algorithms. On a PRAM with n PUs, one PU is allocated to each number. The numbers are a_i, 0 <= i < n. The lowest numbered PU in each subset chooses at random a number s, l <= s <= h, where l and h indicate the lowest and highest index of the elements in the subset, respectively. All PUs in the subset read this number and subsequently they read a_s. PU_i compares a_i with a_s and decides in which subset to put a_i. Sequentially the division of a set goes without extra effort. Parallelly this is the hardest. It requires that a prefix problem is solved for each of the subsets, computing the new position where a number has to be stored. For a set with n elements this takes O(log n) time with O(n) work. Because the subdivision takes logarithmic time, there is no need to perform the broadcasting of the chosen number in constant time using concurrent read. So, any level of the algorithm takes O(log n) time and O(n) work. The whole sorting algorithm takes O(log^2 n) time and O(n * log n) work. Using n / log n PUs, we obtain an optimal EREW PRAM algorithm.

    DMM

    On a DMM, the algorithm can be implemented as it is. The prefix-sum problem on a network with P PUs takes O(c_i * n / P + c_a * log P) time. At the end of each round the packets must be redistributed within the subset of PUs collectively solving a subproblem. The packets from each PU have destinations in at most four different PUs: at most two PUs for the numbers smaller than a_s and at most tow for those larger than a_s. On the other hand, a PU may receive packets from many PUs. The packets may also have very different sizes. In this case it may be a good idea to first scatter the packets so as to obtain two balanced routing operations. Doing this, the routing takes O(c_v * n / P + c_a * P) time, overshadowing the cost of the prefix problem. The number of PUs dealing with a subset decreases geometrically, so the number of start-ups decreases as well. After O(log P) rounds, the subproblems are located in single PUs and no further communication is needed. In total we get
    T_par(P, n) = c_i * O(n * log n / P) + c_v * O(n * log P / P) + c_a * O(P).

    The strong point of divide-and-conquer algorithms in general is that by their localizing nature the number of start-ups is low. For the above algorithm, the start-up costs are negligible if c_a * P << c_i * n * log n / P, that is when n * log n >> c_a / c_i * P^2. If we consider how large the typical problem is for which we will apply a parallel sorting algorithm, this condition will generally be satisfied.

    The weakness of the parallel quicksort algorithm is that the routing volume is almost of the same order as the internal work: only when c_i * log n >> c_v * log P, the communication cost becomes negligible. This condition will rarely be satisfied, implying that the practical speed-up with parallel quicksort, though increasing almost linearly with P, will be small.

    Convex Hull

    The convex-hull problem is one of the most important problems from computational geometry. For a set S of points in space the task is to find the smallest convex polygon containing all points. Here we focus on the variant of the problem which deals with points lying in two-dimensional space, called the planar convex-hull problem. More precisely, the task is to compute the oredered list of all elements of S which lie on the corners of this polygon, for example by giving them in clockwise order. There are also higher-dimensional versions of the problem which are substantially harder.

    The complexity of the planar convex-hull problem is closely related to that of sorting. For a set of integer values A = {a_i| 0 <= i < n}, we can determine in linear time the minimum and maximum. Then in linear time the set A can be turned into a set of points S: a_i is mapped to s_i = (a_i, f(a_i)), where f(a_i) is the unique positive value so that s_i lies on the circle through the minimum and maximum. The convex hull of S consists of all elements. Because they must be presented ordered, solving the convex-hull problem can be used to sort. So, T_sort(n) <= T_convex_hull(n) + O(n).

    The planar convex-hull problem can also be solved by using sorting as a subroutine and is actually not much harder. The algorithm consists of log n rounds which each take O(log n) time. The total work is bounded by O(n * log n). The algorithm proceeds as follows:

    1. Sort the points according to their x-coordinates.
    2. Determine the points a and b with the smallest and largest x-coordinate, respectively.
    3. Determine all points which lie above the line through a and b and those that lie under this line. Call these subsets S_upper and S_lower.
    4. Determine the convex hull of S_upper and the hull of S_lower.
    5. Construct the convex hull of S by enumerating the hull of S_upper followed by enumerating the hull of S_lower, enumerating a and b only once.
    This construction allows to concentrate on the special case that all elements lie above a line. If this is convenient, the points may be rotated about the origin and shifted so that the line through a and b coincides with the x-axis. The special case that all elements lie above the line through a and b can be solved as follows:
    1. Select the point s so that half of the numbers have smaller and half of the numbers have larger x-coordinate.
    2. Construct the sets S_left and S_right of points which lie to the left and right of s, respectively. Add s to both sets.
    3. Recursively solve the problem for S_left and S_right.
    4. Determine the upper common tangent of the convex hulls of S_left and S_right. For two finite sets of points, an upper common tangent is a line which intersects the sets in exactly one point and which has none of the points of the sets above it. Let s_left and s_right be the points in S_left and S_right, respectively, that lie on the upper common tangent.
    5. Construct the convex hull of the whole set by enumerating the points in the hull of S_left until reaching s_left, then continue with the elements in S_right starting from s_right.
    The recursion should be terminated as soon as reaching a subset with only constantly many nodes, for which the convex hull can be determined by enumerating all possibilities. It is easy to verify that the sequential algorithm runs in O(n * log n) time. The only non-trivial point is how to determine the upper common tangent. This can be done by a variant of binary search in O(log (n_1 + n_2)) time for sets with n_1 and n_2 elements, respectively. The initial sorting is not really needed, but makes it easier to repeatedly select the median in the subsets of points. In a randomized variant which is very similar to quicksort the points do not need to be sorted first.

    Later we will consider how to perform parallel sorting. It is rather simple however to perform it in O(log^2 n) time and O(n * log n) work on an EREW PRAM with n / log n PUs. All other steps of the algorithm except for determining the convex hulls of S_upper and S_lower can be performed in O(log n) time and O(n) work or less. Because the set of numbers has been sorted before, constructing S_left and S_right is trivial. The same is true for constructing the whole convex hull in the last step. For the time being we are happy with using the sequential algorithm for finding the upper common tangent. So, the time and work for the recursive part are given by:

    T(n) = O(log n) + T(n / 2),
    W(n) = O(n) + 2 * W(n / 2).
    The solution of these recurrencies is T(n) = O(log^2 n) and W(n) = O(n * log n). Because of this T, we have not put made an effort to push the time for the non-recursive part below O(log^2 n). The work of this parallel algorithm is optimal, but the time consumption can be reduced to O(log n).

    Partitioning

    The partitioning technique consists of breaking up a problem in subproblems of almost equal size and solving these subproblems concurrently and independently. Here we will show how this idea can be used in a parallel merging algorithm.

    Merging is the process of turning two sorted arrays a[] and b[] of length n each into one array c[] of length 2 * n containing all values of a[] and b[] in sorted order. The rank of an element x in a set X, denoted rank(x, X), is defined to be the number of elements y in X with y < x. Here it is convenient to assume that all values are different, otherwise the index of the position in which a value is stored may be taken as a secondary criterion. The rank of an element x in A union B is given by rank(x, A) + rank(x, B). rank(a[i], A) = i and rank(b[i], B) = i, for all 0 <= i < n. Thus, it suffices to determine rank(a[i], B) and rank(b[i], A), for all 0 <= i < n, because once all ranks in A union B are known, it takes constant time and linear work to write the numbers of a[] and b[] in c[].

    A trivial and fast way to determine all ranks is to perform binary search: for each i, rank(a[i], B) and rank(b[i], A) can be determined in O(log n) time and O(log n) work. Thus, merging can be performed in O(log n) time and O(n * log n) work. This time is satisfactory, but the superlinear work should be reduced.

    The obvious improvement is to not determine the rank for all numbers but only a fraction. The largest fraction that can be used so that the work is still bounded by O(n) is n / log n. So, take all numbers a[j * log n] and b[j * log n], for 0 < j < n / log n, and rank them in the other set. This takes O(log n) time and O(n) work. This initial ranking partitions the problem of merging a[] and b[] into small subproblems: there are 2 * n / log n - 2 values for which we know the positions in both a[] and b[]. These subdivide each of the arrays in subsets of size at most log n which have to be merged pairwise. For each of these subproblems a sequential algorithm requires O(log n) time and work linear in the sum of the sizes of the subsets. So, the partitioning allows to solve the whole problem in O(log n) time and O(n) work.

    Merging Using Partitioning

    Pipelining

    Pipelining is the process of breaking up a task in subtasks which can be performed after each other so that several of these problems can be overlapped in a time-shifted way. We have encountered this idea already for some hypercube algorithms and also the systolic matrix product algorithm is a pipelining algorithm. The essence of pipelining is the time-shifted processing.

    As a further example we consider the problem of inserting k new numbers b_i, 0 <= i < k, into a 2-3 tree with n nodes with keys, a_i, 0 <= i < n, in sorted order. It is assume that k < n, otherwise it is more efficient to build a new 2-3 tree from scratch. Sequentially k numbers can be inserted in O(k * log n) time, and in general this cannot be done much faster, because on the one hand Omega(k * log k) is needed because as a result of the insertions the numbers get sorted, and if log k = o(log n), then it may happen that after the highest log k levels of the tree all paths are independent and must be followed to find the positions where to insert the elements. These remaining paths have length Omega(log n - log k) = Omega(log n) each. So, in our parallel algorithm we are happy with O(k * log n) work, but we would like to reduce the time as far as possible, that is, to O(log n).

    First the smallest and largest number, which can easily be determined with linear work in O(log k) time, are inserted in the normal sequential fashion. This is done to eliminate some special cases: after this operation, we know that any element to insert fits between two consecutive leafs of the tree. The insertion takes O(log n) time and work. Then for all k - 2 other b_i the leaf a_j of the tree is determined after which b_i has to be inserted. A chain is the ordered subset B_j of the b_i which fits between a_j and a_{j + 1}. Let k_j = |B_j|. sum_j k_j = k - 2. First consider the special case k_j <= 1, for all j. In this case, the nodes of the tree may get degree larger than 3, but never more than 6. So, processing the tree bottom-up processing all insertions at the same level in parallel, the whole tree can be processed in O(log n) time with O(k * log n) work. Here it is essential that any node gets split only once and that therefore the property that between any two existing nodes at most one new node has to be inserted also holds at the higher levels of the tree. We also use that any node can be processed in constant time.

    The above we call the basic algorithm which will be used as a subroutine for the general case. If k_j > 1, we do not insert all elements in one stroke, but rather first insert only the middle element of the chain. At this point the lists should have been sorted. If we are lucky all lists are short, but there might be one chain with k - 2 elements. In that case we need an optimal sorting algorithm running in logarithmic time. Inserting the middle element of a chain splits the chain in two. So, after at most log k of these operations all chains have length 1 or 0 and we are done. Thus, the general problem can be solved easily in O(log k * log n) time.

    The final improvement comes by observing that there is no neeed to wait with the next application of the basic algorithm until the previous one has been completed. These operations can also be overlayed in such a way that the next insertion starts at the lowest level as soon as the previous insertion has finished there. All these waves run up to the root in O(log n) time each. So, the whole algorithm has running time O(log k + log n) = O(log n). The work is unchanged by this overlaying.

    Accelerated Cascading

    A crucial technique, both theoretically and practically is the idea to use several algorithms to solve a single problem. The typical application of this idea is found for problems whose size is iteratively reduced. For such problems, as long as the problem size is large a work-optimal algorithm is used, but once the problem size has been reduced sufficiently another algorithm is used which is sub-optimal but runs faster. In exceptional cases even three algorithms are combined to minimize the time while still achieving optimal work. In the book "Parallel Algorithms" by JaJa, such a construction is called accelerated cascading. In the course of the text we will encounter several examples. A trivial subcase of accelerated cascading is when the last applied algorithm is a sequential one.

    Consider the problem of computing the maximum of n numbers stored in an array a[]. On a Common CRCW PRAM with n * (n - 1) PUs this problem can be solved in O(1) time. The PUs are indexed PU_{i, j}, 0 <= i < n, 0 <= j < n - 1. In addition to the integer array a[] there is a boolean array b[]. The algorithm is simple:

      1. for (all i, 0 <= i < n, using  PU_{i, 0}) do
           b[i] = true;
      2. for (all i, j, 0 <= i < n, 0 <= i < n - 1, using PU_{i, j}) do
           if (a[i] < a[j] || (a[i] == a[j] && i < j))
             b[i] = false;
      3. for (all i, 0 <= i < n, using  PU_{i, 0}) do
           if (b[i])
             max_index = i;
    
    Clearly the algorithm can be performed in constant time on the specified hardware. In step 1 each of the b[i] is written by a single PU. In step 2 b[i] may be written by up to n - 1 PUs but the written value is always false. By adding the clause a[i] == a[j] && i < j it is assured that if the maximum value occurs several times only one of them survives. In step 3 this single surviving index is written to max_index.

    The algorithm shows that the maximum can be computed very fast, but it is not optimal. However, we know optimal algorithms which can be combined with it. We summarize the algorithms:

    Suppose we have 2 * n PUs. Then for each subset of 2 elements there are 4 PUs, so the above algorithm can be applied. Hereafter there are still n / 2 numbers. For each subset of 4 PUs there are 16 PUs, so again the above algorithm can be applied. More generally, we have

    Lemma: After t >= 0 rounds there are still n_t = n / 2^{2^t - 1} numbers from among which the maximum has to be determined.

    Proof: The proof goes by induction. For t = 0, we have n_0 = n / 2^0 = n. So, assume the lemma holds for some t. The remaining n_t numbers are divided in subsets of size 2^{2^t}. For each of these subsets there are 2 * n * 2^{2^t} / (n / 2^{2^t - 1}) = (2^{2^t})^2 PUs available. So, the algorithm can be run again, and the number of candidates is reduced to 2^{2^t - 1} / 2^{2^t} = 2^{2^{t + 1} - 1}. End.

    So, with n PUs the maximum can be computed in O(loglog n) time. This is much more efficient than before and still very fast, but not yet optimal. If we take n / loglog n PUs, then first each PU can compute the maximum of 2 * loglog n numbers in O(loglog n) time. Then the fast-but-not-optimal algorithm is used to complete the task.

    Theorem: On a Common CRCW PRAM with n / loglog n PUs the maximum of n numbers can be computed in O(loglog n) time.

    The given algorithm is not such a spectacular application of accelerated cascading. It can be made more interesting by performing several rounds of the hypercube algorithm before changing to the CRCW algorithm. This saves some PRAM steps, because the reduction with the CRCW algorithm is not particularly efficient when the subsets have size 2 or 4, but it has no impact on the asymptotic running time.

    We want to point out that the given algorithm is not obtained by a simple application of the WT-scheduling principle: the work of the algorithm with n PUs is n * loglog n because all PUs are active in all rounds. So, it is not because of the scheduling principle that fewer PUs were taken. The decision rather was inspired as follows: considering the loglog algorithm, we wondered for what n' the work of this algorithm is still bounded by O(n). n' = n / loglog n is the largest such value. Then we wondered whether there is another algorithm which allows to reduce the problem size from n to n' with O(n) work in O(loglog n) time. In this case this other algorithm was the sequential algorithm applied to small subsets.

    Symmetry Breaking

    Often symmetry is a desirable property which can be exploited. At the same time symmetry can be a problem when we want to reduce the problem size by choosing a subset: how can we single out a subset if all elements look the same? This is the problem of symmetry breaking and the key to its solution is to either use randomization, or to use the indices if these are available.

    As an example we consider the problem of coloring the nodes of a directed cycle of length n. A node coloring of a graph with c colors, is an assignment of numbers c_i 0 <= c_i < c, to the nodes of the graph so that for any two adjacent nodes i and j, c_i != c_j. Coloring a list sequentially is simple: we can start at any node, coloring the nodes with colors 0 and 1 until we reach the last node, which has to be colored with color 2 if the number of nodes is odd.

    When considering how to solve the problem in parallel, there is the problem that it does not appear to make sense to start coloring everywhere. In this section we show how to efficiently construct a three-coloring in a deterministic way. Of course randomization allows to solve the problem more easily, but it is good to also know how such problems can be solved deterministically.

    The main idea of the parallel algorithm is to start with a correct coloring with many colors and to gradually reduce the number of used colors. A trivial initial coloring is given by c_i = i for all i, 0 <= i < n. The algorithm proceeds in rounds. In any round, all nodes are recolored in parallel as follows: for any node i with color c_i and successor j, the new color is given by c'_i = 2 * dif(c_i, c_j) + bit(dif(c_i, c_j), c_i). Here dif(x, y) gives the index of the least-significant bit which is not the same in x and y and bit(b, x) gives bit b of x, where bit 0 is the least significant one. The computation is simple. dif(,) is the only operation which is not a standard hardware operation, but it might have been so. Much harder operations, such as multiplication, can be performed in one clock cycle, so it is reasonable to assume that even dif(,) can be performed in constant time.

    We must check that the new coloring is correct. The proof of this goes by contradiction. So, assume that c'_i = c'_j. Let node k be the successor of j, then we must have dif(c_i, c_j) = dif(c_j, c_k) and bit(dif(c_i, c_j), c_i) = bit(dif(c_j, c_k), c_j). Substituting the first in the second gives bit(dif(c_i, c_j), c_i) = bit(dif(c_i, c_j), c_j), a contradiction with the definition of dif(,), which in particular should give a bit position for which the two numbers are different. Because the previous coloring was correct, such a bit must exist.

    Node Coloring

    If before the recoloring the largest color is c, requiring round_down(log c) + 1 bits, numbered 0, ..., round_down(log c), then afterwards the largest color is at most 2 * round_down(log c) + 1. From this it follows that in O(log* n) rounds the number of colors is reduced from n to 5. To further reduce the number of colors, we apply

      for (c = 5; c > 2; c--)
        for each node i with color c_i == c, recolor i with the 
        smallest color c'_i which does not conflict with the color
        of its predecessor or its successor.
    

    Theorem: A three-coloring of a cycle can be constructed in O(log*^ n) time and O(log^* n * n) work.

    A (purely theoretical) weakness of the above algorithm is its non-linear work. The algorithm can be made work-optimal at the expense of increasing the time consumption. The above algorithm is performed only once, reducing the number of colors to O(log n). For the special case that we have a set of n numbers with maximum value bounded by O(log n), the sorting problem can be solved in O(log n) time with O(n) work. Sorting means in this case creating buckets containing the indices of all nodes with a given color. After this sorting, the colors can be handled one-by-one as was done before to reduce the number of colors from 5 to 3. Because there are only O(log n) colors and each node has to be processed only once due to the sorting, this recoloring takes O(log n) time and O(n) work. If the super-fast algorithm is applied twice, the number of colors is reduced to O(loglog n). This has a positive impact on the cost of the final recoloring, but the sorting still takes O(log n) time and therefore this does not lead to a smaller asymptotic running time of the entire algorithm.

    Exercises

    1. Due to its simple structure parallel quicksort is very suitable for implementation on interconnection networks.
      • Specify missing details and analyze the algorithm for sorting n^2 numbers on an n x n mesh.
      • Specify missing details and analyze the algorithm for sorting n = 2^k numbers on a k-dimensional hypercube.
      • Consider making the hypercube algorithm work-optimal. In other words, consider sorting n * log n numbers on a k-dimensional hypercube. Which communication model do you need: one-port or full-port?

    2. For numbers x and n, we want to compute x^i for all 0 <= i < n. Describe how to do this in O(log n) time and O(n) work. This leads to an efficient and fast way to evaluate a polynomial of degree n - 1 given by an array of coefficients a[i]. Write down the complete algorithm.

    3. There are arrays a[] and b[] of length. a[i] contains integers and b[i] booleans. If b[i] = true, the interpretation is that a new segment starts for index i. The task is to compute the segmented prefix sums, that is, for any i, we want to compute the value s[i] = sum_{j = i_0}^i a[j], where i_0 <= i is the largest value with b[i_0] = true. Give an algorithm that solves this problem in O(log n) time and O(n) work on an EREW PRAM.

    4. Given an array a[] of length n we want to compute the prefix minima, the values m[i], 0 <= i < n, defined by m[i] = min_{j <= i} a[j]. Give an algorithm that solves this problem in O(log n) time and O(n) work on an EREW PRAM.

    5. Describe in some detail how to compute the upper common tangent of two upper hulls S_1 and S_2, with all elements of S_1 lying in the plane to the left of those of S_2. The algorithm should run in O(log n) time, where n = |S_1| + |S_2|.

    6. Give the smallest numbers n_t, 1 <= t <= 5, for which the recoloring algorithm needs t rounds before reaching a situation with 5 colors.

    7. Describe how to efficiently build a 2-3 tree from an array of n elements which stand in the array in sorted order in parallel. Specify the required PRAM submodel and the time and work of the algorithm.

    8. Show that the maximum of n numbers can be determined in O(1) time on a PRAM with n^{1 + eps} PUs for any eps > 0. Which PRAM model do you need?

    9. We consider the problem of sorting n numbers a[i], with 0 <= i < n, with values in a small range: 0 <= a[i] < m, for all i.
      • Give an algorithm for the case m = log n which runs in O(log n) time and O(n) work on an EREW PRAM. Hint: let S_x, 0 <= x < log n, be the subset of indices i, 0 <= i < n, with a[i] = x. First determine for all x and for all i in S_x, the rank of i within S_i.
      • Generalize the algorithm for arbitrary m and specify the corresponding time consumption and work in terms of n and m. Hint: use radix sort.

    
    
    
    

    Sorting

    Transposition Sort

    The first sorting algorithm of this chapter is particularly simple but not very efficient. It is a useful subroutine though and may applied in small subnetworks. It is known under the name odd-even transposition sort, mostly abbreviated to just "transposition sort". It is very similar to bubblesort, only the order in which the comparisons are performed is different.

    Linear Array

    On a linear array with n PUs, n even, of which each PU holds one number, the algorithm consists of n rounds. In each round the numbers in pairs of consecutive PUs are compared, and if the number in the PU with smaller index is larger, then they are exchanged. Comparing and possibly exchanging the numbers in PU_i and PU_j will be denoted by compare_swap(i, j). In terms of this operation, the transposition-sort algorithm can be written down concisely:
      for (t = 0; t < n / 2; t++)
      {
        for (i = 0; i < n / 2; i++) // Even
          compare_swap(2 * i, 2 * i + 1);
        for (i = 1; i < n / 2; i++) // Odd
          compare_swap(2 * i - 1, 2 * i);
      }
    

    The correctness of the algorithm can be proven by induction. The idea underlying the argument is that the largest number, once it starts moving never gets delayed. It reaches its destination after at most n - 1 steps. The second largest number may be swapped with the largest number if the first step, and then it has to wait one step. Hereafter it will never be delayed anymore, so it arrives no later then after n steps. The operations in each of the inner loops can be executed in parallel. On an interconnection network, the simplest is to implement a compare-swap operation by exchanging the numbers to be compared and to discard one of them in each of the two involved PUs. So, on a linear array, the algorithm can be implemented with n steps in which each PU sends at most one number to a neighbor. Odd-even transposition sort is simple but inefficient: as a sequential algorithm it has complexity Theta(n^2) for sorting n numbers.

    Odd-Even Transposition Sort

    DMM

    On a DMM with P PUs which each hold k = n / P numbers, there are several possible generalizations of the above algorithm. The simplest is to let each PU first sort all k numbers it holds. Then it repeatedly copies all its k numbers to one of its neighbors. The 2 * k numbers are merged and the PU with the smaller index retains the k smallest numbers, while the other PU retains the k largest. The time consumption of this algorithm is given by
    T(P, n) = c_i * O(n / P * log n + n) + c_v * n + c_a * P.
    For small P and very large n, the term c_i * O(n / P * log n) may dominate. In that case the algorithm gives good performance, but this applies only for n which are exponential in P. In the following we will see that good performance can be achieved for much smaller values of n as a function of P.

    Mesh

    Sorting n numbers on a DMM or interconnection network with P PUs consists of two subproblems:
    1. Determine for all numbers their rank;
    2. Route the numbers (and associated information) so that the number with rank r, 0 <= r < n, comes to stand in PU_{r / k}, for k = n / P. (Sometimes the number with rank r should be routed to PU_{r mod P}).
    Sometimes these problems are solved separately, but often better performance is obtained by integrating them. For example, in the odd-even transposition sort algorithm, the rank of the numbers is never determined explicitly.

    In the DMM model, the structure of the interconnection network is ignored. On interconnection networks, there is a structure, and the indexing scheme has impact on the efficiency with which sorting can be performed. Generally speaking, it is convenient if PUs for which the indices do not differ much also lie close to each other in the network. Nevertheless, sorting with respect to any indexing scheme which is fixed in advance is at least as hard as performing a k-k routing: if the hardest permutation pi routes a packet from PU_i to PU_j, then when sorting the value of the corresponding number in PU_i is set to j and this number will have to be routed from PU_i to PU_j in order to perform the sorting. The indexing scheme may have importance, but it should be noticed that the difference between the best and the worst scheme is limited: first sorting according to the best indexing scheme and then routing the numbers so that they come to stand according to the desired scheme costs at most twice as much as just sorting.

    The simplest algorithm for sorting on a mesh is obtained by embedding a linear array into it. This algorithm sorts n^2 numbers on an n x n mesh in n^2 steps. This poor performance is due to the fact that the two-dimensional structure of the mesh is ignored. Ideally, if a problem can be solved in O(n) time on a linear array with n PUs it can also be solved on an n x n mesh in O(n) time. A natural way to obtain mesh algorithms is by combining operations on rows and columns. For sorting, it appears natural to consider algorithms which consist of an alternation of one-dimensional sorting phases, using odd-even transposition sort.

    The shearsort algorithm, consists of an alternation of operations in which all columns are sorted in increasing order from top to bottom, and operations in which the even rows, that is the rows with index 0, 2, ..., n - 2, are sorted from left to right while the odd rows, with index 1, 3, ..., n - 1, are sorted from left to right. Only in the final sorting of the rows, the sorting direction can be chosen freely. Performing this operation as all others, the numbers are sorted in snake-like row-major order. When sorting all rows increasingly from left to right we get row-major order. The first alternative gives the following extremely simple algorithm:

      for (t = 0; t <= t_max(n); t++)
      {
        sort all columns from top to bottom;
        sort even rows from left to right and odd rows from right to left;
      }
    
    If all PUs hold one number, each pass of the loop takes 2 * n steps, so the whole algorithm runs in 2 * n * (t_max(n) + 1) steps. The question is how large t_max(n) should be chosen to guarantee that any distribution gets sorted. The analysis is based on two lemmas.

    0-1 Sorting Lemma: If a comparison-based sorting algorithm correctly sorts any distribution consisting of 0-s and 1-s, it correctly sorts any distribution with arbitrary numbers.

    Proof: The proof goes by contradiction. Assume there is a set of numbers X = {x_0, ..., x_{n - 1}} for which the algorithm is not correct. Let pi'() be the inverse of the permutation realized by the algorithm. That is, afterwards the numbers have been rearranged to x_{pi'(0)}, ..., x_{pi'(n - 1)}. Let pi() be the inverse of the permutation giving a correct sorting of the numbers. That is, pi() is so that x_{pi(0)} <= x_{pi(1)} <= ... <= x_{pi(n - 1)}. Let k be the smallest index so that x_{pi'(k)} != x_{pi(k)}. All numbers with values smaller than x_{pi(k)} belong to the set {x_{pi(0)}, ..., x_{pi(k - 1)}} = {x_{pi'(0)}, ..., x_{pi'(k - 1)}}, so we must have x_{pi'(k)} > x_{pi(k)}. Because pi() is a permutation, this implies that there also must be an index l > k, so that x_{pi'(l)} = x_{pi(k)}. Now define the set of 0-1 values X' = {x'_0, ..., x_{n - 1}} by x'_i = 0 if x_i <= x_{pi(k)}, x'_i = 1 if x_i > x_{pi(k)}. Because for all 0 <= i, j < n, x_i < x_j => x'_i < x_j, the algorithm works on X' as it did on X (except that possibly some elements which now have the same value do not get exchanged). So, on X', the algorithm will produce the following sequence of values as output: ..., x'_{pi'(k)}, ..., x'_{pi'(l)}, ... = ..., 1, ..., 0, ... . That is, at position k there occurs a 1 while at position l > k there occurs a 0: the sorting is not correct, a contradiction with the assumption that it correctly sorts all 0-1 distributions. End.

    A row is called clean if it only contains 0-s or only 1-s. Otherwise a row is called dirty. For the analysis it is convenient to rewrite the algorithm as follows:
      sort all columns from top to bottom;
      for (t = 0; t < t_max(n); t++)
      {
        sort even rows from left to right and odd rows from right to left;
        sort all columns from top to bottom;
      }
      sort even rows from left to right and odd rows from right to left;
    
    Hereafter we will refer to this variant.

    Lemma: For any 0-1 number distribution on an n x n mesh, after t, 0 <= t <= log n, passes through the loop in shearsort there are still at most n / 2^t dirty rows.

    Proof: The purpose of the first sorting of the columns is that it establishes the property that for all i, 0 <= i < n - 1, the number of 0-s in row is at least as large as in row i + 1. This is true because if row i + 1 would hold more 0-s than row i, there would be a position in row i with a 1 where row i + 1 has a 0, contradicting the sorting of the columns.

    For t >= 0 we use induction. For t = 0 the claim holds. So, assume that the claim holds for some t >= 0. Sorting the clean rows does not change anything. So, we may focus on the n_t dirty rows. Let z_i, 0 <= i < n_t, denote the number of 0-s in the i-th of these rows. z_i is monotonously decreasing. Let m_0 = |{0 <= i < n_t| z_i >= n / 2}| and let m_1 = n_t - m_0. We claim that after pass t + 1, at least round_down(m_0 / 2) of these rows only contain 0-s, and at least round_down(m_1 / 2) only 1-s. This is a direct consequence of the the monotonicity of z_i and the fact that even rows are sorted differently from odd rows. Thus, any two consecutive rows with z_i >= n / 2 lead to one row with only 0-s. There are several cases to distinguish. If n_t is odd, then either m_0 is even or m_1 is even and thus round_down(m_0 / 2) + round_down(m_1 / 2) = n_t / 2. If n_t is even, then if m_0 and m_1 are even, round_down(m_0 / 2) + round_down(m_1 / 2) = (m_0 + m_1) / 2 = n_t / 2. Only the case that both m_0 and m_1 are odd deserves special consideration. So consider row m_0, which is sorted from right to left and row m_0 + 1 which is sorted from left to right. If z_{m_0} + z_{m_0 + 1} >= n, then this results in one more row with only 0-s, otherwise in one more row with only 1-s. So, in this case at least round_down(m_0 / 2) + round_down(m_1 / 2) + 1 = n_t / 2 new clean rows are generated. End.

    Theorem: Shearsort sorts any distribution of n^2 numbers on an n x n mesh in (2 * log n + 2) * n steps.

    The time consumption can be improved by almost a factor two, by realizing that the sorting in the columns does not need to be performed for n steps in the later passes of the loop: if there are only n_t rows which might contain 0-s and 1-s, it suffices to perform n_t steps. Thus, all sorting in the columns can be reduced to n + n + n / 2 + ... + 2 = 3 * n - 2. So, the total sorting time can be reduced to (log n + 4) * n - 2 steps. Even this is not such a good result, but it might serve as basis case of a recursive algorithm. The question is whether it is better than the result obtained by trivially embedding a linear array. The times that should be compared are given in the following table, the results are not very impressive:

    n 2 4 8 16
    Transposition Sort 4 16 64 256
    Shearsort 8 22 54 126

    Shearsort

    Merge Sort

    Searching

    The most important ingredient in the subsequent merging algorithm is a subroutine for localizing an element in a sorted array of length n. So, for a sorted array a[] and a value x the task is to determine an index i so that a[i] <= x < a[i + 1]. Sequentially this variant of searching can be solved in O(log n) time using binary search. We could also call this the problem of ranking x in the set of elements in a[]. In binary search in each round the size of the remaining problem is reduced by a factor two. If we have P PUs, the problem size can be reduced by a factor P in each round.

    The algorithm is very simple. Initially x may lie in the whole interval ranging from 0 to n. This interval is reduced in several rounds. In each round the remaining interval is divided in P intervals I_i, 0 <= i < P, of approximately equal size. PU_i, 0 <= i < P, is in charge of I_i. Each of the PUs tests whether x lies in its interval. Assuming that the values are different (otherwise the smallest should be taken) there is a unique interval in which x lies. The PU in charge of this interval writes its index to a global variable c. All PUs read this variable and can start the next round. On an CREW PRAM, each round can be performed in O(1) time. So, the whole searching algorithm runs in O(log_P n).

    Theorem: On a CREW PRAM with P PUs, an element x can be ranked in a sorted set of size n in O(log_P n) time.

    An important special case is that P = n^eps for some eps > 0. In that case log_P n = 1 / eps = O(1). So, with sufficiently many PUs an element can be ranked in constant time.

    Merging through Ranking

    In the chapter on "Basic Techniques", we have seen how two sorted arrays of length n can be merged in O(log n) time with O(n) work. This immediately leads to a parallel sorting algorithm running in O(log^2 n) time and O(n * log n) work. In the following we will show that two sorted arrays of length n can actually be merged in O(loglog n) time. This leads to a work-optimal O(log n * loglog n) sorting algorithm.

    For two sorted arrays, a[] of length m and b[] of length n, let rank(a, b) denote the array (r_0, ..., r_{m - 1}) of length m with r_i = rank(a[i], b), the rank of a[i] in b[]. Assume that we have a CREW PRAM with n PUs, then if m = O(n^{1 - eps}), for some eps > 0, rank(a, b) can be determined in constant time, because for each of the m elements of a[] we can use n^eps PUs to perform the ranking.

    Again we apply the partitioning strategy. For merging two sorted arrays a[] and b[] each of length n with n PUs, we select the subarrays a'[] and b'[] of length sqrt(n) with a'[i] = a[sqrt(n) * i] and b'[i] = b[sqrt(n) * i] (it is convenient to add one more element to each subarray with value infinity). With the available PUs rank(a', b) and rank(b', a) can be computed in constant time. The resulting partitioning reduces the merging problem to merging subarrays of a[] and b[] of length at most sqrt(n). Before we reduced the problem in one stroke to subproblems which could be solved sequentially, now we continue recursively until the size has become constant. Each phase of the recursion takes O(1), so the whole merging takes O(loglog n) time.

    Lemma: On a CREW PRAM two sorted arrays of length n each can be merged in O(loglog n) time with linear work.

    Proof: Above we have shown how to merge two sorted arrays a[] and b[] of length n on a CREW PRAM with n PUs in O(loglog n) time. The work of this algorithm is O(n * loglog n). This algorithm can be made work-optimal by some kind of partitioning. So, we first determine subarrays a'[] and b'[] each of length n / loglog n by taking a'[i] = a[i * loglog n] and b'[i] = b[i * loglog n]. a'[] and b'[] are merged with n / loglog n PUs in O(loglog n) time. Just as in all previous partitionings, as a result it remains to merge sections of a[] and b[] each of length at most loglog n. With one PU for each subproblem, each of these can be solved with the sequential algorithm in O(loglog n) time. End.

    Using the above fast and efficient merging as a subroutine in a merge sort algorithm gives

    Theorem: On a CREW PRAM with n / loglog n PUs n numbers can be sorted in O(log n * loglog n) time.

    O(log n * loglog n) is not the best possible result: sorting can be solved in O(log n) time and O(n * log n) work on an EREW PRAM. This optimal algorithm, which is known in the literature as Cole's merge-sort algorithm, is considerably more complicated and practically a gap of O(loglog n) is hardly significant. A more serious drawback of the presented algorithm is that it is based on the parallel-search algorithm which really needs the CREW model. The parallel quick-sort algorithm can also be refined to run in O(log n) time with high probability.

    Bitonic Merge Sort

    Bitonic merge sort is based on the so-called bitonic merge operation. Except for the description and the correctness proof of this operation all is trivial: sorting n = 2^k numbers consists of k rounds, in round i, 1 < i <= k, n / 2^{i - 1} sorted subsequences of length 2^{i - 1} are pairwise merged to obtain sorted subsequences of length 2^i. This algorithm (just as the odd-even merge presented in the next section), in a version for sorting networks, was presented by Batcher already in 1968. The strong point of this sorting procedure is that it is very suitable for a hard-wired implementation using a sorting network. Even the simplest formulation of the algorithm is non-recursive and insitu.

    Algorithm

    Assume we have an array a[] of length n. Assume that the numbers in a[] from a[0] to a[n / 2 - 1] are increasing, while the numbers from a[n / 2] to a[n - 1] are decreasing. Then the numbers can be made to stand in increasing order in the whole array by running the following procedure:
      void bitonicMergeUp(int[] a, int n) {
        for (int s = n >> 1; s > 0; s = s >> 1)
          for (int i = 0; i < n; i += 2 * s)
            for (int j = i; j < i + s; j++)
              compareSwapUp(a, j, j + s); }
    
    Here compareSwapUp(a, i, j), compares the values of a[i] and a[j] and exchanges them if a[i] > a[j]. The procedure bitonicMergeDn() is analogous. These procedures integrated in a running program can be downloaded here.

    Bitonic Merge Sort

    The correctness is not proven here. The invariant that must be proven is that after round r, k >= r >= 0, all numbers in a subset of size 2^r with bit r equal to 0 are smaller than all numbers in the corresponding subset with bit r equal to 1. Because later comparisons are performed only within the subsets, the earlier established properties are preserved. Because all properties together imply a sorted sequence, proving the invariant implies the correctness.

    Work-Time Framework

    Here we give a precise expression for the number of performed comparisons. Let T_merge(2^k) be the time to merge two sorted sequences of length 2^{k - 1}. From the algorithm it immediately follows that
    T_merge(2^k) = k / 2 * 2^k, for all k >= 1.

    For T_sort(2^k), the time to sort 2^k numbers based on this merge operation, we get

    T_sort(2^1) = 1,
    T_sort(2^k) = 2 * T_sort(2^{k - 1}) + T_merge(2^k), for all k > 1.
    Using induction it is easy to verify, that the solution is given by
    T_sort(2^k) = k * (k + 1) / 4 * 2^k, for all k >= 1.
    Another important point is the depth of the schedule. For bitonic merge sort, this is simply sum_{i = 1}^k i = k * (k + 1) / 2 = T_sort(2^k) / (2^k / 2).

    Theorem: Using bitonic merge sort for sorting n numbers, the work is Theta(n * log^2 n). The time is Theta(log^2 n).

    Interconnection Networks

    On interconnection networks, the compare-swap operation for two numbers a and a' residing in PU p and p', respectively, can be performed in two substeps:
    1. PU p sends a copy of a to p', PU p' sends a copy of a' to p.
    2. PU p and p' compare a and a' and decide which number to keep and which number to discard.
    In a DMM implementation the following might be better:
    1. PU p sends a copy of a to p'.
    2. PU p' compares a and a'. If it should keep a it sends a copy of a' to p, otherwise it simply discards a and keeps a'.
    At the expense of one more routing operation, this reduces the average routing volume from 2 to 3/2 per compare-swap.

    Bitonic merge sort is very suitable for implementation on a hypercube: in all steps number a[i] is compared with numbers a[j], where j = i +- 2^r, for some r, 0 <= r < k. This means that, when mapping the numbers of a[] in a straightforward way to the 2^k PUs of a k-dimensional hypercube, all communication takes place between neighbors, and that thus all steps can be performed in constant time.

    Theorem: Using bitonic merge sort, 2^k numbers can be sorted in Theta(k^2) time on a k-dimensional hypercube.

    Asymptotically this is not the best achievable, but the algorithm is extremely simple and, as seen above, the constants hidden in the Theta-notation are very small.

    Surprisingly bitonic merge sort is asymptotically optimal on linear arays and meshes. The analysis is most easy for linear arrays. Merging two subsequences standing each in m consecutive PUs takes m + m / 2 + + ... + 1 = 2 * m - 1 steps. So, for a linear array with n = 2^k PUs all merging rounds together take sum_{0 <= s < k} (2 * 2^s - 1) = 2 * sum_{0 <= s < k} 2^s - k = 2 * n - log n - 2 < 2 * n steps. This is not bad at all! Of course, for linear arrays odd-even transposition sort performs fewer routing steps, but at the expense of far more comparisons.

    On a two-dimensional n x n mesh n^2 numbers can be sorted in 2 * n + o(n) steps, which is optimal, but the algorithm is complicated and the lower-order term makes this algorithm unsuitable for moderate values of n. How about bitonic merge sorting? It is essential that the PUs are indexed with the shuffled row-major indexing. In that case the distance between PU i and j with j = i +- 2^r, for some r, 0 <= r < k, is exactly 2^{round_down{r / 2}}. Because such PU i and j lie in the same row or column, the routing can be performed without delay. Thus, merging two subsequences of length 2^s each takes 1 + 1 + 2 + 2 + ... + 2^round_down{(s - 1) / 2} + 2^round_down{s / 2} < 4 * 2^{s / 2}. For the number of steps T(n) for the whole sorting this gives T(n) < 4 * sum_{0 <= s < k} 2^{s / 2} < 8 * sum_{0 <= s < k / 2} 2^s < 8 * n. It is worthwhile to compute T(n) exactly. T(2) = 1 + (1 + 1) = 3. T(4) = T(2) + (2 + 1 + 1) + (2 + 2 + 1 + 1) = 13. T(8) = T(4) + (4 + 2 + 2 + 1 + 1) + (4 + 4 + 2 + 2 + 1 + 1) = 37. More generally, T(n) = T(n / 2) + 3 1/2 * n - 4. The solution is T(n) = 7 * n - 4 * log n - 7. As noticed above, asymptotically this is not competitive, but for small n it might be the best achievable. It is better than transposition sort and shearsort for all n.

    DMM

    Bitonic merge sort can even be applied on a DMM. The n = 2^k numbers are distributed evenly over the P PUs. Assume P = 2^p, for some p > 0. Each PU uses an efficient algorithm to sort its n / P numbers. Then it remains to merge these together. That is, it remains to perform the last p merging rounds. During these rounds the numbers have to be communicated 1, 2, ..., p times. Alternatively, all numbers are routed twice for each merging round: once bringing all number with common leading bits together, once bringing the numbers with common trailing bits together. Doing this, the total number of routing rounds is only 2 * p - 1 and the routing volume is bounded by (2 * p - 1) * P. If the network is sufficiently powerful, that is, if t_v is not too large in comparison with t_i, some speed-up may be achieved, but for DMMs we will encounter much better sorting algorithms.

    Odd-Even Merge Sort

    Odd-even merge sort is based on the so-called odd-even merge operation. The sorting proceeds just as with the bitonic merge: 2^k numbers are sorted in k merging rounds. Asymptotically the complexity is the same as for odd-even merge, but the number of performed comparisons is slightly smaller. For small problems there is a difference of more than 20%. The easiest formulation of odd-even merge sort uses a secondary array and recursion. These are not essential, but the pattern of positions to compare is slightly less beautiful then for bitonic merge sort, which reduces its suitability for implementation on interconnection networks.

    Algorithm

    Assume we have an array a[] of length 2^k. Assume that the numbers in a[] from a[0] to a[2^{k - 1} - 1] are increasing, and that the numbers from a[2^{k - 1}] to a[2^k - 1] are increasing as well. Then the numbers can be made to stand in increasing order in the whole array by performing the following steps:
      oddEvenMerge(int a[], int m) {
        if (m > 2) {
          int[] ae = new int[m / 2];
          int[] ao = new int[m / 2];
          for (int i = 0; i < m / 2; i++) {
            ae[i] = a[2 * i];
            ao[i] = a[2 * i + 1]; }
          oddEvenMerge(ae, m / 2);
          oddEvenMerge(ao, m / 2);
          a[0] = ae[0];
          for (int i = 1; i < m / 2; i++)
            if (ae[i] <= ao[i - 1]) {
              a[2 * i - 1] = ae[i];
              a[2 * i] = ao[i - 1]; }
            else {
              a[2 * i - 1] = ao[i - 1];
              a[2 * i] = ae[i]; } 
          a[m - 1] = ao[m / 2 - 1]; }
        else // m == 2
          if (a[0] > a[1]) {
            int x = a[0]; a[0] = a[1]; a[1] = x; } }
    
    This procedure integrated in a running program can be downloaded here. This program can be made considerably more efficient, by eliminating the copying operations.

    We consider the same example as before:

      a[]  initially: 10 12 14 16 20 25 32 34  8 13 24 26 28 36 38 40
      ae[] initially: 10 14 20 32  8 24 28 38
      ao[] initially: 12 16 25 34 13 26 36 40
      ae[] merged:     8 10 14 20 24 28 32 38
      ao[] merged:    12 13 16 25 26 34 36 40
      a[]  finally:    8 10 12 13 14 16 20 24 25 26 28 32 34 36 38 40
    

    Odd-Even Merge Sort

    The correctness is not proven here, but using the 0-1 lemma this is not hard. The key observation is that the estimate of the position of a number in the global array can be predicted with an accuracy of 1 from the position in ae[] and ao[].

    Work-Time Framework

    Here we give a precise expression for the number of performed comparisons. Defining T_merge and T_sort as above, we get
    T_merge(2^1) = 1,
    T_merge(2^k) = 2 * T_merge(2^{k - 1}) + 2^{k - 1} - 1, for all k > 1.
    The solution is
    T_merge(2^k) = (k - 1) / 2 * 2^k + 1, for all k >= 1.
    The expression for T_sort(2^k) is the same as before. The solution is
    T_sort(2^k) = ((k - 1) * k / 4 + 1) * 2^k - 1, for all k >= 1.

    The following table gives the number of comparisons performed by the bitonic and the odd-even merge sort algorithms. First the difference between the algorithms increases, then it gradually decreases again. Due to their O(log^2 n * n) complexity, neither of the algorithms should be used for large values of n. On the other hand, for small n, odd-even merge sort is close to optimal.

    n 1 4 8 16 32 64 128 256
    bitonic 1 6 24 80 240 672 1792 4608
    odd-even 1 5 19 63 191 543 1471 3839
    ratio 1 1.20 1.26 1.27 1.26 1.24 1.22 1.20

    Interconnection Networks

    The merge operations consist of two types of compara-swaps. In the first the data in some PU i with bit r equal to zero are compared with the data in a PU j, with j = i + 2^r. These operations can easily be performed on butterflies, hypercubes and meshes with shuffled row-major indexing. In the latter operations, the data in some PU i which has a bit pattern *...*01...1*...* are compared with the data in PU j with a pattern *...*10...0*...*. These PUs are not directly connected in butterflies and hypercubes. Also in meshes these PUs are further apart and the routing pattern is harder.

    DMM

    On DMMs the situation is different, as distance or the availability of connections does not play a role for them. However, even here bitonic merge sort is better, because it is no longer true that most compare-swaps can be performed internally in the PUs. In all of them the data of some PU in the lower half are compared with those of a PU in the upper half. This has a modest impact on the routing volume, but implies a considerable increase of the number of routing operations.

    Sample Sort

    Deterministic Sampling

    The partitioning strategy was illustrated with a fast parallel merging algorithm. This strategy can also be applied to obtain a very efficient parallel sorting algorithm. Actually, this might very well be the best algorithm for sorting on DMMs. For a parallel computer with P PUs, the following is one of several good implementations of the idea, it uses a parameter x the value of which will be settled later:
    1. Each PU sorts its k = n / P numbers.
    2. Each PU selects out off its k numbers the x - 1 numbers with ranks i * k / x, for all i, 0 < i < x, as presplitters.
    3. Gossip the presplitters, so that all PUs receive all P * (x - 1). presplitters.
    4. Each PU sorts the received presplitters using P-way merge sort.
    5. Each PU picks from the set of presplitters the elements with ranks i * (x - 1), 0 < i < P, as splitters.
    6. Each PU partitions its numbers in P buckets using the splitters. Denote the set of numbers in bucket i of PU_j by B_{i, j}.
    7. Perform an all-to-all routing so that PU_j, for all j, 0 <= j < P, receives all numbers in the buckets B_{i, j}, for all i, 0 <= i < P.
    8. Each PU sorts all received numbers using P-way merge sort.

    We analyze the algorithm for a DMM. The internal time consumption is dominated by the time for the several sorting and bucketing operations. It is given by c_i * O(x * log x + P * x * log P + k * log n). If we take care to select x so that P * x < k, then the time is dominated by the time for sorting k numbers. There are only two communication operations. Good performance can only be achieved if the all-to-all routing is performed in a direct way. Thus, the number of start-ups is log P + P - 1. The routing volume is (1 - 1 / P) * (P * x + k). If P * x < k, then this is dominated by the second term. So, under condition P * x = o(n / P), we get

    T(P, n) = (1 + o(1)) * (T_sort(1, n / P) + c_v * n / P + c_a * P).
    This is a great result, because it means that parallel sorting can be parallelized perfectly except for routing all numbers once through the network. This is the best one reasonably can hope for. The achievable speed-up mainly depends on the ratio of c_i and c_v, but of course, for very large n, the log-factor in the sorting time also plays a certain role. The speed-up is approximately given by c'_i * n * log n / (c'_i * n / P * log n + c_v * n / P) = P / (1 + c_v / (c'_i * log n)). Here c'_i is the constant so that T_sort(1, n) ~= c'_i * n * log n, which depends both on c_i and the applied sequential sorting algorithm.

    In the above analysis we silently assumed that the all-to-all routing is balanced and that finally all PUs receive approximately the same number of packets. This is not automatically true. There are two limitations:

    The first problem is not limited to the sorting problem. It can be overcome at some extra cost by either performing a two-phase randomized routing or by the online version of the off-line routing algorithm based on coloring a bipartite graph of reduced size.

    The second problem is more specific and requires an analysis. We compute an upper bound on the number of numbers that finally may end up in a single PU. So, the question is how many numbers may lie between any two consecutive splitters. Consider two splitters s = s_{i - 1} s' = s_i. Let R_j and R'_j denote the rank of s and s', respectively, in the set of numbers initially stored in PU_j. Analogously, let r_j and r'_j denote the rank of s and s', respectively, in the set of presplitters selected by PU_j. By definition of the splitters, sum_{j = 0}^{P - 1} (r'_j - r_j) = x. The number S_j of numbers routed to PU_j is given by S_j = sum_{j = 0}^{P - 1} (R'_j - R_j). Using that in each PU exactly k / x numbers lie between any two presplitters, we get the following estimates:

    S_j <= k / x * sum_{j = 0}^{P - 1} (r'_j - r_j + 1) = n / P + n / x.
    S_j >= k / x * sum_{j = 0}^{P - 1} (r'_j - r_j - 1) = n / P - n / x.

    So, there is an "unbalance" of O(n / x). Thus, in order to obtain the above claimed results we must take x = omega(P). Because at the same time we had the condition P * x = o(n / P), this gives the condition P^3 = o(n). This is quite a tough condition which is nevertheless satisfied for most practical values of P and n. By refining the procedure by which the splitters are selected from the presplitters, performing iterated subsampling the condition can be relaxed almost to P^2 = o(n), the typical condition which must be satisfied to obtain good performance on a DMM.

    For optimal performance, x should be chosen so that the sum of all additional costs together is minimized. On the one hand there are the costs for handling the presplitters and splitters, on the other hand the costs due to an unbalanced routing and sorting. The first contribution increases with x in the same way the second decreases with x. In such cases an approximation of the optimal choice can be found by equating both cost contributions and solving x. This gives

    (c_v + c_i * log P) * P * x = (c_v + c_i * log P) * n / x.
    This suggests to choose x = sqrt(n / P). For this particular choice, the total "loss" term is bounded by O((c_v + c_i * log P) * sqrt(n * P)), to be compared with the "useful" term Theta((c_v + c_i * log P) * n / P).

    The given algorithm does not guarantee that finally each PU holds the same number of numbers. Furthermore, even though the numbers stand sorted, their ranks are not known. This can easily be overcome: counting the number of numbers in each PU and computing the prefix sum of these numbers, the ranks of the numbers can be determined cheaply. From this it follows where they have to be stored. In the normal case that n / x < n / P, this can be achieved by a routing in which a small fraction of the numbers in PU_i has to be routed to PU_{i - 1} and/or PU_{i + 1}. Of course this correction can be performed along with the final sorting operations.

    Sampling through Randomization

    The above algorithm is good and can hardly be improved for a DMM context. However a simpler algorithm is efficient as well. Furthermore, this algorithm leads to the simplest optimal sorting algorithm for circular arrays, two-dimensional meshes and tori.

    DMM

    The algorithm is first formulated for a DMM with P PUs:
    1. Each PU creates P buckets and uniformly and independently allocates each of its k = n / P numbers to one of the buckets at random. Denote bucket i of PU_j by B_{i, j}.
    2. Perform an all-to-all routing so that PU_i, for all i, 0 <= i < P, receives all numbers in the buckets B_{i, j}, for all j, 0 <= j < P. Let k_i denote the number of numbers received by PU_i.
    3. Each PU sorts all its numbers.
    4. Each PU creates P buckets. PU_i allocates the number with rank r among its k_i numbers to bucket P * r / k_i. Denote bucket i of PU_j by B_{i, j}.
    5. Perform an all-to-all routing so that PU_i, for all i, 0 <= i < P, receives all numbers in the buckets B_{i, j}, for all j, 0 <= j < P. Let k_i denote the number of numbers received by PU_i.
    6. Each PU sorts all received numbers using P-way merge sort.
    7. Each PU exchanges some packets with the PUs with index one smaller and one larger to correct minor sorting faults and to establish the desired final situation in which each PU holds the same number of numbers.
    In this algorithm there is no global sample. Rather has each PU its own sample with expected size k = n / P. On a DMM this selection method is unnecessarily expensive, but on many interconnection networks all-to-all routing requires some kind of "randomization" (possibly this is done deterministically) anyway to assure a good routing time. So, here sampling and routing are performed in an integrated way. Or said otherwise, after the randomization, each PU holds approximately a fraction 1 / P of the numbers with destination in any of the P PUs. Different from list ranking, the problem treated in the next chapter, by sorting the numbers it can be locally guessed quite accurately in which PU a number belongs. Of course this estimate is not perfect However, if k is sufficiently large, it can be almost excluded that a number does not end up in its destination PU or one of its neighbors. Furthermore, for sufficiently large k there will only be o(n) packets which do not stand in their destination PU after the second all-to-all routing. Assuming that that this is granted, the time consumption on a DMM is given by
    T(P, n) = (1 + o(1)) * (T_sort(1, n / P) + c_v * 2 * n / P + c_a * 2 * P).
    The routing time is doubled due to the randomization, but this assures that we do not have to consider the possibility of an unbalanced routing pattern, these are resolved automatically. On the other hand, the internal time consumption is essentially the same. Possibly it is even slightly better because no time is waisted on handling the splitters.

    We now analyze the minimal value of n as a function of P to assure that the above result holds with high probability. The following points must be satisfied:

    The routing condition is the lightest. The expected number of packets in B_{i, j} is n / P^2. Chernoff bounds tells us that the deviation from the expected value is negligible as soon as the expected value exceeds log n. So, we must have n / P^2 = omega(log n). The second point is satisfied as a consequence. The last point is the most interesting, because it specifically has to do with the quality of the applied sampling technique. Under the above condition the values k_i deviate insignificantly, so in the further analysis we may do as if each PU has obtained a random sample of k = n / P numbers. Consider the k numbers in one of the PUs. Let r_i, 0 <= i < k, denote the local rank of number i, that is, the rank among the k numbers in this PU. Let R_i denote the global rank of this number, that is, the rank among the n numbers distributed over all PUs. R_i is estimated as R'_i = P * r_i. This estimate is not made explicit in the algorithm, but this is the underlying justification for routing the number with rank r_i to PU_{P * r_i / k}. A number is routed to the wrong processor if round_down(R'_i / k) != round_down(R_i / k). For how many numbers this happens depends on the differences between R'_i and R_i. This deviation turns out to be maximal for the number with R_i = n / 2. For this number the expected value of r_i equals k / 2. Using the Chernoff bounds, it follows that the deviation is bounded by O(sqrt(k * log n)) with high probability. So, we may assume that any of the values |R'_i - R_i| = O(P * sqrt(k * log n)). Thus a PU receives at most this many packets that belong to a neighbor. As condition this gives k = omega(P * sqrt(k * log n)) which can be rewritten as n = omega(P^3 * log n).

    There are several ways to perform the final correction. Consider PU_i and PU_{i + 1}. Both have their numbers sorted, but possibly the values of the largest numbers in PU_i are larger than those of the smallest numbers in PU_{i + 1}. The above estimate has shown that we may assume that the number of these is at most x = c * P * sqrt(k * log n). Here c is a constant which can be determined by working out the details of the estimate. The simplest solution is to let both PUs exchange the x concerned numbers. Then each of them can merge the two subarrays of length x. PU_i keeps the smallest x and PU_{i + 1} the largest x. First exchanging samples, there is no need to make copy and the data exchange can be minimized.

    Mesh

    On a mesh we do not assume that the PUs hold many numbers. Instead we process the numbers in a sufficiently large submesh collectively. Such a submesh can be viewed as a virtual PU. If on an n x n mesh each PU holds n numbers, then collectively processing numbers in m x m submeshes has the effect of reducing the number of PUs from n^2 to (n / m)^2 and increasing the number of numbers per PU from k to k * m^2. Because on n x n meshes the cost of routing operations increases linearly with n, all operations within the submeshes have negligible cost if n' = (n / m) = o(n). Exploiting these ideas immediately leads to a good algorithm for the case k = 1 and to an optimal algorithm for the case k >= 8.

    For k = 1, a good choice is m = n^{2/3} * log^{1/6} n. Doing that, n' = n / m = n^{1/3} / log^{1/6} n. So, there are P' = n'^2 = n^{2/3} / log^{1/3} n submeshes. k' = 1 * m^2 = n^{4/3} * log^{1/3} n. Any computation within the submeshes, corresponding to steps 1, 3, 4 and 6 of the algorithm above, takes O(m) = O(n^{2/3} * log^{1/6} n) time. The choice of m is made so that P' * (k' * log n)^{1/2} = n^{2/3} / log^{1/3} n * (n^{4/3} * log^{4/3} n)^{1/2} = n^{4/3} * log^{1/3} n = k'. So, in step of the algorithm the rearrangement takes place within subsets of constantly many submeshes. This takes O(m) time as well. Step 2 and 5 amount to routing a randomization and the inverse thereof. It can easily be shown that congestion is very limited. Each routing phase can be performed in n + O(sqrt(n * log n)) time. So, in total the sorting takes 4 * n + O(n^{2/3} * log^{1/6} n) time. This is twice as much as given by the distance lower bound.

    For larger k optimality can be established. Now it is better to choose m smaller in dependency of k: the best choice is m = n^{2/3} * log^{1/6} n / k^{1/6}. This means that for very large k we will take m = 1, which sounds natural. The operations in the submeshes take O(k * m) = O(k^{5/6} * n^{2/3} * log^{1/6} n) time. Again the choice is made so that P' * (k' * log n)^{1/2} = n^{2/3} * k^{1/3} / log^{1/3} n * (n^{4/3} * k^{2/3} * log^{4/3} n)^{1/2} = n^{4/3} * k^{2/3} * log^{1/3} n = k'. In this case in step 2 and 5 half of the packets are routed XY and the other half YX. This makes that in each of the routing phases only k * n / 8 packets have to cross the bisection. Thus, together these steps take k * n / 2 + o(k * n) time.

    Theorem: On a two-dimensional n x n mesh, k-k sorting can be performed in max{4 * n, k * n / 2} + o(k * n) steps with high probability.

    Sampling through Shuffling

    Above we have seen how randomizing the numbers makes the selection of splitters superfluous: the distribution of the numbers in each PU may be assumed to provide a sufficiently good impression of the whole distribution to base the bucketing on. We use randomization here for establishing something that is actually much weaker: we only need that each PU holds approximately the same number of numbers with destination in the same PU. For this we do not need randomization.

    DMM

    The crucial idea is that sorting the numbers in a PU and then handing them out in this sorted order over the PUs gives an even smoother distribution of the numbers over the PUs than doing this randomly. More precisely, if each PU holds k = n / P numbers, these are sorted, and the number with rank r in PU_i is sent to PU_{i + r mod P}. This redistribution is called shuffling. In the context of eliminating the need for randomization from all-to-all routing algorithms idea was already discussed in the section on "Derandomization" in the chapter on "Interconnection Networks and Routing". The operation which is used for the second all-to-all routing in the above randomized algorithm, in which a number with rank r is sent to PU_{P * r / k} is called unshuffling.

    Based on shuffles and unshuffles the following is a simple, versatile and efficient sorting algorithm which is first formulated for a DMM with P PUs and k = n / P numbers per PU:

    1. Sort all numbers in each PU.
    2. Shuffle the numbers.
    3. Use P-way merge sort to sort the numbers in each PU.
    4. Unshuffle the numbers.
    5. Use P-way merge sort to sort the numbers in each PU.
    6. Perform data exchange between pairs of PUs with consecutive indices to establish the desired global sorting with k numbers per PU.
    The strong point of shuffles and unshuffles is that these are fixed routing patterns, for which optimal routing schedules can be computed offline (which due to their very simple nature is mostly easy). The disadvantage is one more merging.

    The analysis of the algorithm goes analogously to those above. For a number x which after the sorting in step 3 has rank r in PU_i, the PU in which it currently resides, the global rank is estimated as R' = P * r. Let R denote the actual global rank of x. It is easy to put upper bounds on |R' - R|. For this we need some additional notation. Let r_j be the rank of x in PU_i among the numbers originating from PU_j. Among the numbers which originally resided in PU_j, x falls between the number with rank j + (r_j - 1) * P and the one with rank j + r_j * P. Summing over all PUs gives

    P * (P - 1) / 2 + P * sum_{j = 0}^{P - 1} (r_j - 1) < R < P * (P - 1) / 2 + P * sum_{j = 0}^{P - 1} r_j
    So, after step 5 there may reside at most about P^2 / 2 numbers which belong in PU_i in both PU_{i - 1} and PU_{i + 1}. For the correctness this gives the condition that k > P^2 / 2. In order that the cost of the final correction is negligible, the condition becomes k = omega(P^2). Substituting k = n / P gives the condition n = omega(P^3), the same we have encountered before.

    Mesh

    Above we have formulated the algorithm as a deterministic variant of a sample-sort algorithm with implicit sampling. That is one view of the algorithm, but it is not the only view. The algorithm can also be viewed as an improved variant of shearsort which works with a constant number of one-dimensional sorting phases in total. This algorithm, which will be described next is called column sort and was presented by Leighton in 1985.

    The mesh is not square but has height n^2 and width n. Each PU holds one number. The numbers will be sorted in column-major order. The algorithm consists of a few simple one-dimensional operations:

    1. Sort all columns from top to bottom.
    2. Shuffle all numbers within the rows. That is, the number in PU_{i, j}, 0 <= i < n^2, 0 <= j < n, is routed to PU_{i, (i + j) mod n}.
    3. Sort all columns from top to bottom.
    4. Rotate the numbers in the columns. That is, the number in PU_{i, j}, 0 <= i < n^2, 0 <= j < n, is routed to PU_{(i + n * j) mod n^2, j}.
    5. Unshuffle all numbers within the rows, that is, the number in PU_{i, j}, 0 <= i < n^2, 0 <= j < n, is routed to PU_{i, (i / n - 2 * j) mod n}.
    6. Sort all even columns from top to bottom and all odd columns from bottom to top.
    7. Sort pairs of adjacent row positions. That is, compare and swap the values in PU_{i, 2 * j} and PU_{i, 2 * j + 1}, 0 <= i < n^2, 0 <= j < n / 2.
    8. Sort pairs of adjacent row positions. That is, compare and swap the values in PU_{i, 2 * j - 1} and PU_{i, 2 * j}, 0 < i < n^2, 0 <= j < n / 2.
    9. Sort all columns from top to bottom.

    Column Sort

    Without the zero-one lemma the analysis of this algorithm would be hard. So let us consider an arbitrary distribution of zeroes and ones. In several steps we show that finally this distribution has been sorted.

    Lemma: Starting with a zero-one distribution, there are at most n dirty rows after step 3.

    Proof: Let z_j, 0 <= j < n, be the number of zeroes in column j. The shuffling is so that any column either receives round_down{z_j / n} or round_up{z_j / n} of these zeroes. So, a column receives at least l = sum_j round_down{z_i / n} and at most h = sum_j round_up{z_i / n} zeroes. So, after step 3, in row 0, ..., l - 1 there are only zeroes, while in row h, ..., n^2 - 1 there are only ones. Only in row l, ..., h - 1 there may be both zeroes and ones. Because h - l <= n, the claimed result follows. End.

    Lemma: Starting with a zero-one distribution, there are at most 2 dirty columns after step 5.

    Proof: Step 4 and 5 together are so that the contents of the columns is permuted among the rows so that the smallest n numbers in each column go to column 0, and more generally that all numbers standing in row j * n, ..., (j + 1) * n - 1 are ending up in column j. Let l and h be as in the proof of the previous lemma. There are two cases to distinguish. There is a j so that j * n <= l and h < (j + 1) * n. In that case only column j receives zeroes and ones. So, column j will be the only dirty column. Otherwise, because h - l <= n, there is a j so that j * n <= l and h < (j + 2) * n. In that case column 0, ..., j - 1 receives only zeroes, column j + 2, ..., n - 1 receives only ones, and column j and j + 1 may receive both zeroes and ones. End.

    Theorem: The given column-sort algorithm correctly sorts any distribution of n^3 numbers on an n^2 x n mesh.

    Proof: Given the situation as described in the proof of the above lemma, with zeroes and ones only in column j and j + 1 for some j, 0 <= j < n - 1, sorting the columns up and down alternatingly followed by sorting pairs of adjacent row positions reduces the number of dirty columns to one. This is a special case of the argument for shearsort that the number of dirty columns is reduced by a factor two in each round. The final sorting rearranges the numbers in this dirty column into sorted order. End.

    Column-sort can also be formulated processor independent. In this generalized form it becomes immediately clear how to carry it over to interconnection networks. The numbers are divided over n^{1/3} buckets of size n^{2/3}. The role of the PUs is taken over by the buckets.

    1. Sort the numbers in each bucket.
    2. Shuffle the numbers.
    3. Sort the numbers in each bucket.
    4. Unshuffle the numbers.
    5. Sort the numbers in each bucket.
    6. Perform data exchange between pairs of buckets with consecutive indices to establish the desired global sorting with n^{2/3} numbers per bucket.
    Sorting n numbers can be performed by a combination of a constant number of operations in which n^{1/3} buckets with n^{2/3} numbers each are sorted and some regular rearrangements of the elements between these buckets.

    Theorem: On a 2-dimensional n x n mesh with k >= 8 numbers per PU, sorting can be performed in k * n / 2 + o(n) steps.

    Proof: The buckets can be chosen to consist of the k^{2/3} * n^{4/3} numbers residing in each n' x n' submesh for n' = n^{2/3} / k^{1/6}. Sorting these submeshes can be performed with any sorting algorithm which achieves the optimal time order in O(k * n') = O(k^{5/6} * n^{2/3}). The final correction also has this time order. The shuffle and unshuffle send only half of the packets over the bisection and when exploiting the all-port facility they can be performed in k * n / 4 steps each. Thus, the whole sorting can be performed in k * n / 2 + O(k^{5/6} * n^{2/3}) steps. End.

    The number of steps is the same as for routing a k-k distribution except for the lower-order term. Thus, we have obtained an optimal sorting algorithm because sorting is at least as hard as routing and k * n / 2 is a lower bound for k-k routing. Comparing with the presented randomized algorithm we see that the performance is almost the same. The lower-order term of the randomized algorithm is asymptotically marginally larger: distributing information regularly is better than randomizing.

    Constant-Time Sort

    We have seen several good sorting algorithms but the best achieved time remains O(log n). It is an interesting question to consider whether sorting, and other problems, can be performed in constant time and how much hardware is needed for that.

    PRAM

    On a Common CRCW PRAM an array a[] of length n can be sorted in constant time using n * n! PUs. The algorithm is similar to the constant-time algorithm for computing the maximum of n numbers. Unfortunately, for sorting there is no way to make the algorithm efficient.

    The n! permutations can be enumerated systematically, and permutation p, 0 <= p < n!, is denoted pi_p. The PUs are indexed PU_{p, i}, 0 <= p < n!, 0 <= i < n. The algorithm is simple:

      1. for (all p, 0 <= p < n!, using  PU_{p, 0}) do
           b[p] = true;
      2. for (all p, i, 0 <= p < n!, 0 <= i < n - 1, using PU_{p, i}) do
           if ( a[p[i]]  > a[p[i + 1]] || 
               (a[p[i]] == a[p[i + 1]] && p[i] > p[i + 1]))
             b[i] = false;
      3. for (all p, i, 0 <= p < n!, 0 <= i < n, using PU_{p, i}) do
           if (b[p]) {
             c[p[i]] = a[i];
             a[i] = c[i]; }
    
    In the last step it is essentially used that a PRAM is synchronous. On a non-synchronous machine the two instructions inside the if-clause should be separated by a synchronization to assure that the instruction c[p[i]] = a[i] has been executed before the instruction a[i] = c[i] is executed.

    Reconfigurable Array

    The above algorithm is fast and simple but extremely inefficient. The main importance is to show that in principle sorting can be performed in constant time. Can sorting be performed in constant time with a polynomial number of PUs? The answer is yes: in an earlier chapter we have seen how to sort n numbers on a reconfigurable processor array with O(n^3) PUs in 6 steps.

    Using O(n^3) hardware to sort n numbers is not particularly efficient. By how much can this be reduced while still bounding the sorting time to O(1)? We will see that O(n^2) hardware is enough. This is optimal because performing a permutation in O(1) requires a bisection bandwidth of n, while the grid also has a basis of length n.

    The main conclusion from column-sort is that sorting n numbers can be performed by several times sorting O(n^{2/3}) numbers and several rearrangements. Of course each of these sortings themselves can also be performed using column-sort. So, the idea for sorting n numbers on a reconfigurable n x n mesh is to recursively apply column sort. The recursion is terminated as soon as the size n' of the subproblem to solve satisfies n'^2 < n, because then it can be solved in constant time using the presented algorithm for sorting on reconfigurable arrays. Let n_i, i >= 0, denote the size of the subproblems at level i of the recursion. n_0 = n, n_1 = n^{2/3} and n_2 = n^{4/9}. So, the recursion tree has constant depth. Because the number of subproblems increases only by a constant factor between two levels of the recursion, there are in total only constantly many subproblems of size n^{4/9} to solve. We conclude

    Theorem: On a reconfigurable processor array with n^2 PUs, n numbers can be sorted in O(1) time.

    Exercises

    1. Give a transposition-sort algorithm for a linear array with an odd number n of PUs. How many steps do you need? Prove the correctness formally. Hint: use the 0-1 sorting lemma and induction.

    2. Consider sorting n^2 numbers on an n x n mesh. Give an algorithm running in O(n) time. The final arrangement can be freely chosen but must be fixed.

    3. Consider sorting n numbers on a k-dimensional full-port hypercube with P = 2^k PUs. Present an algorithm running in O(n / P) time. What is the minimum n for which your algorithm achieves this time order?

    4. Generalize the shearsort algorithm for three-dimensional meshes. How long does your algorithm take to sort n^3 numbers on an n x n x n mesh?

    5. Consider performing shearsort on the k-dimensional hypercube H_k. Distinguish two interpretations of such a hypercube: it can either be viewed as two interconnected copies of H_{k - 1} or as the direct product H_{k / 2} x H_{k / 2}. Give for both interpretations a rough estimate of the time consumption.

    6. Consider sorting n^2 numbers on a DMM with P PUs using a variant of shearsort. Assume that n >= P. The mesh can be mapped on to the DMM in several ways:
      • Each PU taking care of a bundle consisting of n / P rows;
      • Each PU taking care of a bundle consisting of n / P columns;
      • Each PU taking care of an n / sqrt(P) x n / sqrt(P) submesh.
      For each of these views formulate the costs in the DMM model.

    7. In the shearsort algorithm, the only purpose of the sorting in two-directions in the rows is to spread the values out. This can much more effectively be achieved by randomization. Analyze a variant of shearsort in which the sorting in the rows is replaced by performing a random permutation. A random permutation can efficiently be generated by choosing random numbers and sorting on them. The analysis should be based on the 0-1 sorting lemma. You may use that when n times picking white and black balls which are white uniformly with probability 1 / 2, that then the probability that the number of selected white balls exceeds n / 2 + h is bounded by e^{-h^2 / (c * n)}, for some small constant c.

    8. The sorting algorithm based on first randomizing all numbers among the PUs, which was analyzed above for a mesh with k >= 8 numbers per PU, can also be used for the case that each PU holds exactly one number. Specify the time consumption of the algorithm for an n x n mesh.

    9. In the text above the performance of bitonic merge sort was analyzed for linear arrays and two-dimensional meshes. Perform an analogous analysis for 3-dimensional n x n x n meshes for some n = 2^k. Most interesting is the constant of the term which is linear in n. How does this constant develop when the algorithm is applied for d-dimensional n x ... x n meshes?

    10. Rewrite the method oddEvenMerge() in the given implementation of odd-even merge sort so that it becomes insitu and iterative. Hint: first study the above picture showing how 16 numbers are sorted. Compare the new version with the original one and with the implementation of bitonic merge sort, for a suitable range of problem sizes.

    11. The presented three "sample-sort" algorithms (one with explicit sample, one randomizing the numbers and one shuffling them) can also be used for sorting on a hypercube with n PUs and one number per PU.
      • Formulate recurrency relations coarsely estimating the sorting time of each algorithm. It may be assumed that permutation routing can be performed in O(n) time.
      • Solve the recurrence relations and conclude that the differences, which before were hardly significant because they only affected the constant factor, now are more important.

    12. Consider sorting n numbers on a reconfigurable array with n^2 PUs. How many steps does it take exactly when performing column-sort recursively? Assume that n is so large that c * n^{8/9} <= n for any constant c you may need and you may ignore problems of divisibility.

    13. Assume that reconfigurable switches are relatively expensive, much more expensive than providing the same bisection bandwidth by conventional means. Then it may make sense to build a hybrid system: consisting of a reconfigurable grid which is used for computational purposes and some other network which is used for massive communication. We want to use this network for sorting n numbers on a system with n PUs. The time for the rearrangements is ignored. How big, in terms of n, must the reconfigurable array be in order to assure O(1) sorting time?


    
    
    
    

    List Ranking

    Two-Processor Algorithm

    Introduction

    A linked list, hereafter just list, is a basic data structure: it consists of nodes which are linked together, so that every node has precisely one predecessor and one successor, except for the initial node, which has no predecessor, and the final node, which has no successor. The rank of a node is defined as the number of links that must be traversed to reach the final node of the list (or sometimes also the number of links that must be traversed to reach the node when starting from the initial node). An important problem connected to the use of lists is the list ranking problem: the determination of the ranks of all nodes. Typically, the set of links constitutes a set of lists, and in that case for each node one also wants to compute the index of the last node of its list. Parallel list-ranking is a key subroutine in more complex problems such as edge coloring bipartite graphs and in the lowest-common-ancestor problem.

    We consider a set of lists with n nodes in total. This is internally represented by an array succ[] of length n. For a node u, succ[u] gives the successor of u. So, the lists are directed towards their final nodes. If u is the last node of its list, we have succ[u] = u. The task is to compute the values of the arrays last[] and dist[]. For a node u, last[u] indicates the last node of the list to which u belongs, and dist[u] gives the number of links that must be traversed in order to reach last[u] starting in u.

    List Ranking

    Sequentially list ranking can be solved by first figuring out the initial elements of the lists and then simply traversing the list while numbering the nodes.

      void sequential(int succ[], int[] frst, int dist[], int n)
      {
        boolean[] first = new boolean[n];
        for(int i = 0; i < n; i++)
          first[i] = true;
        for(int i = 0; i < n; i++)
          first[succ[i]] = false;
        for(int i = 0; i < n; i++)
          if (first[i]) // i is first node
          {
            int j = last[i] = i;
            int d = dist[i] = 0;
            while (succ[j] != j)
            {
              j = succ[j];
              d++;
              last[j] = i;
              dist[j] = d;
            }
          }
      }
    
    
    The above algorithm performs 4 * n unstructured accesses to arrays of length n. For large n, this causes close to 4 * n cache faults. As a result this algorithm is not particularly efficient. Even a fast computers requires around 1 second for ranking a list with 10^6 elements. Some simple improvements reduce the number of unstructured array accesses to 2 * n, almost halving the execution time.

    There is no obvious parallel algorithm. Parallel list ranking is much harder than parallel sorting: when sorting any two elements can be compared and this adds some information, but while list ranking, the successor of a node is typically stored in another processor and just going there is expensive. Theoretically the parallel list-ranking problem has been solved: for ranking a list of length n, there are PRAM algorithms using n / log n PUs running in O(log n) time. But, practically it is hard to achieve reasonable speed-ups. Even for very large problems and a very good interconnection network it is not easy to achieve a speed-up of more than 30 on a parallel computer with 100 PUs. For sorting it would have been easy to achieve a speed-up of 80 or 90. This is mainly due to the hard, very non-local nature of list ranking, but also to the fact that its sequential complexity is only linear, while sequential sorting has running time O(n * log n).

    The ideas in this section go back on many authors. Just to name a few: Wylie, Anderson, Cole, Miller, Vishkin, Reid-Miller, Ranade and Sibeyn.

    Pointer Jumping

    The basic parallel list-ranking algorithm is pointer jumping. The algorithm is almost trivial:
      void pointerJumping(int[] succ, int[] last, int[] dist, int n) 
      {
        // Initialization
        for (int i = 0; i < n; i++)
          if (succ[i] == i)
          {
            last[i] = succ[i] - n;
            dist[i] = 0;
          }
          else
          {
            last[i] = succ[i];
            dist[i] = 1;
          }
    
        // Performing rounds until all reached final node
        boolean flag = true;
        while (flag)
        {
          flag = false;
          for (int i = 0; i < n; i++)
            if (last[i] >= 0) // i has not yet reached last node
            {
              flag = true;
              dist[i] += dist[last[i]];
              last[i]  = last[last[i]];
            }
        }
    
        // Correction of last[] values
        for (int i = 0; i < n; i++)
          last[i] += n;
      }
    
    

    Operations of the kind last[i] = last[last[i]], for all i, are called pointer-jumping steps. In a parallel algorithm, these are performed synchronizedly. So, as long as last[u] does not yet point to the last node of a list, the value dist[u] doubles in every pass. Therefore the while loop will be executed at most log n times. The work in each pass is O(n), so the total work is O(n * log n) and the time consumption is O(log n). Pointer jumping is simple and fast, but inefficient due to its super-linear work.

    On an interconnection network, each pass requires two all-to-all routing operations, in which each question is resulting in two answers. So, in the DMM model this algorithm costs

    T(P, n) = c_i * O(n / P * log n) + c_v * 3 * log n * n / P + c_a * 2 * log n * P.
    Because of the relatively small number of all-to-all communications, this algorithm is quite reasonable for relatively small problems. For n >> c_a / c_v * P^2, the start-up costs are negligible. But, of course, the factor log n in the work and routing volume excludes really good performance.

    Independent-Set Removal

    The other classical main approach to list ranking is independent-set removal. An independent set of a graph is a subset S of the nodes that no two nodes in S are connected by an edge. For the special case that the graph is a list, a subset of the nodes is an independent set if no two consecutive nodes are in S.

    Independent Set

    The independent-set removal algorithm consists of a reduction and an expansion phase. In the reduction phase, repeatedly the nodes in an independent set are excluded. This continues until the problem has been reduced sufficiently to be solved in a more basic way. Then all excluded elements are reinserted in reversed order. For efficiently performing these operations it is practical to have a doubly linked list, that is, for every node i there is also data item pred[i] giving its predecessor next to succ[i] giving its successor.

    During the reduction phase, the following steps are iterated: somehow an independent set S is constructed, then each node i in S, is excluded from the list by performing

      succ[pred[i]]  = succ[i]; 
      dist[pred[i]] += dist[i];
      pred[succ[i]]  = pred[i];
    
    During the expansion phase these operations are reversed: if i was excluded in round r, then in the corresponding expansion round we may inductively assume that node succ[i] already knows the index of the last node of the list and the distance thereto. Thus, by performing
      last[i]  = last[succ[i]];
      dist[i] += dist[succ[i]];
    
    also the values of node i become final.

    In the work-time framework, each of the exclusion and inclusion operations can be performed in O(1) time and work that is proportional to the number of participating nodes. On an interconnection network each of these operations require a constant number of all-to-all routings, with a routing volume that is proportional to the number of participating nodes.

    The main point that remains to be specified is the selection of the independent set. This can be done both randomly and deterministically. Investing some more effort in order to get a larger independent set might be a good idea, because this will result in a faster reduction of the problem size. However, the simplest idea is to perform random coin tossing: for each node 0 or 1 is chosen independently with probability 1/2. S is the set of all nodes with value 1 whose successor has value 0. For a list of length n, the expected number of nodes selected equals (n - 1) / 4. Using the Azuma inequalities it can even be shown that at least n / 4 - o(n) nodes are selected with high probability. This means that applying this selection method, the problem size is approximately multiplied by a factor 3 / 4 in each reduction round. In the work-time framework, S can be selected in constant time, on an interconnection network, it requires one all-to-all routing to inform all predecessors.

    So, over all selection, reduction and expansion rounds, the sum of the the performed work is given by c * (1 + 3/4 + 9/16 + ...) * n <= 4 * c * n for some suitable constant c. So, it is linear in n. The time is proportional to the number of performed reduction rounds because for each reduction there is one corresponding selection and expansion. There is no need to continue reducing until the problem size is constant, though this would not be wrong. If the selection of the independent set is performed randomly, it is better to switch to another algorithm earlier. A reasonable choice is to reduce the problem size by a factor log n using independent-set removal in O(loglog n) time, and then to rank the remaining nodes using pointer jumping in O(log n) time. Both algorithms perform O(n) work.

    Theorem: On an EREW PRAM with n / log n PUs the list-ranking problem for a set of lists with n nodes can be solved in O(log n) time.

    Proof: The main problem is the allocation of work to the PUs. If initially each PU takes care of log n nodes, it is not guaranteed that this number decreases in a smooth way. However, in groups of log n PU, which initially take care of log^2 n nodes, such a smooth decrease is assured with high probability. These subsets of the PUs can be perceived as constituting (loglog n)-dimensional hypercubes. After reduction round r, each PU performs a load balancing with the PU at the other end of the (i mod loglog n)-axis of this conceptual hypercube. Doing this, it becomes highly unlikely that in round r a PU has to deal with substantially more than (3/4)^r * log n nodes. Neither independent-set removal nor pointer jumping require concurrent-memory access. End.

    On an interconnection network with P PUs, we can proceed in the same way although in practice it is better to first reduce the problem size slightly more in order to reduce the cost of the pointer jumping. Alternatively, the problem size can first be reduced by a factor P and then all remaining nodes can be concentrated in PU 0, where the subproblem can be solved internally in O(n / P) time. This is not so elegant, but minimizes the number of start-ups. In terms of the DMM the second approach gives

    T(P, n) = c_i * O(n / P) + c_v * O(n / P) + c_a * O(log P * P).

    As observed in the section on "Routing" in the chapter on "Interconnection Networks", the number of start-ups for an all-to-all routing can be strongly reduced by simulating the routing of a d-dimensional meshes or hypercubes. In general this observation is useful only for problems that are too small for the used number of PUs, because such an indirect routing multiplies the routing volume and causes overhead for rearranging packets as well. So, this may only give an improvement if the time consumption is dominated by the time for the start-ups. However, in our case, the problem size is gradually reduced, and even when the overall time-consumption is dominated by the term (c_i + c_v) * O(n / P), it may happen that in some of the all-to-all routings the start-up time dominates. So, the best performance is obtained, when at any stage the best all-to-all routing subroutine is chosen as a function of P and the size of the packets. Working out this idea more carefully, shows that the factor log P form the term c_a * O(log P * P) can be omited without increasing the leading constants hidden in the other two terms.

    The independent set can also be found deterministically. One possibility to obtain a large independent set is to deterministically construct a three-coloring as discussed in the section on "Symmetry Breaking" of the chapter on "Basic Techniques". The nodes with any of the three colors constitute an independent set. The largest of these sets has size at least n / 3 for a set with n nodes, The coloring and the computation of the sizes of the subsets can be performed in O(log n) time and O(n) work. Using this independent-set selection procedure until there is only one node left, the whole algorithm takes O(log^2 n) time and O(n) work. An immediate improvement is obtained by applying accelerated cascading: as soon as the problem size has been reduced to n / log n, that is, after at most loglog n / log (3/2) = O(loglog n) reduction rounds, we switch to pointer jumping. This idea reduces the total time consumption to O(loglog n * log n). A more refined algorithm runs in O(log n) time, but practically none of the deterministic algorithms can compete with the randomized one.

    Sparse Ruling Sets

    We now consider an algorithm that can be turned into an optimal PRAM algorithm, but was not designed for this purpose. This algorithm was designed for the DMM model. It is the best parallel algorithm for the case that the number of nodes in the lists n is much larger than the number P of available PUs.

    A ruling set in a graph is a subset S of the nodes so that each node is either in S or adjacent to a node in S. A k-ruling set is a subset S, so that each node has distance at most k from a node in S. The algorithm of this section is called the sparse-ruling-set algorithm, because it constructs a small subset of the nodes, which are called rulers, that are more or less regularly interspaced in the list. Once we have walked through the list from one ruler to the next, a new reduced list is constructed which has a node for each ruler and links with weight equal to the distance between the rulers. Once the ranks of the rulers are known, each non-ruler can ask the ruler from which it was reached for its rank and add an offset.

    The following formulation of the algorithm is given for the DMM model. With minor modifications it can be turned into a PRAM or sequential algorithm.

    1. Create an empty set of weighted links L. An entry (p, p', d) indicates a link from p to p' of length d.
    2. For each node p, if p is a final node, mark each final node p as ruler and add a link (p, p, 0) to L. Otherwise mark p as unvisited.
    3. Select s nodes randomly and uniformly from among the non-final nodes as rulers. Each of these rulers p initiates a wave: p prepares a packet (p, 0) to be sent to succ[p].
    4.   while (there are still active waves) 
        {
          Send all prepared packets.
          for (each node p' that receives a packet (p, d)) 
            if (p' is a ruler) // wave of p ends in p'
            {
              Add a link (p, p', d + 1) to L.
              if (there are still unvisited non-ruler nodes)
              {
                Randomly select such a node p''. p'' initiates a new wave:
                p'' prepares a packet (p'', 0) to be sent to succ[p''].
              }
            }
            else // wave of p continues from p'
            {
              mark p' as visited with p as ruler and d + 1 as distance 
              thereof. Prepare a packet (p, d + 1) to be sent to succ[p'].
            }
        } 
    5. Solve the weighted list-ranking problem defined by L.
    6. For each ruler p extract the value of last[p] and dist[p] from the solution of the problem defined by L.
    7. For each non-ruler node p', ask the ruler p of p' the values of last[p] and dist[p]. Set last[p'] = last[p] and dist[p'] = dist[p] - dist[p'], where dist[p'] is the previously marked distance.

    Sparse-Ruling-Set Algorithm

    In each round of the main while-loop exactly s previously unvisited nodes are reached, so all n nodes are visited in n / s rounds. Considering the number of unvisited nodes, it follows that the expected number s_t of waves that hits a ruler in round t, 0 <= t < n / s, is given by s_t = s * s / (n - t * s). Summing over t, gives the total number of newly created rulers:

      sum_t s_t = sum_{t = 0}^{n / s - 1} s * s / (n - t * s)
                = s * sum_{i = 1}^{n / s} 1 / i 
                = s * H_{n / s}.
    
    Here H_k ~= log_e k is the k-th harmonic number. Thus, the problem size is reduced from n to a value n' ~= s * log_e (n / s).

    In a DMM implementation, the algorithm performs n / s + O(1) all-to-all routings. During the main while-loop each node sends a packet containing the index of the ruler, the distance thereof and the destination, so the routing volume is 3 * n / p. In the last step, each non-ruler asks a question generating two answers, so the routing volume of the complete algorithm is 6 * n / p. So, the cost for a reduction of the problem size by a factor f = n / n' = n / s * log_e (n / s) is approximately

    T(P, n) = c_i * O(n / P) + c_v * 6 * n / P + c_a * f * log_e f.

    Taking s = o(n / (log n * loglog n)) gives n' = o(n / log n), which allows to solve the subproblem with pointer jumping at negligible costs. If s is chosen much smaller, s = o(n / (P * log P)), n' = o(n / P). In that case the data of the reduced list can be gossiped at negligible costs and the remaining problem can be solved internally. In that case the total routing volume is only 3 * n / P, but this goes at the expense of a much larger number of all-to-all routings. For small P this may be a very good idea though. If no new rulers are chosen in the course of the algorithm, the distance does not need to be sent along because it can be determined by counting the number of performed rounds. This has the disadvantage that we have to somehow deal with the initial nodes but allows to reduce the routing volume to just 2 * n / P, which appears to be the best achievable.

    In practical applications, if P is somewhat larger, the focus should lie on reducing the number of routing operations. If s is chosen larger, for example, s = n / 50, the problem size is reduced by about a factor 10 with just over 50 all-to-all routings. Applying the same once more gives a reduction by about a factor 100, with 100 all-to-all routings and a routing volume of about 6.6 * n / P. In practice the choice of s should be made as a function of P, n, c_v and c_a in combination with estimates of the cost to solve the subproblem.

    We remind that there are two main reasons why running a parallel program on a parallel computer with P PUs mostly gives much less than P speed-up:

    The key to the good practical performance of the sparse-ruling-set algorithm is that it is better than other algorithms in both respects. A sequential version of the algorithm takes only about twice as long as the conventional sequential algorithm. Its only weakness is the relatively large number of start-ups. Pointer jumping is better in that respect.

    The weakness of sending too many packets can be alleviated by applying the following idea. Instead of performing all-to-all routings, at the end of each phase (which is the basic idea of the BSP approach!), it is better to intermix the computation and the communication: Round t is decomposed in P subrounds, denoted t_p, for 0 <= p < P. In subround t_p the algorithm running on PU i sends all it has to send to PU (i + p) mod P. Then it works on all the data it received and prepares the packet for PU (i + p + 1) mod P, and so on. Doing this, the routed packets become larger: a wave now has to wait only P / 2 subrounds before it is progressing again, going twice as fast. This idea does not only give improvements for list ranking, but for any problem in which a non-local structure has to be traversed.

    The sparse-ruling-set algorithm may be considered to be a variant of independent-set removal. The difference is the speed at which the reduction is performed: independent-set removal gives a fixed small reduction factor. With the sparse-ruling-set algorithm the reduction factor can be chosen freely. A more interesting observation is that the structure of the sparse-ruling-set algorithm is very similar to that of sample sort: for a small subset of the elements the ranks are determined exactly, then the other elements are ranked between them.

    Click here to see a sequential version of the sequential algorithm, the pointer-jumping algorithm and the sparse-ruling-set algorithm. It is important to observe that in the implemented variant of pointer jumping the distance may double by more than a factor two in one round because it may happen that node succ[i] gets processed before node i.

    One-by-One Cleaning

    There is an alternative algorithm which, at the cost of more internal work and a larger routing volume, really saves on the number of packets sent. It does not perform all-to-all routing but routes permutations: in total on a network with P PUs, a PU sends only O(P) packets. This can also be achieved by other algorithms (repeatedly reducing the problem size by a factor, than concentrating the rest of the nodes in a subset of the PUs), but in a much less natural way and with worse constants. The algorithm is called one-by-one cleaning, because it gradually eliminates links pointing to other PUs.

    Two-Processor Algorithm

    We first consider how to efficiently solve list ranking on a two-processor system. Of course with just two PUs not much speed-up can be expected, but this already illustrates the main idea of the more general algorithm. The two-processor algorithm can be implemented so efficiently that if the communication can be performed at the same speed as memory access, which is the case if the two PUs have access to a shared memory, that then speed-up 1.5. This can be achieved already for fairly small problems, because only two communication operations are performed independently of the size of the problem.

    As before every node has fields succ[], last[] and dist[], where ultimately last[i] will point to the last node of the list of i and dist[i] at all times gives the distance from node i to last[i]. A final node i has succ[i] = i. There are two basic operations. The first is called autoclean, the other altroclean. If last[i] = j, we say that node i points to node j.

    Autoclean(p) means that for all nodes i stored in some PU_p for which last[i] points to a node also stored in PU_p, we perform last[i] = last[last[i]] and dist[i] += dist[last[i]] until i is pointing to a node stored in another PU. This operation does not require any communication.

    Altroclean(p, p') means that for all nodes i stored in some PU_p for which last[i] points to a node stored in PU_{p'}, we perform last[i] = last[last[i]] and dist[i] += dist[last[i]], by asking PU_{p'} for this information. An autoclean operation means that both involved PUs must send one packet. The first packet contains the index of the node the question is destined for, the second contains its last- and dist-value. In total three integers are exchanged.

    Building on these two basic operations, we get the following algorithm for PU_p, 0 <= p < 2:

    1.     for (each node i in PU_p)
            if (succ[i] == i) // i is a final node
            {
              last[i] = i - n;
              dist[i] = 0;
            }
            else
            {
              last[i] = succ[i];
              dist[i] = 1;
            } 
    2.     autoclean(p); 
    3.     altroclean(p, (p + 1) mod 2); 
    4.     autoclean(p); 
    5.     for (each node i in PU_p)
            last[i] += n; 
    PU_0 takes care of the nodes with indices 0, ..., n / 2 - 1, PU_1 of the nodes with indices n / 2, ..., n - 1. By giving the final nodes a last-value which lies outside these ranges, the autocleans continue until either having found a node in the other PU or until a final node. During the altrocleans, no question is sent to the final nodes.

    The algorithm is correct, because after the first autoclean no node points to a node in its own PU. Thus, after the altroclean, each node which does not point to a final node points to a node in the own PU. The second autoclean therefore finishes the problem, because if no node in PU_p points to a node in PU_{(p + 1) mod 2}, at the beginning of an autoclean, then it does not do so at the end of this operation. But by then no node points to a node in the own PU either, so all nodes must point to a final node.

    The autocleans perform a linear number of internal operations. Essentially an autoclean is nothing but an internal list-ranking problem in which all nodes pointing to a node in another PU are considered to be final nodes. So, typically there will many very short lists.

    T(2, n) = c_i * O(n) + c_v * 3 * n / 2 + c_a * 2.

    Two-Processor List Ranking

    DMM

    Combining autocleans and altrocleans in a systematic way the above algorithm can be generalized for P > 2. The simplest generalizations do not work. Already for P = 3, there is the problem that after having eliminated all links from PU_p to PU_{p'} by performing altroclean(p, p'), such links may be reintroduced by a subsequent altroclean(p, p'') for p'' != p'. The solution is to maintain two differently cleaned data sets. So, instead of last[] we have last_left[] and last_rght[]. For dist[] we do not need such a copy, at all times dist[i] = dist_rght[i] gives the number of links that must be traversed to reach last_rght[i] starting from node i.

    The kernel of the algorithm consists of a loop in which all other PUs are addressed in a circular way:

    1.     for (each node i in PU_p)
            if (succ[i] == i) // i is a final node
            {
              last_left[i] = last_rght[i] = i - n;
              dist[i] = 0;
            }
            else
            {
              last_left[i] = last_rght[i] = succ[i];
              dist[i] = 1;
            } 
    2.     autoclean (p, last_left);
          autoclean (p, last_rght); 
    3.     for (t = 1; t < P; t++)
          {
            altroclean(p, last_left, (p - t) mod P, last_rght);
            altroclean(p, last_rght, (p + t) mod P, last_left);
            autoclean (p, last_left);
            autoclean (p, last_rght);
          } 
    4.     for (each node i in PU_p)
            last_rght[i] += n; 
    The notation has been slightly extended: auto_clean(p, last_left) means that an autoclean is performed for the nodes in PU_p for the links stored in last_left. auto_clean(p, last_rght) is defined analogously. altroclean(p, last_left, (p - t) mod P, last_rght) means that an altroclean is performed for the nodes in PU_p asking a question to PU_{((p - t) mod P} for each node i for which node last_left[i] is handled by PU_{(p - t) mod P} and that the answer is based on the information in last_rght[]. altroclean(p, last_rght, (p + t) mod P, last_left) is defined analogously.

    We claim that at the end of the algorithm last_rght[i] gives the last node of the list of i for all i, 0 <= i < n. This can most easily be proven with invariants. We claim that after round t, 0 <= t < P, last_left[i] for all nodes i stored in PU_p is clean for p, p - 1, ..., p - t. By this we mean that last_left[i] does not point to any node stored in a PU with indices from {p, p - 1, ..., p - t}. In addition we claim that after round t, 0 <= t < P, last_rght[i] for all nodes i stored in PU_p is clean for p, p + 1, ..., p + t. For t = 0, we have performed autocleans, so in PU p, last_left and last_rght are clean for 0. More generally, assume the claims hold for some given t. So, at the beginning of round t + 1 last_left[i] is clean for p, p - 1, ..., p - t and last_rght[i] for p, p + 1, ..., p + t. Thus, if last_left[i] points to a node in PU_{p - t - 1}, putting last_left[i] = last_rght[last_left[i]] will return the index of a node which does not lie in a PU with indices in {p - t - 1, p - t, ..., p - 1}, but possibly it may lie in PU_p. Performing the autoclean establishes the claim for t + 1. In the above discussion, all PU indices are meant to be computed modulo P.

    How about the time complexity? There are 2 * P - 2 altrocleans. Each operation consists of a question followed by an answer. Sending the second question in each pass of the loop together with the first answer, the number of start-ups is reduced to 3 * P - 3, this is only as much as in three all-to-all routings, without any indirect routing. On the other hand, if we have bad luck, there may be n / P questions in each altroclean. However, if the nodes indices are randomized, then we get much better performance: we may assume that in round t, the last values are the indices of nodes which are uniformly distributed over the remaining P - t PUs. If this assumption holds, the expected number of questions for an altroclean in round t equals (n / P) / (P - t). In total this gives sum_{i = 1}^{P - 1} (n / P) / (P - t) = n / P * H_P ~= n / P * log_e P. For every left-question we get back one answer, only a new last-value, for every right-question we get back two answers. So, for the presented version of the algorithm, the total routing volume is about 5 * (n / P) * log_e P. Altogether this gives

    T(P, n) = c_i * O(n * log n) + c_v * 5 * n * log_e P + c_a * 3 * (P - 1).

    Comparing with pointer jumping, we have slightly less work and routing volume because 5 * log_e P < 3 * log n, and much fewer start-ups. So, in practice pointer jumping will be worse. Comparing with the sparse-ruling-set algorithm we have lost almost a factor log_e P in work and routing volume, but the number or start-ups is far smaller. So, one-by-one cleaning will be better for small values of n / P^2, because then the start-up costs tend to dominate. There are other algorithms which are better for intermediate values, but it is correct to say that sparse ruling sets is best for large values of n / P^2 and one-by-one cleaning for small values of n / P^2. A combination of these two algorithms always gives reasonable results.

    One-by-One Cleaning

    Circular-Array

    How about list-ranking on circular arrays? Different from before we now assume that in any communication step the PUs can send a packet of arbitrary size to both neighbors, the complexity measure being the number of these communication steps. The number of PUs is P, each holding the information related to n / P of the n list nodes.

    The optimal PRAM algorithms can quite easily be turned into algorithms performing O(P) routing steps. This can be achieved on a linear array by geometrically reducing the problem size and the number of employed PUs until the reduced problem fits in one PU. Then it can be solved internally. Reversing the reduction steps solves the problem. The constant hidden in O(P) depends on how the details are worked out. It lies between 10 and 20. It is not clear how such algorithms that contract the problem in an ever smaller subset of PUs should profit from the wrap-around connection after the first reduction rounds, so on circular arrays, the number of routing steps is only slightly smaller.

    The quite amazing thing is that on circular arrays list ranking can be solved much faster than this (and because circular arrays can be embedded on linear arrays with dilation 2, by extension even on linear arrays): P - 1 routing steps are sufficient! This can be achieved by an adaptation of the one-by-one cleaning algorithm. The idea is simple: the values of last_rght[] stay in the PU where they started, while the values of last_left[] travel around the network moving leftwards, that is, from PU_p to PU_{(p - 1) mod P}. In this way, the communication that is performed in the DMM algorithm can be resolved locally.

    The packets sent in every step have size O(n / P) and the amount of computation is of the same order. Now one may complain that using P - 1 steps with O(n / P) work in each of them gives O(n) in total. This will not give much speed-up! This is true in general, but the important thing is the idea: this is an algorithm in which all communication is local. It is a kind of systolic list-ranking algorithm: two streams of information which are only interacting where meeting. The algorithm might even have practical importance. Maybe the "PUs" actually give highly connected clusters of PUs in a structure in which the clusters are more loosely connected (NUMA platforms). Or, we may have that n, the problem size, is greater than M, the size of the memory of a PU, but that n < P * M.

    Exercises

    1. Give a sequential list-ranking algorithm that performs at most 2 * n unstructured array accesses for a list of length n.

    2. By a theoretical consideration we want to determine which algorithm in practice will be the best. The interesting point is that there is not a single algorithm dominating all others. There are several serious competitors: all of the presented algorithms and the sequential algorithm. Which algorithm is the best depends on the cost model and the values of P and n. Some algorithms which were not presented are even slightly better for certain ranges. We assume the DMM cost model with the following parameters: c_i = 10^{-9}, c_v = 10^{-8}, c_a = 10^{-5}. For a list with n nodes, estimate the number of internal operations performed by the sequential algorithm on 400 * n. The number of internal operations of the parallel algorithms can be estimated on 100 * routing_volume.
      • Which of the presented algorithms is best for very small problems? Draw a graph giving the maximum number of PUs that should be used for solving a problem of size n. Along the x-axis log_2 n increases from 14 to 34, along the y-axis log P increases from 0 to 10.
      • Compare pointer-jumping and one-by-one cleaning. Is one of them always better than the other, or do they both have their range of optimality? In the first case, indicate which algorithm is better, and prove this; in the second case, make a two-dimensional image like the one above and draw an approximate boundary.
      • Which of the presented algorithms is best for very large problems? Draw a graph of the maximum achievable speed-up as a function of n. Use again logarithmic scales along both axes. The cost of the sparse-ruling-set algorithm may be approximated by the cost for achieving a reduction by a factor P in one reduction round. Also indicate the number of PUs used for obtaining this maximum speed-up.

    3. One of the tasks in the above exercise was to determine the maximum achievable speed-up in a practical setting. Now we take a theoretical stand point. Assume that we have arbitrarily many PUs in a DMM. Assume that c_i = c_v = c_a. The problem has size n. When answering the following questions you may use asymptotic notation, neglecting constant factors.
      • What is the maximum achievable speed-up? How many PUs do you need for this? What is the efficiency? Which algorithm do you use?
      • What is the maximum achievable speed-up if we insist that the efficiency should not converge to 0 in the limit for n goes to infinity? How many PUs do you need for this? Which algorithm do you use?

    4. The list-ranking algorithm for a circular array with n PUs runs in n - 1 steps. This immediately gives an algorithm for a linear array running in 2 * n - 2 steps.
      • Show that even on a linear array n - 1 steps are sufficient.
      • Extend this idea to derive an improved algorithm for circular arrays running in n / 2 steps.


    
    
    
    

    Trees

    Euler-Tour Technique

    In this section we present a simple technique with a whole bunch of basic but crucial applications solving problems on trees such as rooting the tree and computing the depth of all nodes.

    Construction

    The idea of the Euler-tour technique is to reduce a problem on trees to a problem on lists. This is done by first replacing each undirected tree edge (u, v) by two directed edges (u, v) and (v, u). Each node u numbers its neighbors in arbitrary order. So, these can be denoted v_i, 0 <= i < deg_u, where deg_u denotes the degree of node u. This ordering gives rise to a successor relation between the directed edges: the successor suc(v_i, u) of edge (v_i, u) is defined by
    suc(v_i, u) = (v, u_{(i + 1) mod deg_u}).

    If the edges are numbered with different numbers, which can be performed by computing the prefix sums of the numbers of edges each PU is managing, then the defined successor relation simply gives a linked circular list. This can be turned into a non-circular list, by cutting it at a single place. For example, node 0 can cut the list between its first and last neighbor by setting suc(v_{deg_0 - 1}, 0) = null.

    For a tree with n nodes and n - 1 edges, on a PRAM this construction takes O(log n) time with O(n) work for the prefix computation and O(1) time with O(n) work for the linking. On a DMM with P PUs, the prefix takes O(n / P) * c_i + log P * (c_v + c_a). If we assume that the information concerning edge (v_i, u) is stored in the PU responsible for node v_i, then the linking requires that the PU which is responsible for node u sends a packet to this other PU. Thus on a DMM the cost is dominated by that of an all-to-all routing in which each PU sends and receives one information for each edge of a node it is managing. If there are no extreme node degrees and the nodes are distributed evenly, this amounts to a k-k routing with k = 2 * n / P. More generally, the edges, after numbering them, should be distributed evenly over the PUs.

    Euler-Tour Construction

    Basic Applications

    The Euler-tour technique can be used for several problems by weighting the edges of the list appropriately and then solving a weighted variant of the list-ranking problem. In a normal list-ranking problem the task is to compute for each node u the number of links on the path from the initial node or on the path to the final node, whatever is easier. A weighted list-ranking problem is very similar: the links have weights which are not necessarily 1 for all edges, and the task is to compute the sum of all weights of the links on the path from the initial node or on the path to the final node. Said otherwise, this is the problem of computing a prefix or suffix sum on a list. None of the presented list-ranking algorithms essentially depends on the links being unweighted, though in an optimized version some of them must send slightly smaller information units for the unweighted version. For example, in pointer jumping the distance regularly doubles until reaching the final node. This observation makes it unnecessary to send distance information all the time.

    Tree Rooting

    In many applications, for example as a result of a parallel algorithm for constructing a spanning tree, the tree is just an acyclic connected graph without further structure: there is no root and the edges are not directed. Choosing a root is not hard, the node with the smallest index is a good candidate. But in a forest this presupposes that each node knows to which tree it belongs and for further operations it is also handy if for each node u the node v = parent[u] is known, being the neighbor of u lying on the path from u to the root of its tree.

    Preorder and Postorder

    The preorder and postorder numbers are important for determining the structure of the tree. For example, a DFS spanning tree of a graph is characterized by the fact that there are no "forward cross edges", these are edges running from a node u to a node v, with preorder(v) > preorder(u) and postorder(v) > postorder(u). These numberings can also be used as just a numbering.

    Assume we have a rooted tree as input. Then to compute the preorder numbers, all edges pointing away from the root are given weight +1 the other edges are given weight 0. After running a weighted list-ranking algorithm, a node u can obtain its number from the rank of the edge coming from its parent. Switching the weights, that is giving weight +1 to the edges leading towards the root and 0 to the others computes the postorder numbers.

    Depth

    Computing the depth of all nodes provides important information about the structure of the tree. It also may help to compute distances in the tree. The distances are very easy to compute. Just weight the edges of the Euler tour appropriately: all edges pointing away from the root get weight +1, all other edges get weight -1. After solving the weighted list-ranking problem, each node u can obtain its depth from the rank of the edge coming from its parent.

    Size of Subtree

    The size of the subtree of a node can be determined as a by-product of the computation of the preorder numbering: if the edges pointing away from the root are given weight +1 and all other edges weight 0, then for any node u the difference of the rank of the edge leading from parent[u] to u and that of the edge leading from u to parent[u] gives the number of downward edges in the subtree of u. Each such an edge reaches a new node, so this equals the size of the subtree of u without counting u.

    Inorder

    For a binary tree without nodes of degree 1 it is also handy to have an inorder numbering. This is slightly harder than the above. In a binary tree there are four types of consecutive edges in the list of edge. The default is that an edge has weight 0. Some edges get weight 1 according to the following rules:

    Performing a weighted list ranking on the thus weighted list, a node u can find its inorder number from the rank of the first edge of the pair of edges that changed direction in u.

    Coloring Regular Bipartite Graphs

    Sequential Algorithm

    An edge coloring of a graph is an assignment of numbers to the edges so that no two edges incident upon a node have the same number. In general it is very hard to compute a coloring using a minimum number of colors, but for bipartite graphs this is much easier. Particularly it is generally true that it is possible to construct a coloring with d colors when d is the maximum degree of any node. This is a consequence of Hall's (also called König's) theorem which states that on bipartite graphs there is a matching which matches all nodes of maximum degree. This provides the step in an inductive proof: a coloring can certainly be found by repeatedly constructing such a maximum matching. Surprisingly coloring bipartite graphs can be performed much more efficiently than this. A single matching in a bipartite graph with n nodes and m edges takes O(sqrt(n) * m) time. For a regular bipartite graph of degree g, m = g * n, so repeatedly matching takes O(sqrt(n) * g * m) time. A coloring can be constructed in just O(log g * m) time.

    Consider a bipartite graph with node sets V_1 and V_2, each with n nodes. Assume that the graph is regular of degree g. A first idea for constructing a coloring is to start allocating the colors to the first node of V_1, then the second and so on. When assigning the colors of node i, we should respect the conditions imposed by earlier assigned colors. If one is lucky this may work, but in general this approach will get stuck when we must assign a color c to an edge (i, j) while node j has an edge (i', j) for some i' < i, which was already assigned color c while coloring node i'. Another "greedy" strategy may also work, but not always. The idea is that one tries to assign the colors one by one. The algorithm gets stuck when, while assigning color c, one comes to a node i for which all the uncolored edges lead to nodes which already have an edge that was given color c before.

    A working and efficient strategy is based on Euler splittings. For a g-regular graph G with g even, the graph can easily be split in two edge-disjoint subgraphs G_0 and G_1 each of degree g / 2. This is done by constructing an Euler tour of the graph, numbering the edges on the tour alternatingly 0 and 1. All edges which have been numbered 0 belong to G_0, all other edges to G_1. Because sequentially an Euler tour of a graph with n nodes and m edges can be constructed in O(n + m) time, this splitting costs time O(g * n).

    In general, for a graph of degree g = 2^k, the algorithm consists of k rounds. In round i, 0 <= i < k, the algorithm is working on 2^i subgraphs in which all nodes have degree 2^{k - i}. Each of the 2^i operations in round i takes O(2^{k - i} * n) time, so the total amount of work in any of the rounds is O(2^k * n) = O(g * n) = O(m). Thus, in total over all rounds the algorithm takes O(k * m) = O(log g * m) time. For g which are not powers of 2, the construction of such colorings is considerably harder. After much research it has been established (by Cole, Ost and Schirra in Combinatorica, Vol. 21, 2001) that even the general case can be solved in O(log g * m) time.

    Coloring Regular Bipartite Graphs

    Parallel Algorithm

    How can we turn this idea into a parallel algorithm? In the following we assume that the degree g of the graph is a power of two, the general case being much harder and unsolved as far as we know.

    The basic idea is to turn the problem of finding an Euler tour into a list-ranking problem. For every edge there is a node. These nodes have no predecessor and successor links, but for every edge e, there is a data element left[e] and rght[e]. left[e] gives the edge e' to which e is connected in its left endpoint i in V_1. This decision is made locally, without global coordination in node i. If left[e] = e', then left[e'] = e. rght[e] gives the edge e'' to which e is linked in its right endpoint j in V_2. A node of the list (corresponding to an edge of the graph) which is entered by a left[] link is left through a rght[] link and vice versa.

    The list-ranking algorithms can, with minor modifications also be applied to figure out the minimum on a circular list and the distance thereof. This is now done for both lists (not knowing the global structure, it must be done for both). Of course we do not have a conventional list structure: each node has two "successors": one labeled "left" and one labeled "right". So, what is the appropriate successor? The answer is simple: the path starting over a "left" link must at even distances from the original node traverse "left" links and at odd distances "right" links. Further, notice that, because any cycle on a bipartite graph has even length, for any node e, if the distance to the node with the smallest index is even along one direction then it is also even along the other direction. This observation makes that we now can split the set of nodes (corresponding to the edges of the graph) in two: all those at even distances go into one subset, all those at odd distances into another.

    Summarizing: first each node of the graph arbitrarily links up the edges connected to it in connected pairs; then a set of bidirectionally linked circular lists of a slightly modified nature with g * n nodes is constructed; then two list ranking problems are constructed. Clearly this whole problem can be solved in O(T_listrank(g * n)). Hereafter we have to solve two instances with each half as many edges. In a straightforward implementation this gives

    T_color(P, n, g) = O(T_listrank(P, g * n) + 2 * T_color(P, n, g/2).
    The solution depends on the list-ranking time. For a DMM, T_listrank(P, g * n) = O(g * n / P * (c_i + c_v) + c_a * P). Substituting and solving gives
    T_color(P, n, g) = O(log g * g * n / P * (c_i + c_v) + c_a * g * P).

    This is a good result except for the number of startups. Organizing things slightly differently the number of startups can be minimized: after each splitting of the problem, each of the two subproblems should be concentrated in half of the available PUs. This gives a different recurrence relation:

    T_color(P, n, g) = O(T_listrank(P, g * n) + T_color(P/2, n, g/2).
    Solving this recurrence gives
    T_color(P, n, g) = O(log g * g * n / P * (c_i + c_v) + c_a * P).
    So, we saved a factor g in the number of startups. This improved algorithm will work well if log g * g * n >> P^2, the first version required log g * n >> P^2.
    If there are two independent problems which must be solved on a DMM or an interconnection network with P Pus, then it is mostly better to solve these problems in parallel each on P / 2 PUs than to solve them after each other each on P PUs.

    On a two-dimensional m x m mesh the situation is even much better. Assuming that g * n > m^2, T_listrank(m^2, g * n) = O(T_route(m^2, g * n)) = O(g * n / m). So, we get

    T_color(m^2, n, g) = O(g * n / m) + T_color(m^2 / 2, n, g / 2).
    Because (g / 2) * n / sqrt(m^2 / 2) = (g * n / m) / sqrt(2), after each splitting the subproblems can be solved faster by a constant factor (more accurately: after splitting twice there are four subproblems which can be solved twice as fast on m / 2 x m / 2 meshes). Thus, the sum of the times of all splitting phases is proportional to the time for the first one:
    T_color(m^2, n, g) = O(g * n / m).

    Of course the speed-up of this mesh algorithm is only m = sqrt(P), but this is the normal situation on meshes. There are very few problems like matrix multiplication and all-pairs shortest paths for which full speed-up can be achieved. All but very trivial problems require that at least all data are communicated once over the bisection of the network and therefore full speed-up can be achieved only if the sequential complexity of the problem is sufficiently high. A result like the obtained one is optimal up to constant factors in a two-dimensional world. The basic assumption of the DMM model that locality plays no role is definitely not valid without limitation on P.

    Unbalanced All-to-All Routing

    Balanced all-to-all routing is mostly simple. We have seen how to perform it on a DMM. If every PU is sending and receiving the same number of packets in total, but not the same number to each other PU, the problem is harder. This is not an uncommon situation. For example in sorting algorithms each PU holds initially and finally the same number of packets, but it is not necessarily true that initially each PU holds a uniform sample of the set of numbers. If on a DMM with P PUs the largest numbers all happen to stand in PU_0 all numbers from PU_0 must be routed to PU_{P - 1}.

    We have seen how to solve the unbalanced routing problem using randomization: sending all packets first to a random intermediate destination (or by shuffling the packets in a deterministic way), any routing pattern may be replaced by two almost balanced routing patterns. The main disadvantage of this approach is that it doubles the routing volume. A coloring approach offers an alternative solution which are mostly satisfied in practical contexts.

    We consider a DMM with P PUs. Each PU has to send and receive in total k packets. Let a_{i, j} denote the number of packets PU_i has to send to PU_j. The conditions are

    0 <= a_{i, j} <= k,
    sum_{i = 0}^{P - 1} a_{i, j} = sum_{j = 0}^{P - 1} a_{i, j} = k.
    The entire routing pattern constitutes a bipartite k-regular graph with P nodes on either side. Constructing a k coloring decomposes the routing into k permutation routing problems, which can be routed in (c_v + c_a) * k time. Even when ignoring the excessive time for the coloring, this is not good, because in general this will be much more than c_v * 2 * k + c_a * 2 * P, the time for the randomized coloring.

    Some form of sparsification is the key to the solution. Possibly we might group PUs together, but this inevitably leads to routing packets several times, and then we could just as well apply the simpler randomized routing. So, bundling the packets into superpackets is the only cost reduction we can try to apply. Let s be the size of the superpackets. That is, in PU_i the a_{i, j} packets which have to be routed to PU_j are packet into b_{i, j} = round_up(a_{i, j} / s) superpackets of size s, creating one partially filled superpacket. In total PU_i creates sum_{j = 0}^{P - 1} round_up(a_{i, j} / s) <= sum_{j = 0}^{P - 1} (a_{i, j} / s + 1) = k / s + P. So, this gives a bipartite graph with maximum degree k / s + P. The nodes do not need to have the same degree, but after computing prefix sums dummy packets can be added to make the graph regular. Because within certain limits s can be chosen freely, we may assume that k / s + P is a power of two.

    On this reduced (k / s + P)-regular bipartite graph with P nodes on either side the presented Euler-tour-based coloring algorithm can be applied. This takes O(log(k / s + P) * (k / s + P) * (c_i + c_v) + c_a * P) time. This is a loss term. The second loss term comes from the fact that we have more than k / s superpackets: the subsequent routing takes (k / s + P) * (c_v * s + c_a). We want to determine a good choice for s and a condition on k and P so that all loss terms can be made negligible in comparison with the useful time c_v * k. Clearly P should be negligible in comparison with k / s. This allows to simplify the time for the coloring. Equating the loss terms gives the following equation:

    log k * k / s = P * s.
    Solving gives a close-to-optimal choice for s, namely s = sqrt(k * log k / P).

    Theorem: On a DMM with P PUs any k-k routing can be performed in (1 + o(1)) * k * c_v time if k = omega(P * log P).

    Proof: Taking s = sqrt(k * log k / P), the loss terms are bounded by O(sqrt(k * log k * P) * c_v). If k = x * P * log P, sqrt(k * log k * P) = k * sqrt((1 + log x / log P + loglog P / log P) / x). Clearly for x = omega(1), this is o(k). End.

    So, for unbalanced k-k routing optimality can be achieved for values of k which are only marginally larger than the k = omega(P) required for balanced k-k routing. Practically the only disadvantage is the rather complicated construction.

    Tree Contraction

    Problem Context

    The Euler-tour technique is powerful, but cannot be used to solve all problems on trees. Such a problem is the expression-evaluation problem. In this problem it is assumed that a number is associated with each leaf of the tree, while some operators are associated with any of the internal nodes. The task is to compute for any node u the value of the subexpression given by the subtree rooted at u. Expression evaluation can be solved by performing a tree contraction, the process of gradually reducing the tree until only the root node with the correct value remains.

    Evaluating arithmetic expressions in parallel does not appear to make sense because it is hard to imagine in which context sufficiently large expressions occur, but the expression-evaluation problem models other tree problems which are important and for which even have to be solved for large trees. An example of such a problem is that of computing the height of each node u in a tree, that is, the distance to the deepest lying leaf in the subtree of u. For this problem the Euler-tour technique does not appear to be suitable. As an expression-evaluation problem it can be solved by putting 0 in all leafs and using as operator max + 1, taking the maximum value of the children and adding 1. Another example is that of computing for each node u of a tree the minimum key value in the subtree of u. This problem can be solved as an expression-evaluation problem in which each node computes the minimum of its own value and that of its children.

    For expression trees of moderate depth the evaluation problem can be solved efficiently in parallel by allocating a processor to each node connected to two leafs. This initial processor allocation can be performed using the Euler-tour technique. For each such a node its value can be computed in constant time. When as a result a new node connected to two leafs is generated, the allocated PU is taking over this job, otherwise it becomes passive. In each step the depth of the tree is reduced by one. Thus, for a tree of depth d with n nodes, the time is O(d + log n) and the work is O(n). If the tree is deep and skinny, for example a path, this way of evaluating gives very limited parallelization.

    Raking

    Consider a rooted binary tree with root r. A tree of higher degree should first be replaced by a binary tree. This reconstruction can easily be performed in parallel by allocating to each node computational resources which are proportional to its degree. For each node u != r, parent[u] gives the parent of u and sibling[u] gives the sibling of u. Possibly sibling[u] = null. The rake operation is the basic operation we will use to perform tree contraction. It can be applied to any leaf u of the tree as far as u != r and parent[u] != r. It removes u and parent[u] from the tree and hooks sibling[u] to parent[parent[u]]. Of course it also performs some arithmetic to assure that the value expressed by this tree does not change as a result of the raking. Repeating rake operations, we ultimately get a tree consisting of r and at most two leafs for which the value can be computed in constant time.

    Many rakes can be performed in parallel, but it should be avoided to rake two leafs with the same grandparent. This is not hard to realize. The algorithm starts by consecutively numbering the leafs using the Euler-tour technique. Then the rake operation is first performed on all leafs with numbers 4 * k + 0, for some k, and then on all leafs with numbers 4 * k + 2, for some k. Hereafter each leaf with an odd numbers i is renumbered as (i - 1) / 2, and the process is repeated until there are only two leafs left. Because the raked leafs are always four numbers apart, they can never have the same grandparent. At this point we essentially use that the tree is binary. So, provided we know how to perform a rake operation, this correctly contracts the tree.

    The above tree contraction can be performed very efficiently in parallel: the Euler-tour technique was analyzed before. On an EREW PRAM it can be performed in O(log n) time with O(n) work. Assuming that raking a leaf takes constant time, each contraction phase, consisting of two parallel rakings and a trivial renumbering, takes O(1) time. In each contraction phase the number of leafs is halved, so the whole contraction can be performed in O(log n) time. For some constant c, the work during the contraction is given by sum_t c * n / 2^t = O(n). In the time estimate it is assumed that there are no internal nodes of degree one, otherwise for a tree which has very few leafs only a few nodes disappear in every contraction phase until there is one leaf left. The easiest is to add a dummy leaf to any internal node with degree one. This leaf should get a neutral value so that the evaluated value remains the same for all real tree nodes.

    Expression Evaluation

    Now that we know how to perform raking in parallel expression evaluation can be presented as a special case. Initially any node u gets a pair of parameters (a_u, b_u). If u is a leaf, a_u = 0 and b_u = c_u, where c_u is the constant of this leaf. If u is an internal node a_u = 1 and b_u = 0. Let val(u) denote the value to be computed for node u. For any internal node u, the operator of u, + or * in our case, is denoted by op(u).

    Invariant: At all times for any leaf u, a_u = 0 and b_u = val(u). For any internal node u, with children v and w, val(u) = (a_v * val(v) + b_v) op_u (a_w * val(w) + b_w).

    Initially the invariant condition is satisfied, while raking the values of the parameters must be modified appropriately. Assume we are raking a node v with parent u and sibling w. The nodes v and u are eliminated, so w should take over the values of these three. v is a leaf, thus, according to the invariant, val(u) = b_v op_u (a_w * val(w) + b_w). We distinguish two cases: Hereafter in order to establish the invariant, w plays the role of u.

    The way the algorithm is specified, the correct value will be computed in the root of the tree, but not for all other nodes. In the following examples it will be discussed how to correct this. The idea is analogous to that for computing a prefix sum: the whole algorithm consists of two phases, one running upwards, one running downwards. The difference is now that operations may take place at different levels of the tree in the same parallel operation.

    Applications

    Size of Subtree

    Above we have seen how the problem of computing for each node u of a tree the size of the subtree at u can be solved quite easily using the Euler-tour technique. To illustrate the method we here consider how this problem can also be solved by tree contraction. Practically this does not make sense because the Euler-tour technique is used as a subroutine in the tree contraction.

    This is an easy case. All nodes are given initial weight 1. In addition there are dummy leafs added to all internal nodes of degree one, these get weight 0. The only operation is the addition. It adds the value of both children to the weight. More precisely, when raking u with parent v and grandparent w which have weights a_u, a_v and a_w, respectively, then u and v are removed and a_w is replaced by a_w = a_w + a_v + a_u.

    Notice that the above construction does not compute the size of the subtree for the node v which is skipped over. This problem can be solved easily. After the tree contraction there is a tree expansion, reversing the operations. So, all nodes which are excluded in phase i, i = 1, 2, ..., t, are reinserted in phase i = t, t - 1, ..., 1. After reinsertion phase i we may assume that all nodes which have been reinserted already know their correct values. So, by the time v is reinserted, u and its sibling u' at the time of the exclusion of v have been reinserted and obtained their final values a_u and a_{u'}. So, v can compute its value by setting a_v = a_v + a_u + a_{u'}.

    Height

    The heights can be computed analogously to the size of the subtrees. The differences lie in the details. Initially all nodes are given weight 0. When raking u with parent v and grandparent w, we set a_w = max{a_w, a_u + 2}. When reinserting we should set a_v = max{a_v, a_u + 1, a_{u'} + 1}.

    Lowest Common Ancestor

    Definition and Applications

    Consider a rooted tree T with root r. For any node u, an ancestor is a node w on the path from u to r, u and r included. For a pair of nodes u and v a common ancestor is a node w which is both an ancestor of u and of v. The lowest common ancestor w of u and v is the common ancestor of u and v which lies furthest from r. Thus, this is the node w in which the path from r to u and v forks. w is called the LCA of u and v, also denoted LCA(u, v). In many contexts it is required to compute lowest common ancestors. In the following we first give some examples.

    If the tree represents some kind of evolutionistic development, then the LCA of u and v gives the historical event where two "species" went their own independent way. If the tree T is a spanning tree of some undirected graph and we want to make it more like a DFS tree, then when discovering a cross edge (u, v) it may be clever to replace one of the edges from LCA(u, v) leading to u or v by the edge (u, v).

    The distance from a node u to a node v in a tree is given by the sum of the distance from u to LCA(u, v) and that from v to LCA(u, v). The distance of u to LCA(u, v) is given by the difference of their depths in the tree. So, after computing the depths of all nodes, which can be performed in linear time, the distances between any pair of nodes can be computed in constant time plus the time for for computing the LCA. Replacing the depths by the sum of all weights on the path from the root to a node, this idea can even be used for computing the distances in weighted trees.

    An analogous problem is when for a fixed weighted tree we frequently have to determine the heaviest edge on the path from a node u to another node v in the tree. So, instead of using the sum operator we use the maximum operator here. The difference is that the maximum operator has no inverse, so we cannot simply subtract the contribution from the path between the root and LCA(u, v). In this case we first have to construct an associated tree T'. We describe how this can be done sequentially. We need a union-find data structure in which all nodes are inserted. If T has n nodes, then T' has n leafs in one-one correspondence with the nodes. The leafs all get key value 0. The edges of T are sorted according to their weight and processed in order of increasing weight. For an edge (u, v) of weight weight(u, v), a new node w of T' with two children is created. w has key value weight(u, v). w is connected to the node corresponding to the component of u and the component of v. Hereafter these components are unified and w becomes the node representing this component. Hereafter, for any pair of node u and v the weight of the heaviest edge on the path from u to v in T is given by the key of the LCA of u and v in T'.

    Finding Heaviest Edge on Path in Tree

    The above examples show that for several applications it is desirable to be able to answer the question "what is the LCA of u and v" for two nodes u and v of a tree. For any single such query the problem can be solved easily by walking towards the root from u and v marking the nodes on the path as visited. The first node that is reached for the second time is the LCA. If this walking is done by alternatingly making a step on the path from u and the path from v, the total number of visited nodes is proportional to the length of the path from u to v. The marks can be set back in the same time. This is clearly optimal and if there are several such queries, then these can be performed in parallel without problem (though concurrent reads may occur).

    However, if there are very many such queries, it is better to preprocess the tree so that the queries can be faster than by walking along the tree. The problem of preprocessing the tree so that subsequent queries can be performed in constant time is called the LCA problem. So, the topic of this section is not how to parallelize the queries, which is trivial, but to show how this preprocessing, which sequentially can be performed in O(n) time for a tree with n nodes, can be performed in parallel.

    Special Cases

    There are two special cases for which the LCA is easy. If the tree is a path, then it suffices to compute for each node its depth. LCA(u, v) = u if depth(u) <= depth(v), otherwise LCA(u, v) = v. This is not a very interesting case.

    More important is the case that the tree T is a perfect binary tree with n = 2^k - 1 nodes for some k > 0. The key idea is to perform an infix numbering of this tree, giving number 1 to the leftmost leaf. So, the node numbers are 1, ..., 2^k - 1. Let w = LCA(u, v) be the LCA of two nodes u and v. Let (u_{k - 1}, ..., u_0) and (v_{k - 1}, ..., v_0), be the binary expansions of the numbers of u and v respectively. Let x(u, v) be the index of the most-significant bit of u and v which are different. Using this notation, the binary expansion of the number of w can be expressed easily: w_i = u_i for all i > x(u, v); w_{x(u, v)} = 1; w_i = 0 for all i < x(u, v). Of course it is not necessary that the number of nodes is exactly a power of two minus one.

    LCA on Perfect Binary Tree

    Theorem: For a perfect binary tree T with n = 2^k - 1 nodes a numbering of the nodes can be computed in O(log n) parallel time and O(n) work so that based on this numbering for any pair of nodes u and v of T a query LCA(u, v) can be solved in O(1) sequential time.

    The idea for the perfect binary trees can be extended to all trees which are sufficiently like perfect binary trees. The sequential linear time algorithm is based on such an extension. We do not further consider this construction but assume the following as a fact:

    Theorem: For any tree T with n nodes a tree T' of size O(n) can be constructed in O(n) time so that based on the information contained in T' for any pair of nodes u and v of T a query LCA(u, v) can be solved in O(1) time.

    Reduction to Range-Minima Problem

    In this section the general LCA problem will be reduced to the problem of computing the minimum in subranges of an array. Interestingly, in the next section we will show how this latter problem can be solved by the special case of the LCA applied to perfect binary trees.

    The construction is again based on the Euler-tour technique. First an Euler tour is constructed running along all edges in the tree, starting and ending in the root node r. This tour is used to construct arrays index[] and depth[]. For a tree with n nodes each has length 2 * n - 1. index[i] gives the index of the node that is visited after traversing i edges of the tour. index[0] = index[2 * n - 2] = r. depth[i] gives the depth of node index[i]. It can be computed by weighting all list edges leading away from the root with +1 and all other edges with -1. Running a list-ranking algorithm, each of these arrays can be computed in O(log n) time and O(n) work.

    From index[] and depth[] we compute the arrays first[] and last[]. These arrays have length n. For all u, 0 <= u < n, first[u] = min_{0 <= i < 2 * n - 1} {index[i] = u}, that is, it gives the first occurrence of u on the tour. Analogously, last[u] gives the last occurrence of j on the tour. Because the tree is connected all these values are well defined. For a leaf u, first[u] = last[u]. For an internal node u first[u] < last[u]. On an EREW PRAM these arrays can be constructed in O(1) time and O(n) work: For all i, 0 <= i < 2 * n - 1, it is checked whether depth[i - 1] < depth[i] or not. If this is the case, we know that the tour came to node u = index[i] from above. So, this must be the first visit to node u. In that case the testing PU sets first[u] = i. If depth[i + 1] > depth[i], this is the last visit to node u = index[i] and the testing PU sets last[u] = i.

    For any pair of nodes u and v, there are five mutually exclusive possibilities:

    1. u = v;
    2. first[u] < first[v], last[v] < last[u]: u is an ancestor of v, LCA(u, v) = u;
    3. first[v] < first[u], last[u] < last[v]: v is an ancestor of u, LCA(u, v) = v;
    4. last[u] < first[v], LCA(u, v) is neither u nor v.
    5. last[v] < first[u], LCA(u, v) is neither u nor v.
    Only the last two cases need to be considered further. The last is analogous to the first, so it suffices to consider case 4: < first[u].

    Lemma: If last[u] < first[v], LCA(u, v) = index[j], where j = min_{last[u] < i < first[v]}{depth[i] = d}, where d = min_{last[u] < i < first[v]}{depth[i]}.

    Proof:

    Let d and j be as defined and let w = index[j]. All nodes on the tour between last[u] and first[v] have depth larger than w. So, the tour has not left the subtree rooted at w. Thus, all these nodes, particularly u and v are descendants of w, that is, w is a common ancestor of u and v. Now consider another node w' which also lies on the subtour lying between last[u] and first[v]. Suppose w' is an ancestor of u. The tour is constructed so that once it leaves a node to a higher lying node it does not return. This holds in particular for w'. So, only after having visited all nodes in the subtree of w' the tour reaches w. These nodes in the subtree of w' will not be visited again. We know that hereafter the tour reaches v at least one more time because first[v] > j. Thus, v does not lie in the subtree of w' and thus w' is not a common ancestor. We conclude that on this subtour w is the only common ancestor of u and v. End.

    Reduction of LCA to Range-Minima

    The lemma reduces the problem of computing LCA(u, v) to testing several simple cases, and if necessary computing the minimum value in a specified interval of an array of length O(n). Finding the minimum in a subrange of an array can be solved easily in parallel: for a subrange of length n' it can be performed in O(log n') time and O(n') work. This shows that any single LCA query can be performed in O(log n) time and O(n) work. This is not a trivial result but in the above presented context we are striving for something else: we want to preprocess the tree so that LCA queries can be performed in O(1) time. So, after the reduction this means that we should preprocess an array so that queries for the minimum in a subrange of the array can be performed sequentially in O(1) time. The question how to achieve this is called the range-minima problem.

    Range-Minima Problem

    For an array of length n there are (n over 2) = Omega(n^2) subranges. That sounds like a too big number. A clever and rather general idea is to reduce any kind of range searches to prefix and suffix searches. In this case the range-minima problem will be solved by constructing a set of arrays containing prefix and suffix minima from which the desired value can be easily recovered.

    Suboptimal Algorithm

    For an array a[] of length n, the array p[] of prefix minima is defined by p[i] = min_{0 <= j <= i}{a[i]}, for all 0 <= i < n. The array s[] of suffix minima is defined by s[i] = min_{j <= i < n}{a[i]}. Suppose we have already computed the prefix minima p_1[] for the first half of an array a[] and the prefix minima p_2[] for the second half of a[]. Then the array p[] of prefix minima of the whole array can be computed in O(1) time and O(n) work setting
    p[i] = p_1[i], for all 0 <= i < n / 2;
    p[i] = min{p_1[n / 2 - 1], p_2[i - n / 2]}, for all n / 2 <= i < n.
    In an analogous way the array of suffix minima s[] can be constructed from arrays s_1[] and s_2[] of suffix minima for each half of a[].

    Assume that n = 2^k for some k >= 0. For other n, we can if necessary round up the length of the array to the next power of two. Let p_{l, m}[], 0 <= l <= k, 0 <= m < n / 2^l, be the array of prefix minima for the subarray of a[] of length 2^l starting at position m * 2^l. Define s_{l, m}[] to be the analogous array of suffix minima. By the above recursive construction all arrays p_{l, m} and s_{l, m} can be constructed in O(log n) time and O(n * log n) work:

    T(n) = T(n / 2) + O(1);
    W(n) = 2 * W(n / 2) + O(n).
    Once all these arrays have been computed, a range-minimum query can be performed in O(1) sequential time.

    We start with an example. Suppose we want to determine min_{393 <= i <= 438}{a[i]}. Then we first write these numbers u and v in binary: u = 393 = 00110001001, v = 434 = 00110110010. We have added a few leading zeroes to show the generality of the idea. Then we determine the most-significant different bit. In our example it occurs at position 5. The number w which is obtained by taking v and setting the 5 least-significant bits to zero is w = 0011010000 = 416 = 13 * 32. The number we are looking for is min{s_{5, 12}[9], p_{5, 13}[18]}. Here 23 = 393 - 12 * 32 and 18 = 434 - 13 * 32.

    More generally, for numbers u and v, u <= v, the most-significant different bit x = x(u, v) is determined. Then the number w is constructed by taking v and setting the least-significant x bits to zero. Let d = 2^x and l = w / d. The definition of w and d is so that (l - 1) * d = w - d <= u and (l + 1) * d = w + d > v. The number we are looking for is min{s_{x, l - 1}[u - (l - 1) * d], p_{x, l}[v - l * d]}. This is so, because s_{x, l - 1}[u - (l - 1) * d] = min_{u <= j < w}{a[j]} and p_{x, l}[v - l * d] = min_{w <= j <= v}{a[i]}.

    It is interesting to notice that the number w is just the index of the LCA of nodes u and v in a perfect binary tree with inorder indexing with smallest index 1, the special case of the LCA problem treated above. This shows the close relation between the LCA problem and the range-minima problem. In an implementation it is not necessary to make this tree structure explicit, the set of prefix and suffix arrays can be managed by a double indexed array of arrays of integers. The given computation shows which two values in this construct must be accessed to compute the range minimum from. On a RAM this takes constant time.

    Lemma: For an array a[] of length n, in O(log n) time and O(n * log n) work a data structure requiring O(n * log n) storage can be created which allows it to solve range-minima queries on a[] in constant time.

    Optimal Algorithm

    The above construction is simple and once it is ready it allows to work with efficiently, but the work and the storage are too large. Using accelerated cascading the work and memory consumption can be made linear, which is optimal. Actually for the range-minima problem there is an even faster parallel algorithm, running in O(loglog n). This faster algorithm is not considered here because it does not make the complete algorithm for the LCA problem faster.

    To reduce the work and memory consumption, the array a[] of length n is divided in n / log n subarrays of length log n each. For each of these subarrays an optimal sequential algorithm is applied. For each of the subarrays this takes O(log n) time and work. So, the parallel time is O(log n) and the work is O(n). The created structure allows to perform range-minima queries inside any of the subarrays in constant time. Along with this construction we can also construct an array b[] of length n / log n consisting of the minima in each of the subarrays. The above given suboptimal algorithm is now applied to b[] rather than a[]. This takes O(log n) time and O((n / log n) * log(n / log n)) = O(n) work. So, the whole construction takes O(log n) time and O(n) work.

    It remains to specify how this two-stage structure can be used to perform range-minima queries in constant time. For numbers u and v, u <= v, belonging to subarray i_u and i_v, respectively, we distinguish two cases (with a subcase in the second):

    Now the construction is perfect except that we have not told how to sequentially preprocess an array in linear time so that range-minima queries can be performed in constant time.

    Theorem: For an array a[] of length n, in O(log n) time and O(n) work a data structure requiring O(n) storage can be created which allows it to solve range-minima queries on a[] in constant time.

    Corollary: For a tree with n nodes, in O(log n) time and O(n) work a data structure requiring O(n) storage can be created which allows it to solve LCA queries for any pair of nodes in constant time.

    Computing Most-Significant Different Bit

    The above results are based on the assumption that the most significant different bit of two numbers u and v can be computed in constant time. Computing z = u ^ v, the bitwise exor of u and v, this problem is reduced to determining the leading non-zero bit of a number, said otherwise the problem of computing intlog(z) = round_down(log_2 z). Unfortunately, this integer logarithm is not a standard operation, even though it would be easy to realize in hardware and even though it is quite important.

    For k-bit numbers, any single computation of intlog() can trivially be performed in O(k) time. Using a binary-search strategy, based on comparisons and shifts, this can easily be reduced to O(log k) time. Using bit parallelism further improvements are possible, but we do not obtain a constant-time operation this way. If we have to perform many computations of intlog(), it may be better to first compute an array intlog[] of length n = 2^k containing all values. A later call intlog(i) can then be answered in constant time by returning intlog[i].

    Sequentially all n values of intlog[] can easily be computed in O(n) time:

      void computeIntlog(int[] intlog, int n)
      {
        intlog[0] = -1; // Artificial value
        intlog[1] =  0; 
        for (int i = 2; i < n; i++)
          if ((i & (1 << (intlog[i - 1] + 1) == 0)
            intlog[i] = intlog[i - 1];
          else
            intlog[i] = intlog[i - 1] + 1;
      }
    

    The given subroutine cannot easily be parallelized. There are several alternative constructions. If optimal work would not be a major concern, then we could apply any of the sequential algorithms. For a sequential time T(n), this gives work T(n) * n. We know that T(n) = O(loglog n) or less. It is very simple to make the algorithm optimal: first compute intlog[i] for all values 0 <= i < sqrt(n). This takes T(n) time and O(sqrt(n) * T(n)) = O(n) work. Then for any number i, sqrt(n) <= i < n, set intlog[i] = intlog[i >> (k / 2)] + k / 2. This takes constant time and O(n) work.

    Theorem: If T(n) = O(loglog n) denotes the time for sequentially computing the value of intlog(i) for any number 0 <= i < n, then the array intlog[] with intlog[i] = intlog(i) for all i, 0 <= i < n, can be computed in parallel in O(T(n)) time and O(n) work.

    Exercises

    1. For a tree T the task is to number all leafs consecutively, that is, the numbering should be so that for any node u the set of all numbers of the leafs equals {l, l + 1, ..., h} for some l and h. Describe how the Euler-tour technique can be used to compute such a numbering.

    2. Sequentially, coloring a bipartite graph can also be solved by repeatedly determining a maximum matching. Matching is more expensive than coloring, but different from the Euler-tour-based splitting technique it can even be applied for graphs of odd degree.
      • Why should Euler-tour splitting not be applied for graphs of odd degree?
      • The task is to color a bipartite graph with n nodes in both subsets of nodes and degree g with the minimum number of colors. If possible Euler-tour splitting should be applied, otherwise matching. How many matchings may be needed at most?
      • Reduce the number of matchings as far as possible. Hint: two disjoint regular subgraphs of degree g_1 and g_2 respectively can be combined to one subgraph of degree g_1 + g_2.

    3. In the raking procedure it was used that the tree is binary. Trees of other degrees can be first made binary, but if the degree of any node is bounded by some constant d this is not necessary. Describe how to generalize the algorithm. Express the time consumption for a tree with n nodes in n and d.

    4. Suppose we have a tree in which all nodes have degree at most two. Each node holds a key and for each node u we would like to compute the minimum value of the keys in the subtree rooted at u, u included. In the text it was suggested to use the tree-contraction method for this. Work out the details, particularly describe how to handle the internal nodes of degree one. For a tree with n nodes the algorithm should run in O(log n) time and O(n) work on an EREW PRAM independently of the tree structure.

    5. Assume we have a black-box subroutine which allows to compute the minimum value m in a specified interval [l, h] of an array a[] of length n. Describe how this routine can be used to compute an index j, l <= j <= h so that a[j] = m. We assume that the range-minima subroutine only works on integer arrays, so that we cannot simply add the index as secondary key to the values of a[]. Distinguish the case that it is given that all values of a[] are different and the case that values may be the same.

    6. This question deals with computing the least-significant different bit. All considered numbers have at most k bits for some even k > 0. n = 2^k. It is supposed that there is an array intlog[] of length n containing for all i, 0 <= i < n, the value of round_down(log_2 i).
      • Describe how to construct in O(n) sequential time an array reverse[] of length n containing for all i, 0 <= i < n, the value of the reversal of i, reverse(i). If i_j denotes bit j of i, then reverse(i)_j = i_{k - j - 1}. Hint: Start with i = 0 and reuse earlier computed values.
      • Use the values in intlog[] and reverse[] to define a constant-time operation lsdb(,), computing the least-significant different bit for any pair of numbers.
      • Show how to compute reverse[] in parallel with at most O(n) work. Try to obtain a running time of O(loglog n).

    
    
    
    

    Graphs


    
    
    
    

    Strings


    
    
    
    

    Computational Geometry


    
    
    
    

    Numerical Algorithms


    This page was created by Jop Sibeyn.
    Last update Monday, 06 September 04 - 09:19.
    For any comments: send an email.