Preface and Notation

This document contains the text of a course in parallel structures.

As html is not suited for mathematical formulas, some additional notation is used (as used in the typographical package Latex). a_i denotes a with subscript i. a^i denotes a to the power i. <= and >= are used as in most computer languages. Curly brackets, "{" and "}", are use to group things. sum_{i = 0}^j denotes the sum for i running from 0 to j. The same may also be written sum_{0 <= i <= j}. ~= stands for "approximately equal" and ~ for "proportional to". Greek letters are written out. For logical expressions we either use the notation from C or the operators are written out in text. So, "a && !b" is the same as "a and not b". sqrt stands for the square root function and log without specifying the base number for the logarithm to basis 2. There are a few more notations, but they should be understood easily. New notions are printed bold there where they are defined. Inside the chapters particularly important ideas and short key notes are highlighted without introductory text. There are many pictures. These are intended to be self-explanatory. Typically they are placed just after the text fragment to which they belong, generally there is no direct reference to them in the text.

There are very few references to the literature. Clearly most of the presented material is not new. A large part is even common knowledge. More directly this text is based on material found in the following two books:

The first was used for most of the PRAM algorithms, the second for some results on sorting. Several students have contributed by pointing out errors and spots which required better explanation.

Notice: the following text may be overcomplete. At the examination it is expected that the students only know all that has been presented during the lectures.

Table of Contents





Introduction

Parallel Computing

Introduction

Processors become faster at a tremendous rate. For the last 30 years the clock rate has been doubling about every 18 months. However, there are convincing arguments that this trend cannot continue another 30 years: processors must have a certain minimal size and information cannot travel faster than the speed of light. Nevertheless, the demand for computing power will continue to increase. Even right now there are problems which are at the same time very urgent and very time consuming, weather forecasting is a prominent example.

To meet both kinds of demands, future and current, it is natural to consider using more than one processor for solving a single task. Because processors not only become faster but also cheaper this becomes more and more realistic. As soon as there is a demand for it, the production of multi-processor chips will start. Right now the newest processors are still expensive, but processors of the previous generation are cheap: If one strives for the maximum number of clock cycles per money unit, then clearly one should not buy the top model, but rather four with half the speed. Using P processors instead of 1, we may hope to become P times faster, but this is rarely achieved. Therefore, it is not sure that four 2 GHz processors will solve a problem faster than a single 4 GHz processor.

If we know how to exploit the computational power of a system with several processors, such a system is financially attractive and technologically possible.

Definition: A computer with several processors is called a parallel computer. Parallel computing is the (study of) the usage of parallel computers in order to solve tasks faster.

Task Parallelism

The simplest way to use parallel computers is to let several processes be solved by several processors. This can both be done for a multi-user system, as well as for a single-user system. Large data servers often have many processors: the incoming data requests are scheduled over the processors and each request is handled by a single processor. Such systems are commercially employed and may have hundreds of processors. On a single-user system, the various processes of the user may be allocated to several processors. At a lower level, the various kinds of instructions may be executed by several processors. The degree of parallelism is rather bounded this way, because a single user typically runs only one time-demanding process, and instructions cannot be subdivided arbitrarily.

Even though technologically different, this kind of parallelism can be characterized as being task parallelism. The most important theoretical question in this context is how to schedule the tasks, that is, how to allocate the tasks to the processors so that an even load balancing is achieved. The load balancing problem has several variants. For a fixed set of tasks, it is a common goal to minimize the make span, the time to complete all tasks, respecting interdependencies if any. In an online context, the goal is more reasonable to try to minimize the completion time, the time before any job has been completed, or to minimize the time before any job starts being done.

In practical applications until now task parallelism is by far the most important branch of parallel computing. However, from an algorithmic point of view it is not very interesting because there is not so much one can do. Therefore, this is not the topic of this lecture. Only here we will briefly discuss the load-balancing problem.

If the lengths of the tasks are known in advance, then the problem of how to schedule a set of tasks so that the make-span is minimized is NP-hard. A reasonable heuristic is to sort the tasks according to their lengths and to allocate in this order to the processor which until then has the least work to do.

If the lengths of the tasks are not known, then the best one can do is to always allocate a task to the processor which currently has the shortest queue. This is the normal scheduling principle applied in supermarkets. In this context there is the issue of task migration: at certain costs it may be possible for tasks to be reallocated in order to achieve a more even work load.

A problem with this approach is that it requires that the lengths of all queues are centrally known: in a huge supermarket with 30 cash-sites, one will typically not check all of them, but simply take the locally shortest queue. If the central allocating agent does not know the loads, then the best he can do is to perform the allocation in a random way. Queuing theory tells that this is nevertheless quite good: if a processor on average gets allocated k tasks, the deviation from this expected value is bounded by O(sqrt(k * P * log(k * P))). Much better balance is achieved if the allocator always first asks the queue lengths of two randomly selected processors and then allocates the next task to the one with the shorter queue.

Perfect balance is achieved if the scheduling is postponed: a processor fetches a next available task as soon as it is ready with the previous one. This is the scheduling principle applied in modern post offices. The disadvantage is clear: the waiting tasks have to be stored in a central storage and cannot be handed out immediately. Furthermore, the processor will be idle while sending the request for a next task and waiting for this task to arrive.

"True" Parallelism

The main topic in this lecture is the question how a single computational task can be efficiently solved using more than one processor. As said above, ideally, with P processors the problem is solved P times faster, but this is rarely achieved.

For an analogy consider a ship wharf. In earlier days work was done in a very labour intensive way. In the rich countries, using better equipment, the number of working hours for building a ship has been reduced considerably. Productivity is not doubling every 18 months, but even in this domain there is an exponential increase, justifying the noticeable salary increases of the past 100 years. Nevertheless, it still takes many thousands of working hours to build a supertanker. Unfortunately the company which ordered the ship normally does not want to wait 20 years. The solution is well-known: a wharf building a supertanker typically employs thousands of workers. Of course having many workers around, there must also be people not actually building the ship but leading the work, managing the company and cleaning the buildings. It also takes time to get instructions from the management down to the people on the work floor, and similarly there may be a need for workers to communicate with the management (paint is running out!) or among each other (we are ready with our section of the ship!). Even worse, the process cannot be speed-up arbitrarily, because there is a fixed order of steps, which cannot be rearranged arbitrarily. Some of these steps require many more workers than others, and it thus becomes inevitable that during part of the process some workers are idle. It may even happen that due to bad organization people are standing around though there is work they could do. Or people are working, but because they did not receive precise instructions, they are drawing cables which later have to be removed again, or which were already installed by someone else. Finally, given that one wants to minimize the construction time, the chief engineer may decide to reschedule the work. For example, it is clearly practical to not start to paint the ship before it is completely welded together, but maybe the painters can already start painting where ever something gets ready. They waste time while climbing the scaffoldings many times and cleaning there equipment.

Let us summarize the aspects which lead to the fact that P workers will not build a ship P times faster than a single worker (provided he has all the skills and is able to physically do the job):

All these aspects arise also if a single problem has to be solved by a set of P processors. Communication and how to minimize the cost of it is a central issue on computers with a distributed memory, but will be abstracted away in most of this lecture. The scheduling is an important problem in general, but we will design the algorithms so, that in most cases it is trivial how to schedule the arising subtasks. Useless work can normally be prevented, double work may be reasonable in some contexts, but will not play a big role in this lecture. The central topic of this lecture is the design of algorithms which allow to use as many processors as possible, while loosing at most a constant factor in efficiency. So, we accept that a parallel algorithm is more work-intensive by a constant factor, but normally not more than that. This does not mean that we will only look at such problems, we will see how to determine the maximum of n numbers in O(1) time using O(n^2) processors, but this is the focus. Quite often we will first only consider the time: given arbitrarily many processors, how fast can the problem be solved. Then we will try to reduce the number of processors employed so that the algorithm becomes optimal in the above sense.

Definitions

We introduce some important notions describing the performance and quality of parallel algorithms and programs. In the remainder of the text we will mostly write PU, an abbreviation for "processing unit'', instead of processor. The time of a parallel program solving a problem of size n on a computer with P PUs is typically denoted T(P, n). For an algorithm the same notation may be used to denote a O()-estimate with respect to some computational model.

The cost C(P, n) of a parallel algorithm solving a problem of size n on a computer with P PUs is defined as C(P, n) = P * T(P, n). The cost is to be distinguished from the work W(n, P) which equals the sum over all PUs of the number of performed operations. The difference is that the work does not take idle time into account.

The speed-up S(P, n) gives the number of times the parallel program is faster than the sequential program. So, S(P, n) = T(1, n) / T(P, n), for some definition of the sequential time T(1, n). For T(1, n) there are several possibilities, which are all in use:

Especially with the last notion of T(1, n), the speed-up expresses how the algorithm scales: if the speed-up is close to P (or Omega(P) in theoretical contexts), one says that the algorithm scales well. If the speed-up is much lower than P (or o(P) in theoretical contexts), one says that the algorithm scales poorly.

The efficiency E(P, n) of an algorithm is related to the speed-up: it is given by E(P, n) = S(P, n) / P = T(1, n) / (P * T(P, n)) = C(1, n) / C(P, n). So, it is the ratio of the costs, whereas the speed-up is the ratio of the times. A parallel algorithm is called optimal if E(P, n) = Omega(1), that is, the cost of the parallel algorithm is at most a constant factor larger than that of the best sequential algorithm. A study of the efficiency is especially important in practical contexts: Often the efficiency starts to decrease quite strongly from a certain P value on. The typical reason for this is that when P becomes too large for the value of n, the communication costs (which often increase with P^2) become dominating, and any further increase of P will hardly lead to a decrease of the overall time. Particularly on a multi-user system with P = 32 one should not run an application with an efficiency of under 10%.

Super-Linear Speed-Up

Above it was stated several times: the goal of parallel computation is to achieve speed-up P using P PUs and that this goal cannot be achieved due to several kinds of overhead. There are, however, contexts in which it is possible to solve a problem more than P times faster using P PUs than when using a single PU of the same type. The reason is that P PUs also have P times as much cache and P times as much main memory. On a modern computer cache misses are heavily penalized, costing several hundreds of clock cycles. Page misses even cost on the order of 10^7 clock cycles.

As an example we consider sorting 10^10 integers, requiring 40 GB of storage. On a single PU, having only 1 GB of main memory, this will cost about 12 hours. This time is estimated as follows: 40 GB must be read and written twice, a typical transfer rate between main memory and hard disc is 8 MB. So just for the data transfer we need 160 * 10^9 / (8 * 10^9) = 20,000 seconds. If we have 64 of these PUs interconnected by a switch giving 2 Gbit point-to-point connection the time becomes much better. We must assume that originally the data stand distributed over the PUs. After sending around a sample of negligible size, all data must be routed over the network once before each PU has to sort 156 * 10^6 integers. The routing takes a few seconds: each PU holds data with a total size of 5 * 10^9 bits. 2 * 10^9 bits can be routed per second. The sorting takes less than 100 seconds on a 3 GHz processor. So, we achieve a speed-up of 200 with 64 PUs.

The described phenomenon of solving a problem more than P times faster using P PUs is called super-linear speed-up. It is a true phenomenon, and it is practically important, but it does not really tell something about the quality of the algorithm. Clearly, very bad parallel algorithms will not achieve any speed-up, not even in an extreme context. But, comparing with an unreasonably slow sequential algorithm does not provide a fair basis for evaluation: if in the above example the joint 64 GB of main memory would be available to a single PU, then this PU could have performed the sorting in about 50 times the parallel time. In that case the efficiency of the parallel sorting algorithm would have lain around 0.90 and not around 5.0.

Too-good-to-be-true speed-ups are obtained by either comparing with sub-optimal sequential algorithms, or by comparing with sequential algorithms which are slowed down by slow access to a large data set.

Cost Models

Multitude of Models

In a sequential context there is a single widely accepted cost model: the von Neumann unit-cost model. The basic assumption of this model, that all operations are equally expensive, is far from exact, but in this case convenience tends to prevail over accuracy: accessing the main memory is far more expensive than operating on data residing in the registers or cache. This may be important in practice, but in theoretical studies this distinction is mostly ignored. As long as the complexity of algorithms is given in O()-notation, this is even formally correct as all operations take at most a constant number of clock cycles.

In parallel computing the situation is different: there are several cost models none of them dominating. The reason for this multitude is that the simplest of the models is felt to model reality too poorly to be acceptable. The more precise models are considered to be too hard to design algorithms for and in part also considered not to be of long lasting validity. Another reason for the multitude of models is also that the underlying hardware shows much more variability than the platform on which sequential computation is founded.

The most fundamental distinction in parallel computation is on how the memory is accessed: whether there is a shared memory to which all PUs have access in constant time, or whether there is a distributed memory so that each PU has direct access only to its own local memory. In the first case data can be exchanged by "parking" some information at some fixed position making it accessible to the other PUs. In the second case the data are exchanged by some form of communication between the PUs.

Another important distinction is whether the PUs are working synchronously or asynchronously. In a synchronous model, all PUs have access to a central clock. In practice systems are mostly asynchronous: each PU computes at its own speed. The speed differences may be small, but in a parallel computer not necessarily all PUs are of the same type. In principle the speed differences can be arbitrarily large. In practice, considering that there is mostly a non-negligible increase of the overhead with P, it normally does not make sense to combine PUs which are too different. Only for trivial problems, such as searching for extra-terrestrial life in radio signals or finding very large prime numbers, it is profitable to add 100 MHz PUs to a system that also contains 3 GHz PUs.

Historically parallel computers have been distinguished in SIMD, Single Instruction Multiple Data, and MIMD, Multiple Instruction Multiple Data, machines. On an SIMD machine, which is synchronous, all PUs execute the same instruction at all times. Because they execute these instructions on different data, it is still possible to obtain parallelism. The idea behind SIMD machines, is that if all PUs execute the same instruction, there is no need for the PUs to generate these instructions themselves: there is a central host computer connected to all PUs which tells them what instruction to execute next. This implies that the PUs can be much simpler.

In the 1980s there were many parallel computers of this type. An impressive example was the Masspar. This parallel computer consisted of a grid of very simple PUs. The PUs had primitive 4-bit arithmetic, a few kilobytes of storage and were slow. But, they were cheap. This implied that research institutes could afford to buy Masspar machines with 65536 PUs in a 256 x 256 grid. More recently no parallel computer with a comparable number of PUs has been build.

In an MIMD machine all PUs in principle may have their own program. In practice all will execute the same program, but due to conditional instructions, they will not always execute the same instructions. All modern parallel computers, in which the CPUs are mostly conventional PC CPUs, are of this type. Because most parallel algorithms become less efficient if the number P of PUs increases, it is more cost effective to buy 32 expensive but fast PUs than 1024 cheap but slow ones. Only for specific numerical applications architectures like the Masspar might still make sense.

In the following we briefly discuss the most important models. In the last section of this chapter we consider several problems from various perspectives. In the following chapter we present some basic algorithms for interconnection networks. Thereafter, we will only consider the simplest and most theoretical model, which allows to concentrate most strongly on the underlying parallelization ideas.

Distributed-Memory Model

In the distributed memory model it is assumed that the PUs are interconnected by some network without specifying the topology. More precisely, it is assumed that the topology plays a minor role and can thereby be neglected. Nevertheless any exchange of data must be performed by communication through this network.

Distributed Memory Computer with 4 PUs

It is assumed that the cost of communication is linear in the amount of data sent or received plus some constant cost. So, under this assumption the time consumption of a computation is estimated as follows:

T_tot = c_i * t_i + c_v * t_v + c_a * t_a.
Here c_i, c_v and c_a are constants whose values depend on the state of technological development. c_a is known under the name start-up latency. t_i gives the number of internal operations; t_v gives the total amount of data sent by a PU, this we will call the routing volume; and t_a gives the total number of packets sent by a PU. Typically a computation is divided in steps, and it would be more correct to define the above values as the sums over all steps of the maxima over all PUs of the number of each type of operations. This model will be referred to as DMM model, where DMM is an abbreviation for Distributed-Memory Machine.

The technological justification for the DMM model is that in modern parallel computers the communication is performed by wormhole routing. This means that if PU_i wants to send something to PU_j, a path from PU_i to PU_j is cleared, and then the data are transferred along this path flit by flit. a flit is a small data unit. The idea is that these flits run through the network like the segments of a worm: first all of them are standing in PU_i, then the first flit enters the network. By the time the first flit reaches PU_j several other flits, depending on the number of hops from PU_i to PU_j, have entered it. In PU_j the flits pile up and are recomposed. In this way the transfer time is something like d + l, where d gives the number of hops from PU_i to PU_j and l gives the number of flits. The time for the d hops is mostly negligible, much more important being the almost fixed start-up time.

Currently c_i ~= 0.3 * 10^-9. The values of c_v and c_a vary much more depending on how much money was invested on the interconnection infrastructure. In some cases the network is virtually unbounded, in that case the bounding factor is the speed of the memory bus of the individual PUs: clearly data cannot be pumped into the network faster than the rate at which they can be supplied by the memory. This means that we may have c_v = 10^-9 (when t_v indicates the number of integers). On the other hand, the PUs may also be coupled very loosely. For example, they may be connected by fast ethernet or worse, giving up to c_v = 10^-6. c_a can vary accordingly. In ideal cases we have c_a = 10^-7, but it may also be orders of magnitude larger. Notwithstanding all variations, we will work under the general assumption that c_i is considerably smaller than c_v, which in turn is orders of magnitude smaller than c_a.

t_i, the number of internal operations, cannot be estimated accurately because of the obscure internal operation of a modern processor. However, in many cases, it is possible to compare it with the number of operations performed by a sequential algorithm. For example, we will encounter sorting algorithms in which it is clear that t_i(P, n) = (1 + o(1)) * t_i(1, n) / P. Here t_i(P, n) denotes the number of internal operations performed by a PU when solving a problem of size n on P PUs. For such an algorithm, as long as we can guarantee that the communication terms are negligible, we may expect to achieve very high efficiencies.

Here we touch on the main point in the context of DMM computation: the contribution c_v * t_v + c_a * t_a to T_tot is perceived as the loss. For most applications t_i(P, n) decreases linearly with P, whereas particularly t_a(P, n) tends to increase at least linearly with P. So, in order to obtain acceptable efficiency, P must not be chosen too large in comparison to n. It depends both on the time complexity of the problem and on the development of t_a(P, n) how large P can be chosen, or said otherwise, for how small n it still makes sense to use P PUs.

The DMM model is a compromise between accurately modeling the time consumption of a parallel program and simplicity. It takes into account that sending small packets should be avoided, but neglects locality aspects.

Interconnection-Network Model

A much considered cost model is based on the assumption that the PUs are interconnected by a network in some given topology. While in the DMM model it is assumed that this topology can be ignored, it is here assumed that the structure of the network plays a significant role and algorithms will be designed with the network topology in mind.

4 PUs Connected by Ring Network

The time consumption of algorithms for interconnection networks is mostly expressed in steps. A step consists of a certain amount of computation followed by communication. Often there is no limit on the amount of computation that can be performed in a step. The communication is ruled by the precise model assumptions, but typically it is assumed that one packet of unit size can be transferred over each connection in each step. These packets can neither be divided nor composed. This cost model will be referred to as interconnection-network model.

The main assumption is that PUs can only communicate with neighbors and that longer distance communications are achieved by forwarding packets along some path connecting the source PU with the destination PU. This is called store-and-forward routing. Due to the dominance of wormhole routing, this assumption does not reflect the current technological conditions. There are nevertheless good reasons to consider a cost model based on it:

The model is further subdivided according to whether PUs can communicate with all neighbors at the same time, the all-port model, or only with one of them, the single-port model. Sometimes it is specified that the connections are uni-directional (also called half-duplex), in that case a PU cannot send to a PU from which it is also receiving. Otherwise the connections are said to be bi-directional (also called full-duplex).

The interconnection-network model neglects the aspects of computation. Thereby it allows to concentrate on the fundamental aspects of data exchange.

Shared-Memory Model

The main model of parallel computation is the shared-memory model. In this model all PUs have direct access to a common memory. Such an access is assumed to cost constant time. In addition the PUs have a local memory in which they can store local data.

Shared Memory Computer with 4 PUs

When considering computers with a shared memory, it is important to specify which types of access to the memory are allowed:

In the last case there are several possible ways of handling write conflicts, leading to a further subdivision of the model:

Depending on the underlying hardware, any of these models may be realistic. They are not far apart anyway: the most powerful of these models, a step of a Maximum CRCW machine with P PUs can be simulated on an EREW machine in O(log P) steps. The main parallel computer model considered in this text is the PRAM model. A PRAM, Parallel Random Access Machine, is a synchronous parallel computer with a shared memory. When specifying the performance of a PRAM algorithm we will mostly specify the memory access model, so we may say "On a CREW PRAM ... ".

It would be most correct to formulate a PRAM algorithm including load and store operations specifying which data have to be copied from the global memory to the local memory and vice versa. However, the existence of a local memory is mostly ignored, specifying algorithms as if all the PUs operate directly on the shared memory itself. This is similar to the practice of writing sequential algorithms without mentioning the memory management. The main reason why the PRAM model is so popular is that it allows to concentrate on the work to perform.

The shared memory model entirely neglects the aspects of data exchange. Thereby it allows to concentrate on the fundamental aspects of parallelizability.

Work-Time Framework

The PRAM model is simple, but the work-time framework allows to concentrate even more on the underlying concepts. The idea is to not specify the number of involved PUs: an algorithm is composed of steps, and in each step there are operations to be executed in parallel. Such an instruction may be of the type "for all i, 0 <= i < n, a[i] = b[i] + c[i]". In the literature one may find this expressed also by a pardo instruction.

An algorithm is evaluated with two cost parameters: the time, that is, the number of parallel steps; and the work, the sum over all steps of the number of performed operations. Specifying the performance of an algorithm in this way leaves the allocation of the work to the PUs unsolved. In principle it should be possible to perform a step in which w operations are executed on a PRAM with P PUs in round_up(w / P) time, but only if each PU knows which operations to execute.

If the work-allocation problem can be solved, then a WT algorithm gives a PRAM algorithm: if W(n) and T(n) are the work and time for a problem of size n, respectively, then

  T(P, n)  = sum_{step i} round_up(w_i / P) 
          <= T(n) + sum_{step i} w_i / P
          <= T(n) + W(n) / P,

  C(P, n)  = P * T(P, n)
          <= P * T(n) + W(n).

These relations immediately suggest a close to optimal choice of P for a given value of n: take P = P(n) = W(n) / T(n): for that value T(P, n) <= 2 * T(n) and C(P, n) <= 2 * W(n). So, for this choice of P, the completion time is at most twice the minimum value, and the cost is at most twice as large as the work. This is a good compromise: taking P larger, C(P, n) increases rapidly, without giving a substantial reduction of T(P, n). Taking P smaller gives substantially larger T(P, n), without giving a substantial reduction of C(P, n).

If there is a parallel algorithm in the work-time framework solving a problem of size n with work W(n) and time T(n), then, provided the work allocation can be solved, there is a PRAM algorithm with time O(T(n)) and cost O(W(n)).

Importance of PRAM Algorithms

PRAM Simulation

A common way of obtaining DMM algorithms, is to simulate PRAM algorithms. In this section we describe how this is done. This is a rich topic. Here we only mention how the problem can be solved in most cases occurring in practice, only touching on the topic of non-trivial memory allocations which must be used if one wants to guarantee good performance.

Suppose we have a PRAM algorithm solving a problem with P_PRAM PUs in T_PRAM time. We want to execute it on a DMM with P_DMM PUs. It is understood that P_DMM << P_PRAM. In many cases P_DMM must be fairly small in relation to the problem size in order to obtain speed-up. The easiest is to let PU_i, 0 <= i < P_DMM, of the DMM perform the work executed by the PRAM PUs with indices k, i * P_PRAM / P_DMM <= k < (i + 1) * P_PRAM / P_DMM.

Of course, this is not all. In the first place there is no guarantee that in this way the DMM PUs are equally loaded, even though they take over the work of equally many PRAM PUs. This is called the load balancing problem. For example, when simulating the PRAM algorithm for computing the sum, more and more PRAM PUs become idle. If things are arranged so that at all times the first PUs remain active, then the above processor allocation concentrates all the work unnecessarily on a few DMM PUs. In the case of the sum algorithm this can be prevented by letting the work of PU_j of the PRAM be executed by PU_{j mod P_DMM} of the DMM. However, we strive for a blind PRAM simulation, not one in which every algorithm has to be studied before we can decide how to allocate the PRAM PUs to DMM PUs. A possible solution is to randomly allocate the PRAM PUs to the DMM PUs, for example, by using a universal hash function. This does not exclude bad allocations, but makes them unlikely.

Another point is the memory allocation. In some cases one might in principle distribute the complete data set to all DMM PUs, and then compute on these. This is normally not a good practice: this requires much communication and much memory at the PUs. One of the strong points of a parallel computer is that the joint main memory is much larger than that of most single-processor machines, thereby allowing to solve large problems without exceeding the memory size. Furthermore, there are not so many problems which can be split in non-interacting subproblems. So, normally the data set is distributed over the PUs in such a way that each data item is stored in a single PU.

Often it is good enough to use a regular allocation like for the PRAM PUs. In such a regular allocation, data item i is either stored in DMM PU i mod P_DMM or in i * P_DMM / number_of_items_to_allocate. If now a PRAM PU which is allocated to DMM PU p needs a data item stored in DMM PU p', it sends a request to PU p' and waits for the reply. A typical way of progressing is that each PRAM step is decomposed in three substeps:

  1. Send requests for data
  2. Wait for the data to arrive
  3. Perform a computation step
Because each DMM PU is simulating P_PRAM / P_DMM PUs, and for most computational steps a constant number of data are needed, a common situation is that there are in total O(P_PRAM / P_DMM) requests sent by each DMM PU. In view of the high start-up costs for a routing operation, all requests going to the same PU must be bundled before sending. This requires a bucket sort with P_DMM buckets.

A possible communication schedule for sending all requests is to have P_DMM - 1 communication steps: in step t, 1 <= t < P_DMM, PU p sends to PU (p + t) mod P_DMM. In this way each PU is sending and receiving only one packet in each step. If the pattern is balanced this works fine. A pattern is said to be balanced if each DMM PU is approximately sending the same number of requests to any other PU.

If the requests are balanced, then also the answers are balanced. If there are O(P_PRAM / P_DMM) requests, then with the above schedule each transferred packet has size O(P_PRAM / P_DMM^2). So, one such communication round costs O(P_PRAM / P_DMM) * t_v + 2 * (P_DMM - 1) * t_a. From this it becomes clear that we must have P_PRAM / P_DMM^2 >> 1 if we do not want that the start-up costs dominate the overall time consumption.

The condition P_PRAM / P_DMM^2 >> 1 coincides with the condition that in most cases will guarantee that the pattern is balanced. The average packet size is P_PRAM / P_DMM^2 and if we may assume that the requests are randomly distributed over the PUs, the size of a packet is given by a random distribution with expected value S = P_PRAM / P_DMM^2. Such a distribution is tightly concentrated around its expected value. Chernoff bounds give that the deviations from the expected value are bounded by O(sqrt(S * log(P_PRAM))) with high probability.

However, a fixed data allocation can normally not guarantee that the communication pattern is balanced. An approach which makes it very likely that for P_PRAM / P_DMM^2 >> 1 all arising communication patterns are balanced is to allocate the data in a random way just as we did with the PUs. Again a hash function should be used, otherwise it would be hard to figure out where a data item is stored. Nevertheless, evaluating a hash function implies considerable overhead in comparison to a sequential algorithm which can access its data directly. If such a construction is applied, the efficiency drops considerably.

In some cases the problem is more fundamental: if the same data item is used by many PRAM PUs, then there is no way of allocating the item so that the communication pattern becomes balanced. The PU holding this item will be flooded by requests, and applying the above approach will give very poor performance, because all requests are handled sequentially. A clever algorithms designer will know whether such patterns occur in his/her application or not and mostly no action is needed.

If the problem of having many requests for a small subset of data items really occur or may occur, then one must do something about it. A practical solution is to first sort all requests on the index of the data item they are destined for. Sorting is not for free, but normally we can apply bucket sort because the range of values is limited and the amount of involved communication is also quite limited as we will see in a later chapter. Then, for all requests to the same data item one is chosen as representative. This representative is forwarded to the item. The answer is returned and replicated to all requests. Then the routing pattern of the sorting is reversed.

PRAM simulation is the link between a library of PRAM algorithms and programs running on actual machines. Only some elementary communication algorithms must be implemented in a machine-specific way.

Importance of Minimizing PRAM Time

The central question in the complexity theory of sequential algorithms is the P == NP question. The corresponding question in the complexity theory of parallel algorithms is the P == NC question. NC is the class of all problems which with a polynomial number of PUs can be solved in polylogarithmic time, that is, for a problem of size n in O(log^k n) time, for some constant k. Most of the problems considered in this text belong to NC. For example,

Above we noted that often there is a considerable discrepancy between the bounds in the PRAM and the DMM model. This does not mean that it is useless to try to design fast PRAM algorithms: simulating a PRAM algorithm with a running time T normally leads to a DMM algorithm with T communication rounds. A small number of communication rounds is a good thing to have, because this implies a smaller value of the number t_a of start-ups.

On the other hand, the number of start-ups is not the only cost factor. Consider again the formula giving the time consumption of an algorithm running on a DMM:

T_total ~= c_i * t_i + c_v * t_v + c_a * t_a.
This means that the best parallel algorithm is mostly the one that minimizes t_a without substantially increasing t_i or t_v. From this we obtain a strong and interesting reduction of the set of practice-relevant PRAM algorithms:
Only cost-optimal PRAM algorithms have potential practical relevance. Most promising are cost-optimal algorithms with minimal running time.




Interconnection Networks and Routing

Introduction

The PRAM and the DMM are simple and can be defined and described in a single paragraph. For interconnection networks the situation is different: there is a whole body of literature just describing and analyzing network topologies. Also there is extensive literature on the various communication problems on interconnection networks. In this chapter we first give a brief general introduction focusing on a small set of widely used interconnection network topologies and the most important communication problems. Then we go slightly deeper into some slightly more advanced topics which are based on some very interesting ideas with application beyond the precise scope for which they are presented. Because the communication problems which must be solved on DMMs are the same and because most algorithms are directly inspired by communication algorithms for interconnection networks, we also consider routing on DMMs.

Algorithms for interconnection networks are mostly analyzed using the interconnection-network cost model, in which it is assumed that in each step each PU may perform an unlimited amount of communication and then route one unit packet to one or more of its immediate neighbors.

Most algorithms for this cost model deal with the various communication problems, rather than to solve a general problem, say maxflow. The reason for this is that the communication problems are already hard enough, and that harder problems can often be solved quite well by taking a parallel algorithm designed for a more powerful parallel computer model, and then to simulate the operations of this computer on the interconnection network. Such a simulation is composed of two components: an allocation of the data and processes to the PUs and performing the resulting routing operations.

Next to the various communication operations also some of the most basic problems have been studied for interconnection networks. These are sorting, selection, list ranking, matrix multiplication and the like. In a following chapter we will even see how to solve the all-pairs shortest-paths problem optimally on a two-dimensional mesh.

Topologies

Here we introduce some of the most important interconnection-network topologies, fixed classes of interconnection patterns. The degree of a PU in a network is the number of other PUs it is connected to, corresponding to the degree of the node if the network is identified with the underlying graph. Interconnection networks are normally considered to be undirected: if PU_i is connected to PU_j, then PU_j is also connected to PU_i. The most important quality measures of an interconnection network are its

Complete Networks

A natural network topology is the one whose underlying graph is the complete graph. In such a complete network there is a connection between any pair of PUs. This topology is normally studied for the one-port model, while full-port capability on a complete network being unrealistically powerful.

In hardware a complete interconnection network with one-port capability can be realized with a crossbar switch. For interconnecting P PUs, a crossbar switch consist of a grid of P x P switches. These switches have four connections, W, N, E, S. The switches have two settings, either connecting W with E and N with S or connecting W with N and E with S, which can be chosen dynamically. Setting the switches appropriately, point-to-point connections can be created. Notice that using a crossbar switch each PU can send a packet to at most one PU and receive a packet from at most one PU, but that the PU to which it is sending may be different from the PU from which it is receiving. Switch-based complete network topologies are very common in practice. But, because of the increasing delays and cost such systems are typically of moderate size.

8 x 8 Crossbar Switch

Grid-Like Networks

Another very simple network topology is the linear array. In this topology each of the n PUs has one or two neighbors. Between any pair of PUs, (i, i + 1), 0 <= i <= n - 1, there is a connection. So, the degree of the linear array is 2. Its diameter is n - 1 and the bisection width is 1.

A similar network is the circular array which is also called ring. The only difference with the linear array is an additional connection between PU_0 and PU_{n - 1}. This connection is often designated as wrap-around connection. The degree is 2, the same as before. However, the other measures are better: the diameter is round_down{n / 2} and the bisection width is 2. This implies that most problems can be solved on a circular array twice as fast as on a linear array.

An n_1 x n_2 mesh consists of n_1 * n_2 PUs arranged in a grid. Each PU is connected to its up to four neighbors. As a graph, a mesh is the direct product of two linear arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter n_1 + n_2 - 2 and the bisection width is min{n_1, n_2}.

An n_1 x n_2 torus is a mesh with wrap-around connections: there are additional connections between PU_{i, 0} and PU_{i, n_2 - 1} for all i, 0 <= i < n_1, and between PU_{0, j} and PU_{n_1 - 1, j}. As a graph, a torus is the direct product of two circular arrays with n_1 and n_2 nodes, respectively. The degree is 4, the diameter round_down(n_1 / 2) + round_down(n_2 / 2) and the bisection width is 2 * min{n_1 , n_2}. On tori most problems can be solved twice as fast as on meshes.

There are also higher-dimensional meshes and tori. A d-dimensional n_1 x ... x n_d mesh consists of n_1 * ... * n_d PUs arranged in a d-dimensional grid. Each PU is connected to its at most 2 * d neighbors. The degree is 2 * d, the diameter is n_1 + ... + n_d - d and the bisection width is n_1 * ... * n_d / min{n_1, ..., n_d}. The definition and properties of d-dimensional tori are analogous.

Linear Array, Circular Array, Mesh, Torus

Tree-Like Networks

A tree is a network consisting of 2^{k + 1} - 1 PUs arranged and interconnected as a perfect binary tree of depth k. So, there are k + 1 levels of PUs, with 2^j PUs in level j, 0 <= j <= k. The root, the PU in level 0 has degree two, the leafs, the PUs in level k have degree one, all internal nodes, the PUs at all other levels have degree three. The degree is 3, the diameter 2 * k and the bisection width is 1.

The major weakness of a tree network is its very small bisection width. This means that any problem requiring a single rearrangement of the data has at least linear running time. For such problems a tree is not better than a linear array. Therefore several improved trees have been designed. The simplest is the fat tree. In a fat tree the connections between level i and level i + 1, where the leafs are considered to lie at level 0, have capacity 2^i. This means that such a connection can transfer 2^i packets instead of 1.

Alternatively tree structures have been integrated in other networks. An important example of such a network is the mesh of trees. An n x n mesh of trees is constructed from an n x n grid of not-connected nodes. The PUs in each row and column of this grid are connected by a tree. There are no direct connections between the row trees or between the column trees. The row and column trees are connected at their shared leafs. An n x n mesh of trees has 2 * n * (2 * n - 1) - n^2 = 3 * n^2 - 2 * n nodes.

8 x 8 Mesh of Trees

Cube-Like Networks

A k-dimensional hypercube consists of n = 2^k PUs arranged in a k-dimensional grid with side length 2. Each PU is connected to its k neighbors in the grid. The degree is k = log n, the diameter is k and the bisection width is n / 2.

Hypercube

A k-dimensional cube-connected cycle, abbreviated CCC, network is obtained by taking a k-dimensional hypercube and replacing each node by a cycle with k nodes. Node d, 0 <= d < k, in each cycle takes over the connection of the hypercube along the d-direction. More precisely, the b * 2^k nodes of the CCC are indexed with pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k), (i, (d - 1) mod k) and (i ^ 2^d, d). Here i ^ 2^d denotes i + 2^d if bit d of i is 0 and i - 2^d if this bit is 1. The degree is 3, the diameter is round_down(5 * k / 2) - 2 for k >= 3. A CCC is the simplest network which combines a logarithmic diameter with constant degree. The bisection width is the same as in the corresponding hypercube, 2^{k - 1}.

A k-dimensional butterfly network is similar. The k * 2^k nodes are indexed as pairs (i, d), 0 <= i < 2^k, 0 <= d < k. Node (i, d) is connected to the nodes with the following indices: (i, (d + 1) mod k) and (i, (d - 1) mod k). Furthermore there is an undirected edge from (i, d) to (i ^ 2^d, (d + 1) mod k). So, the degree is 4. The diameter is round_down(3 * k / 2). Because many router chips provide four connections, and because of the good routing properties of butterflies these networks have been used quite a lot in practice.

Four Dimensional CCC and Butterfly

Cayley Graphs

The concept of Cayley graphs gives a generic way of constructing many different graphs. For a finite group G and a generator set S which is closed under inversion, a graph (V, E) is defined as follows: Because the set S is a generator set, the graph is connected. Because it is closed under inversion, for any edge (g, g * e), there is also an edge (g * e, g * e * e^{-1}) = (g * e, g), and thus the graph is undirected.

Cayley graphs are much nicer than just connected and undirected: they are even node symmetric. A graph is said to be node symmetric, if for any pair (u, v) of nodes there is an automorphism phi = phi(u, v) of the nodes so that if (x, y) is an edge that then also (phi(x), phi(y)) is an edge. For Cayley graphs, these automorphisms are easy to construct: determine elements e_1, ..., e_k so that e_1 * ... * e_k * u = v, then phi maps any x in V to x * e_1 * ... * e_k. If (x, y) in E, this means that y = x * e for some e in S. Thus, (phi(x), phi(y)) = (phi(x), phi(x * e)) = (phi(x), e_1 * ... * e_k * (x * e)) = (phi(x), (e_1 * ... * e_k * x) * e) = (phi(x), phi(x) * e) in E.

Several of the graphs we have encountered before are Cayley graphs: circular arrays, tori, hypercubes and CCCs, though these can be presented in a more natural way. We describe how hypercubes can be obtained as Cayley graphs. For H_k, the k-dimensional hypercube, is obtained from a subgroup of the permutation group of 2 * k elements. There are k generators each having two consecutive elements exchanged. To make the description more concrete we only consider the special case k = 4. The set S of generators consists of (10234567), (01324567), (01235467) and (01234576). The subgroup consists of all elements obtained by starting with (01234567) and letting these elements work. If in the conventional indexing of the hypercube we find a 1 at position i, 0 <= i < k, the elements 2 * i and 2 * i + 1 in the indexing with a permutation are exchanged. For example, (0110) corresponds to (01325467).

Conclusion

Linear and circular arrays are very weak networks. Only for very simple problems it is possible to achieve speed-up on them. The main reason to consider them is that many algorithms for meshes and tori can be viewed as generalizations of 1-dimensional algorithms. Meshes and tori have a relatively high diameter and relatively small bisection width. O(sqrt(P)) is mostly the maximum achievable speed-up with P PUs. On the other hand are these networks easy to construct and arbitrarily scalable and extendible. A tree has small diameter, but too small bisection width. Only problems which require very little data to be exchanged can be solved efficiently on them. On hypercubes many important problems can be solved optimally. The problem with these is that the degree is not constant. Furthermore, in a two-dimensional layout, due to their large bisection width, they require an area which is quadratic in the number of PUs n, which implies that some adjacent PUs are standing sqrt(n) or more apart. So, if the physical distance signals must travel is the determining factor, hypercubes are not better than meshes. The conclusion is that none of the networks is entirely satisfying, and this is precisely the reason that all of these and many more are studied.
For interconnection networks, there is a trade-off between quality on the one hand and complexity and scalability on the other.

Embeddings

There is a multitude of considered interconnection networks and there is a multitude of problems. It would be inefficient to study each problem for each network. This is often not necessary. In the first place many algorithms can be generalized or modified. This is a common practice between networks of a similar nature such. Algorithms for two-dimensional meshes can often be generalized for higher-dimensional meshes. Algorithms for hypercubes after modification often work also for other interconnection networks with logarithmic diameter and high bisection width. A more formal way of carrying over algorithms from one network to another is by using embeddings.

An embedding of an interconnection network IN_1 with P_1 PUs into a network IN_2 with P_2 PUs is a mapping phi() of the PUs of IN_1 to the PUs of IN_2. Generally it is part of the definition that phi() should be injective. Not only the PUs should be mapped, but for any two PUs PU_i and PU_j in IN_1 connected by a link there should also be specified a path in IN_2 running between phi(PU_i) and phi(PU_j). There are three cost measures that are used when specifying the quality of an embedding:

Any step of IN_1 can be simulated on IN_2 with at most dilation * congestion steps. Thus, the work on IN_2 exceeds the work on IN_1 by at most a factor dilation * congestion * expansion. As long as all these numbers are constants, an optimal algorithm for IN_1 automatically gives an optimal algorithm for IN_2.

As a first example we consider the embedding of a torus into a mesh. This embedding is obtained by first blowing up the n x n torus to a 2 * n x n torus and folding it double along the middle row, so that one PU comes in each row, and then repeating this for a folding along the middle column. The following picture illustrates the process. Notice that each pair of PUs that previously was adjacent now is at distance two at most: the dilation is two. Also, are there at most two paths mapped to any link of the mesh: the congestion is two. For the special case of dilation and congestion both equal to two, any step of the embedded network can be simulated in two steps. So, any problem which on an n x n torus can be solved in T communication steps can be solved on a mesh in 2 * T steps.

Embedding a Torus in a Mesh with Dilation Two

Another important example is the embedding of a binomial tree with n = 2^k nodes in a hypercube. The construction is simple: a binomial tree with 2^k nodes, T_k consists of two trees T_{k - 1} whose roots are connected. A k-dimensional hypercube consists of two (k - 1)-dimensional hypercubes with connections between all corresponding nodes, in particular between the nodes (0, 0, ..., 0) and (1, 0, ..., 0). So, the embedding can be performed recursively. The dilation, congestion and expansion are all 1: the graph of the binomial tree is a subgraph of the hypercube, obtained by removing edges.

Embedding Binomial-Tree in Hypercube

Two types of embeddings are particularly important: the embeddings of the linear and circular array. The embedding of a linear array, which is nothing but the construction of a Hamiltonian path, is important because it provides a way of indexing the PUs so that any two PUs with consecutive indices are adjacent in the network. Because many algorithms contain steps in which data must be exchanged to PUs with consecutive indices, this is a very desirable property. A systematic way of indexing the PUs of an interconnection network is called an indexing scheme. The embedding of a circular array, which is nothing but the construction of a Hamiltonian cycle, is important because algorithms may even require data to be shifted in a cyclical way.

In any grid it is easy to construct a Hamiltonian path. A two-dimensional mesh is Hamiltonian, which means that it has a Hamiltonian cycle, iff at least one of the sides has even length. Next to the indexing schemes obtained by embedding linear and circular arrays there are several other schemes for indexing the PUs of a grid. The most important is the so-called row-major indexing scheme. In this scheme, PU_{i, j} in an n x n mesh, which is located in row i and column j, is given index i * n + j. Mostly the rows and columns are numbered starting from the upper-left corner as in a matrix. In the column-major indexing scheme, this PU has index i + j * n. The indexing obtained by embedding a linear array, starting at the upper-left corner, then always going along the rows first, is called the snake-like row-major indexing. There are also so-called shuffled indexing schemes. The most important is the shuffled row-major indexing scheme. In this scheme, PU_{i, j}, for numbers i and j with binary expansions given by (i_{k - 1}, ..., i_i, i_0) and (j_{k - 1}, ..., j_1, j_0) respectively, has an index with binary expansion (i_{k - 1}, j_{k - 1}, ..., i_1, j_1, i_0, j_0).

Indexing Schemes

On a hypercube a Hamiltonian path can be constructed by exploiting the recursive structure of the network. H_k denotes the k-dimensional hypercube. The traversal of all PUs of H_k is obtained by gluing together two traversals of H_{k - 1}. If the gluing is done correctly, this even gives a Hamiltonian cycle. The schedule is known as Gray code. It can be described by giving the order in which the PUs are visited. It is easy to verify that in the following sequences all consecutive PUs, including the first and last, are adjacent:

for H_1 we have {0, 1},
for H_2 we have {00, 01, 11, 10},
for H_3 we have {000, 001, 011, 010, 110, 111, 101, 100}.

How to generalize the above sequences? Look at the sequence of H_3. It starts with the sequence of H_2 in which all indices have an additional leading 0. Then this same sequence appears once more in reversed order in which all indices have an additional leading 1. More generally, denoting the Hamiltonian sequence for H_{k - 1} by S_{k - 1}, the Hamiltonian sequence S_k for H_k is the concatenation of 0-S_k and 1-reversed(S_k). Here, for a sequence S and b in {0, 1}, b-S denotes the sequence obtained by preceding all elements in S by a b.

Lemma: The sequences S_k defined by S_1 = {0, 1} and S_k = 0-S_{k - 1} + 1-reversed(S_{k - 1}), where + denotes concatenation, give Hamiltonian enumerations of the PUs of the k-dimensional hypercubes H_k.

Proof: The proof goes by induction. For H_0 the construction is clearly correct. So, assume it is correct for H_{k - 1}. If S_{k - 1} is a Hamiltonian cycle, then so is reversed(S_k). The sequences 0-S_{k - 1} and 1-reversed(S_{k - 1}) are paths through H_k because of the recursive way in which H_k is constructed out of two connected copies of H_{k - 1}. It remains to check that these two paths fit together. The first element of S_{k - 1} and the last element of reversed(S_{k - 1}) are the same. So, the first element of 0-S_{k - 1} and the last element of 1-reversed(S_{k - 1}) differ in exactly one position, the leading one, and are therefore the corresponding PUs are connected n the hypercube. End.

Hamiltonian Cycle in Hypercube

Embeddings provide a major way of deriving algorithms for one topology from algorithms designed for another topology. Embeddings of linear and circular arrays lead to indexings for which consecutively indexed PUs are adjacent.

Routing

Considered Problems

Communication is a precondition for any computation in which several PUs jointly solve a common task rather than each solving their own task. Even in the latter case some communication is normally needed to hand-out the tasks in the beginning and to collect the answers in the end. In principle, in the course of a computation, we may encounter any pattern of source-destination pairs. However, because most algorithms have a fairly regular structure, there are some patterns which occur particularly frequently. The most frequently occurring patterns are the following:

Each of these operations has its own importance. All-to-all routing is the pattern encountered when there is no specific pattern in the data access: the PUs compute a while on the data they have and by doing so accumulate questions to be sent to other PUs. Unless the algorithm was designed in a specific way, these questions will be more or less randomly distributed over the PUs, giving rise to a more or less balanced all-to-all routing pattern. A permutation-routing pattern arises in algorithms that where designed in such a way that the questions are not scattered over the data set but concentrated. Such a structure is desirable because it reduces the number of start-ups. Broadcasting is a very common operation performed to distribute information from a source PU to a (subset) of PUs that must work on the same data. Typically this happens at the beginning of a computation. The symmetric operation is called gathering. Gathering means that information from all PUs is combined to a single information in some specified PU. For example, this may be done in order to sum up the computed results. Gossiping is used to make local samples of the information available to all PUs. It is also used to synchronize a database and in numerical communications. Because gossiping is a very expensive operation, it will normally be performed only for a small subset of the total data set. Another much studied problem is k-k routing, in which each PU is the source and destination of at most k packets. Both permutation routing and all-to-all routing are special cases of k-k routing. In practice general k-k routing patterns are rarely encountered.

Permutation Routing

Two-Dimensional Meshes

Consider routing a permutation on a two-dimensional n x n mesh. There is a very simple algorithm which works quite well:
  1. Route all packets within their rows to their destination columns.
  2. Route all packets within their columns to their destinations.
This algorithm is called the greedy routing algorithm, also called xy routing. The description is not complete though: in addition to saying how the packets move, a routing algorithm should also specify what is done if more than one packet wants to use the same connection at the same time. Due to our model assumption, only one can be transferred, the others must wait. So, we must specify a conflict-resolution strategy. In many cases the most natural strategy works best: the packet which has to move farthest gets priority. This strategy is called the farthest-first strategy. Also it should be specified whether the first routing phase is completed before the second phase is started, or that the phases are coallesced. Normally it is understood that the phases are separated, though in practice it will be faster and simpler to coallesce them.

How good is the greedy algorithm? It is easy to see that during phase 1 the packets run without any delay: the packets moving rightwards do not change there relative distances as long as they are moving and therefore they will never encounter each other. The same is true for the leftwards moving packets. So, phase 1 takes as long as the longest distance a packet has to go. The maximum distance that may have to be covered in a row is n - 1. As a result of the routing in phase 1, the packets may have piled up in some PUs. For example if the permutation is the transposition, in which PU_{i, j}, 0 <= i, j < n, sends a packet to PU_{j, i}, then n packets reside in all diagonal PUs at the end of phase 1. So, the queue size, that is, the maximum number of packets that may have to be stored in a single PU at the same time, may be as high as n. Thus, during the second routing phase there may be conflicts. However, due to the applied farthest-first strategy, can a packet which has to move upwards to a PU at distance d from the boundary of the mesh be delayed at most d times. Because the distance this packet has tot travel is bounded by n - d - 1, it will arrive no later than in step n - 1. So, even phase 2 will be completed in n - 1 steps, and therefore the whole algorithm is completed in 2 * n - 2 steps which is optimal. Much more sophisticated algorithms have shown that it is possible to achieve the same time with constant queue size.

Higher-Dimensional Meshes and Hypercubes

On higher-dimensional meshes and hypercubes the same algorithmic idea can be applied: the packets are routed dimension-by-dimension. This routing is called dimension-ordered routing and is widely applied in practice. On two-dimensional meshes this may give long queues but the time is optimal as we have seen above. However, on higher-dimensional meshes and hypercubes it can be very bad. Consider for a k-dimensional hypercube with n = 2^k PUs a generalization of the transposition, in which the packet residing in a PU with index (i_0, ..., i_{k - 1}) has to be routed to the PU with index (i_{k - 1}, ..., i_0). If dimension-ordered routing is applied, then all packets starting in the PUs with indices (i_0, ..., i_{k / 2 - 2}, 1, 0, ..., 0) will be routed through the PU with index (0, ..., 0, 0, 0, ..., 0). That is, 2^{k / 2 - 1} = sqrt(n) / 2 packets are routed through a single PU and have to leave it over a single connection. Thus, the routing takes Omega(sqrt(n)) time, which is not better than routing on a two-dimensional grid.

In general we hope that permutation routing can be solved in a time equal to or only slightly larger than the diameter. In any case we would like to route permutations in O(diameter) time. The above algorithm for k-dimensional hypercubes achieves this for most permutations, the expected time is O(k), but there are a few permutations for which the performance is unacceptable. Further down we will see how to overcome this problem.

DMMs

On a DMM permutations can be routed in a trivial way: each PU sends its packet directly to the destination PU. Because the whole pattern is a permutation it is guaranteed that this can be performed without conflicts. If the packets have length l, this takes c_v * l + c_a time. This is clearly optimal because at least one start-up is needed and all data must be transmitted.

All-to-All Routing

All-to-all routing is the working horse of general-purpose parallel computation. Its cost depends on the bisection bandwidth of the interconnection network. If the provided bandwidth is insufficient, then whole classes of problems can not be solved efficiently. Because the routing pattern is fixed and regular, it is not hard to design optimal algorithms for most topologies.

Two-Dimensional Meshes

On an n x n mesh each PU is the source and destination of n^2 packets, one for each of the PUs. Applying the greedy algorithm works fine now. Each PU holds n packets with destinations in each of the n columns. These packets with common destination might be bundled. In each row n / 2 * n^2 / 2 = n^3 / 4 packets must be routed over the bisection. At the end of phase 1, each PU holds again n^2 packets, n of them with destination in each of the PUs of its column. These packets with common destination might again be bundled. The algorithm takes n^3 / 2 routing steps. This is twice the time given by the bisection bound, the lower bound which is obtained by dividing the number of packets that must cross the bisection in one direction (in this case n^4 / 4) by the bisection width (in this case n).

The given algorithm is uni-axial, that is, in any step either only horizontal or only vertical connections are used. If the mesh has full-port capacity, then this is wasteful. However, it is often possible to overlay two uni-axial algorithms. In this particular case the solution is very simple: all packets going from PU_{i_x, i_y} to PU_{j_x, j_y} with i_x + i_y + j_x + j_y even are routed as described, while the other half of the packets are routed first along the columns and then along the rows. Said otherwise: if the PUs are colored black and white as on a chess board, all packets routed between fields of the same color are routed xy, while the others are routed yx. Each of the problems involves half as many packets as before and therefore the whole routing takes only n^3 / 4 now, which is optimal.

This coloring idea is very useful, because it allows to concentrate on designing uni-axial algorithms. This strongly facilitates the design. Of course this idea is not limited to two-dimensional meshes, but can be applied on any network in which separate dimensions can be distinguished, most notably meshes and tori of any dimension and hypercubes.

Higher-Dimensional Meshes and Hypercubes

On a d-dimensional n x ... x n mesh, the algorithm consists of d phases. In phase i, 0 <= i < d, the routing is performed along axis i. In each routing half of the packets must cross the bisection. So, each phase takes n^{d + 1} / 4 time. In total the algorithm takes d * n^{d + 1} / 4 time. This algorithm is uni-axial. Applying a coloring with d colors it can be made optimal: a packet routed from PU_{i_0, ..., i_{d - 1}} to PU_{j_0, ..., j_{d - 1}} is given color (i_0 + ... + i_{d - 1} + j_0 + ... + j_{d - 1}) mod d. A packet with color c is routed along the (c + i) mod d axis in routing phase i, 0 <= i < d. So, the original problem is decomposed in d independent problems each involving a fraction 1 / d of the packets. So, the total routing time of this full-port algorithm is n^{d + 1} / 4.

In the context of all-to-all routing, a k-dimensional hypercube with n = 2^k PUs can be perceived as a k-dimensional mesh with side length equal to two. The uni-axial algorithm runs in k * 2^{k + 1} / 4 = k * n / 2. Overlaying k of these algorithms each taking care of a fraction 1 / k of the packets, the routing time can be reduced to n / 2.

DMMs

For all-to-all routing on a DMM with P PUs there is one standard algorithm which is the best if the packets are sufficiently large. The algorithm consists of P - 1 steps. In step t, 1 <= t < P, PU_i routes its packet to PU_{(i + t) mod P}. Possibly there is even a step 0 in which the packets with destination in the same PU are handled. If all packets have size l, then the routing takes
T_one(P, l) = c_v * l * (P - 1) + c_a * (P - 1).
This is P - 1 times the time for routing a permutation, which is not surprising because the all-to-all routing is realized by decomposing it into permutations. Of course this works only if all the packets have (approximately) the same length. Otherwise a more sophisticated algorithm should be applied which is presented further down.

Even if all packets have the same size, then it is not necessarily true that sending the packets directly to their destinations is the best one can do. This may sound surprising but is due to the relatively large number of start-ups of the given algorithm: if l is small, then the term c_a * (P - 1) will dominate and it may be profitable to reduce it even if the first term increases because of this.

As an alternative consider the simulation of the algorithm for a two-dimensional mesh. All PUs are indexed as in a mesh with a pair of indices (i, j), 0 <= i, j < sqrt(P). The algorithm consists of two phases. In phase 1, all packets are rearranged within the "rows", in the second phase within the "columns". More precisely, in step t, 1 <= t < sqrt(P) of phase 1 PU_{i, j} routes to PU_{i, (j + t) mod sqrt(P)} all packets which have destination in any PU_{i', (j + t) mod sqrt(P)}, with 0 <= i' < sqrt(P). In step t, 1 <= t < sqrt(P) of phase 2 PU_{i, j} routes to PU_{(i + t) mod sqrt(P), j} all packets which have destination in any PU_{(i + t) mod sqrt(P), j'}, with 0 <= j' < sqrt(P). So, the algorithm consists of 2 * (sqrt(P) - 1) steps in each of which l * sqrt(P) data are routed. This gives

T_two(P, l) = c_v * 2 * l * (P - sqrt(P)) + c_a * 2 * (sqrt(P) - 1).
For large l this algorithm is almost twice as slow, but it may be up to sqrt(P) / 2 times faster for very small l.

Of course we do not have to stop here. If the packets are really small, then the algorithm for a three-dimensional mesh can be simulated, or in the extreme case, the whole routing can be performed with just log(P) routing steps by simulating the hypercube algorithm. Doing this, in each step half of the data a PU holds must be routed. This gives

T_log(P, l) = c_v * log(P) * l * P / 2 + c_a * log(P).

The algorithms with several phases also require that in between two phases the received packets are decomposed into their components and recomposed for the next routing. With appropriate bucketing all this can be performed in a time that is linear in the amount of involved data. Nevertheless, if c_i is not so much smaller than c_v, as it is the case on the best parallel computers, this gives a non-negligible delay. More generally, reducing the number of start-ups by multiplying the routing volume by a factor should be limited to subproblems involving only a small fraction of the total data set. Otherwise, considering that c_v > c_i, no reasonable speed-up can be expected, and it would be better to reduce the number of processors used in order to get packets of a more substantial size.

Broadcasting

Broadcasting on a full-port network is trivial: in any step the information is sent to all neighbors. It will reach all PUs in the minimum time, which is bounded by the diameter of the network. On single-port machines the problem is non-trivial.

Linear Arrays, Meshes and Hypercubes

If on a linear array with n PUs an information from PU_0 has to be broadcast, then the trivial algorithm, sending the algorithm in step i, 0 <= i < n - 1, from PU_i to PU_{i + 1}, is optimal, taking n - 1 steps. If the source is PU_j, j < n / 2, then the packet is first routed rightwards, then leftwards.

On a two-dimensional n x n mesh, the information is first routed within the row of the source PU and then within all columns. If the source is not PU_{0, 0}, then the routing in each of the phases should start in the direction in which the distance to cover is largest. The algorithm requires 2 * n - 2 steps at most, which is optimal.

On higher-dimensional meshes and hypercubes we can apply the dimension-ordered generalization of the above algorithm. On a hypercube this means that the routing is performed along an embedded binomial tree. On a k-dimensional hypercube the algorithm takes k steps, which is optimal.

DMMs

On a DMM with P PUs we assume that PU_0 is the source. There are several options. The most natural is to simulate the hypercube algorithm: the routing conists of log(P) rounds, in each round l data have to be transferred. This gives
T_hypercube(P, l) = c_v * l * log(P) + c_a * log P.

Surprisingly this is mostly not the best. If l is slightly larger, then v * l * log(P) is a large number. In that case it is better to use an algorithm which is similar to that for the linear array. The information is divided in a large number f of flits. Flit i, 0 <= i < f, is routed in step i from PU_0 to PU_1, and then on in a linear way. It takes f + P - 2 steps before flit f reaches PU_{P - 1}. This gives

T_linear(P, l) = c_v * l / f * (f + P - 2) + c_a * (f + P - 2).
For sufficiently large f, the first term converges to c_v * l, which is optimal. Of course this produces many more start-ups.

For intermediate values of l, there are intermediate algorithms. A simple idea is to use an algorithm similar to the one for a two-dimensional mesh: first broadcasting in a pipelined way within the rows, then within the columns.

Gossiping

Broadcasting is relatively cheap because there is only one packet. Gossiping is much more expensive. Therefore, in practical applications it should be performed only for very small sets of data.

Interconnection Networks

In the context of gossiping it is customary to use other cost models. The most considered model is to assume that in any round pairs of PUs can exchange data and that in a single round they can exchange all data they want to exchange. This is called the unit-cost telephone model. So, in any round the communicating PUs are given by a matching. Therefore, a gossiping algorithm is a specification of a sequence of matchings that eventually lead to all PUs knowing everything. Such a sequence is called a gossiping schedule.

Lemma: Gossiping in the unit-cost telephone model on a network with P PUs and diameter D requires at least max{D, round_up(log(P)) + odd(P)}. Here odd(x) = 1 for odd x and 0 otherwise.

Proof: To cover a distance d, at least d rounds are required. After t >= 0 rounds any information is known in at most 2^t PUs, because in any round the number of informed PUs can at most double. If the number of PUs is odd, then there is at least one PU which remains unmatched in the first round. So, the information from this PU is known in at most 2^{t - 1} PUs after t > 0 rounds. This means that for t <= round_up(log(P)) + odd(P) - 1, the number of informed PUs is at most 2^{round_up(log(P)) - 1} < P. End.

Gossiping on d-dimensional meshes is easy as long as all side lengths are even. In that case the gossiping can be performed in a dimension-ordered way. On an n x ... x n mesh for n even, this takes d * (n - 1) rounds, which is optimal. For odd n, it is always possible to perform the gossiping in d * n rounds, but this is not necessarily optimal. For example, on a 3 x 3 mesh gossiping can be performed in 5 rounds. This is optimal, because for a 3 x 3 mesh, the number of PUs P = 9, and thus round_up(log(P)) + odd(P) = 5.

Gossiping on 3 x 3 Mesh

For small networks gossiping schedules can be described very concisely and convincingly by a series of pictures. The nodes are subdivided according to the shape of the network, marking in each subdivision whether the information from the PU at the corresponding position is already known or not. The used matchings can be highlighted.

Gossiping on 3-Dimensional Butterfly

The unit-cost model is not the only possible cost model. More flexible is the linear-cost model in which it is assumed that a data exchange between two adjacent PUs in which each sends at most l informations takes c_v * l + c_a time. In principle it is not even necessary that all PUs remain connected equally long. However, the more freedom there is, the harder it becomes to design good algorithms. Therefore it is still useful to perform the gossiping by a sequence of matchings. The cost of a gossiping schedule is given by the number of rounds, the number of consecutive matchings used, and the number of steps, the sum over all rounds of the number of transferred packets.

We consider again the above schedule for gossiping on a 3 x 3 mesh. It consists of 5 rounds which was shown to be optimal. The number of steps is 1 + 2 + 4 + 4 + 7 = 18. This is far from optimal. It has been shown (masters thesis of Rene Beier) that 11 steps is the minimum achievable for any schedule with 5 rounds, and that in general at least 10 steps are required. Both bounds can be matched. We see here, and that is a general phenomenon, that there is a trade-off between rounds and steps. So, in general the best gossiping algorithm depends on the ratio c_a / c_v. Only for some very regular networks, such as linear arrays and hypercubes, the number of rounds and steps can be minimized at the same time. In that case there is a unique optimal gossiping schedule.

Gossiping on 3 x 3 Mesh, 5 rounds 11 steps

DMMs

On a DMM with P PUs, the most natural thing is again to apply a simulated hypercube algorithm, performing the gossiping with log P communication steps. If originally each PU holds l data, in step t, 0 <= t < log P, the size of the packets is l * 2^t. Summing over all steps gives
T_hypercube(P, l) = c_v * l * (P - 1) + c_a * log P.
This is optimal: the number of start-ups is minimized, and the routing volume equals the amount of data each PU must receive. For large P the start-up costs will become negligible.

Another communication operation is gather. In a gather each PU initially has l individual data. The task is to gather all this information in PU_0. This is not the inverse of a broadcast, because in a broadcast finally all PUs know the same information. Clearly this operation requires at least log P start-ups and because l * (P - 1) data must be transferred into PU_0. So, on a DMM gathering has the same lower bound as gossiping and can be solved optimally by applying the hypercube gossiping algorithm.

Using Randomization

Using randomization is a mighty instrument which in many contexts allows to simplify algorithms or even to achieve results which otherwise cannot be achieved. This also applies to the context of routing. In the following we will present quite a surprising application of randomization and show that on meshes the randomization can be substituted by a deterministic equivalent.

Hypercubes

We have seen that routing a permutation on a hypercube with n PUs can take Omega(sqrt(n)) time. This problem is not only of theoretical interest because the permutations in which the indices are permuted are concerned. These permutations are very common. There are several ways to overcome this problem. The simplest is the following idea of Valiant.

The packets are denoted p_i. s_i is the index of the source PU of p_i, d_i is its destination PU.

  1. For each p_i choose randomly and independently an intermediate destination x_i.
  2. Route all packets p_i from PU_{s_i} to PU_{x_i}.
  3. Route all packets p_i from PU_{x_i} to PU_{d_i}.

Notice that the mapping s_i -> x_i is not a permutation. Alternatively PU_0 might select the index of a permutation and broadcast it to all PUs in k steps, but this is a waste of time. The analysis of Valiants algorithm is not hard but beyond the scope of this chapter. Both routing phases have the same time distribution, because phase 2 can be viewed as a reversal of phase 1: in phase 2 the packets are moving from a random position to a fixed position. It can be shown that each of these phases takes k + o(k) time with high probability and that the queues remain small as well. So, the whole algorithm essentially 2 * k + o(k) = O(k) steps with high probability. Using randomization all permutations become equally hard, there are no longer good and bad permutations. The identity has the same expected time as a transposition.

A disadvantage of the above approach is that we loose a factor two for all those permutations for which even the greedy algorithm performed well: the expected time of the greedy algorithm is k + o(k), which is optimal up to the lower order term. A nice idea helps to make the randomized algorithm optimal while still making it highly unlikely that much more than the expected time is required. The algorithm is the same as before, with one small change. After selecting x_i, the length of the path s_i -> x_i -> d_i is computed. If this path is longer than k, we use comp(x_i) as intermediate destination instead of x_i. Here comp() is the operation which in a binary number complements all bits. The maximum distance any packet has to travel is now bounded by k: if dist(s_i, x_i) + dist(x_i, d_i) > k, then dist(s_i, comp(x_i)) + dist(comp(x_i), d_i) = k - dist(s_i, x_i) + k - dist(x_i, d_i) = 2 * k - dist(s_i, x_i) + dist(x_i, d_i) < k. Vöcking has shown that not withstanding the reduced randomness of the intermediate destination, the delay due to congestion is still limited to o(k) with high probability.

DMMs

On DMMs the standard routing algorithms above were all given under the assumption that the packets were of comparable size. It is no problem if there are small variations but if the sizes differ by a factor two or more, then we may expect that the routing takes considerable longer than estimated. Because the routing steps are not synchronized it is not easy to tell how long the routing will take exactly.

Consider an all-to-all routing. Assume that in total each PU is sending and receiving the same number of packets, but that not each PU is sending the same number of packets to each other PU. Then we can first send each packet to a random intermediate destination like in the hypercube algorithm. This replaces a possibly unbalanced pattern by two balanced all-to-all routing patterns. A refinement is to first determine the degree of unbalance and to randomize only as much as necessary.

Meshes

Consider a k-k routing on meshes. A good strategy is to choose for each packet a random intermediate destination and to route the packets to their intermediate destinations before routing them to their actual destinations. The whole algorithm consists of four routing phase: in phase 1 and 3, the packets are routed along the rows, in phase 2 and 4 the packets are routed along the columns.

For the analysis it is necessary to understand how much time the routing on the one-dimensional subarrays takes. Without a proof we give the central lemma in this domain:

Lemma: For a routing pattern on a linear array with n PUs, let r_{i, j}, 0 <= i < j < n, denote the number of packets that must travel from a PU with index i or smaller to a PU with index j or larger. When applying the farthest-first strategy, the rightwards moving packets need exactly max_{i < j} {r_{i, j} + j - i - 1} steps to all reach their destinations.

For most distributions the maximum is either assumed for i = 0, j = n - 1, or for i = n / 2 - 1, j = n / 2. The maximum time for a regular pattern like a k-k routing is always given by the maximum of these bounds. This simplifies the analysis a lot.

In the above case we are routing random k-k patterns. The maximum distance is n - 1, and in a randomization about half of the packets want to cross the bisection. Thus, each of the phases takes max{n, k * n / 4} + o(k * n). This is not yet good, but we can exploit that the given routing is uni-axial. As for the all-to-all algorithm above, two of these algorithms can be overlaid. To make this effective, the packets must somehow be split in two. In a randomized context it is natural to use randomization for this as well. So, each packet is colored independently white or black with probability 1 / 2. The white packets are routed xy, the black packets yx. In this way we do not exactly obtain two (k / 2)-(k / 2) patterns, but that does not matter. The important point is that now in each routing phase, the expected number of packets crossing the bisection of a one-dimensional subarray is k * n / 8. Using Chernoff bounds, it may be shown that all deviations from the expected value are small with high probability.

Theorem: On a two-dimensional n x n mesh, k-k routing can be performed in max{4 * n, k * n / 2} + o(k * n) steps with high probability.

The above result is optimal for all k >= 8, because there the k * n / 2 bisection lower bound is matched up to lower-order terms. For k = 1, 2 k-k routing can be performed in 2 * n + o(n) time. This is optimal because it matches the distance bound, the lower bound which is obtained by considering how far the packets have to travel. For 2 < k < 8 it is not known how many steps are exactly needed for k-k routing. For example, 4-4 routing can be performed in 3 * n + o(n) steps, but it is not clear whether this is optimal or not.

Derandomization

Randomization is a handy means of obtaining good results, but there are practical and theoretical reasons to prefer deterministic algorithms if they can achieve the same bounds. In this case the only purpose of using randomization is to obtain a regular pattern. However, this can also be achieved by handing out the packets in a regular way. We first consider how to turn this idea into an algorithm for all-to-all routing on a DMM.

Assume we are performing an all-to-all routing on a DMM with P PUs, each PU sending and receiving m numbers in total, l = m / P packets traveling between any pair of PUs on average. Each PU sorts the numbers according to the index of their destination PUs. Using bucket sort this takes O(m) time. A number residing in PU_i with rang r is sent to PU_{(i + r) mod P} as intermediate destination in a first routing phase and in a second routing phase to its destination. This is a very simple algorithm. Creating the packets to be sent to each PU may be even faster than in the randomized algorithm because the calls to the random number generator are saved.

How good is the resulting distribution? Clearly in the first routing phase the packets are almost perfectly balanced, the number of packets going from one PU to another is either round_down(m / P) or round_up(m / P). In phase 2 the imbalance is larger. Denote by a_{i, j} the number of packets originating in PU_i with destination in PU_j and by b_{i, k, j} the number of packets which are send from PU_i via PU_k to PU_j. For arbitrary k and j we want to put an upper bound on sum_i b_{i, k, j}. The regular distribution gives b_{i, k, j} <= round_up(a_{i, j} / P). This gives

sum_i b_{i, k, j} <= sum_i round_up(a_{i, j} / P) <= P + sum_i a_{i, j} / P = P + m / P.
So, if m = omega(P^2), the imbalance is negligible. For the small values of P that currently are common, this is no problem, but in general this is quite a strong condition.

On n x n meshes any operation in submeshes of size o(n) takes only o(k * n) time and is therefore cheap in comparison to the Theta(k * n) time for solving most problems. Here k denotes the number of packets per PU. So, on meshes we can apply the same idea, but the allocation of intermediate destinations may be coordinated in submeshes of quite considerable size. Thereby the rounding errors can be made of minor importance even for small k. As a result, the same result that was achieved above can also be obtained deterministically.

The idea of coordinating the rounding decision also works on DMMs. Due to rounding errors, each of the rounds of phase 2 might be delayed by c_v * P, in total c_v * P^2. If first all PUs are subdivided in groups of size sqrt(P), these PUs can exchange their values b_{i, j}. These are small gossip operations, costing c_v * P^{3 / 2} + c_a * log(P) / 2 time. Then they can optimize their rounding, reducing the loss in phase 2 to c_v * P^{3 / 2}. So, now we must have m = omega(P^{3 / 2}). This is much better, but still not as good as for the randomized algorithm, for which the rounding errors are negligible for any m = omega(P * log P). Of course, in practice it is quite unlikely that the deterministic algorithm has maximal bad luck. Even though we might not assume that all a_{i, j} are equal, it is quite reasonable to assume that the rounding errors are more or less randomly distributed.

Offline Routing

In algorithmics there is a fundamental distinction between online and offline problems. In the context of routing an algorithm is said to be online if the destinations of the packets only become known at the beginning of the routing. The cost of precomputation, if any, has to be booked on the routing problem to be solved. In an offline routing algorithm, it is assumed that we are dealing with a fixed routing pattern which has to be performed many times. In this case the cost of precomputation is not considered, and the only concern is the quality of the produced routing schedule. Particularly, the precomputation is not necessarily performed on the parallel computer on which the routing must be performed.

Two-Dimensional Meshes

For two-dimensional n x n meshes there is a ingenious algorithm achieving something none of the online algorithms can achieve: the queue size is only one. That is, at all times, any PU is storing in addition to the packets moving by only a single packet. The algorithm consists of three one-dimensional routing phases, each being a permutation of the n packets standing in the rows or columns. It is a simplification of the more general algorithm for product networks in the next section which is due to Annexstein and Baumschlag.

First we construct a bipartite graph with 2 * n nodes. The sets V_1 and V_2 each have n nodes, corresponding to the source and destination rows of the packets, respectively. There is an edge from node i in V_1 to node j in V_2 if there is a packet moving from row i to row j. Because there are exactly n packets starting and finishing in each row, the graph is n-regular (all nodes have degree n). Thus, according to Hall's theorem, the graph has an edge coloring with n colors. Because this is an online algorithm, the existence of such a coloring is the only thing that matters, but actually such colorings can be computed very efficiently. That is, there is an assignment of values, which are called colors, ranging from 0 to n - 1, so that the edges incident on a node all have different colors.

These colors are used in the routing algorithm. Denote the packets by p_i and let x_i be the color attributed to the edge corresponding to this packet.

  1. Route each packet p_i along the rows to column x_i.
  2. Route each packet p_i along the column to its destination row.
  3. Route each packet p_i along the row to its destination PU.

It remains to check that the algorithm is as good as claimed above. For the analysis of phase 1 we consider the routing in an arbitrary row. Originally each PU holds one packet. Because the edges corresponding to these packets are all incident on the same node of the bipartite graph, they all have different colors, and therefore no two packets will be routed to the same PU. For the routing in phase 2, we consider the routing in an arbitrary column. Because of the previous result, initially each PU holds one packet. If two packets in the column would move to the same row, then two packets with destination in the same row would have gotten the same color, which is in contradiction with the coloring which colors all edges incident on the node in V_2 corresponding to this row differently. Therefore, at the beginning of phase 3, all PUs hold one packet. In each row each PU is the destination of one packet because the complete pattern is a pattern, and in phase 3 the packets are routed to their destinations. So, the algorithm consists of three permutations within one-dimensional subarrays. Therefore, each phase can be executed in n - 1 steps, 3 * n - 3 steps in total, and never a PU is storing more than one packet.

The queue size is optimal, the time is not. It is still open whether it is possible to route all permutations in 2 * n - 2 steps with queue size 1. A few steps more and queue size 2 is possible.

Offline Permutation Routing with Queue Size 1

Product Networks

The above idea can be generalized for any product network A = B x C, where "x" denotes the direct product of graphs: the nodes of A are pairs (b, c), where b is a node in B and c a node in C. There is an edge from (b_1, c_1) to (b_2, c_2) iff b_1 = b_2 and (c_1, c_2) is an edge in C, or c_1 = c_2 and (b_1, b_2) is an edge in B. A two-dimensional n_1 x n_2 mesh is the direct product of a linear array with n_1 and a linear array with n_2 nodes. Tori and hypercubes are also direct products.

Let A = B x C be a product network. B has n_B nodes and C has n_C nodes. A schedule for routing a permutation on A can be computed offline by first constructing a bipartite graph with node sets V_1 and V_2, each with n_C nodes. Node c, 0 <= c < n_C, in V_1 and V_2 corresponds to the subnetwork B x c in A. The bipartite graph has n_A = n_B * n_C edges. For a packet traveling in A from a node (b_1, c_1) to a node (b_2, c_2) there is an edge from node c_1 in V_1 to node c_2 in V_2. Thus, the bipartite graph is n_B-regular and can be colored with n_B colors. Packet i is denoted p_i, it is traveling from (b_i, c_i) to (b'_i, c'_i), and its color is x_i, 0 <= x_i < n_B. The routing is performed as follows:

  1. Route p_i from PU_{b_i, c_i} to PU_{x_i, c_i}.
  2. Route p_i from PU_{x_i, c_i} to PU_{x_i, c'_i}.
  3. Route p_i from PU_{x_i, c'_i} to PU_{b'_i, c'_i}.

Theorem: Permutation routing on a product network A = B * C can be performed in T_A = 2 * T_B + T_C steps, where T_B and T_C denote the number of steps for routing permutations on B and C, respectively. At any time a PU is storing at most one packet.

Proof: Phase 1 is a permutation routing in subnetworks with the structure of B, because for any subnetwork B x c, all destinations occur exactly once due to the coloring of the edges incident on the node in V_1 corresponding to this subnetwork. The coloring of the edges incident on the nodes of V_2 assures that phase 2 is a permutation routing in a network with the structure of C. Phase 3 is again a permutation routing within subnetworks with the structure of B because at the beginning each PU holds one packet, and because the whole pattern is a permutation, each PU is also the destination of one packet. End.

If T_B != T_C, then it is profitable to arrange things so that T_B < T_C. The above construction immediately gives results for d-dimensional meshes. The best is to perceive a d-dimensional n x ... x n mesh as the direct product n x (n x ... x n). Using induction this leads to

Corollary: Permutation routing on a d-dimensional n x ... x n mesh can be performed in (2 * d - 1) * n steps. At any time a PU is storing at most one packet.

DMMs

Above we have studied all-to-all routing on DMMs with P PUs. Let m = P * l be the total number of packets each PU has to send and receive. The greedy algorithm can perform very poorly and is acceptable only if we reasonably can assume that the distribution is balanced. The random and deterministic algorithms which send the packets through an intermediate destination overcome the problem of unbalances to a large extend, but are not perfect either. First solving a coloring problem for a bipartite graph the problem can be overcome entirely: the routing consists of two phases in which each PU sends exactly l packets to each of the other PUs.

The algorithm is simple: a bipartite graph with two sets of P nodes is created. For a packet moving from PU_i to PU_j there is an edge from node i in V_1 to node j in V_2. This gives an m regular graph which can be colored with m colors. Each PU first routes all packets with color x to PU_{x mod P}. In the second routing phase the packets are routed to their destinations. Because of the coloring property for the edges incident on the nodes of V_1, in the first round each PU sends exactly l packets to each PU. Because of the coloring property for the edges incident on the nodes of V_2 each PU receives one packet from each color. These stand perfectly distributed after the first routing.

Towards Online Routing

Offline routing is fine for fixed patterns, but patterns which become known only at runtime are more common. Just as it is often possible to turn randomized algorithms into deterministic algorithms with essentially the same performance, so is it often also possible to turn offline algorithms into online algorithms. Both conversions are of great importance, because they allow for a separation of concerns. Knowing that such conversions exist, one may when designing an algorithm assume some smooth distribution as it is achieved by a randomization, or one may first consider what might be achieved offline.

Clearly for a problem like permutation routing, which is not particularly hard, it appears outrageous to first solve a bipartite graph-coloring problem of a graph with one edge per packet. This problem is far more complicated then the routing itself. However, we can solve a strongly reduced problem in o(n) time, while still obtaining almost as good performance as with the offline algorithm. This idea of reducing the information content of a problem is called sparsification and is useful beyond the scope of turning offline algorithms into online algorithms.

The approach is sketched for a two-dimensional n x n mesh. The mesh is divided in m x m submeshes. m = o(n) is quite large, one should think of m = n^{1 - eps}, for some suitable eps > 0. The packets going to each of these n^{2 * eps} submeshes are counted in each submesh. The bipartite graph does not have one node for each row, but one node for each of the n^eps row bundles. A further idea, which helps to reduce the size of the graph is to not create an edge for each edge but only for each superpacket. A superpacket consists of a large number s = n^alpha packets with common destination. Because we cannot assume that s exactly divides the number of packets traveling between a pair of row bundles, in general there will be one partially filled superpacket for each destination pair. So, in each submesh the graph to color has 2 * n^eps nodes and n^{2 - 2 * eps - alpha} + n^{2 * eps}. Choosing the parameters right, we can achieve the following:

After solving the localized and sparsified coloring problems, the routing cannot be performed so simply as it was done before. Each routing phases must be preceded by a rearrangement in the submeshes, so that the packets stand well-distributed. For example, preceding a horizontal routing phase, the packets must be arranged so that in each row of a submesh the number of packets going to any other submesh is about equal. When reaching a submesh, the packets are not routed to a specific PU, but rather to the first PU in the submesh which is still storing fewer than the maximum allowed number of packets. Together these ideas assure small queues and good routing times. Because the submeshes have size o(n), the cost of each operation is a lower-order term of the total routing time, and because there are only constantly many routing phases, their cost is negligible altogether.

Reconfigurable Arrays

Model

So far we have only considered parallel computers which are connected by a fixed interconnection network. This is the most natural model and such parallel computers are almost the only ones that exist. But, consider again the complete interconnection network realized by a crossbar switch. This can also be viewed as a set of P PUs which are dynamically interconnected. Said otherwise, the technology for making dynamic interconnections is available and apparently the price for this technology is reasonable and the speed at which the connections can be made is acceptable.

The entire system consists of a number of PUs with conventional powerful arithmetic and storage and a number of switches. The switches are slightly more powerful than in a crossbar: they listen to the signals passing on the buses they are connected to and take out whatever is destined for them. A small number of values can be stored and the switches can be set conditionally as a result of a comparison between stored values. The whole system is assumed to be synchronized and the difference between receiving a signal or not receiving a signal can be detected.

Typically the switches are arranged in a rectangular grid, but in principle they might also be arranged differently. The PUs are arranged on the outside. The switches normally have four ports, denoted W, N, E and S. Sometimes it is assumed that these ports can be connected dynamically in any desirable configuration, sometimes the configurations are limited to a subset thereof. In the presented examples we will not need anything fancy. A switch setting is denoted by indicating which ports are connected. For example, WN-SE indicates the setting in which W is connected with N and S with E.

It is important to notice that the switch setting decomposes the grid into a number of subnets. Within these nets there is broadcasting possibility: as long as it is assured that only one of the connected PUs is sending, all connected PUs and switches will receive the information transmitted in the net. A system as described is called a reconfigurable processor array

One may argue that reconfigurable networks are physically not realistical. There are two major assumptions which give reason for skepticism:

The first point is a true limitation: speed of light is finite and the size of the reconfigurable network certainly increases with the number of switches. This is a limitation, but exactly the same assumption is made in all hypercube, butterfly and PRAM algorithms. Two-dimensional meshes (and to a lesser extend three-dimensional meshes) are the only interconnection networks that can be realized in physical space without scaling problems. In a two-dimensional lay-out of a hypercube with n PUs there are wires of length sqrt(n).

It is harder to estimate the importance of the second point. Radio exists, and the signal does not become noticeably weaker if the number of listeners increases. Even wire-bound broadcasting exists in the form of cable television. So, technologically, this is no problem either. Possibly the signal must be somewhat stronger than for a point-to-point transmission. Also one can imagine that especially branchings weaken a signal. In most algorithms for reconfigurable arrays, these can easily be avoided by performing the broadcasting in two steps.

Power of Reconfiguration

Parity is the problem of determining the parity of the sum of n bits. On a CRCW PRAM this problem requires Omega(log n / loglog n) time. Even taking exponentially many PUs does not help. Parity is important because it is a simplification of many other more useful problems. Surprisingly, parity can be solved on a reconfigurable array in constant time. The amount of used hardware is O(n). Hence, the algorithm is not only fast, but even efficient.

The algorithm requires n + 3 PUs and 6 * n switches arranged in an 3 x 2 * n grid. In a first round PU_i informs the six switches in the columns 2 * i and 2 * i + 1 about its value. After receiving this value, the switches set their connections in preparation for the next round. If a 0 is received, a switch is set to WE-NS. If a 1 is received the switch must be set WN-SE if it is found in column 2 * i. If the switch is in column 2 * i + 1, then it is set to WS-NE, with exception of the switch in row 1, which is set to WE-NS. Then in the second round, a signal is entered in the net from the lower-left side. If it comes out at the right side in the same row, then the parity is even, if it comes out one row higher, the parity is odd. The PU on the right side receiving the signal can either output the conclusion, or broadcast it to all other PUs.

Computing Parity of 1 + 1 + 0 + 1 + 0 + 1

Constant Time Sorting

Parity is not the only problem that can be solved in constant time on reconfigurable arrays. Actually, there are many interesting problems which can be solved in constant time. However, mostly these algorithms are far from efficient, requiring considerably more switches than the sequential time. As an example we show here how to sort n numbers using O(n^2) PUs and O(n^3) switches in O(1) time. In the chapter on sorting we will see how the number of switches can be reduced to O(n^2). This is optimal within this context, because routing a permutation of n elements in constant time requires an n x n array.

The input is a set of n numbers a_i, number i originally residing in PU_i. By taking also the index into account, we may assume that all values are different. The algorithm consists of two somewhat independent parts. In the first part, for each a_i its rank r_i within the set of numbers is computed. In the second part, a_i is routed to PU_{r_i}. Each of these operations can be performed in a constant number of steps. For each i there is a dedicated n x n grid of switches. The switches in column j of this network are set depending on a comparison of a_i with a_j similar to the construction in the parity algorithm: if a_j < a_i, a signal which is inserted at the lower-left corner is guided one row upwards. The signal finally leaves the network on the right side in row r_i of the subnetwork. Even when using only the three different settings which were used before, the whole algorithm can be implemented with just 6 steps.

Sorting in Constant Time

Exercises

  1. Show with a picture how to embed a linear array in a two-dimensional mesh. Generalize this embedding for d-dimensional meshes. Using the embedding as the basis for an indexing scheme, specify exactly the index of PU_{i_1, ..., i_d} in an n_1 x ... x n_d mesh.

  2. Show with a picture how to embed a circular array in a two-dimensional n_1 x n_2 mesh with n_1 even. Generalize this embedding for d-dimensional meshes. Considering the side lengths n_1, ..., n_d, which condition must they satisfy in order that this embedding is possible?

  3. The main importance of embeddings of linear arrays stems from the fact that they lead to indexings schemes in which PUs with consecutive indices are adjacent. More generally, we would like that any pair of PUs whose indices do not differ much are close to each other in the network. Construct an indexing scheme for a mesh so that the distance in the network between PU_i and PU_j is bounded by O(sqrt(|i - j|)).

  4. Assume each PU of a circular array with n PUs holds k packets that must be routed so that each PU receives at most k packets in total. This routing pattern is called k-k routing. Specify lower bounds on the routing time in terms of n and k. Distinguish several cases. Consider the routing in a greedy way: each packet is routed to its destination along the shortest path, applying farthest-first priority strategy. How long may the routing take? For which values of k is this algorithm optimal? Present another algorithm which is optimal for all k >= 4.

  5. Draw two- and three-dimensional CCC networks and determine their diameters.

  6. Give Cayley graphs corresponding to circular arrays, two-dimensional tori and CCC networks.

  7. The k-dimensional pancake graph is a Cayley graph defined over the permutation group S_k which has as elements the k! permutations of the elements 0, 1, ..., k - 1. The product in this group is given by the composition. The set of generators S = {swap_i| 2 <= i <= n}, where swap_i is the permutation ((i - 1) (i - 2) ... 0 i (i + 1) ... (n - 1)). So, swap_i reverses the order of the first i elements.

  8. The greedy routing algorithm on two-dimensional meshes has maximum queue size if the two routing phases are separated. If the phases are coallesced, then this queue size becomes smaller. Show that coallescing does not really help by providing a permutation for which the greedy algorithm with coallesced phases still has queue size Omega(n). The worst achievable queue size is 2 * n / 3.

  9. Give an algorithm for broadcasting on a k-dimensional butterfly network running in 2 * k + O(1) communication steps. Specify the exact number of steps used by your algorithm.

  10. Consider gossiping on a circular array with n PUs in the unit-cost telephone model. Give algorithms and specify the number of steps. Distinguish the cases that n is even and n is odd.

  11. We consider gossiping in the unit-cost telephone model on n_1 x n_2 meshes for which at least one of the sides has odd length.

  12. Show that gossiping in the linear-cost model on a linear array with P PUs requires at least 2 * P - 3 steps.

  13. Show how to gossip on a 3 x 3 mesh in the linear-cost model using only 10 steps.

  14. Give an algorithm for gossiping on a k-dimensional CCC network running in 5 / 2 * k + O(1) communication steps of the unit-cost telephone model. Specify the exact number of steps used by your algorithm.

  15. Give an algorithm for gossiping on a k-dimensional butterfly network running in 5 / 2 * k + O(1) communication steps of the uni-cost telephone model. Specify the exact number of steps used by your algorithm.

  16. In the text it was demonstrated that in the DMM model gathering has the same complexity as gossiping. Give an example of an interconnection network on which gathering is cheaper than gossiping.

  17. A torus can be embedded in a mesh and a circular array in a linear array. These embeddings imply that any torus or circular-array algorithm requiring t steps can be run on a mesh or linear array, respectively, in 2 * t steps. An interesting question is whether the reverse holds: can any problem be solved twice as fast on a torus / circular array as it can be solved on a mesh / linear array? This exercise provides a simple and convincing counter example: online 2-2 routing can be solved trivially performed in n steps on a linear array, while 2-2 routing on a circular array may require at least 2 / 3 * n steps, even when all source-destination pairs are known beforehand and the routing may be planned offline.

    In the section on "Using Randomization" a lemma was given specifying exactly the time for routing on a linear array with n PUs when using the farthest-first strategy.