To meet both kinds of demands, future and current, it is natural to consider using more than one processor for solving a single task. Because processors not only become faster but also cheaper this becomes more and more realistic. As soon as there is a demand for it, the production of multi-processor chips will start. Right now the newest processors are still expensive, but processors of the previous generation are cheap: If one strives for the maximum number of clock cycles per money unit, then clearly one should not buy the top model, but rather four with half the speed. Using P processors instead of 1, we may hope to become P times faster, but this is rarely achieved. Therefore, it is not sure that four 2 GHz processors will solve a problem faster than a single 4 GHz processor.
If we know how to exploit the computational power of a system with several processors, such a system is financially attractive and technologically possible.
Definition: A computer with several processors is called a parallel computer. Parallel computing is the (study of) the usage of parallel computers in order to solve tasks faster.
Even though technologically different, this kind of parallelism can be characterized as being task parallelism. The most important theoretical question in this context is how to schedule the tasks, that is, how to allocate the tasks to the processors so that an even load balancing is achieved. The load balancing problem has several variants. For a fixed set of tasks, it is a common goal to minimize the make span, the time to complete all tasks, respecting interdependencies if any. In an online context, the goal is more reasonable to try to minimize the completion time, the time before any job has been completed, or to minimize the time before any job starts being done.
In practical applications until now task parallelism is by far the most important branch of parallel computing. However, from an algorithmic point of view it is not very interesting because there is not so much one can do. Therefore, this is not the topic of this lecture. Only here we will briefly discuss the load-balancing problem.
If the lengths of the tasks are known in advance, then the problem of how to schedule a set of tasks so that the make-span is minimized is NP-hard. A reasonable heuristic is to sort the tasks according to their lengths and to allocate in this order to the processor which until then has the least work to do.
If the lengths of the tasks are not known, then the best one can do is to always allocate a task to the processor which currently has the shortest queue. This is the normal scheduling principle applied in supermarkets. In this context there is the issue of task migration: at certain costs it may be possible for tasks to be reallocated in order to achieve a more even work load.
A problem with this approach is that it requires that the lengths of all queues are centrally known: in a huge supermarket with 30 cash-sites, one will typically not check all of them, but simply take the locally shortest queue. If the central allocating agent does not know the loads, then the best he can do is to perform the allocation in a random way. Queuing theory tells that this is nevertheless quite good: if a processor on average gets allocated k tasks, the deviation from this expected value is bounded by O(sqrt(k * P * log(k * P))). Much better balance is achieved if the allocator always first asks the queue lengths of two randomly selected processors and then allocates the next task to the one with the shorter queue.
Perfect balance is achieved if the scheduling is postponed: a processor fetches a next available task as soon as it is ready with the previous one. This is the scheduling principle applied in modern post offices. The disadvantage is clear: the waiting tasks have to be stored in a central storage and cannot be handed out immediately. Furthermore, the processor will be idle while sending the request for a next task and waiting for this task to arrive.
For an analogy consider a ship wharf. In earlier days work was done in a very labour intensive way. In the rich countries, using better equipment, the number of working hours for building a ship has been reduced considerably. Productivity is not doubling every 18 months, but even in this domain there is an exponential increase, justifying the noticeable salary increases of the past 100 years. Nevertheless, it still takes many thousands of working hours to build a supertanker. Unfortunately the company which ordered the ship normally does not want to wait 20 years. The solution is well-known: a wharf building a supertanker typically employs thousands of workers. Of course having many workers around, there must also be people not actually building the ship but leading the work, managing the company and cleaning the buildings. It also takes time to get instructions from the management down to the people on the work floor, and similarly there may be a need for workers to communicate with the management (paint is running out!) or among each other (we are ready with our section of the ship!). Even worse, the process cannot be speed-up arbitrarily, because there is a fixed order of steps, which cannot be rearranged arbitrarily. Some of these steps require many more workers than others, and it thus becomes inevitable that during part of the process some workers are idle. It may even happen that due to bad organization people are standing around though there is work they could do. Or people are working, but because they did not receive precise instructions, they are drawing cables which later have to be removed again, or which were already installed by someone else. Finally, given that one wants to minimize the construction time, the chief engineer may decide to reschedule the work. For example, it is clearly practical to not start to paint the ship before it is completely welded together, but maybe the painters can already start painting where ever something gets ready. They waste time while climbing the scaffoldings many times and cleaning there equipment.
Let us summarize the aspects which lead to the fact that P workers will not build a ship P times faster than a single worker (provided he has all the skills and is able to physically do the job):
All these aspects arise also if a single problem has to be solved by a set of P processors. Communication and how to minimize the cost of it is a central issue on computers with a distributed memory, but will be abstracted away in most of this lecture. The scheduling is an important problem in general, but we will design the algorithms so, that in most cases it is trivial how to schedule the arising subtasks. Useless work can normally be prevented, double work may be reasonable in some contexts, but will not play a big role in this lecture. The central topic of this lecture is the design of algorithms which allow to use as many processors as possible, while loosing at most a constant factor in efficiency. So, we accept that a parallel algorithm is more work-intensive by a constant factor, but normally not more than that. This does not mean that we will only look at such problems, we will see how to determine the maximum of n numbers in O(1) time using O(n^2) processors, but this is the focus. Quite often we will first only consider the time: given arbitrarily many processors, how fast can the problem be solved. Then we will try to reduce the number of processors employed so that the algorithm becomes optimal in the above sense.
The cost C(P, n) of a parallel algorithm solving a problem of size n on a computer with P PUs is defined as C(P, n) = P * T(P, n). The cost is to be distinguished from the work W(n, P) which equals the sum over all PUs of the number of performed operations. The difference is that the work does not take idle time into account.
The speed-up S(P, n) gives the number of times the parallel program is faster than the sequential program. So, S(P, n) = T(1, n) / T(P, n), for some definition of the sequential time T(1, n). For T(1, n) there are several possibilities, which are all in use:
The efficiency E(P, n) of an algorithm is related to the speed-up: it is given by E(P, n) = S(P, n) / P = T(1, n) / (P * T(P, n)) = C(1, n) / C(P, n). So, it is the ratio of the costs, whereas the speed-up is the ratio of the times. A parallel algorithm is called optimal if E(P, n) = Omega(1), that is, the cost of the parallel algorithm is at most a constant factor larger than that of the best sequential algorithm. A study of the efficiency is especially important in practical contexts: Often the efficiency starts to decrease quite strongly from a certain P value on. The typical reason for this is that when P becomes too large for the value of n, the communication costs (which often increase with P^2) become dominating, and any further increase of P will hardly lead to a decrease of the overall time. Particularly on a multi-user system with P = 32 one should not run an application with an efficiency of under 10%.
As an example we consider sorting 10^10 integers, requiring 40 GB of storage. On a single PU, having only 1 GB of main memory, this will cost about 12 hours. This time is estimated as follows: 40 GB must be read and written twice, a typical transfer rate between main memory and hard disc is 8 MB. So just for the data transfer we need 160 * 10^9 / (8 * 10^9) = 20,000 seconds. If we have 64 of these PUs interconnected by a switch giving 2 Gbit point-to-point connection the time becomes much better. We must assume that originally the data stand distributed over the PUs. After sending around a sample of negligible size, all data must be routed over the network once before each PU has to sort 156 * 10^6 integers. The routing takes a few seconds: each PU holds data with a total size of 5 * 10^9 bits. 2 * 10^9 bits can be routed per second. The sorting takes less than 100 seconds on a 3 GHz processor. So, we achieve a speed-up of 200 with 64 PUs.
The described phenomenon of solving a problem more than P times faster using P PUs is called super-linear speed-up. It is a true phenomenon, and it is practically important, but it does not really tell something about the quality of the algorithm. Clearly, very bad parallel algorithms will not achieve any speed-up, not even in an extreme context. But, comparing with an unreasonably slow sequential algorithm does not provide a fair basis for evaluation: if in the above example the joint 64 GB of main memory would be available to a single PU, then this PU could have performed the sorting in about 50 times the parallel time. In that case the efficiency of the parallel sorting algorithm would have lain around 0.90 and not around 5.0.
Too-good-to-be-true speed-ups are obtained by either comparing with sub-optimal sequential algorithms, or by comparing with sequential algorithms which are slowed down by slow access to a large data set.
In parallel computing the situation is different: there are several cost models none of them dominating. The reason for this multitude is that the simplest of the models is felt to model reality too poorly to be acceptable. The more precise models are considered to be too hard to design algorithms for and in part also considered not to be of long lasting validity. Another reason for the multitude of models is also that the underlying hardware shows much more variability than the platform on which sequential computation is founded.
The most fundamental distinction in parallel computation is on how the memory is accessed: whether there is a shared memory to which all PUs have access in constant time, or whether there is a distributed memory so that each PU has direct access only to its own local memory. In the first case data can be exchanged by "parking" some information at some fixed position making it accessible to the other PUs. In the second case the data are exchanged by some form of communication between the PUs.
Another important distinction is whether the PUs are working synchronously or asynchronously. In a synchronous model, all PUs have access to a central clock. In practice systems are mostly asynchronous: each PU computes at its own speed. The speed differences may be small, but in a parallel computer not necessarily all PUs are of the same type. In principle the speed differences can be arbitrarily large. In practice, considering that there is mostly a non-negligible increase of the overhead with P, it normally does not make sense to combine PUs which are too different. Only for trivial problems, such as searching for extra-terrestrial life in radio signals or finding very large prime numbers, it is profitable to add 100 MHz PUs to a system that also contains 3 GHz PUs.
Historically parallel computers have been distinguished in SIMD, Single Instruction Multiple Data, and MIMD, Multiple Instruction Multiple Data, machines. On an SIMD machine, which is synchronous, all PUs execute the same instruction at all times. Because they execute these instructions on different data, it is still possible to obtain parallelism. The idea behind SIMD machines, is that if all PUs execute the same instruction, there is no need for the PUs to generate these instructions themselves: there is a central host computer connected to all PUs which tells them what instruction to execute next. This implies that the PUs can be much simpler.
In the 1980s there were many parallel computers of this type. An impressive example was the Masspar. This parallel computer consisted of a grid of very simple PUs. The PUs had primitive 4-bit arithmetic, a few kilobytes of storage and were slow. But, they were cheap. This implied that research institutes could afford to buy Masspar machines with 65536 PUs in a 256 x 256 grid. More recently no parallel computer with a comparable number of PUs has been build.
In an MIMD machine all PUs in principle may have their own program. In practice all will execute the same program, but due to conditional instructions, they will not always execute the same instructions. All modern parallel computers, in which the CPUs are mostly conventional PC CPUs, are of this type. Because most parallel algorithms become less efficient if the number P of PUs increases, it is more cost effective to buy 32 expensive but fast PUs than 1024 cheap but slow ones. Only for specific numerical applications architectures like the Masspar might still make sense.
In the following we briefly discuss the most important models. In the last section of this chapter we consider several problems from various perspectives. In the following chapter we present some basic algorithms for interconnection networks. Thereafter, we will only consider the simplest and most theoretical model, which allows to concentrate most strongly on the underlying parallelization ideas.
It is assumed that the cost of communication is linear in the amount of data sent or received plus some constant cost. So, under this assumption the time consumption of a computation is estimated as follows:
T_tot = c_i * t_i + c_v * t_v + c_a * t_a.Here c_i, c_v and c_a are constants whose values depend on the state of technological development. c_a is known under the name start-up latency. t_i gives the number of internal operations; t_v gives the total amount of data sent by a PU, this we will call the routing volume; and t_a gives the total number of packets sent by a PU. Typically a computation is divided in steps, and it would be more correct to define the above values as the sums over all steps of the maxima over all PUs of the number of each type of operations. This model will be referred to as DMM model, where DMM is an abbreviation for Distributed-Memory Machine.
The technological justification for the DMM model is that in modern parallel computers the communication is performed by wormhole routing. This means that if PU_i wants to send something to PU_j, a path from PU_i to PU_j is cleared, and then the data are transferred along this path flit by flit. a flit is a small data unit. The idea is that these flits run through the network like the segments of a worm: first all of them are standing in PU_i, then the first flit enters the network. By the time the first flit reaches PU_j several other flits, depending on the number of hops from PU_i to PU_j, have entered it. In PU_j the flits pile up and are recomposed. In this way the transfer time is something like d + l, where d gives the number of hops from PU_i to PU_j and l gives the number of flits. The time for the d hops is mostly negligible, much more important being the almost fixed start-up time.
Currently c_i ~= 0.3 * 10^-9. The values of c_v and c_a vary much more depending on how much money was invested on the interconnection infrastructure. In some cases the network is virtually unbounded, in that case the bounding factor is the speed of the memory bus of the individual PUs: clearly data cannot be pumped into the network faster than the rate at which they can be supplied by the memory. This means that we may have c_v = 10^-9 (when t_v indicates the number of integers). On the other hand, the PUs may also be coupled very loosely. For example, they may be connected by fast ethernet or worse, giving up to c_v = 10^-6. c_a can vary accordingly. In ideal cases we have c_a = 10^-7, but it may also be orders of magnitude larger. Notwithstanding all variations, we will work under the general assumption that c_i is considerably smaller than c_v, which in turn is orders of magnitude smaller than c_a.
t_i, the number of internal operations, cannot be estimated accurately because of the obscure internal operation of a modern processor. However, in many cases, it is possible to compare it with the number of operations performed by a sequential algorithm. For example, we will encounter sorting algorithms in which it is clear that t_i(P, n) = (1 + o(1)) * t_i(1, n) / P. Here t_i(P, n) denotes the number of internal operations performed by a PU when solving a problem of size n on P PUs. For such an algorithm, as long as we can guarantee that the communication terms are negligible, we may expect to achieve very high efficiencies.
Here we touch on the main point in the context of DMM computation: the contribution c_v * t_v + c_a * t_a to T_tot is perceived as the loss. For most applications t_i(P, n) decreases linearly with P, whereas particularly t_a(P, n) tends to increase at least linearly with P. So, in order to obtain acceptable efficiency, P must not be chosen too large in comparison to n. It depends both on the time complexity of the problem and on the development of t_a(P, n) how large P can be chosen, or said otherwise, for how small n it still makes sense to use P PUs.
The DMM model is a compromise between accurately modeling the time consumption of a parallel program and simplicity. It takes into account that sending small packets should be avoided, but neglects locality aspects.
The time consumption of algorithms for interconnection networks is mostly expressed in steps. A step consists of a certain amount of computation followed by communication. Often there is no limit on the amount of computation that can be performed in a step. The communication is ruled by the precise model assumptions, but typically it is assumed that one packet of unit size can be transferred over each connection in each step. These packets can neither be divided nor composed. This cost model will be referred to as interconnection-network model.
The main assumption is that PUs can only communicate with neighbors and that longer distance communications are achieved by forwarding packets along some path connecting the source PU with the destination PU. This is called store-and-forward routing. Due to the dominance of wormhole routing, this assumption does not reflect the current technological conditions. There are nevertheless good reasons to consider a cost model based on it:
The model is further subdivided according to whether PUs can communicate with all neighbors at the same time, the all-port model, or only with one of them, the single-port model. Sometimes it is specified that the connections are uni-directional (also called half-duplex), in that case a PU cannot send to a PU from which it is also receiving. Otherwise the connections are said to be bi-directional (also called full-duplex).
The interconnection-network model neglects the aspects of computation. Thereby it allows to concentrate on the fundamental aspects of data exchange.
When considering computers with a shared memory, it is important to specify which types of access to the memory are allowed:
Depending on the underlying hardware, any of these models may be realistic. They are not far apart anyway: the most powerful of these models, a step of a Maximum CRCW machine with P PUs can be simulated on an EREW machine in O(log P) steps. The main parallel computer model considered in this text is the PRAM model. A PRAM, Parallel Random Access Machine, is a synchronous parallel computer with a shared memory. When specifying the performance of a PRAM algorithm we will mostly specify the memory access model, so we may say "On a CREW PRAM ... ".
It would be most correct to formulate a PRAM algorithm including load and store operations specifying which data have to be copied from the global memory to the local memory and vice versa. However, the existence of a local memory is mostly ignored, specifying algorithms as if all the PUs operate directly on the shared memory itself. This is similar to the practice of writing sequential algorithms without mentioning the memory management. The main reason why the PRAM model is so popular is that it allows to concentrate on the work to perform.
The shared memory model entirely neglects the aspects of data exchange. Thereby it allows to concentrate on the fundamental aspects of parallelizability.
An algorithm is evaluated with two cost parameters: the time, that is, the number of parallel steps; and the work, the sum over all steps of the number of performed operations. Specifying the performance of an algorithm in this way leaves the allocation of the work to the PUs unsolved. In principle it should be possible to perform a step in which w operations are executed on a PRAM with P PUs in round_up(w / P) time, but only if each PU knows which operations to execute.
If the work-allocation problem can be solved, then a WT algorithm gives a PRAM algorithm: if W(n) and T(n) are the work and time for a problem of size n, respectively, then
T(P, n) = sum_{step i} round_up(w_i / P)
<= T(n) + sum_{step i} w_i / P
<= T(n) + W(n) / P,
C(P, n) = P * T(P, n)
<= P * T(n) + W(n).
These relations immediately suggest a close to optimal choice of P for a given value of n: take P = P(n) = W(n) / T(n): for that value T(P, n) <= 2 * T(n) and C(P, n) <= 2 * W(n). So, for this choice of P, the completion time is at most twice the minimum value, and the cost is at most twice as large as the work. This is a good compromise: taking P larger, C(P, n) increases rapidly, without giving a substantial reduction of T(P, n). Taking P smaller gives substantially larger T(P, n), without giving a substantial reduction of C(P, n).
If there is a parallel algorithm in the work-time framework solving a problem of size n with work W(n) and time T(n), then, provided the work allocation can be solved, there is a PRAM algorithm with time O(T(n)) and cost O(W(n)).
Suppose we have a PRAM algorithm solving a problem with P_PRAM PUs in T_PRAM time. We want to execute it on a DMM with P_DMM PUs. It is understood that P_DMM << P_PRAM. In many cases P_DMM must be fairly small in relation to the problem size in order to obtain speed-up. The easiest is to let PU_i, 0 <= i < P_DMM, of the DMM perform the work executed by the PRAM PUs with indices k, i * P_PRAM / P_DMM <= k < (i + 1) * P_PRAM / P_DMM.
Of course, this is not all. In the first place there is no guarantee that in this way the DMM PUs are equally loaded, even though they take over the work of equally many PRAM PUs. This is called the load balancing problem. For example, when simulating the PRAM algorithm for computing the sum, more and more PRAM PUs become idle. If things are arranged so that at all times the first PUs remain active, then the above processor allocation concentrates all the work unnecessarily on a few DMM PUs. In the case of the sum algorithm this can be prevented by letting the work of PU_j of the PRAM be executed by PU_{j mod P_DMM} of the DMM. However, we strive for a blind PRAM simulation, not one in which every algorithm has to be studied before we can decide how to allocate the PRAM PUs to DMM PUs. A possible solution is to randomly allocate the PRAM PUs to the DMM PUs, for example, by using a universal hash function. This does not exclude bad allocations, but makes them unlikely.
Another point is the memory allocation. In some cases one might in principle distribute the complete data set to all DMM PUs, and then compute on these. This is normally not a good practice: this requires much communication and much memory at the PUs. One of the strong points of a parallel computer is that the joint main memory is much larger than that of most single-processor machines, thereby allowing to solve large problems without exceeding the memory size. Furthermore, there are not so many problems which can be split in non-interacting subproblems. So, normally the data set is distributed over the PUs in such a way that each data item is stored in a single PU.
Often it is good enough to use a regular allocation like for the PRAM PUs. In such a regular allocation, data item i is either stored in DMM PU i mod P_DMM or in i * P_DMM / number_of_items_to_allocate. If now a PRAM PU which is allocated to DMM PU p needs a data item stored in DMM PU p', it sends a request to PU p' and waits for the reply. A typical way of progressing is that each PRAM step is decomposed in three substeps:
A possible communication schedule for sending all requests is to have P_DMM - 1 communication steps: in step t, 1 <= t < P_DMM, PU p sends to PU (p + t) mod P_DMM. In this way each PU is sending and receiving only one packet in each step. If the pattern is balanced this works fine. A pattern is said to be balanced if each DMM PU is approximately sending the same number of requests to any other PU.
If the requests are balanced, then also the answers are balanced. If there are O(P_PRAM / P_DMM) requests, then with the above schedule each transferred packet has size O(P_PRAM / P_DMM^2). So, one such communication round costs O(P_PRAM / P_DMM) * t_v + 2 * (P_DMM - 1) * t_a. From this it becomes clear that we must have P_PRAM / P_DMM^2 >> 1 if we do not want that the start-up costs dominate the overall time consumption.
The condition P_PRAM / P_DMM^2 >> 1 coincides with the condition that in most cases will guarantee that the pattern is balanced. The average packet size is P_PRAM / P_DMM^2 and if we may assume that the requests are randomly distributed over the PUs, the size of a packet is given by a random distribution with expected value S = P_PRAM / P_DMM^2. Such a distribution is tightly concentrated around its expected value. Chernoff bounds give that the deviations from the expected value are bounded by O(sqrt(S * log(P_PRAM))) with high probability.
However, a fixed data allocation can normally not guarantee that the communication pattern is balanced. An approach which makes it very likely that for P_PRAM / P_DMM^2 >> 1 all arising communication patterns are balanced is to allocate the data in a random way just as we did with the PUs. Again a hash function should be used, otherwise it would be hard to figure out where a data item is stored. Nevertheless, evaluating a hash function implies considerable overhead in comparison to a sequential algorithm which can access its data directly. If such a construction is applied, the efficiency drops considerably.
In some cases the problem is more fundamental: if the same data item is used by many PRAM PUs, then there is no way of allocating the item so that the communication pattern becomes balanced. The PU holding this item will be flooded by requests, and applying the above approach will give very poor performance, because all requests are handled sequentially. A clever algorithms designer will know whether such patterns occur in his/her application or not and mostly no action is needed.
If the problem of having many requests for a small subset of data items really occur or may occur, then one must do something about it. A practical solution is to first sort all requests on the index of the data item they are destined for. Sorting is not for free, but normally we can apply bucket sort because the range of values is limited and the amount of involved communication is also quite limited as we will see in a later chapter. Then, for all requests to the same data item one is chosen as representative. This representative is forwarded to the item. The answer is returned and replicated to all requests. Then the routing pattern of the sorting is reversed.
PRAM simulation is the link between a library of PRAM algorithms and programs running on actual machines. Only some elementary communication algorithms must be implemented in a machine-specific way.
Above we noted that often there is a considerable discrepancy between the bounds in the PRAM and the DMM model. This does not mean that it is useless to try to design fast PRAM algorithms: simulating a PRAM algorithm with a running time T normally leads to a DMM algorithm with T communication rounds. A small number of communication rounds is a good thing to have, because this implies a smaller value of the number t_a of start-ups.
On the other hand, the number of start-ups is not the only cost factor. Consider again the formula giving the time consumption of an algorithm running on a DMM:
T_total ~= c_i * t_i + c_v * t_v + c_a * t_a.This means that the best parallel algorithm is mostly the one that minimizes t_a without substantially increasing t_i or t_v. From this we obtain a strong and interesting reduction of the set of practice-relevant PRAM algorithms:
Only cost-optimal PRAM algorithms have potential practical relevance. Most promising are cost-optimal algorithms with minimal running time.