Preface and Notation

This document contains the text of a second course in algorithms and data structures.

As html is not suited for mathematical formulas, some additional notation is used (as used in the typographical package Latex). a_i denotes a with subscript i. a^i denotes a to the power i. <= and >= are used as in most computer languages. Curly brackets, "{" and "}", are use to group things. sum_{i = 0}^j denotes the sum for i running from 0 to j. The same may also be written sum_{0 <= i <= j}. ~= stands for "approximately equal" and ~ for "proportional to". Greek letters are written out. For logical expressions we either use the notation from C or the operators are written out in text. So, "a && !b" is the same as "a and not b". sqrt stands for the square root function and log without specifying the base number for the logarithm to basis 2. There are a few more notations, but they should be understood easily. New notions are printed bold there where they are defined. Inside the chapters particularly important ideas and short key notes are highlighted without introductory text. There are many pictures. These are intended to be self-explanatory. Typically they are placed just after the text fragment to which they belong, generally there is no direct reference to them in the text.

There are very few references to the literature. Clearly most of the presented material is not new. A large part is even common knowledge. More directly this text is based on material found in the following books:

Several students have contributed by pointing out errors and spots which required better explanation.

Notice: the following text may be overcomplete. At the examination it is expected that the students only know all that has been presented during the lectures.

Table of Contents





Introduction

Covered Material

This text constitutes a wider introduction to the topic of algorithms and data structures. For completeness sake some of the basic definitions are repeated, but the accent lies on the slightly more advanced topics.

In the chapters on data structures, one can find a choice of dictionary and priority queue implementations, not just the basic two for each of them. There is also an analysis of the union-find algorithm. Most of these data structures are only good in an amortized sense, and therefore these chapters provide a rather intensive exposure to amortized analysis.

Most chapters deal with algorithmic topics. In the first place there are chapters on topics from a particular field: numerical algorithms, text algorithms and graph algorithms. The stress, however, lies on algorithmic approaches to solving problems: greedy methods, divide and conquer, dynamic programming, backtracking and branch and bound. These methods are treated in quite some detail, offering several examples in the body of the text and investigating some others in the questions.

At the same time many topics remain untouched. There are numerous data structures for other ADTs than dictionaries and priority queues. Dynamic trees, a data structure which allows to efficiently compute the lowest common ancestor in a changing tree, is just one example. For graphs we are just treating a small selection of problems and not even these very intensively. Important topics like planarity, tree width, minors and covering are not even mentioned. There are even whole algorithmic directions which are not considered, such as randomized, geometric, approximation, online and parallel algorithms.

Asymptotical Notation

For estimating and comparing the time consumption of programs, we need some notions to estimate the time consumption of algorithms in a computer independent way. We want to make statements like "algorithm A is better than algorithm B". By this we mean: "for sufficiently large inputs a program that implements algorithm A will run faster than a program that implements algorithm B". The following notation is standard and widely used among all computer scientists:

Actually all these notions should be written with the "element-of symbol", so one should say "T(n) = 6 * n^2 + 2 * n^3 + 345 is an element of O(n^3)", but it is very common to use the equality symbol. Nevertheless one should not be fooled by this: T(n) = O(f(n)) does not imply that O(f(n)) = T(n) (this is not even defined) or that f(n) = O(T(n)) (which might be true, but which does not need to be true).

By far the most common of these symbols is O(). This symbols gives us an upper-bound on the rate of growth of a function, allowing us to make an overestimate. In the chapter on union-find we will give an analysis of the time consumption and we will conclude that certain operations can be performed very efficiently, stating T(n) = O(f(n)) for some function f() which is almost constant. However, the actual result is even sharper, even in an asymptotic sense. Most other algorithms and data structures we study in this lecture are so simple that the time consumption of operations can be specified exactly. In that case we might even write T(n) = Theta(f(n)), but because it is typically the upper bound which interests us most, we will even in these cases mostly use O().

It is easy to show, just arguing formally, that if T_1(n) = O(f_1(n)) and T_2(n) = O(f_2(n)), that then T_1(n) + T_2(n) = O(max{f_1(n), f_2(n)}). As an example we will prove this. Other relations can be proven similarly. We know T_1(n) < c_1 * f_1(n) for all n > n_1, and T_2(n) < c_2 * f_2(n) for all n > n_2. This implies T_1(n) + T_2(n) < (c_1 + c_2) * max{f_1(n), f_2(n)}, for all n > max{n_1, n_2}.

Common terminology:

It happens, even in the literature, that people mix-up Omega, Theta and O. You will see that O is used where clearly one of the other two is understood. The most common usage of the above notation is that one tries to express the time complexity of an algorithm consisely as O(f(n)) for some suitable and simple function f(). For an expression f(n) + g(n), f(n) is said to be the leading term and g(n) a lower-order term when g(n) = o(f(n)). Typically the time complexity involves a number of contributions. In that case only the leading term should be retained, while the lower-order terms are scratched away. The leading term should be given without constants. For example, 23 * n * log + 2 * n^2 + 100 * n, is normally written as O(n^2). In some contexts it may make sense to specify the leading constant, the constant with which the leading term is multiplied. Then this same result can be given as 2 * n^2 + o(n^2), as (2 + o(1)) * n^2, or as 2 * n^2 + O(n * log n).

In order to determine the leading term, it is handy to know a few rules which all can be proven by using the definition above:

  1. c * f(n) = Theta(f(n)), for any constant c.
  2. n^a = o(n^b), for any a < b.
  3. a^n = o(b^n), for any a < b.
  4. log^c(n) = o(n), for any constant c.
  5. n^c = o(a^n), for any constant c and any a > 1.
The first is trivial (use alpha = c and alpha = 1 / c, respectively). The second follows because n^a / n^b = 1 / n^(b - a). If b > a, b - a > 0. In that case lim_{n -> infinity} 1 / n^{b - a} = 0, because lim_{n -> infinity} 1 / n^c = 0, for any constant c > 0. The third can be proven analogously and is left as an exercise.

The last two relations can be proven conveniently using l'Hopital's rule. This rule states that for differentiable functions f() and g(), with lim_{n -> infinity} f(n) = infinity and lim_{n -> infinity} g(n) = infinity, lim_{n -> infinity} f(n) / g(n) = lim_{n -> infinity} f'(n) / g'(n). We now prove the fourth relation, leaving the last as an exercise. lim_{n -> infinity} log^c(n) / n = 0 <=> (lim_{n -> infinity} log^c(n) / n)^{1/c} = 0 <=> lim_{n -> infinity} log(n) / n^{1/c} = 0. The last transition is not trivial but may be found in textbooks on analysis. To prove this last we use l'Hopital's rule. First we notice that both f(n) = log(n) and g(n) = n^{1/c} are differentiable for n > 0 and go to infinity with n. f'(n) = 1 / n and g'(n) = 1/c * n^{1/c} / n. So, f'(n) / g'(n) = c / n^{1/c}. For any constant c, lim_{n -> infinity} f'(n) / g'(n) = 0, and therefore we conclude lim_{n -> infinity} f(n) / g(n) = 0, as was to be shown.

Computer and Cost Model

It is common, but other assumptions are made as well, to work under the so-called RAM cost model, also called "von Neumann" model. This means that we assume that all basic instructions take equally long, and that we take the total number of basic instructions as our cost measure. So, time, measured in terms of basic instructions, and expressed in O() notation, is our main concern.

One should be aware that this model is only a coarse approximation of the reality that holds for modern computer systems. Of course, at least in theory, this is dealt with by the O() assumption: all operations are constant time, but the constants may be quite considerable. The most important aspect is the non-uniform cost of memory access. The following picture gives a very high-level view of a modern computer system.

Sketch of a memory hierarchy

There are a few registers, there is 16-64 KB of first-level cache, there is 256-1024 KB of second-level cache, there is 64-16384 MB of main memory and a hard disk with storage for 20-200 GB. Each higher level of memory has higher access costs. The registers and the first-level cache may be assumed to be accessible in 1 clock cycle. The second-level cache costs several clock cycles to access. Upon a cache miss (that is, when the required data are not available in the cache) the data are fetched from the main memory, which currently costs on the order of 200 clock cycles. This cost is partially amortized by look-ahead and loading a cache line consisting of 64 bytes, but will nevertheless slow down the computation noticeably. Much worse is a page-fault (that is, when the required data are not available in the main memory). In that case, the data are fetched from the hard disk. This is a terribly expensive operation, costing about 10 ms. Again it is attempted to amortize these costs by delivering a large page of 10 to 100 KB, but this only helps if one is indeed using the data on the page. Random access to the secondary memory (a more general name for memory like hard disks) is devastatingly slow.

Other factors may, however, be equally important. The main other factor is space. There are examples of problems that can be solved faster if we are willing to use more space. So, in that case we find a space-time trade-off. A trivial (but extreme) example is to create a dictionary for all words of at most five letters with 27^5 storage. After initialization, we can perform insertions, deletions and look-ups in constant time. Actually, applying the idea of virtual initialization (treated in one of the later chapters), the prohibitively expensive initialization does not need to be performed.

A further point that should be stressed here, is that when evaluating the performance of an algorithm, we consider the time the algorithm might take for the worst of all inputs. That is, we perform a worst-case analysis. Thus, if we state "this sorting algorithm runs in O(n * log n)", we mean that the running time is bounded by O(n * log n) for all inputs of size n. If on the other hand we say "this sorting algorithm has quadratic running time", we do not mean that it takes c * n^2 time for all inputs of size n, but that there are inputs (maybe not even for all values of n) for which the algorithm takes at least c * n^2 time for some constant c. This type of analysis is by far the most common in the study of algorithms, but occasionally one gets the feeling that this does not reflect the "real" behavior in practice. Maybe there are only some rare very artificial instances for which the algorithm is slow. In that case one may also perform some kind of average case analysis. The problem is to define what the average case is. For example, when sorting n numbers, can one assume that all n! arrangements are equally frequent? In the real practice there may be a tendency to have some kind of clustering. The strong point of a worst-case analysis is that it leaves no room for such discussions. Furthermore, one can easily find many contexts, where one wants guaranteed performance.

Limitations of the Cost Model

Why do we need a cost model at all, and what makes a good cost model? The purpose of a cost model is two fold. Ideally one would like that the cost model allows to reasonably well estimate the running time of a program before actually running it. This is maybe asking too much, but nevertheless one would like to know whether a problem of size n will be solvable within one minute or not. A much weaker requirement is that one would like to be able to estimate the running time T(n') for a problem of size n' knowing the running time T(n) for a problem of size n. Another, probably even more relevant requirement to a cost model is that it allows to compare the quality of algorithms without implementing them. If the problem has to be solved anyway, it is not so important how long a program will run, but it should not run longer than necessary. So, the cost model should not abstract away features which are essential for the running time in practice.

The other main requirement to a model is simplicity. Clearly there is a trade-off between accuracy and simplicity. The von Neumann model is simple. In that sense it is very satisfactory, and the fact that there has been a widely accepted cost model is one of the main reasons for the rapid development of the theory of algorithms and data structures.

In the remainder of this section we consider the limitations of the von Neumann model in more detail, and give examples of problems for which these imply that it either does not give a reasonable impression of the actual running time, or that it leads to preferring an algorithm which in practice does not fulfill the expectations.

Neglecting Constants

In extreme cases, neglecting the constants may give a completely false impression of the performance of an algorithm. For example, there is a O(n) algorithm for computing the "tree-decomposition of a k-tree with n nodes for any finite k". At this point it is not important what the problem stands for. Hearing this, one might believe that computing tree-decompositions is a relatively easy problem. However, the problem is virtually unsolvable for any k > 2 and n > 10, because the complexity is something like 100 * n^2^{2^k}. This is a an exceptional case, but generally, as soon as one starts to implement algorithms, the constants matter a lot.

One of the main problems with fixing a cost model, is that it is abused by theoreticians: it is considered to be a great improvement to obtain an O(n * log n / loglog n) algorithm for a problem which until then could not be solved faster than O(n * log n). However, for any practical value of n, loglog n < 6. Therefore, if one wants to know whether the new algorithm is better, one should know the ratio of the neglected leading constants. In many cases such "improved" algorithms are considerably more complicated. This does not only mean that they are harder to implement, but often also that the involved constants are bigger. Another serious case are "improvements" based on assumptions of the model which were never intended to be stretched to the limit. The most noticeable example is the exploitation of the assumption that the computer we are working on is able to perform arithmetic on log n bits in parallel, when the size of the problem is n. With tricks of this kind the asymptotic running time of some algorithms has been reduced by a factor loglog n, but only on machines with asymptotically large word size.

Worst Case Analysis

When we are sorting n numbers with bubble sort and plot the running times the curve shows a clear quadratic increase. If we use merge sort instead, it grows like n * log n. In these cases the statements "sorting n numbers with bubble sort takes O(n^2) time" and "sorting n numbers with merge sort takes O(n * log n) time" are meaningful, and allow to correctly predict that for not too large values of n one should rather take merge sort for sorting. In this case the actual behavior corresponds to the worst-case behavior.

This is not always the case. The worst-case complexity of Karzanov's max-flow algorithm is O(n^3), where n is the number of nodes of the graph. However, running Karzanoz's algorithm shows that for most graphs the running time is more like O(n^2). A more extreme example we encounter in the domain of linear programming. The classical algorithm for this, the simplex algorithm, has exponential running time. Nevertheless the simplex algorithm effectively solves even very large problem instances. A more recent development, the elipsoid method, solves the linear programming problem in polynomial time. Only recently the elipsoid method has become somewhat competitive.

Therefore, one may want to capture something like the "actual" behavior of an algorithm. The problem with this is that it is very hard to tell what inputs the algorithm is going to be used for. Nevertheless it may be reasonable to study the average-case complexity. This does not always solve the problem though, because in the first place this makes it mostly much harder to analyze the complexity, and in the second place one must define a probability space which seems to capture the features of a typical instance. Considering graph algorithms, it is often tempting to perform an analysis for some kind of random graphs, however, these have very particular properties: the internet graph or the graph corresponding to a road network have very different properties. This means that a graph algorithm which has good average-case performance on random graphs does not need to perform well at all on the internet graph or on road graphs.

The following sorting algorithm is a nice example of an average-case algorithm, an algorithm that is performing well when performing an average-case analysis. We consider sorting n numbers which are uniformly sampled from an interval [0, M>. The idea is to create buckets of size M / n. First all elements are sorted into the buckets using a slightly generalized bucket-sorting algorithm: a number with key i is going to bucket k, 0 <= k < n, with k so that k * M / n <= i < (k + 1) * M / n. Then the elements in the buckets are sorted in a conventional O(n * log n) way. The total time T(n) is given by

T(n) = O(n) + sum_{0 <= k < n} "time to sort bucket k".

To analyze the expected running time of the algorithm, we use a basic, but extremely important fact from probability theory:

The expected value of a sum equals the sum of the expected values.
This fact, a proof can be found in any book on probability theory, is called linearity of expectation. So, the expected running time T_expected(n) for the sketched sorting algorithm is given by
T_expected(n) = O(n) + c * n * sum_{x >= 0} x * log x * Prob[x],
where Prob[x] denotes the probability that a bucket receives exactly x elements. It is easy to see that
Prob[x] = (n over x) * (1 / n)^x * (1 - 1/n)^{n - x}.
Prob[0] ~= Prob[1] = 1 / e, hereafter the probabilities strongly decrease, about like 1 / x! This can be estimated to be less than 1 / (log x * 2^x), and then it immediately follows that
sum_{x >= 0} x * log x * Prob[x] = O(1).

So, for sorting n numbers which are uniformly sampled from a finite interval the above algorithm has expected running time O(n). A nice feature of the algorithm is that it is correct and efficient even for inputs which are not nicely distributed: in the worst case all elements go to a single bucket. In that case, the initial O(n) work was entirely wasted, but we still finish in O(n * log n) time. In practice, if the interval is not known a priori, its bounds can be found with one extra scan through all numbers. Whether the bucketing helps to reduce the sorting time or not depends on how evenly the numbers are distributed within the interval.

Memory Usage

Most algorithms do not use substantially more memory than the size of the input (this means that the amount of needed memory is O(n), for an input of size n). In that case there is no need to specify the memory usage. However, there are many examples, where using more memory may in principle lead to faster algorithms. In general one cannot assume that memory has been initialized to any specific value, and initializing f(n) memory takes Theta(f(n)) time. So, it may appear that it is hard to profit from having much memory. However, one can imagine a problem which is solved in O(n^3) with O(n) memory, and in O(n^2) with O(n^2) memory. Furthermore, there is the idea of virtual initialization which allows to use uninitialized memory by creating certificates for all the used memory. In this way, using O(M) memory, a dictionary ADT for a set of n indices from [0, M> can be implemented with O(1) time per operation.

Uniform-Cost Model

There is even more to say about cost measures. Counting (rather estimating) instructions is fine, but in many contexts, some instructions are more important than others. In those cases, it makes sense to count these separately. For example in

External Computation

This means that we are dealing with a problem that is so large that only a fraction of the data fits in the main memory, while most data are stored on the hard disk. Because accessing the hard disk is much more expensive than accessing the main memory, we should not count an I/O instruction as 1, even though it has constant cost. In this context, in order to get a somewhat accurate estimate of the time consumption, one should know three values: the number t_i of internal operations; the number t_v of integers that is read or written, the data volume; and the number t_a of accesses to the secondary memory. The cost formula then might look like
T_total ~= (0.3 * t_i + 1,000 * t_v + 10,000,000 * t_a) * 10^{-9}.

The importance of an accurate formula of this kind is that it allows to rank algorithms that otherwise appear equally good. It guides when trying to design better algorithms. This formula shows that, as long as this does not lead to strong increases of t_i and t_v, the algorithms designer should focus on reducing t_a.

Parallel Computation

In parallel computation we try to solve a problem on P > 1 processors. If the memory is distributed over the processors, then data are exchanged by communication. In this case a reasonably accurate cost estimate can be obtained by estimating t_i as before, the number t_v of integers sent and received and the number t_a of packets that is sent. The cost formula is similar to above, the constants depend strongly on the system. A possible setting is
T_total ~= (0.3 * t_i + 10 * t_v + 10,000 * t_a) * 10^{-9}.

"Practical" Computation

Even in normal sequential computation it is no longer true that all instructions are equally expensive. In the first place there are the rather unpredictable effects of the pipeline. Pipeline optimization is mostly done in a final stage (in part by the compiler) , and typically not a theme for algorithms designers.

Much more important, and easier to understand is cache performance. On current computers cache misses cost around 600 clock cycles. Again, this is only a constant, but for any reasonable n, n * log n internal instructions cost less than n cache misses. Thus if an estimate of the cost of a sequential program is needed, it may make sense to estimate the number t_v of integers exchanged between cache and main memory and the number t_a of cache misses. For a modern fast computer we then find something like

T_total ~= (0.3 * t_i + 2 * t_v + 200 * t_a) * 10^{-9}.

As an example of algorithms with different cache behavior, we consider quick sort and bucket sort. If applicable, bucket sort should be much faster, because it is composed of only 3 or 4 simple linear time loops. Quick sort is simple too, but it performs at least log_2 n passes through the elements. Already for modest n, it performs 5 to 10 times more instructions. Nevertheless it is only slightly slower or even faster. The reason is the cache behavior: in bucket sort, the elements are traversed twice. The first time a counter is increased for every element. If the elements are not presorted, this means more or less randomly accessing the array with counters: if the array is larger than the size of the cache (on a Pentium IV the second level cache is 512 KB = 2^17 integers), this implies a cache miss for almost every element. During the second pass a counter is increased (or decreased) and an element is written to an array. This causes even 2 cache misses. So, for bucket sort we find t_c ~= 3 * n (actually slightly more, because the other operations also cause cache misses). In quick sort, the memory access is much more structured. Assume we are performing the in-situ sorting, where in every pass the array of elements is accessed from both sides, progressing towards the middle, exchanging elements that are wrongly placed. Only those little sections around the two positions in the array where the algorithm is operated must be held in cache. This is no problem. So, cache misses only arise when we come to the end of a cache line. Cache lines are not long (64 bytes = 16 integers on the Pentium IV), but still there is a large difference between handling all elements on a line and only one of them. Assuming that the splitters are chosen optimally (slightly under-estimating the costs), we can estimate t_c = log_2 n * n / b, where b is the number of integers on a cache line. For n = 2^24 and b = 16, bucket sort causes twice as many cache misses as quick sort (notice that for this n only a very small fraction of the arrays can be held in cache, and that indeed almost every random access to an array of length n leads to a cache miss).

Conclusion

The importance of considering alternative cost measures has continuously increased. The rate at which processors become faster (the famous "Moore's law", which says that the clock speed doubles every 18 months) is considerably higher than the rate at which access to the main memory becomes faster, and the rate at which hard disks can be accessed increases even slower. This is not surprising: the main memory has certain dimensions. Information is packed denser and denser, but the amount to store increases, and in practice are boards of computers still quite large. Building up a sufficiently high potential over a wire of 30 cm takes some time. Furthermore, there is a certain, almost constant, delay related to going through a connection. The operations within the processor can be controlled much better and here the improvements of chip design technique pay off much more directly. The hard disk is a mechanical device. The fact that it can be accessed only about 120 times per second is mainly due to rotational delay: data cannot be transferred before the read-head has reached the beginning of the requested section. On average we must wait for half a rotation. It is not easy to increase the rotational speed, forces become much larger, and this leads to increasing deformations.

The importance of considering constant factors gradually grows, because more complicated algorithms are developed in the course of time, for which it is often observed that in practice these are not better or even worse than the simple algorithms which have been used for a long time. The interest for algorithms using excessive amounts of memory is very limited, so from that side there is no danger to expect, and the analysis of the expected behavior of algorithms has greatly developed in the last 30 years, continuously contributing to a more complete understanding of the behavior and practical value of algorithms.

All this being said, in the rest of the text we will mainly deal with the von Neumann model, which still serves its purpose quite well. Anyway, the focus is more on ideas rather than on specific algorithms. These ideas are quite fundamental and useful beyond the applications for with which they are illustrated here and robust to changes of the cost model.

Solving Recurrence Relations

The time consumption of recursive algorithms is mostly given as a recurrence relation. Even though this contains all information and allows to compute the value for a concrete case, it is desirable to have an explicit form. This amounts to solving the recurrence relation. In general this is very hard, but for special cases the problem can be solved. In this section we suggest three approaches: tracing the form back to a common form which can be found in a table, guessing the correct form and verifying it by using induction, and solving it using a method which will be presented further down. The advantage of this last method is that, when applicable, it gives exact results, something which typically is hard to achieve by guessing.

Common Forms

The following list is very useful as a reference guide, covering the most commonly encountered recurrence relations:

T(n) <= T(alpha * n) + c T(n) <= c * log_{1/alpha} n, for alpha < 1 binary search, Euclidean algorithm
T(n) <= T(alpha * n) + c * n T(n) <= c * n / (1 - alpha), for alpha < 1 selection
T(n) <= T(alpha * n) + T((1 - alpha) * n) + 1 T(n) <= 2 * n - 1
T(n) <= T(alpha * n) + T((1 - alpha) * n) + c * n T(n) <= c * n * log_{1 / alpha} n, for 1/2 <= alpha < 1 merge sort
T(n) <= T(n - a) + c * n T(n) <= c * (n^2 / a + a)
T(n) <= a * T(n / b) T(n) <= n^{log_b a} recursive (matrix) multiplication
T(n) <= 2 * T(n - 1) + 1 T(n) <= 2^n

Here a, b and c are arbitrary positive constants, mostly integers.

Intelligent Guesswork

Not all recurrences are of one of the listed forms. But sometimes, there is a certain similarity, which allows to guess the form. It may also happen that one does not have the list of forms available still knowing how about the form should look. More in general, one can proceed as follows:
  1. Calculate the first few values
  2. Look for a pattern and guess a suitable general formula
  3. Prove that the guessed formula is the correct one using induction

This is indeed a very effective approach, one will rarely need more than this. Of course one needs considerable experience to perform step 2. As an exercise we will see how this works for the Fibonacci numbers defined by

F(0) = 0,
F(1) = 1,
F(n) = F(n - 1) + F(n - 2), for all n >= 2.
The first few values are 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... . Looking at the development of these, we get the feeling that they grow exponentially. So, let us guess F(n) = x^n. If this is correct, then we should have x^n = x^{n - 1} + x^{n - 2}, which only holds when x^2 - x - 1 = 0. Of the two roots of this quadratic equation, only (1 + sqrt(5)) / 2 (the "Golden Ratio") makes sense. This, however, does not match F(0) = 0, but this is a minor inconvenience, without problem we can prove:

Lemma: F(n) <= x^n, for x = (1 + sqrt(5)) / 2, for all n.

Proof: The proof goes by induction. For n = 0 and 1, it is ok (base case). So, assume the lemma is true for n - 1 and n - 2, for some n >= 2 (induction assumption). Then, F(n) = F(n - 1) + F(n - 2) <= x^{n - 1} + x^{n - 2} = x^n (step). Here the first "=" holds because of the definition, the "<=" holds because of the induction assumption, and the second "=" because of the choice of x. End.

Homogeneous Recurrences

Now we will present a method which is more general and more precise. We first consider the simplest type of recurrence relations. These are characterized by the fact that there are constants a_i, 0 <= i <= k, so that
a_0 * t_n + a_1 * t_{n - 1} + ... + a_k * t_{n - k} = 0, for all n.
To really fix the recurrence, its value must be given for k different values of i. The very special character of this recurrence relation lies in the fact that it is a linear expression, with constant coefficients and homogeneous: the sum equals zero.

The Fibonacci sequence is an example of such a linear homogeneous recurrence relation. We can take k = 2, with a_0 = 1, a_1 = -1, a_2 = -1. On the other hand, the expressions for the time development of a refined matrix multiplication are not homogeneous: using t_n = T(2^n), it reads t_n = 7 * t_{n - 1} + c * 2^{2 * n}, which can be rewritten as 1 * t_n - 7 * t_{n - 1} = c * 2^{2 * n}. Thus, it is linear, but not homogeneous. For these one has to go one step further (see next section).

The central concept in the study of homogeneous recurrence relations is that of the characteristic polynomial. If one assumes that the solution of the recurrence relation is given by t_n = x^n, then one should have

a_0 * x^n + a_1 * x^{n - 1} + ... + a_k * x^{n - k} = 0.
There is a trivial, uninteresting solution x = 0, which is clearly correct. If we assume x != 0, we can divide by x, and obtain
a_0 * x^k + a_1 * x^{k - 1} + ... + a_k = 0.
This is the characteristic polynomial of the recurrence relation. A polynomial of degree k has k roots (not necessarily all of them different and not necessarily all of them real), which we denote r_i. Said otherwise, the polynomial can be rewritten as
a_0 * x^k + a_1 * x^{k - 1} + ... + a_k = (x - r_1) * (x - r_2) * ... * (x - r_k)

The reason that we considered this characteristic polynomial is that these roots precisely lead us to the solutions of the recurrence relation: taking t_n = r_i^n, for all n, satisfies the relation. This can be checked easily by substituting these t_n in the original equation: after dividing by r_i^{n - k}, we find back the characteristic polynomial with r_i substituted for x, which equals 0 because r_i is a root.

Thus, we have found k different solutions. But actually, these are the generators of a k-dimensional linear space of solutions, because the linear combination of any two solutions is also a solution:

Lemma: If a homogeneous recurrence relation is satisfied by f_i and g_i, then it is also satisfied by h_i = x * f_i + y * g_i.

Proof: sum a_i * h_{n - i} = sum a_i * (x * f_{n - i} + y * g_{n - i}) = x * sum a_i * f_{n - i} + y * sum a_i * g_{n - i} = x * 0 + y * 0 = 0. End.

So, for any choice of constants c_i, 1 <= i <= k, all t_n of the form
t_n = sum_{i = 1}^k c_i * r_i^n
are solutions. The k "boundary" conditions determine which of these choices is the correct solution: we get a system of k equations with k unknowns, which (provided degeneration), has a unique solution.

Theoretically this is the most important you should know about homogeneous recurrence relations. Practically there is the problem of finding the roots and solving the system of equations. For k = 2, both is easy though, and for k = 3 you can often guess a root and then reduce the degree to 2. As an example we consider F(n) again. The characteristic polynomial is x^2 - x - 1 = 0, with roots r_1 = (1 + sqrt(5)) / 2 and r_2 = (1 - sqrt(5)) / 2. Now we are looking for constants c_1 and c_2 so that

c_1 * r_1^n + c_2 * r_2^n = F(n)
for all n, including n = 0 and 1. This gives
c_1 + c_2 = 0
c_1 * r_1 + c_2 * r_2 = 1
This gives c_1 = - c_2 and c_1 * (r_1 - r_2) = 1. Which gives c_1 = 1 / sqrt(5) and c_2 = - 1 / sqrt(5):
F(n) = 1 / sqrt(5) * ( ((1 + sqrt(5)) / 2)^n - ((1 - sqrt(5)) / 2)^n )

It is quite surprising that all the factors sqrt(5) just cancel so that F(n) is an integer for all n. If one knows the theory, deriving this equation was actually very easy. Furthermore, the result is much more precise than the one obtained before, telling us the constant factor and the speed at which it converges to the function 1 / sqrt(5) * ((1 + sqrt(5)) / 2)^n.

A special case occurs when root i, r_i, has multiplicity m_i larger than 1. In that case the linear space of solutions degenerates, which cannot be what really happens. Indeed, if a root r has multiplicity m, then also taking t_n = n^j * r^n gives a solution, for all 0 <= j < m. This can be shown by taking derivatives j times. The idea is to use that a function of the form (x - r)^m * f(x) is zero at position x = r also in its derivatives up to m - 1. So, if there are l distinct roots r_i with multiplicities m_i (of course sum_{i = 1}^l m_i = k), then the general form of a solution is given by

t_n = sum_{i = 1}^l sum_{j = 0}^{m_i - 1} c_{i, j} * n^j * r_i^n
If all m_i = 1, we find back the earlier form.

Inhomogeneous Recurrences

If at the right-hand side of the recurrence relation we do not find a zero but any other value, the recurrence relation is said to be inhomogeneous. One way of solving these, is to estimate this value in terms of the values t_i, thereby reducing inhomogeneous relations to homogeneous ones. In the following we present a more accurate method not involving any guesswork for an important special case. There are several more general types of inhomogeneous recurrence relations which can be solved exactly. Methods for this are presented in any book on discrete mathematics.

We consider inhomogeneous recurrency relations of the following form:

a_0 * t_n + a_1 * t_{n - 1} + ... + a_k * t_{n - k} = b^n * p(n), for all n.
Here b is a constant and p(n) is a polynomial of degree d. The main idea is to transform this equation to a homogeneous expression with k' = k + d + 1. In the following we only consider the simplest case: d = 0.

So, assume we must solve

a_0 * t_n + a_1 * t_{n - 1} + ... + a_k * t_{n - k} = b^n, for all n.
Then we know, that
b * (a_0 * t_{n - 1} + a_1 * t_{n - 2} + ... + a_k * t_{n - k - 1}) = b^n, for all n.
Subtracting the second from the first gives a homogeneous equation:
a_0 * t_n + (a_1 - b * a_0) * t_{n - 1} + ... + (a_k - b * a_{k - 1}) * t_{n - k} - b * a_k * t_{n - k - 1} = 0, for all n.
This equation can be solved by the method of the characteristic polynomial. This gives (assuming all roots are different):
t_n = sum_{i = 1}^{k + 1} c_i * r_i^n
Not all solutions are solutions of our original recurrence relation though: we increased the degree of the characteristic polynomial by one, without having extra boundary conditions. In the original formulation, the problem was fully specified, so now we have too many degrees of freedom. The extra boundary condition can be obtained, assuming t_0, ..., t_k are specified, by computing t_{k + 1} using the original formulation of the recurrence relation. Hereafter we can compute all the c_i.

As an example we consider the recurrence relation giving a possible expression of the time consumption of multiplying large integers by a clever recursive method. T(k), the time for multiplying two numbers consisting of k digits each, is given by

T(k) = 1, for k == 1,
T(k) = 3 * T(k / 2) + 11 * k, for all k > 1.
This formulation can be brought into the format we need by substituting t_n = T(2^n). This gives
t_0 = 1
t_n - 3 * t_{n - 1} = 11 * 2^n, for all n >= 1
2 * t(n - 1) - 6 * t_{n - 2} = 11 * 2^n, so
t_n - 5 * t_{n - 1} + 6 * t_{n - 2} = 0, for all n >= 1
The characteristic polynomial is x^2 - 5 * x + 6 = (x - 2) * (x - 3). Thus, r_1 = 2 and r_2 = 3. We are lucky, real nice numbers! So, the solution is of the form c_1 * 2^n + c_2 * 3^n, for appropriate c_1 and c_2. We now need to compute t_1 = 3 * t_0 + 11 * 2^1 = 25. Then we must solve
t_0 = 1 = c_1 + c_2
t_1 = 25 = 2 * c_1 + 3 * c_2
This gives c_1 = -22 and c_2 = 23, so the solution of the recurrence relation is given by t_n = 23 * 3^n - 22 * 2^n, which is the same as before, but now in a much more mechanical way and without need to check the result with an inductive proof.

Exercises

  1. Prove that log^c n = o(n), using as few as possible facts.

  2. Assume that processor speeds continue to double every 18 month, while memory access speeds double every 24 months. If now it takes 600 clock cycles to access the memory, how many clock cycles does it take in 5, 10 and 20 years? Show your computation.

  3. Quicksort has relatively good cache performance: in the top levels cache faults are produced, but only n / b per level, where n is the size of the input and b is the number of ints per cache line. Let M denote the cache size. If n = 2^k, M = 2^17 and b = 16, then there are about 3/2 * (k - 17) levels before the subproblems fit into the cache. Thus, the number of cache faults can roughly be estimated on 3 * (k - 17) / 32 * n. Notice that this linearly increases with k. Another approach is better in this respect. The idea is to first sort all subsets of a size so that all data fit into the cache. In practice the size should be taken slightly smaller, but we assume that this gives n / M sorted subarrays. So far the algorithm produced n / b cache faults. Now we create a heap of size n / M containing the smallest remaining element of each sorted subarray. Because we may assume that n / M < M, the heap fits in the cache. We may even assume that b * n / M < M, so for each subarray the currently interesting section stands in cache as well. Now we repeatedly perform deletemin and reinsert an element from the subarray from which this element was coming. The whole process produces c * n / b cache faults for some small constant c. As long as the conditions hold, the number of cache faults increases only linearly with n, and not with n * log n as when using quick sort. Implement this algorithm in Java, using the heap implementation which can be downloaded here. In the implementation one should not insert the keys into the heap but the array position from where this value comes. Otherwise each entry in the heap had to carry around this additional information. Use quicksort for sorting the subarrays. An implementation can be downloaded here. Compare the performance of the modified algorithm with that of quicksort. Test n = 2^k, starting with k = 17 and going as far as possible.

  4. An AVL tree is characterized by the property that for any node, the depth of its left and right subtree differ by at most 1. We want to compute the exact value of the minimum size s_d of an AVL tree of depth d. A minimum AVL tree of depth d is composed of a minimum tree of depth d - 1, a root and a minimum tree of depth d - 2. s_0 = 1. Formulate a recurrence relation giving s_d and solve this recurrence relation.

    Now we define a relaxed AVL tree to be a binary tree in which the depth of left and right subtree may differ by at most 2. Such a tree may have advantages because it requires less balancing. Estimate the minimum size s'_d of a relaxed AVL tree of depth d. Give an upper bound on the depth of a tree with n nodes.

  5. Solve the recurrence relation giving some generalized kind of Fibonacci numbers:
    t_0 = 0,
    t_1 = 0,
    t_2 = 1,
    t_n = t_{n - 1} + t_{n - 2} + t_{n - 3}, for all n > 2.

  6. Solve the recurrence relation
    t_0 = 0,
    t_1 = 5,
    t_n = 2 * t_{n - 1} - t_{n - 2}, for all n > 1.

  7. A very important class of recurrences are those describing the time consumption of algorithms in which a problem of size n is reduced at a cost of b * n^c by a factor a. That is,
    T(1) = 1
    T(n) = T(n / a) + b * n^c, for all n > 0.
    Give the general solution of this recurrence, paying attention to special cases.





Numerical Algorithms

Euclidean Algorithm

The greatest common divisor (gcd) of two numbers is the largest number that divides (in an integer sense) both of them. For example, for 1134 and 308, the gcd equals 14: 1134 = 81 * 14 and 308 = 22 * 14. Because 81 and 22 have no common factors, a larger number than 14 does not divide both 1134 and 308.

A possible way to compute the gcd, is to find a factorization for each of the numbers (1134 = 2 * 3 * 3 * 3 * 3 * 7, 308 = 2 * 2 * 7 * 11) and to take all common factors. This algorithm is correct, but has a terrible complexity: factoring a number n may be as expensive as trying for all of the numbers up to sqrt(n) whether they divide n. What is the complexity of this algorithm? You might answer O(sqrt(n)) (assuming that the numbers are not so large that we must account more than one time unit for a division). This is correct, and sounds not so bad. However, commonly one expresses the cost in relation to the size of the input! How big is the input here? Well, it is not much larger than we need to write down the number n. That is: x = log n bits. In terms of x, the time consumption is exponential: sqrt(n) = sqrt(2^x) = 2^{x / 2}. This sounds much less good. Maybe you think this is only a game of words, and this is right. What matters is how long it takes. But even then, it is no problem to write down a 64 bit number, but (even on a 64-bit machine) it takes quite a while to perform 2^{32} divisions.

The algorithm to efficiently compute the gcd is named after the Greek mathematician Euclid (+- 300 B.C.). This does not necessarily mean that he invented it, but may also mean that it appears in his collection of mathematical work the "Elements"). The algorithm is ingenious, but not hard. Let the numbers be called a and b.

  if (a < b)
  {
    c = a;
    a = b;
    b = c;
  }
  while ((c = a mod b) != 0)
  {
    a = b;
    b = c;
  }
  return b;

For the numbers 1134 and 308, we get the following sequence:

  a_0 = 1134, b_0 = 308;
  a_1 = 308,  b_1 = 210;
  a_2 = 210,  b_2 =  98;
  a_3 = 98,   b_3 =  14. 
The algorithm stops here because 14 divides 98.

This algorithm certainly deserves a proof of correctness. We start with a lemma:

Lemma: gcd(a - x * b, b) = gcd(a, b).

Proof: First we prove that gcd(a - x * b, b) >= gcd(a, b). Assume that d divides a and b, that is, a = c_1 * d and b = c_2 * d. Then, d also divides a - x * b because a - x * b = (c_1 - x * c_2) * d. Now consider the other direction. So, assume d divides a - x * b and b. That is, a - x * b = c_3 * d and b = c_2 * d. Then d also divides a because a = (a - x * b) + x * b = (c_3 + x * c_2) * d. End.

From this it follows that gcd(b, c) = gcd(c, b) = gcd(a mod b, b) = gcd(a, b). Applying induction shows that the gcd of the two last numbers equals the gcd of the two first numbers. But for the final two a and b, gcd(a, b) = b, because b divides a. A nicer formulation of this argument by induction uses an invariant: at all times during the computation gcd(a, b) is the value we are looking for. Initially this is clear, giving the precondition, during every pass of the loop this property is maintained because of the lemma, making a step, so finally it still holds, giving the postcondition.

For the time consumption we have the following:

Lemma: The algorithm runs in O(log(max(a, b))) time.

Proof: For proving a lemma like this with a logarithmic time bound the proof almost always goes along one of the following lines:

Remember this! Clearly you should first try the first approach, which is the easiest.

Why does this work? If we know that in every pass of the loop the problem size n is reduced by a factor alpha, then after k passes the size is reduced from its original size to n / alpha^k. We also know that the problem size is an integer. That means, that if we know for sure that it is smaller than 1, then it must be 0, and the loop will have terminated. For which k we have n / alpha^k < 1? Precisely then when alpha^k > n, that is for k > log n / log alpha = log_alpha n. For alpha > 1, constant, log alpha > 0, and thus log n / log alpha = O(log n). If the reduction by a factor is not achieved in one pass but in two (or some other constant number of) rounds, then the total needed number of passes is multiplied by two (or the other constant), still giving O(log n) rounds in total.

In order to be able to give a good prove, we need some extra notation, whenever you are trying to give a proof of this kind, you should do the same: denote the value of a after k passes through the loop by a_k, a_0 being the original value of a. Define b_k analogously.

Let us now try to apply all these remarks to the problem at hand. So, we try to prove that the maximum, after the initial rearrangement this is a, is substantially reduced in every round. However, if b_k lies close to a_k, for example, if b_k = a_k - 1, then this does not happen: in that case a_{k + 1} = a_k - 1.

As an alternative one might try to prove that a is reduced every second round. This we can prove. We need a case distinction. The first case is the easiest: if b_k <= a_k / 2, then clearly a_{k + 2} < a_{k + 1} = b_k <= a_k / 2. Otherwise, if b_k >= a_k / 2, then a_{k + 2} = b_{k + 1} = a_k - b_k <= a_k / 2. So, in both cases we get a reduction by a factor 2 in two rounds.

Case distinction is a general proof technique. The most common is that there are two cases: small values and large values. Often for both extreme cases it is easy, but without an extra assumption the proof cannot be completed.

The lemma can also be proven using the third idea: instead of considering the maximum value, which does not always decrease sufficiently fast from round to round, we can also consider the sum of a and b. Clearly max(a, b) < a + b, so if the sum has become small, then the maximum is also small. At the same time is the sum at most twice as large as the maximum, so this lies so close to it, that the estimate we hope to get still will be useful for the maximum as well.

Again we apply a case distinction. This time the cases are b_k <= 2/3 * a_k and b_k >= 2/3 * a_k. If b_k <= 2/3 * a_k, then a_k + b_k >= 5/2 * b_k. At the same time, a_{k + 1} + b_{k + 1} < 2 * b_k, so a reduction by a factor alpha_1 = 5/4. If b_k >= 2/3 * a_k, we use that a_k + b_k > 2 * b_k. At the same time a_{k + 1} + b_{k + 1} <= b_k + a_k - b_k = a_k <= 3/2 * b_k, so a reduction by a factor alpha_2 = 4/3. We conclude that for all a_k and b_k, a_k + b_k is at least reduced by a factor alpha >= 5/4. End.

Thus, even when considering the size of the input, this is only linear and not exponential, a formidable difference.

It is even true, that for any pair of positive integers a and b, there are numbers x and y (one of them negative) so that x * a + y * b = gcd(a, b). These numbers can be computed by tracing back the Euclidean algorithm. Let a_k and b_k be the values of a and b after k applications of the loop in the algorithm. Let x_k and y_k be the numbers so that x_k * a_k + y_k * b_k = gcd(a_k, b_k) = gcd(a, b). We may inductively assume that these x_k and y_k exist, because at the end, where b_k divides a_k, we can take x_k = 0 and y_k = 1. So, assuming that we know x_k and y_k, we will now compute x_{k - 1} and y_{k - 1}. a_k = b_{k - 1} and b_k = a_{k - 1} mod b_{k - 1} = a_{k - 1} - d_{k - 1} * b_{k - 1}, where d_{k - 1} = a_{k - 1} / b_{k - 1} (integer division). Substituting these expressions for a_k and b_k gives

  gcd(a, b) = x_k * a_k + y_k * b_k 
            = x_k * b_{k - 1} + y_k * (a_{k - 1} - d_{k - 1} * b_{k - 1}) 
            = y_k * a_{k - 1} + (x_k - y_k * d_{k - 1}) * b_{k - 1}. 
Thus, in order that x_{k - 1} * a_{k - 1} + y_{k - 1} * b_{k - 1} = gcd(a, b), we should take
  x_{k - 1} = y_k,
  y_{k - 1} = x_k - y_k * d_{k - 1}. 

Lemma: Numbers x and y so that x * a + y * b = gcd(a, b) can be computed in O(log(max(a, b)) time.

Proof: The above construction shows how to compute the numbers x_k and y_k so that x_k * a_k + y_k * b_k = gcd(a, b), for all k >= 0. a_0 = a and b_0 = b. So, we can take x = x_0 and y = y_0. Each step of the computation takes constant time, the number of steps is the same as in the Euclidean algorithm, which was proven to be logarithmic. End.

For the numbers 1134 and 308, we had the following sequence:
  a_0 = 1134, b_0 = 308, d_0 = 3;
  a_1 =  308, b_1 = 210, d_1 = 1;
  a_2 =  210, b_2 =  98, d_2 = 2;
  a_3 =   98, b_3 =  14, d_3 = 7. 
Applying the above rules gives the following sequence of x_k and y_k:
  x_3 =    0, y_3               =   1;
  x_2 =    1, y_2 =   0 - 1 * 2 =  -2;
  x_1 =   -2, y_1 =   1 + 2 * 1 =   3;
  x_0 =    3, y_1 =  -2 - 3 * 3 = -11. 
Indeed 3 * 1134 - 11 * 308 = 14, as it should be. Notice that (3 + k * 308) * 1134 - (11 + k * 1134) * 308 = 14, for all k, showing that the computed values of x = x_0 and y = y_0 are not unique.

Integer Operations

Multiplication

Assume we are writing a library for handling arbitrary large numbers. Because the arithmetic operations (addition, subtraction, multiplication, division, comparison, etc.) are defined only for numbers with 32 or 64 bits, these must be programmed by the user. Addition and subtraction are rather straightforward: applying the elementary methods taught at primary school leads to algorithms running in O(n) time for operations on two n-digit numbers.

Multiplication is much more interesting. The school method is correct. Let us consider it with an example (assuming that our computer can only handle one digit at a time). Then

  83415 * 61298 =
    6 * 83415 shifted left 4 positions +
    1 * 83415 shifted left 3 positions +
    2 * 83415 shifted left 2 positions +
    9 * 83415 shifted left 1 positions +
    8 * 83415 shifted left 0 positions

How long does this take? When multiplying two n-digit numbers, there are n multiplications of a 1-digit number with an n-digit number, n shifts and n additions. Each such operation takes O(n) time, thus the total time consumption can be bounded by 3 * n * O(n) = O(n^2). Clever tricks may reduce the time in practice quite a bit, but this algorithm appears to really need Omega(n^2). This quadratic complexity is precisely the reason that it is so tedious to multiply two 4-digit numbers. There is an alternative method. It is a pearl of computer science, surprisingly simple and, for sufficiently long numbers, considerably faster, even in practice.

Assume we are multiplying two n-digit numbers, for some even n (one can always add a leading dummy digit with value 0 to achieve this). Let m = n / 2. The following description is for decimal numbers, but can easily be generalized to any radix (in practice it is efficient to work with a radix of 2^16 or 2^32). Let the numbers be x = x_1 * 10^m + x_0 and y = y_1 * 10^m + y_0. That is, x_1 and y_1 are the numbers composed of the leading m digits, while x_0 and y_0 are the numbers composed of the trailing m digits. So far this is just an alternative writing, nothing deep. A correct way to write the product is now

  x * y = (x_1 * 10^m + x_0) * (y_1 * 10^m + y_0) 
= x_1 * y_1 * 10^{2 * m} + x_1 * y_0 * 10^m + x_0 * y_1 * 10^m + x_0 * y_0.

This formula suggests the following recursive algorithm:

  superlong prod(superlong x, superlong y, int n) {
    /* add(x, y) adds x to y,
       shift(x, n) shifts x leftwards n positions */

    if (n == 1)
      return x * y /* Product of ints */

    if (n is odd)
      add a leading 0 to x and y and increase n by 1;

    compute x_1, x_0, y_1, y_0 from x and y;

    xy_11 = prod(x_1, y_1, n / 2);
    xy_10 = prod(x_1, y_0, n / 2);
    xy_01 = prod(x_0, y_1, n / 2);
    xy_00 = prod(x_0, y_0, n / 2);

    xy = xy_00;
    xy = add(xy, shift(xy_01, n / 2));
    xy = add(xy, shift(xy_10, n / 2));
    xy = add(xy, shift(xy_11, n)); 

    return xy; }

How long does this take? Is it faster than before? Let us look what happens. Instead of one multiplication of two numbers of length n, we now have 4 multiplications of numbers of length m = n / 2 plus 3 shifts plus 3 additions. The additions and the shifts take time linear in n. So, all together, the second part takes linear time. That is, there is a c, so that the running time for this part is bounded by c * n. The first part is formulated recursively. So, it makes sense to formulate the time consumption as a recurrence relation:

  T_prod(n) = 4 * T_prod(n / 2) + c * n
  T_prod(1) = 1

To solve recurrence relations it often helps to try a few values in order to get an idea:

  T(1)  = 1
  T(2)  = 4 *    1 + 7 *  2 =    18
  T(4)  = 4 *   18 + 7 *  4 =   100
  T(8)  = 4 *  100 + 7 *  8 =   456
  T(16) = 4 *  456 + 7 * 16 =  1936
  T(32) = 4 * 1936 + 7 * 32 =  7968
  T(64) = 4 * 7968 + 7 * 64 = 32320

Here we assumed c = 7 (estimating 1 * n for each linear time operation, counting the construction of the numbers x_0, x_1, y_0 and y_1 as one operation). Quite soon one starts to notice that actually this additional term c * n does not matter a lot. The main development is determined by the factor 4. Which function returns a four times larger value when taking twice as large an argument? How about n^2? Indeed, this algorithm has running time O(n^2).

Let us try to prove this, taking T(n) = d * n^2, and see whether it works, and for which value of d. Our estimate should be exact or an overestimate, so substitution should give:

  d * n^2 >= d * 4 * (n / 2)^2 + c * n.
This does not work! There is no choice for d for which this relation is true. Bad luck, we apparently estimated T(n) wrong. Still we feel that the quadratic development is essentially correct. This is not such an easy case. The idea that works is to take T(n) = d * n^2 - e * n. Substitution then gives:
  d * n^2 - e * n >= d * 4 * (n / 2)^2 - 4 * e * n / 2 + c * n.
So, we can take e = c, and d so that d * n^2 - e * n >= 1 for n = 1. That is, we can take d = c + 1. The above row of values is thus given by T(n) = 8 * n^2 - 7 * n.

Now we have found a good estimate for the time consumption, but what is this all about? Where is the gain? Performing the multiplication this way, there is no gain. However, we can also do the following:

  superlong prod(superlong x, superlong y, int n) {
    /* add(x, y) adds x to y,
       shift(x, n) shifts x leftwards n positions */

    if (n == 1)
      return x * y /* Product of ints */

    if (n is odd)
      add a leading 0 to x and y and increase n by 1;

    compute x_1, x_0, y_1, y_0 from x and y;

    xy_11 = prod(x_1, y_1, n / 2);
    x_sum = add(x_1, x_0);
    y_sum = add(y_1, y_0);
    xy_ss = prod(x_sum, y_sum, n / 2);
    xy_00 = prod(x_0, y_0, n / 2);

    xy = xy_00;
    xy = add(xy, shift(xy_ss, n / 2));
    xy = subtract(xy, shift(xy_00, n / 2));
    xy = subtract(xy, shift(xy_11, n / 2));
    xy = add(xy, shift(xy_11, n)); 

    return xy; }

So, we compute x * y as x_0 * y_0 + (x_1 + x_0) * (y_1 + y_0) * 10^m - x_0 * y_0 * 10^m - x_1 * y_1 * 10^m + x_1 * y_1 * 10^n, which is just right. Clever or not? Let us write the time expression again.

  T_prod(n) = 3 * T_prod(n / 2) + c * n
  T_prod(1) = 1
Here c is somewhat larger than before. Estimating the cost of all linear-time operations as before gives c = 11. What matters much more is that now there are only three calls of the form prod(..., n / 2), giving 3 * T_prod(n / 2) instead of 4 * T_prod(n / 2).

Let us look at a few numbers again:

  T(1)  = 1
  T(2)  = 3 *    1 + 11 *  2 =    25
  T(4)  = 3 *   25 + 11 *  4 =   119
  T(8)  = 3 *  119 + 11 *  8 =   445
  T(16) = 3 *  445 + 11 * 16 =  1511
  T(32) = 3 * 1511 + 11 * 32 =  4885
  T(64) = 3 * 4885 + 11 * 64 = 15359

Again the development is dominated by the multiplication, so essentially, when doubling n, the time is multiplied by three. That is just what happens for the function n^{log_2 3}, and indeed it can be shown that the solution to the recurrency relation is given by:
  T_prod(n) = O(n^{log_2 3}) = O(n^{1.58...}).

This can be proven as before. We have become clever, so we guess we should take T(n) = d * n^x - e * n, for x = log_2 3, and some constants d and e to determine. Substituting gives that we must have:

  d * n^x - e * n = d * n^x - 3 * e * n / 2 + c * n.
So, now e = 2 * c. T(1) = 1 again gives d = e + 1. For c = 11, this would give T(n) = 23 * n^1.58 - 22 * n. Even though the leading constant is three times larger, n^1.58 is so much smaller than n^2 for large n, that this way of multiplication is considerably faster than the conventional one.

Division

It is interesting to consider how hard division is. The first question is whether it is substantially harder than multiplication or not. The answer is no: division is only marginally harder. Assume we want to compute r = p / q. Then, most methods first compute 1 / q and then multiply by p. In this way division has been reduced to computing reciprocals. The most common way of computing reciprocals is with Newton iteration.

Newton iteration is a general method for computing a functional inverse of a function. For a function f, we want to find a value x so that f(x) = y. We know from calculus that for well-behaved functions and sufficiently small d, f(x) ~= f(x + d) - d * f'(x + d). Said otherwise: the deviation d from x can be estimated as

d ~= (f(x + d) - f(x)) / f'(x + d).
Starting with some reasonable estimate x_0, we can then find a sequence of values x_i approximating the value x better and better by setting
x_{i + 1} = x_i - (f(x_i) - y) / f'(x_i).

In our case we choose f(x) = q * x and y = 1. This gives

x_{i + 1} = x_i - (q * x_i - 1) / q.
Now we should not make the mistake to simplify this (giving us an accurate but unsolvable relation), but rather we should replace 1 / q by the approximation x_i. Then we get
x_{i + 1} = x_i - q * x_i^2 + x_i = 2 * x_i - q * x_i^2.

This is really quite effective, because if we assume that f(x_i) = q * x_i = 1 + s, then f(x_{i + 1}) = q * x_{i + 1} = 2 * (1 + s) - (1 + s)^2 = 1 - s^2. So, the deviation of f(x_{i + 1}) from 1 is quadratically smaller than the deviation of f(x_i). For the x_i this means: the number of correct positions doubles in each iteration.

If we want to solve the integer division p / q, then it is sufficient to achieve a precision proportional to the number of bits of p: the further positions will be rounded off anyway. Thus, O(loglog p) iterations are sufficient. A good value for x_0 = 1 / q', where q' is the largest power of 2 smaller than q. Each iteration requires the computation of two products involving numbers of size at most p (we may assume that q <= p). Thus, the complete division takes time O(loglog p) times as much time as computing the product. Using the product method above, this gives O(log n * n^{log_2 3}), where n = log p.

In a cryptography application, we may have n = 1000. In that case, this division costs about 25 times more than multiplication. Long division can be made quite efficient but essentially requires O(n^2) time. For n = 1000 this is not so much worse than the above.

Newton iteration is also very effective for computing inverses of other functions. The nicest example is the square root. That is, we want to compute a value x so that f(x) = x^2 = y. This gives

x_{i + 1} = x_i - (x_i^2 - y) / (2 * x_i) = 1/2 * (x_i + y / x_i).
For example, when computing sqrt(100) and starting with x_0 = 1, we find x_1 = 50.5, x_2 = 26.2, x_3 = 15.0, x_4 = 10.8, x_5 = 10.03, x_6 = 10.00005.

As long as we are far away, the deviation is more or less halved in each step. That is not very good. However, as soon as the deviation is small it rapidly converges to 0. This can be quantified: if x_i = sqrt(y) * (1 + d), then x_{i + 1} = sqrt(y) * (1 + d^2 / (2 * (1 + d))). So, the next relative deviation d' = d^2 / (2 * (1 + d)).

Therefore, it is important to start with a reasonable first guess of the ultimate value. Let y' be the largest power of two smaller than y. If y' = 2^{2 * k}, it is good idea is to start with x_0 = 2^k. If y' = 2^{2 * k + 1}, we can take x_0 = 2^{k + 1}. For y = 100, y' = 64 = 2^6, so we would take x_0 = 2^3 = 8. The further sequence then becomes x_1 = 10.25, x_2 = 10.003, x_3 = 10.000001.

Exponentiation

Suppose we want to compute x^n. How do we do this? Clearly the following works:
  // In the following we have as invariant that at the
  // beginning of each pass through the loop c == x^i,
  // so in the end c = x^n.

  for (c = 1, i = 0; i < n; i++)
    c *= x;

Assuming that all the multiplications can be performed in unit time, this algorithm has complexity O(n). However, we can do this much faster! Supposing, for the time being, that n = 2^k, the following is also correct:

  // In the following we have as invariant that at the
  // beginning of each pass through the loop c == x^i,
  // so in the end c = x^n.

  for (c = x, i = 1; i < n; i *= 2)
    c *= c;
Here the number of passes through the loop is equal to the number of times we must double i to reach n. That is exactly k = log_2 n times. This algorithm is of the same type as binary search: there is some notion of repeated halving/doubling, which leads to logarithmic time, whereas doing the operation in a linear way gives linear time.

Now, we consider the general case. Assume that n has binary expansion (b_k, b_{k - 1}, ..., b_1, b_0). Then we can write n = sum_{i = 0 | b_i = 1}^k 2^i. So, x^n = x^{sum_{i = 0 | b_i = 1}^k 2^i} = prod_{i = 0 | b_i = 1}^k x^{2^i}. If we now first perform the above computation and store all intermediate c values in an array of length k, then x^n can be computed from them with at most log n additional multiplications and a similar number of additions. That is, the whole algorithm has running time O(log n). Actually it is not necessary to store the c-values: the final value can also be computed by taking the interesting factors when they are generated. The complete routine may look as follows:

  int exponent_1(int x, int n) 
  {
    int c, z;
    for (c = x, z = 1; n != 0; n = n / 2) 
    {
      if (n & 1) /* n is odd */
        z *= c;
      c *= c; 
    }
    return z; 
  }

It is a good idea to try how the values of z, c and i develop for x = 2 and n = 11.

A slightly different idea works as well. The idea is to start from the top-side: x^99 = x * x^98, x^98 = x^49 * x^49, x^49 = x * x^48, x^48 = x^24 * x^24, x^24 = x^12 * x^12, x^12 = x^6 * x^6, x^6 = x^3 * x^3, x^3 = x * x^2, x^2 = x * x. This idea can be turned into code most easily using recursion:

  int exponent_2(int x, int n) 
  {
    if (n == 0) /* terminal case */
      return 1;
    if (n & 1) /* n is odd */
      return x * exponent_2(x, n - 1);
    return exponent_2(x, n / 2) * exponent_2(x, n / 2); 
  }

As usual, the recursive algorithm is easy to understand and its correctness is obvious, while the iterative algorithm was rather obscure. How about the time consumption? Check what happens for n = 32. Formally the time consumption can be analyzed by writing down a recurrence relation. For numbers n = 2^k for some positive k, the time consumption T(n) is given by

T(1) = c_2,
T(n) = 2 * T(n / 2) + c_1, for all n > 1.
The solution of this is given by T(n) = (c_1 + c_2) * n - c_1. Once this relation has been found somehow, for example by intelligent guessing after trying small values of n, it can be verified using induction. So, define the function f() by f(n) = (c_1 + c_2) * n - c_1. Then T(1) = c_2 = (c_1 + c_2) * 1 - c_1 = f(1). This gives a base case. Now assume the relation holds for some n. Then we get
  T(2 * n)                          =def T()= 
  2 * T(n) + c_1                    =induction assumption=
  2 * f(n) + c_1                    =def f()=
  2 * ((c_1 + c_2) * n - c_1) + c_1 =computation=
  (c_1 + c_2) * (2 * n) - c_1       =def f()=
  f(2 * n).
Thus, assuming the equality for n = 2^k, we can prove it for 2 * n = 2^{k + 1}. Because it also holds for n = 1 = 2^0, it holds for all n which are a powers of two.

So, the running time of exponent_2 is at least linear! What went wrong? The problem is that we are recursively splitting one problem of size n in two subproblems of size n / 2. At the bottom of the recursion this inevitably leads to a linear number of subproblems. For other problems this may be inevitable, but here there is an easy solution:

  int exponent_3(int x, int n) 
  {
    int y;
    if (n == 0) /* terminal case */
      return 1;
    if (n & 1) /* n is odd */
      return x * exponent_3(x, n - 1);
    y = exponent_3(x, n / 2);
    return y * y; 
  }

Algorithm exponent_3 performs the same number of multiplications as exponent_1 (the exact analysis is left as an exercise). Nevertheless, even though the difference will not be large, it will be somewhat slower because every recursive step means that the whole state vector must be pushed on the stack.

Let us now assume that the time for the multiplications increases with the size of the number. This is reasonable, because unless the numbers are small, c^n will soon become a very large number. In order to get an easy comparison, we assume that multiplying an n_1-digit number and an n_2-digit number costs O(n_1 * n_2). Under this assumption, the conventional algorithm takes

  sum_{i = 0}^{n - 1} log c * log c * i = O(log^2 c * n^2).
The cost of the improved exponentiation can be estimated as
  sum_{i = 0}{log n - 1} (log c * 2^i)^2 = O((log c * n)^2).
So, even though we have reduced the number of products to compute from linear to logarithmic, we have not gained much when we look at the time for the whole computation because it are the last few products that dominate the cost.

If we are computing expo_mod, that is (c^n) mod m, then the situation is much better: generally we have (a * b) mod m = ((a mod m) * (b mod m)) mod m, and thus we can compute modulo after each multiplication: the numbers do not grow beyond the size of m, and therefore, all products cost the same (possibly with the exception of the first few). So, for expo_mod, the reduction of the number of the products gives the performance one would hope to achieve.

Matrix Operations

Multiplication

Matrices are of key importance far beyond linear algebra. Matrices may for example also stand for the connections in a graph. Matrix multiplication is the most important operation on matrices, unfortunately it is an expensive operation.

For two n x n matrices A and B we want to compute C = AB, defined by C_{ik} = sum_{j = 0}^{n - 1} A_{i, j} * B_{j, k}. This definition immediately gives a simple algorithm:

  void matrix_product(int** A, int** B, int** C) {
    for (i = 0; i < n; i++)
      for (k = 0; k < n; k++) {
        C[i][k] = 0;
        for (j = 0; j < n; j++)
          C[i][k] += A[i][j] * B[j][k]; } }
The complexity is clear: there is a threefold loop, each running from 0 to n - 1, so the inner loop is performed n^3 times, giving a total complexity of Theta(n^3).

Can this be done faster? The idea has the same spirit as that for the faster multiplication, a very important idea indeed:

If in a recursive algorithm the number of recursive calls is reduced at the expense of an increase of the cost of the non-recursive part, then even though the number of recursive calls is reduced only by a constant factor, this may lead to an asymptotical improvement of the time of the whole algorithm.

This idea can be applied to matrix multiplication, leading to Strassen's matrix multiplication algorithm with running time O(n^{log_2 7}), where log_2 7 ~= 2.81. The improvement is significant but not so spectacular as for multiplication. The spectacular thing is that it was possible to come below Theta(n^3) at all, which one would easily believe to be the "intrinsic complexity" of the problem. From this we should also learn how careful we have to be when stating that a problem "obviously cannot be solved faster than ...".

As for normal multiplication, matrix multiplication can be rewritten recursively. For matrix multiplication this is quite a natural thing to do (and a good idea when one considers the memory access). For an n x n matrix A, let A_00, A_01, A_10, A_11 be the m x m (as before m = n / 2, where we assume that n is even) submatrices arranged as in

   A_00 A_01
   A_10 A_11
Define B_00, ..., B_11 and C_00, ..., C_11, analogously. Then the following is correct:
  int** matrix_prod(int** A, int** B, int n) {
    /* submatrices as defined above,
       n the side length of the matrices
       add(A, B, n) adds two n x n matrices together */

    if (n == 1)
      multiply A and B (they are single ints!) and return the result.

    M_1 = matrix_prod(A_00, B_00, n / 2);
    M_2 = matrix_prod(A_00, B_01, n / 2);
    M_3 = matrix_prod(A_01, B_10, n / 2);
    M_4 = matrix_prod(A_01, B_11, n / 2);
    M_5 = matrix_prod(A_10, B_00, n / 2);
    M_6 = matrix_prod(A_10, B_01, n / 2);
    M_7 = matrix_prod(A_11, B_10, n / 2);
    M_8 = matrix_prod(A_11, B_11, n / 2);

    C_00 = add(M_1, M_3);
    C_01 = add(M_2, M_4);
    C_10 = add(M_5, M_7);
    C_11 = add(M_6, M_8);

    return C; }

As usual is the time consumption of a recursive algorithm most easily expressed in the form of a recurrence relation:

  T(1) = 1
  T(n) = 8 * T(n / 2) + c * n^2
The additional term is due to the 4 additions, we could take c = 4. Of course this only holds for n = 2^k for some k. Trying some values gives
  T(1)  = 1
  T(2)  = 8 *      1 + 4 *    4 =     24
  T(4)  = 8 *     24 + 4 *   16 =    256
  T(8)  = 8 *    256 + 4 *   64 =   2304
  T(16) = 8 *   2304 + 4 *  256 =  18432
  T(32) = 8 *  18432 + 4 * 1024 = 151552
Again, the additional term does not really matter, the real development is determined by the recursive part: doubling n makes it eight times more expensive. That is O(n^3). The precise solution of this recurrence relation can be found as before, it is T(n) = 5 * n^3 - 4 * n^2. In practice it is much better to stop the recursion when, n = 32 or 64. For these n the cubic term is already dominating, and the time comes much closer to 1 * n^3, as in the simple algorithm.

Just as the first recursive multiplication algorithm, this gives no real improvement. However, now we can start to play around. Strassen did, and found the following alternative algorithm:

  int** matrix_prod(int** A, int** B, int n) {
       submatrices as defined above,
       n the side length of the matrices
       + and - are used for sums and differences of matrices. */

    if (n == 1)
      multiply A and B (they are single ints!) and return the result.

    M_1 = matrix_prod(A_01 - A_11, B_10 + B_11, n / 2);
    M_2 = matrix_prod(A_00 + A_11, B_00 + B_11, n / 2);
    M_3 = matrix_prod(A_00 - A_10, B_00 + B_01, n / 2);
    M_4 = matrix_prod(A_00 + A_01, B_11,        n / 2);
    M_5 = matrix_prod(A_00,        B_01 - B_11, n / 2);
    M_6 = matrix_prod(A_11,        B_10 - B_00, n / 2);
    M_7 = matrix_prod(A_10 + A_11, B_00,        n / 2);

    C_00 = M_1 + M_2 - M_4 + M_6;
    C_01 = M_4 + M_5;
    C_10 = M_6 + M_7;
    C_11 = M_2 - M_3 + M_5 - M_7;

    return C; }
You can easily (but tediously) check that this is correct. This time the recurrence relation looks like
  T(1) = 1
  T(n) = 7 * T(n / 2) + c * n^2
The additional term is due to the 18 additions and subtractions, we could take c = 18. Of course this only holds for n = 2^k for some k. Trying some values gives
  T(1)  = 1
  T(2)  = 7 *      1 + 18 *    4 =     79
  T(4)  = 7 *     79 + 18 *   16 =    841
  T(8)  = 7 *    841 + 18 *   64 =   7039
  T(16) = 7 *   7039 + 18 *  256 =  53881
  T(32) = 7 *  53881 + 18 * 1024 = 395599
The solution is as said: O(n^{log_2 7}). Because log_2 7 ~= 2.81 is not so much smaller than 3, and because the constant factors are considerably larger this time, this algorithm is better than the other one only for quite large (but still realistic) values of n. Even here everything becomes much better when terminating the recursion for a somewhat larger value of n, for example n = 64.

Inversion

In the case of integers, once we had found a faster algorithm for multiplication, it was a natural question whether division has the same complexity. In that case, the answer was that it is at worst a logarithmic factor slower. Analogously, it is interesting to consider the complexity of matrix inversion. In this section we will show that Inv(n) = Theta(Mat(n)), where Inv(n) denotes the time for inverting an n x n matrix and Mat(n) denotes the time for computing the product of two n x n matrices. However, we start by considering several classical methods for computing matrix inverses.

Determinant-Based Matrix Inversion

The method which works easiest when performing computations by hand, is the method based on determinants. The determinant det(A) of an n x n matrix A is defined recursively by
det(A) = sum_{i = 0}^{n - 1} (-1)^{i + j} * a_{i, j} * det(A_{i, j}).
Here a_{i, j} denotes the entry of A at position (i, j) and A_{i, j} denotes the submatrix of A obtained by scratching out row i and column j. In the above case we say that the determinant is computed by developing along column j, 0 <= j < n. The determinant can also be computed by developing along a row. If the matrix has zeroes, then one way of developing may be more efficient than another.

The nice thing about determinants is, that they can be computed by a trivial recursive program:

  int determinant(int n, int[][] a)
  {
    if (n == 1)
      return a[0][0];
    
    int det = 0;
    int sign = 1;
    int[][] b = new int[n - 1][n - 1];
    for (int i = 0; i < n; i++)
    {
      for (int j = 0; j < i; j++)
        for (int k = 1; k < n; k++)
          b[j][k - 1] = a[j][k];
      for (int j = i + 1; j < n; j++)
        for (int k = 1; k < n; k++)
          b[j - 1][k - 1] = a[j][k];
      det += sign * a[i][0] * determinant(n - 1, b);
      sign *= -1;
    }
  }
The time complexity of this is easy to determine:
T(1) = 1,
T(n) = n * T(n - 1), for all n > 1.
The solution of this is T(n) = n!, which is very bad: Stirling's formula gives n! ~ (n / e)^n.

Using a method which we will later encounter under the name of dynamic programming, we can improve the computation of determinants considerably. The general idea of dynamic programming is to work with some kind of table in which already computed results are stored. As usual there are two ways of organizing this: top-down (following the recursive algorithm) and bottom-up (unwrapping the recursion). A rudimentary application of this idea we already saw in the efficient recursive computation of the exponent, where the second recursive call was saved by using the already computed result.

In this case, the bottom up method is easiest and most efficient: first compute all needed 2 x 2 determinants, then all needed 3 x 3 determinants, and so on. there are (n over 2) 2 x 2 determinants to compute, and more generally, there are (n over k) k x k determinants to compute. Given all necessary (k - 1) x (k - 1) determinants a k x k determinant can be computed in O(k) time. So, in total this computation takes O(sum_{k = 2}^n k * (n over k)) = O(n * sum_{k = 0}^n (n over k)) = O(n * 2^n) time. This is still exponential in n and not very good, but incomparably better than the primitive method: there are many kinds of exponential. 20! ~= 2.4 * 10^18, 20 * 2^20 ~= 2.1 * 10^7. The first cannot be computed in reasonable time, the second is no problem at all.

Based on determinants, the inverse of a matrix can be computed using Cramer's rule. Suppose A is invertible and let B = (b_{i, j}) be a matrix defined by

b_{i, j} = (-1)^{i + j} * det(A_{i, j}) / det(A).
Then A^{-1} = B^T, where B^T is the transpose of B, that is, at position (i, j) of B^T we find b_{j, i}. So, along these lines we do not only need det(A), but even det(A_{i, j}) for all i and j. The above guarantees that all these can be computed in O(n^2 * 2^n) time. The rest of the computation takes negligible time.

LU-Decomposition

Inverting a matrix with determinants is easy to program. For small n, this may indeed be a handy and even efficient way of doing (possibly this offers a good bottom in an implementation of the recursive algorithm of the next section). However, for larger n, the exponential complexity is unacceptable. We now show that an n x n matrix can be inverted in O(n^3) time.

The idea is to rewrite A as a product of several matrices which each can be inverted easily. More precisely, finally we will have P * A = L * U. Here P is a permutation matrix, that is, a matrix with exactly one 1 in each row and column. L is a lower-triangular matrix with only ones on the diagonal and U is an upper-triangular matrix with non-zero elements on the diagonal.

L is constructed column by column and U row by row. We use additional matrices Ă^(k). Ă^(k) has zeroes in all positions of row and column 0, ..., k. Ă^(-1) = A. For k >= 0, these matrices are constructed consecutively by determining vectors l^(k) and u^(k) with zeroes in positions 0, ..., k - 1, so that

Ă^(k + 1) = Ă^(k) - l^(k + 1) * u^(k + 1)^T.

The construction for k = 0 goes as follows:

          (     1     )          (a_00)
          (a_10 / a_00)          (a_01)
          (a_20 / a_00)          (a_02)
  l^(0) = (a_30 / a_00)  u^(0) = (a_03)
          (     .     )          ( .  )
          (     .     )          ( .  )
          (     .     )          ( .  )

           (0          0                   0          . . .)
           (0 a_11-a_10*a_01/a_00 a_12-a_10*a_02/a_00 . . .)
   Ă^(0) = (0 a_21-a_20*a_01/a_00 a_22-a_20*a_02/a_00 . . .)
           (0 a_31-a_30*a_01/a_00 a_32-a_30*a_02/a_00 . . .)
           (.     .    .                   .          . . .)
           (.     .    .                   .          . . .)
           (.     .    .                   .          . . .)
Finally, tracing back all reductions, we have
A = sum_{k = 0}^{n - 1} l^(k) * u^(k)^T = L * U.
Here L is the matrix obtained by putting the l^(k) next to each other, and U is obtained by putting the u^(k)^T over each other:
                                        (u^(0)^T)
                                        (u^(1)^T)
                                        (u^(2)^T)
  L = (l^(0) l^(1) ... l^(n - 1)),  U = (u^(3)^T)
                                        (   .   )
                                        (   .   )
                                        (   .   )

The above construction only works when the pivot element, the value at position (k, k) of Ă^(k - 1) != 0. If ă^(k - 1)_kk = 0, then we can be sure that ă^(k - 1)_ik != 0 for some i, k < i < n. Otherwise A would not be invertible. So, by permuting the rows of Ă^(k - 1), that is, by multiplying Ă^(k - 1) on the left with some permutation matrix, we can move a non-zero value to the desired position. Because the product of permutation matrices is a permutation matrix, we can finally write P * A = L * U and thus A = P^{-1} * L * U. This decomposition can be performed in O(n^3) time: T(n) = T(n - 1) + c * n^2, which gives T(n) = c * sum_{i = 1}^n i^2 ~= c / 3 * n^3.

If A = P^{-1} * L * U, then A^{-1} = U^{-1} * L^{-1} * P. P is known, U and L are easy to invert and then it remains to multiply them. The inverse of an upper-triangular matrix can be computed by determining value by value, using that we must have U^{-1} * U = I_n, where I_n is the n x n identity matrix. In each row, the values of the positions of U^{-1} can be determined from left to right by solving a linear equation with a single unknown. Each value can be computed in O(n) time, in total we need O(n^3). L is inverted most easily by using that L * L^{-1} = I_n.

The decomposition A = P^{-1} * L * U also provides an indirect, but efficient, way of computing the determinant of A: a theorem from linear algebra tells us that for any pair of n x n matrices A and B, det(A * B) = det(A) * det(B). The determinant of a permutation matrix is +1 or -1 and can be computed in O(n) time. The determinant of a triangular matrix is given by the product of all diagonal elements. In our case L has only ones on the diagonal, so det(L) = 1. We conclude that

det(A) = det(P^{-1}) * det(L) * det(U) = +- prod_{i = 0}^{n - 1} u_ii.

Recursive Matrix Inversion

We have now seen two algorithms for inverting n x n matrices. The first is really easy, and that is why it is taught in linear-algebra courses, but computationally it is bad due to its exponential complexity. The second is slightly more involved but runs in O(n^3), which is reasonable in the context of matrix computation. This method and variants of it are really used in practice in large-scale numerical computations. Since Strassen presented his algorithm, however, we know that the product of two n x n matrices can be computed in o(n^3) time and, at least from a theoretical point of view, it is interesting to consider whether matrix product and matrix inversion are in the same complexity class. In the following we show that Inv(n) = Theta(Mat(n)), where Inv(n) gives the time for computing the inverse and Mat(n) the time for computing the product of n x n matrices. It is convenient to be able to use n^2 <= Mat(n) = O(n^3), n^2 <= Inv(n) = O(n^3). The lower bounds are obvious, n^2 is needed for reading the input. The upper bounds follow from the given algorithms.

It appears obvious that inverting a matrix is not easier then computing a product. By a tricky construction this can be shown easily. Consider the following 3 * n x 3 * n matrix:

           ( I_n   A    0  )
  M      = (  0   I_n   B  )
           (  0    0   I_n )
Here A and B are the n x n matrices we want to multiply. It is easy to check that M^{-1} is given by
           ( I_n  -A  A * B)
  M^{-1} = (  0   I_n  -B  )
           (  0    0   I_n )
So, Mat(n) <= Inv(3 * n) + O(n^2) <= (27 + c) * Inv(n). Here c is the constant hidden in the term O(n^2), which covers the time for creating M from A and B and the time to extract the result from M^{-1}. It is here that we are using Inv(n) >= n^2.

The other direction goes analogously, but requires a more elaborate construction. At first we only consider how to compute the inverse of a matrix A that is symmetric and positive definite. A is said to be symmetric if A equals its transpose. A is said to be positive definite, if for any n-vector x which is not the null-vector, x^T * A * x > 0. This definition is equivalent with the statement that all eigenvalues of A are positive. Any positive definite matrix is invertible but the opposite is not true. Such a matrix A can be divided in four n / 2 x n / 2 submatrices as follows:

           (  B   C^T )
  A      = (          )
           (  C    D  )

Define S = D - C * B^{-1} * C^T. The symmetry and positive definiteness of A implies that even B and D have these properties and by extension also S. Particularly this implies that B and S are invertible. The inverse A^{-1} of A can now be expressed in terms of B, C and S:

           ( B^{-1}+B^{-1}*C^T*S^{-1}*C*B^{-1}  -B^{-1}*C^T*S^{-1} )
  A^{-1} = (                                                       )
           (                  -S^{-1}*C*B^{-1}              S^{-1} )
We check the diagonal positions of A * A^{-1}:
   B * (B^{-1}+B^{-1}*C^T*S^{-1}*C*B^{-1}) - C^T * S^{-1}*C*B^{-1}
     = I_{n / 2} + C^T*S^{-1}*C*B^{-1} - C^T*S^{-1}*C*B^{-1},
  -C * B^{-1}*C^T*S^{-1} + D * S^{-1}
     = (D - C * B^{-1} * C^T) * S^{-1}
     = S * S^{-1}.
Because B and S again have the right properties, they can be inverted recursively.

All this appears to be too involved to be practical, however, the constants are good. Arranging things optimally, the number of products of n / 2 x n / 2 matrices can be bounded to 4. This gives the following recurrence:

Inv(n) <= 2 * Inv(n / 2) + 4 * Prod(n / 2) + O(n^2).
Here Prod(n) gives the time for multiplying n x n matrices. Under the assumption that Prod(n) >= n^2, which can hardly be called an assumption because Theta(n^2) time is needed just to read all values of the matrices, Prod(n / 2) < 4 * Prod(n). Thus, inductively assuming that Inv(n / 2) < c * Prod(n / 2), we get
Inv(n) <= (2 * c + 4) / 4 * Prod(n) + O(n^2).
For c >= 2, (2 * c + 4) / 4 <= c. The sum of the contributions from the O(n^2) can be bounded to O(n^2). Thus, Inv(n) <= 2 * Prod(n) + O(n^2). The actual factor is even smaller, because Prod(n) = n^a for some a which lies closer to 3 than to 2. For a > log_2 6 ~= 2.58 it is easy to check that Inv(n) < Prod(n). The conclusion is that in practice, for sufficiently large n, for symmetric positive definite matrices the inverse can be computed faster than a single matrix product!

For arbitrary matrices, we use that A^T * A is symmetric and positive definite. So, an arbitrary matrix may be inverted as follows:

This shows that the cost of inverting an arbitrary matrix exceeds that of inverting a symmetric positive definite matrix by at most two matrix products.
Computing matrix products and computing matrix inverses are problems with approximately the same complexity. Not only in theory, but even in practice.

Polynomial Operations

Convolution

Consider two discrete probability distributions A and B, assigning probabilities to events 0, ..., n - 1. More precisely, there are positive numbers a_i and b_i, 0 <= i < n, interpreted as probabilities, so that sum_{i = 0}^{n - 1} a_i = sum_{i = 0}^{n - 1} = 1. Let C = A + B. How is C distributed? The distribution of C is given by the convolution of A and B, that is,
c_j = sum_{i = 0}^{n - 1} a_i * b_{j - i}, for all j, 0 <= j < 2 * n.
Here, for simplicity, we assumed that a_i and b_i equal 0 for all i outside the interval of their definition. Also it should be noticed that the upper bound on j could be taken 2 * n - 2, but taking 2 * n - 1 is more convenient. The vector (c_0, ..., c_{2 * n - 1}) is called the convolution of the vectors (a_0, ..., a_{n - 1}) and (b_0, ..., b_{n - 1}). To compute the convolution element by element requires O(n^2) time.

Given two polynomials f() and g() of degree n - 1 with coefficients a_i and b_i, 0 <= i < n, respectively. Let h() = f() * g(). How can the coefficients c_j, 0 <= j < 2 * n of h() be expressed in the coefficients of f() and g()? Again by computing a convolution:

c_j = sum_{i = 0}^{n - 1} a_i * b_{j - i}, for all j, 0 <= j < 2 * n.
More concisely,
h(x) = sum_{j = 0}^{2 * n - 1} (sum_{i = 0}^{n - 1} a_i * b_{j - i}) * x^j.
Computing convolutions has several important applications. Therefore, it is worth considering the most efficient way of performing this task.

Evaluation

There are two simple ways of evaluating a polynomial f() of degree n - 1 defined by f(x) = sum_{i = 0}^{n - 1} a_i * x^i. The simplest is by executing:
  int evaluate_easy(int x, int[] a)
  {
    y = 0;
    z = 1;
    for (int i = 0; i < n; i++)
    {
      y += a[i] * z;
      z *= x;
    }
    return y
  }
This takes O(n) time for each evaluation. Notice that even for an empty polynomial a sensible result, 0, is returned.

Horner's rule rewrites the polynomial f() as follows:

f(x) = a_0 + x * (a_1 + x * (a_2 + x * (a_3 + x * ( ... ) ) ) ).
This leads to a more efficient evaluation scheme:
  int evaluate_horner(int x, int[] a)
  {
    y = 0;
    for (int i = n - 1; i >= 0; i--)
      y = a[i] + x * y;
    return y;
  }
In each pass of the loop there are only two instead of three arithmetic operations, and there is only one instead of two assignments. How much faster this will be in practice is hard to say. If the numbers are large and the multiplication takes many clock cycles, then the difference may be a factor two. If the numbers are small and the processor is fast, then the time may be dominated by the time for fetching the a[i].

Interpolation

A polynomial f() of degree n - 1 is uniquely determined by giving the values y_i = f(x_i), 0 <= i < n, for n different x_i. This fact is a theorem from algebra, but it does not tell how to compute this polynomial. The process of computing a polynomial of degree n - 1 running through n specified points (x_i, y_i) is called interpolation. To be precise, by "computing a polynomial", we mean computing its n coefficients a_i, 0 <= i < n.

The most straight-forward way of determining the coefficients is to solve the resulting system of equalities:

  a_0 + a_1 * x_0       + ... + a_{n - 1} * x_0^{n - 1}       = y_0

                           .
                           .
                           .

  a_0 + a_1 * x_{n - 1} + ... + a_{n - 1} * x_{n - 1}^{n - 1} = y_{n - 1}
Here the a_i, 0 <= i < n, are n unknowns which have to be determined. This system can be reformulated as a matrix problem as follows:
  (1 x_0       ... x_0^{n - 1}      )   (   a_0   )   (   y_0   )
  (                                 )   (         )   (         )
  (             .                   )   (         )   (         )
  (             .                   ) * (         ) = (         )
  (             .                   )   (         )   (         )
  (                                 )   (         )   (         )
  (1 x_{n - 1} ... x_{n - 1}^{n - 1})   (a_{n - 1})   (y_{n - 1})
So, the interpolation problem can be solved by computing the inverse A^{-1} of the matrix A with entries x_i^j, which is called a Vandermonde matrix, and then to compute the product of A^{-1} and the vector (y_0, ..., y_{n - 1}). Because we know that the system has a solution, A cannot be singular. Matrix inversion has been studied above. The conclusion was that it is a rather expensive operation.

A method which, in general, is more efficient, is to use Lagrange's formula:

f(x) = sum_{i = 0}^n y_i * prod_{j != i} (x - x_j) / prod_{j != i} (x_i - x_j).
The correctness of this formula is easy to verify. The denominators cannot be zero because all x_i are different. So, the expression indeed gives a polynomial. Its degree is n - 1, because each term consists of n - 1 factors of degree 1. Furthermore, it can easily be verified that f(x_i) = y_i. So, f() must be the polynomial we are looking for. It remains to determine the coefficients. Working out each of the terms and then gathering all terms of the same degree takes O(n^3). Being slightly more clever the problem can be solved in O(n^2).

Fast Fourier Transform

A very important operation is the so-called Fourier transform. Here we only consider the discrete version of it, but replacing sums by integrals, there is an analogously defined continuous Fourier transform. The discrete Fourier transform allows to efficiently evaluate a polynomial of degree n - 1 in n very special points. It also has an inverse which can be computed equally efficiently. One of the many applications of the Fourier transform is the fast algorithm for multiplying polynomials presented in the next section.

Define the complex number omega_n by omega_n = exp(2 * pi * i / n). Here i is the special complex number so that i^2 = -1. We do not need to know much about complex numbers. The only facts that we will need are

Because of the last property, the number omega_n is called the n-th root of unity. Here we use the fact that e^{2 * pi * i} = 1, which goes back on the definition of the complex exponential function: e^{a + b * i} = e^a * (cos b + i * sin b).

For a polynomial f() of degree n - 1 with coefficients a_i, 0 <= i < n, the discrete Fourier transform F(a_0, ..., a_{n - 1}) is defined by

F(a_0, ..., a_{n - 1}) = (f(omega_n^0), ..., f(omega_n^{n - 1})).
The inverse Fourier transform will be the operation F^{-1} which, when composed with F either on the left or the right, gives the identity operation.

We consider how to perform the Fourier transform in an efficient way. In the following we assume that n is even, and because of the recursion we will apply, it should even be a power of two. This is no problem, because for any n' > n, a polynomial of degree n - 1 can be interpreted as a polynomial of degree n' - 1 with n' - n additional coefficients with value 0. Define

f^[0](x) = sum_{i = 0}^{n / 2 - 1} a_{2 * i} * x^i,
f^[1](x) = sum_{i = 0}^{n / 2 - 1} a_{2 * i + 1} * x^i.
Both are polynomials of degree n / 2 - 1. f^[0]() is build on all even coefficients, f^[1]() on all odd coefficients. f() can be expressed in terms of these functions:
f(x) = f^[0](x^2) + x * f^[1](x^2).

In the following we use w = omega_n. We will use that omega_{n / 2} = w^2. So, the Fourier transforms of f^[0]() and f^[1]() are given by

F(a_0, a_2, ..., a_{n - 2}) = (f^[0](w^0), f^[0](w^2), ..., f^[0](w^{n - 2}) F(a_1, a_3, ..., a_{n - 1}) = (f^[1](w^0), f^[1](w^2), ..., f^[1](w^{n - 2})
Suppose that these values have been computed, then all values f(w^i), 0 <= i < n, can be computed easily as well:
f(w^i) = f^[0](w^{2 * i}) + w^i * f^[1](w^{2 * i}).
For 0 <= i < n / 2, we can directly use the values from F(a_0, a_2, ..., a_{n - 2}) and F(a_1, a_3, ..., a_{n - 1}). For n / 2 <= i < n, we use that w^{2 * i + n} = w^{2 * i} and that w^{n / 2} = -1. Together this implies that f(w^{i + n / 2}) = f^[0](w^{2 * i + n}) + w^{i + n / 2} * f^[1](w^{2 * i + n}) = f^[0](w^{2 * i}) - w^i * f^[1](w^{2 * i}).

F(a_0, a_2, ..., a_{n - 2}) and F(a_1, a_3, ..., a_{n - 1}) can be computed recursively. At the bottom of the recursion we use that for a polynomial of degree 0 given by a constant a, F(a) = a. The recurrence relation giving the time consumption T(n) for computing the Fourier transform of a polynomial of degree n - 1 is simple: T(n) = 2 * T(n / 2) + O(n). This is so, because each of the n values needed for F(a_0, ..., a_{n - 1}) can be computed in constant time once the two subproblems have been solved. The solution is T(n) = O(n * log n). This recursive construction is called the Fast Fourier Transform, abbreviated FFT.

It remains to show how to compute F^{-1}(y_0, ..., y_{n - 1}) for a given set of values y_i, 0 <= i < n. That is, we must show how to find coefficients a_i, 0 <= i < n, so that F(a_0, ..., a_{n - 1}) = (y_0, ..., y_{n - 1}). This is exactly the interpolation problem we considered before, because we know that y_i = f(w^i), where f() is the polynomial with the unknown coefficients we are looking for. So, we find back a Vandermonde system with very special entries:

  (1 1          1              ... 1              )   (   a_0   )   (   y_0   )
  (1 w         w^2             ... w^{n - 1}      )   (   a_1   )   (   y_1   )
  (1 w^2       w^4             ... w^{2 * (n - 1)})   (   a_2   )   (   y_2   )
  (                                               )   (         )   (         )
  (             .                                 ) * (    .    ) = (    .    )
  (             .                                 )   (    .    )   (    .    )
  (             .                                 )   (    .    )   (    .    )
  (                                               )   (         )   (         )
  (1 w^{n - 1} w^{2 * (n - 1)} ... w^{(n - 1)^2}  )   (a_{n - 1})   (y_{n - 1})

Denote the above Vandermonde matrix by FM. Notice that for any vector (b_0 ... b_{n - 1})^T, which may be interpreted as the coefficients of a polynomial of degree n - 1, FM * (b_0 ... b_{n - 1})^T gives the Fourier transform. The reason is that computing the product of a Vandermonde matrix with a vertex is equivalent to evaluating the corresponding polynomial at the points that can be found in the second column of the matrix.

In this case the inverse FM^{-1} of FM is easy to compute: FM^{-1} = conj(FM) / n, where conj{FM) denotes the complex conjugate of FM, that is the matrix obtained by taking the complex conjugate of all its entries. For a complex number x + i * y the complex conjugate is the number x - i * y. That conj(FM) / n is indeed the inverse of FM can be checked using that the complex conjugate of w^k is given by w^{-k}. Because sum_{0 <= i < n} w = 0, this implies that in the product FM * conj(FM) all but the diagonal elements disappear while we get n on all diagonal positions. This gives our last equation:

  F^{-1}(y_0, ..., y_{n - 1}) 
    = conj(FM) * (y_0 ... y_{n - 1})^T / n 
    = conj(FM * (conj(y_0) ... conj(y_{n - 1}))^T) / n
    = conj(F(conj(y_0), ..., conj(y_{n - 1}))) / n.

So, F^{-1} of an n-vector can be computed by applying F to the complex conjugate of that vector, taking the conjugate of the result and dividing by n. In the particular case that we know that the result is a real number, this last conjugation may be omited.

The Fourier transform of n coefficients and its inverse can be computed in O(n * log n) time.

Products of Polynomials

Above we have seen that the coefficients of the product h = f * g of two polynomials f() and g() of degree n - 1, can be determined in O(n^2) by computing the convolution of their coefficients. We are not satisfied with this time consumption and consider alternative ways of solving the problem.

h() is a polynomial of degree 2 * n - 1 (actually 2 * n - 2). By definition h(x) = (f * g)(x) = f(x) * g(x). So, h() can be determined as follows:

  1. Choose 2 * n different support points x_i, 0 <= i < 2 * n.
  2. Compute f(x_i) and g(x_i), for all x_i, by evaluation.
  3. Compute h(x_i) = f(x_i) * g(x_i), for all x_i.
  4. Compute the coefficients c_i, 0 <= i < 2 * n, of h() by interpolation.

Using conventional methods, step 2 and step 4 take at least O(n^2) each, which gives no improvement over the convolution-based computation. However, by choosing the support points in a special way each of these steps can be performed in O(n * log n) time. The idea is to evaluate f() and g() in the points omega_{2 * n}^i for all i, 0 <= i < 2 * n. This can be done by computing a Fourier transform of f() and g(), which for this purpose must be extended with n dummy coefficients with value 0, after computing the products in step 3, the interpolation is performed by applying the inverse Fourier transformation. In code, this might look as follows:

  int[] polynomial_product(int[] a, int b[], int n)
  // Takes as input two vectors of length n, the coefficients of two
  // polynomials of degree n - 1 and returns a vector of length 2 * n, 
  // the coefficients of a polynomial of degree 2 * n - 1.
  {
    int[] a_prime = new int[2 * n];
    int[] b_prime = new int[2 * n];
    for (int i = 0; i < n; i++)
    { 
      a_prime[i] = a[i]; 
      b_prime[i] = b[i];
    }
    for (int i = n; i < 2 * n; i++)
      a_prime[i] = b_prime[i] = 0;
    complex[] alpha = F(a_prime, 2 * n);
    complex[] beta  = F(b_prime, 2 * n);
    complex[] gamma = new complex[2 * n];
    for (int i = 0; i < 2 * n; i++)
      gamma[i] = conjugate(alpha[i] * beta[i]);
    complex[] c_prime = F(gamma, 2 * n);
    int[] c = new int[2 * n];
    for (int i = 0; i < 2 * n; i++)
      c[i] = real_part(c_prime[i]) / (2 * n);
    return c;
  }

The method of computing the product of polynomials with help of the Fourier transform is quite remarkable: we start and end with integers or real numbers, but at the intermediate stages we are working with complex numbers. It illustrates how results from branches of mathematics which have no apparent relation to computer science can have computational importance.

Exercises

  1. Compute the gcd of a = 409899 and b = 1206641. First try the conventional method constructing a factorization of both numbers, then apply the Euclidean algorithm. Also find numbers x and y so that a * x + b * y = 1.

  2. In the analysis of the Euclidean algorithm we have proven that the largest number a is reduced by a factor 2 every second round. We also have shown that a + b, the sum of both number is reduced by a factor 5/4 every round. How many rounds do we need at most using either estimate (lower order terms may be estimated, but the constant of the leading term must be specified)? Sharpen the result by either looking further than two rounds or using a finer case distinction. Give a class of really bad examples.

  3. The given time analysis for the Euclidean algorithm is based on the assumption that all numerical operations can be performed in constant time. Let T_prod(n) denote the time for computing the product of two n-digit numbers. Assume that division is performed with Newton iteration.

  4. Solve the recurrences giving the times of the two recursive product algorithms. Do not use guessing, but apply the methods for solving inhomogeneous recurrence relations. Assuming that c = 6 and c' = 10, how large must n be for Karazuba's method to be faster?

  5. Try to find an even more efficient algorithm for multiplying two numbers. That is, try to divide the numbers in more than two pieces with a small number of recursive calls. Notice that if there are a subproblems of size n / b, that then we get n^{log_b a}. We have succeeded with b = 2 and a = 3. Better would be b = 3 and a = 5 or b = 4 and a = 8, b = 5 and a = 11, ... . This is not easy!

  6. Write a program for dividing very long numbers. You may assume that all numbers are positive. It is suggested to use 16 bits per unsigned int, but of course this implies some waste of memory. The program should compare two alternative methods: Newton iteration or something of this kind in combination with Karazuba's multiplication and some more conventional implementation based on long-division. Measure the time consumption as a function of the length (== number of needed ints needed to store them) n of the numbers. Consider n = 2^k, for all sensible values of k.

  7. Consider the recursive algorithm exponent_3 for computing exponents. Let f(n) denote the number of multiplications for computing x^n. Let z(n) and o(n) denote the number of zeroes and ones in the binary expansion of n, respectively. Prove, using induction, an exact expression of f(n) in terms of z(n) and o(n).

    The number of multiplications is largest for numbers n of the form n = 2^k - 1. Then it is almost 2 * log n. For this particular n, we can do much better: compute x^{n + 1} / x. The question is whether we can always come close to 1 * log n multiplications. The answer is yes. The idea is to precompute a table a[] with a[i] = x^i, for 0 <= i < m. This precomputation costs m - 1 multiplications. The best is to take m = 2^k, for some small number k. Then n = 2^k * n' + n'' for some n'', 0 <= n' < m. Thus, f(n, k) <= k + 1 + f(n', k). That is, we get a recursion f(n, k) <= k + 1 + f(n / 2^k, k). The solution of this is f(n, k) <= (1 + 1 / k) * log n. Determine a choice for k that minimizes the total number of multiplications f(n, k) + 2^k. Show that now the number of multiplications is bounded by (1 + o(1)) * log n.

    Work out a recursive and a non-recursive version of this exponentiation method. The non-recursive version should be a direct modification of exponent_1 given above.

  8. Solve the recurrences giving the times of the two recursive matrix product algorithms. Do not use guessing, but apply the methods for solving inhomogeneous recurrence relations. Assuming that c = 4 and c' = 18, how large must n be for Strassen's method to be faster?

  9. For an n x n permutation matrix it is trivial to compute the determinant in O(n^2). Now assume the matrix is presented in condensed form, that is, assume that it is given as an array a[] of length n, a[i] giving the position in column i in which the single one in this column occurs. Show how to compute the determinant in linear time. Hint: focus on the cycle structure.

  10. Consider again the matrix-inversion algorithm based on computing matrix products. Figure out how to compute all products required for the computation of A^{-1} with just 4 products of n / 2 x n / 2 matrices. In other words, show how to obtain c_1 = 4, for the constant c_1 in the cost analysis of this algorithm.

  11. Write a program in C or Java for computing matrix products and inverses of n x n matrices. The matrix product can be implemented in the easiest way. For the inverse the recursive algorithm should be used. You may assume that n = 2^k for some integral k > 0, but you cannot assume that the original matrix is symmetric or positive definite. Measure the times Mat(2^k) and Inv(2^k) for k = 6, 7, ..., until it takes more than 10 minutes. Plot the ratio and compare with the values you expected on basis of the given theoretical analysis.

    Also implement a conventional matrix-inversion routine based on LU factorization. Again measure Inv(2^k). If for sufficiently large k the recursive algorithm is faster while for small k it is slower, then it can be made even better by integrating the conventional algorithm into it and terminating the recursion earlier. Determine the optimal choice for the value k_min at which the recursion is terminated. Measure Inf(n) once more and compare again with Mat(n).

  12. We consider the problem of computing a polynomial interpolation through n points (x_i, y_i), 0 <= i < n, where all x_i are different. More precisely, the task is to compute all coefficients a_i, 0 <= i < n, so that for the polynomial f() of degree n - 1 with the a_i as coefficients, f(x_i) = y_i for all i, 0 <= i < n. Show how Lagrange's formula presented above can be used to solve this task in O(n^2) time.

  13. Compute the product of f(x) = 3 * x^3 + 2 * x^2 + 3 * x - 2 and g(x) = x^3 - 5 * x^2 - 7 * x + 1 the ordinary way and by using the described recursive Fourier transform algorithm. Show the computation. Do not work out the complex numbers, but use that omega^i * omega^j = omega^{i + j}. Of course the computer should really work with complex numbers. Why?

  14. Consider computing the product of two long integers by using FFT. A number with digits (a_{n - 1}, ..., a_0), where a_0 is the least significant, can be interpreted as the value of the polynomial f(x) = sum_{i = 0}^{n - 1} a_i * x^i evaluated for x = r, where r is the radix with respect to which the number is given. If (b_{n - 1}, ..., b_0) is a second radix-r number, then the product of them can be computed by computing the product of the polynomials and then evaluating the product polynomial for x = r. The point is that the involved numbers may be large, and therefore we cannot simply equate the running time with the number of performed arithmetic operations. Work out the details and analyze the performance. Try to achieve the best possible result, O(n * log n * loglog n) is possible.




Union-Find

Definition

The defining properties of the subset ADT are union and find. It is used to maintain sets of subsets. Initially there are n subsets each consisting of 1 element. These elements may be assumed to be the numbers 0 to n - 1. Gradually subsets get fused (union) together and become larger. At the same time there are queries (find operations) asking for the unique identifier, the name of the subset to which an element belongs. Initially the name might be taken equal to the single number in the subset. Later it might be one of the numbers in the subset, possibly, but not necessarily, the smallest number. The only important thing is that find(x) and find(y) return the same values when x and y belong to the same subset, and different values when they belong to different subsets.

The above leaves much freedom in how to perform the unions and how to choose the names. A possibility is to apply a rule like "add first to second", meaning that an operation union(x, y) is performed by adding all elements of the subset in which x lies to the subset in which y lies. The new name for this set then becomes the name of the subset in which y lies.

Union-Find: Add First to Second

The subset ADT may sound unimportant, but it shows up everywhere. Subsets are the canonical example of the mathematical concept of equivalence relations. An equivalence relation is a binary operator "~" on elements of a set with the property that

If we read "~" like "in the same subset as", then all three properties are satisfied. The most important example of an equivalence relation is "is reachable": in a road system with two-way roads, all three properties are satisfied. Thus, the subset ADT can be used to compute something called the "connected components" of a graph.

The subset ADT allows to maintain equivalence relations dynamically: relations may be added by applying the union operation. The only limitation is that there is no de-union: once unified, the sets cannot be split anymore. This latter feature would be much more expensive to implement, in particular it would require that all previous operations are recorded.

There are several implementations, ranging from extremely simple giving modest performance (one operation O(1), the other O(n)), to slightly less simple giving excellent performance (almost constant time for both operations).

Array-Based Implementation

Simple Approach

The simplest way to implement the subset ADT is by maintaining an integer array a[], in which for every node we have stored the current index of the subset to which it belongs. Initially a[i] = i. find(i), then simply returns a[i].

A union is more complicated: if all the elements of a subset S have to be renamed, then, in the simplest implementation, we have to scan the whole array to find those which belong to S. This takes O(n) time. Thus, for all n - 1 unions (after n - 1 non-trivial all elements have been unified into one set ), we need O(n^2) time.

  void initialize() {
    for (int i = 0; i < n; i++)
      a[i] = i; }

  int find(int k) {
    return a[k]; }

  void union(int k1, int k2) {
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) // rename all elements in subset of k1
      for (int i = 0; i < n; i++)
        if (a[i] == k1)
          a[i] = k2; }

Array-Based Union-Find

If we are slightly more clever, we might maintain the elements that belong to a set in a linked list. In that case, we do not have to scan through all the elements when performing a union operation. A union is now performed by traversing one of the lists, accessing a[i] for each listed element i and updating it with the index of the other set. Then this list is hooked to the other list.

We consider an implementation of this. The lists are also implemented in an integer array b[]. In general the value b[i] gives the successor of i in its list. It is convenient to maintain a set of circular lists. The initial situation for this can be established by setting b[i] = i, for all 0 <= i < n. These ideas are worked out in the following piece of code:

  void initialize() {
    for (int i = 0; i < n; i++)
      a[i] = b[i] = i; }

  int find(int k) {
    return a[k]; }

  void union(int k1, int k2) {
    int i, j;
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) { // rename all elements in subset of k1
      i = k1;
      do {
        a[i] = k2;
        j = i;
        i = b[i]; }
      while (i != k1);
      // glue list of k1 into the list of k2 
      b[j] = b[k2];
      b[k2] = k1; } }

At a first glance this appears to be a very good idea: the implementation is simple and does not cause too much overhead. The work of a union is now proportional to the number of renamings that is, it is proportional to the size of the subset of k1 and no longer Theta(n).

However, consider the following sequence of unions

  for (int i = 1; i < n; i++)
    union(i - 1; i)
If always the first set is joined to the second, then in total we have to rename sum_{i = 0}^{n - 1} i = (n - 1) * n / 2 = Omega(n^2) elements. This is half as much as with the trivial implementation, and in view of the extra work for each renaming, it is actually no improvement at all.

Union by Size

The problem with the previous construction is that a large set is repeatedly joined to a small set. Performance improves tremendously if we maintain the size of the subsets and always join the smaller to the larger. In that case the number of nodes that is renamed is bounded by O(n * log n). We prove this.

Union-Find: Union by Size

How do we prove such a thing? The usual way of bounding the time of a sequence of operation is to put a bound on the time per operation. Here this approach does not work, because union operations may take liner time. In this case one should not perform an operation-based analysis, but an element-based analysis: bounding the cost per operation instead of bounding the cost per element. The idea is to consider the maximum number of times that any node may be given a new name. We show that this happens at most log n times. Then it follows that in total there are at most n * log n renamings.

Lemma: A node that has been renamed k times belongs to a set of size at least 2^k.

Proof: To be formal, we use induction. We must check a base case and a step. The base case is easy: a node that has been renamed 0 times belongs to a set of size at least 1 right from the start. Now suppose the lemma holds for all k' < k, for some k > 0. Consider a node x that has been renamed k - 1 times. The induction assumption implies that the size n_x of the subset to which x belongs satisfies n^x >= 2^{k - 1}. Assume x gets renamed again when performing union(x, y). Because unions are performed by-size, this implies that n_y >= n_x, where n_y gives the size of the subset to which y belongs. For the size n_{xy} of the subset created by the union, this gives n_{xy} = n_x + n_y >= 2 * n_x >= 2 * 2^{k - 1} = 2^k. End.

Corollary: A node can be renamed at most log n times.

Proof: Assume some node x is renamed k > log n times. Then according to the lemma it belongs to a subset of size n_x >= 2^k > 2^{log n} = n, which is impossible. End.

Theorem: When performing union-by-size, the time consumption of any sequence of n - 1 unions is bounded by O(n * log n) time.

This result is sharp: there is a sequence of unions that actually requires Omega(n * log n) renamings:
  for (int i = 1; i <= log(n); i++)
    for (int j = 0; j < n / 2^i; j++)
      union(j * 2^i, j * 2^i + 2^{i - 1})
The number of renamings is log n * n / 2: in every round n/2 of the nodes get a new name. The factor two difference with the upper bound comes from the way the upper bound was proven: though every individual element may have to be renamed log n times, it is not possible that all elements are renamed that often. Mostly we do not care about such small factors.

Maintaining the size of the sets is trivial: if two sets are joined, the new size is the sum of the two old sizes. It is trivial to implement this using an additional array s[] for storing the sizes, initialized at 1. If we are very tricky (this is quite ugly hacking) then we can also do without this extra array: Normally, we find at position i of a[] the index of the subset to which node i belongs. If a[i] = i, then we can also flag this by just putting a[i] = -1 (or any other value that does not lie in the range [0, n - 1]). But then we can also store there the size of the subset of i. Here we use that we have one spare bit. If n is extremely large, needing all bits, then this idea does not work. In code this may be implemented as follows:

  void initialize() {
    for (int i = 0; i < n; i++) {
      a[i] = -1; 
      b[i] = i; } }

  int find(int k) {
    if (a[k] < 0)
      return k;
    return a[k]; }

  void union(int k1, int k2) {
    int i, j;
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) {
      if (a[k1] < a[k2]) { // the set of k1 is larger
        i = k1; 
        k1 = k2;
        k2 = i; }
      // rename all elements in subset of k1
      a[k2] += a[k1];
      i = k1; 
      do {    
        a[i] = k2;
        j = i;  
        i = b[i]; } 
      while (i != k1); 
      // glue list of k1 into the list of k2 
      b[j] = b[k2];
      b[k2] = k1; } } 

Click here to see the above piece of code integrated in a working Java program. This program is executing the same example with n = 10 as shown in the pictures.

Implementation Alternatives

We consider slightly closer the above implementations. The set of linked lists is realized in an array. In the following section we will see how a set of trees (with links directed towards the roots) is realized in an array. This is highly efficient: it saves memory and time. In this case it is also reasonable to do so.

What is special about the application of union-find, that we are using arrays here to realize linked structures, whereas before we were using a structure build of list nodes linked together? The answer is that in the current case, there is a fixed number of nodes which have keys from 0 to n - 1. This makes that we can apply direct addressing: the information for node k, including the information of its next field, is stored at position k of one or more arrays.

In the example above we are working with two arrays. Alternatively, we might also work with one array of objects of the following type:

  class ArrayNode
  {
    int a;
    int b;

    ArrayNode(int i)
    {
      a = -1;
      b =  i;
    }
  }
It becomes even more elegant if a boolean instance variable is added indicating whether a gives the size or the name of the list. An organization with ArrayNode objects is more object oriented than an organization with several arrays.

Which of these two organizations is more efficient depends on the memory access pattern. If there are several arrays, each array may be assumed to stand consecutively in the memory. This allows for speedy access to all information of one kind. This is more true if this information is accessed in a consecutive way, then if it is accessed by single accesses such as in the find operation. In an organization with one array of ArrayNodes, the information belonging to each node stands together. This makes it cheaper to access several fields of a node as is done in the union operation. Of course using ArrayNodes causes some overhead because there is an extra indirection: accessing nodes[i].b is more involved then b[i].

Memory Organization

Tree-Based Implementation

A key feature of find is that it does not need to return any specific name, it just should be the same for all elements belonging to the subset and different for elements of other subsets. Another point is that it is acceptable that a find operation takes more than constant time if this helps to reduce the time of execution for the whole set of unions and finds to perform. This allows for a lot of flexibility, which will be exploited.

Simple Approach

A suitable implementation of the disjoint-subset ADT is by using a set of trees. Initially each node has its own tree of size one. find(k) returns the index of the root of the tree of k. union(k1, k2) hooks the root of one of the involved trees to the root of the other tree if these are not the same any way.

Tree-Based Union Find

This idea can be realized very simply using an array to represent the set of links:

  void initialize() {
    for (int i = 0; i < n; i++)
      a[i] = i; }

  int find(int k) {
    while (a[k] != k)
      k = a[k]; 
    return k; }

  void union(int k1, int k2) {
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) // hook k1 to k2
      a[k1] = k2;

Here a root k of a tree is characterized by the fact that it has a[k] = k. The good thing is that with the tree-based implementation, there is no need to access all elements of a set. Thus, there is no need to have the additional list structure requiring a second array. Conceptually using trees is a step, but practically it is even easier than the array-based implementation.

How about the efficiency? A find requires that one runs up the tree to find the index of the root. A union, once the two finds have been performed, is trivial, it just requires that one new link is created. So, here we have reduced the cost of the union at the expense of more expensive finds. The finds can actually be arbitrarily expensive: if the tree degenerates, it can have depth close to n. In that case finds may take linear time.

Tree-Based Union-Find

Union by Size

As for the previous approach, it is a good idea to maintain for every subset its size and to join the smaller subset to the larger one. In code this requires only small modifications:
  void initialize() {
    for (int i = 0; i < n; i++)
      a[i] = -1;  }

  int find(int k) {
    while (a[k] >= 0)
      k = a[k]; 
    return k; }

  void union(int k1, int k2) {
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) {
      if (a[k1] < a[k2]) { // the set of k1 is larger
        int i = k1; 
        k1 = k2;
        k2 = i; }
      a[k2] += a[k1];
      a[k1] = k2; } }

Here a root k of a tree is characterized by the fact that it has a[k] < 0. In that case -a[k] gives the size of the tree.

Lemma: A tree of depth k has at least 2^k nodes.

Proof: For a tree T, depth(T) denotes the depth of T. The proof goes by induction. To settle the base case, we fix that a tree with one node has depth 0. Now assume the claim is correct for given k. How do the depths develop? If T_1 is joined to T_2, giving a tree T_3, and depth(T_1) < depth(T_2), then depth(T_3) = depth(T_2). If depth(T_1) >= depth(T_2), then depth(T_3) = depth(T_1) + 1. Thus, while performing unions, a new tree of depth k + 1 can only arise when a tree T_1 of depth k is joined to another tree T_2. Because of our clever joining technique, this tree T_2 must have at least as many nodes as T_1. Because of our induction assumption, T_1 has at least 2^k nodes, and thus the new tree has at least 2^k + 2^k = 2^{k + 1} nodes. End.

Corollary: Using trees and performing union-by-size, union takes O(1), while the time for a find is bounded by O(log n).

Proof: The time to perform find(x) is proportional to the depth of node x in its tree. So, assume some node x has depth k > log n. Because the depth of a tree equals the maximum of the depths of all its nodes, this implies that the depth of the tree of x is at least k. According to the lemma this implies that the size n_x of the tree satisfies n_x >= 2^k > 2^{log n} = n, which is impossible. End.

The given bound is sharp: trees of logarithmic depth may really arise when repeatedly performing union for trees of equal size. If we are mainly interested in limiting the depth of our tree, then we can just as well perform union by depth: the shallower of the two trees (if any) is hooked to the other. Doing this, it is even easier to prove that the number of nodes in a tree with depth k is at least 2^k and that consequently the depth is bounded by log n.

Path Contraction

If we compare the tree-based approach with the simpler approach (both with union-by-size or by height), then we see that we have one constant time and one logarithmic time operation in each case. This appears equally good, but one should realize that there may be arbitrarily many finds, whereas the number of unions is limited to n. So, the tree-based idea as it is should be considered to be inferior. However, it can be made much better.

The only further algorithmic idea in this domain is path contraction. That means, that when we are performing find(k), after we have found that find(k) = r, we start once more at k and link all nodes on the path directly to r. This makes the individual finds twice as expensive, but has a very positive impact on the structure of the tree. The idea that expensive operations lead to an improvement of a search structure is not limited to union-find. Similar ideas are also applied for search trees and priority queues. Notice that even a union operation involves two finds, so even a union may lead to changes in the trees more than just hooking one to the other.

Using the same initialize as before, the code for find now looks as follows:

  int find(int k) {
    int l = k;
    while (a[l] >= 0)
      l = a[l];
    // Now l == find(k)
    while (a[k] > 0) {
      int m = a[k];
      a[k] = l;
      k = m; }
    // Now all nodes on the path point to l
    return l; }

Click here to see the above piece of code integrated in a working Java program. This program is executing the same example with n = 10 as shown in the pictures.

The idea of path contraction is that we invest something extra right now, in order to exclude that in the future we have to walk this long way again.

Union-by-size/depth and path contraction

There are alternative implementations of this idea. We can also save the second run, by keeping a trailer and just reducing the depth of the search path by a factor two. This reduces the number of elements to address and may therefore be slightly faster in practice:

  int find(int k) {
    if (a[k] < 0)
      return k;
    int l = a[k];
    while (a[l] >= 0) {
      a[k] = a[l];
      k    = l;
      l    = a[l]; }
    return l; }

The combination of union-by-size (or by height, although the height information may become inaccurate due to the finds) and path contraction leaves very little to desire. A partial analysis is given separately in the next section. Even understanding how good exactly the algorithm is is not trivial.

Theorem: The time for an arbitrary sequence of m unions and finds is bounded by O((n + m) * log* n).

Here log*, pronounced log-star, is the function informally defined as "the number of times the log function must be applied to reach 1 or less". More formally, log* n = min{i >= 0| log^(i) n <= 1}. Here log^(i) n denotes i the function obtained by i times applying the log-function. More generally, for any function f, f^(i) is defined by
f^(1)(n) = f(n)
f^(i)(n) = f(f^(i - 1)(n)
log* 1 = 0, log* 2 = 1, log* 4 = 2, log* 16 = 3, log* 65536 = 4, log* 2^65536 = 5. In practice log* cannot be distinguished from a constant. The actual result is even much stronger: the time for any sequence of m >= n unions and finds is bounded by O(m * alpha(m, n)), where alpha(,) is called the inverse Ackermann function. For any m slightly larger than linear, alpha(m, n) is constant.

It costs very little extra to perform the union by size, but still one may wonder whether it is necessary. Possibly just performing the finds with path contraction might be enough. This makes the union procedure even simpler and faster. Trying examples suggests that this has almost no negative impact on the time of the finds. However, doing this, there is a sequence of n unions and n finds requiring Omega(n * log n) time. This shows that, at least in theory, one really needs the combination of union-by-size and path contraction to obtain the best achievable performance.

Analysis

In this section we partially analyze the extremely good performance of tree-based union-find using union-by-size and path contraction, proving the above theorem. However, as already announced, the actual result is even much better, being formulated in terms of the inverse Ackermann function, which will be considered first.

Ackermann Function

Ackermann's function is defined as follows:
A(1, j) = 2^j, for j > 0
A(i, 1) = A(i - 1, 2), for i > 1
A(i, j) = A(i - 1, A(i, j - 1)), for i, j > 1.
Ackermann's function grows terribly fast. It is instructive to fill in a small square of values. Now one can define alpha(m, n) = min{i >= 1| A(i, m / n) > log n}. Because of the growth rate of A(i, j), is alpha(m, n) practically bounded by 4, even for m / n = 1.

Values of the Ackermann Function

Theorem: The time for an arbitrary sequence of m unions and finds is bounded by O(m * alpha(m, n)) provided m >= n. This bound is sharp.

The proof of this is technical and omited. Instead we prove the weaker theorem stating that the time is bounded by O(m * log* n), which is still incredibly strong in itself. Comparing the two theorems, we see that the "weakness" of the first is most notable for slightly larger values of m: Because A(2, j) = 2^2^ ... ^2, with in total j exponentiations, we have that A(2, log* n) = n. Thus, if m / n > log* n, then alpha(m, n) = 2, and thus we find that already for m which are only a little bit larger than n, m operations can be performed in O(m) time and not something that is super-linear in m.

Using Ranks

We first slightly modify the algorithm. Instead of hooking by size, we are going to hook by rank. The rank of a tree, maintained at its root, is the depth of the tree without considering the effect of the path compressions. That is, it gives the depth of the tree when only performing a sequence of unions. It is easy to maintain the ranks: hooking a tree with rank r_1 to a tree with rank r_2, gives resulting rank r_2 if r_1 < r_2 and r_2 + 1 if r_1 = r_2. So, union by rank can be performed in constant time, once the roots of the trees are found just as union-by-size.

Let us summarize the algorithm:

Ranks may only change with unions, so for proving results on the numbers of nodes with given ranks, we can forget the path compressions.

Lemma: When only performing a set of unions, a node of rank r has at least 2^r descendants (counting the node itself as well).

Proof: The proof goes by induction over time t. In other words, we reformulate the claim as an invariant property: at all times any tree with rank r has at least 2^r nodes. First we check that the claim is true for t = 0: initially all trees have size 1 and rank 0, so this is ok. At any given time two trees are hooked together. If the rank of the new tree is unchanged, then certainly no tree violating the condition is resulting. The rank only increases when two trees with the same rank r are hooked. Each of them has at least 2^r nodes because of the invariant property. The resulting tree thus has at least 2^{r + 1} nodes, preserving the invariant. End.

Lemma: The ranks decrease strongly monotonically on a path away from the root.

Proof: Because of the union-by-rank rule, this is obvious when we would not perform path contraction. However, if a node v is a descendant of w after path contraction, then it was already a descendant of w before path contraction, and thus must v have smaller rank than w. End.

Lemma: There are at most n / 2^r nodes of rank r.

Proof: Consider a node of rank r. Without path compression it would be the root of a subtree of size at least 2^r. All these subtrees are disjoint. The path compression has no consequences for the rank, and the unions are also performed independently of them, as they only consider ranks and not the actual depths. So, there can be at most n / 2^r nodes of rank r. End.

Counting Trick

Even knowing the above, it is not easy to proof the main result directly. Mostly a proof of a bound of f(n, m) on the running time of an algorithm performing m operations on a structure with n elements can be given by doing one of the following:

In our case there is a relatively rare and interesting mixture of both proof techniques: Costs will be both accounted to the operations and to the elements. As we are interested in minimizing the sum of the two cost factors, we will at the end choose things so that both contributions are approximately equal.

Theorem: The time for an arbitrary sequence of m unions and finds is bounded by O((n + m) * log* n).

Proof: The proof is slightly technical but the idea is really simple and beautiful. The ranks are somehow divided in rank groups consisting of consecutive ranks. F(g) gives the largest rank in group g. So, group g comprises all of the F(g) - F(g - 1) ranks F(g - 1) + 1, ..., F(g). Let G(n) be the total number of rank groups.

Consider a find starting in a node v and leading to the root of its tree r. Whenever we follow a link (w, w') leading to another rank group, or when w' = r, or when w = r (the final step), we account one cost unit to the find operation. Because there are only G(n) rank groups and 2 final steps, this allocates at most G(n) + 2 cost units to any find. Notice that this result holds independently of the path-contraction we are performing.

For following all other links (that is all non-final links within a rank group), the cost unit is accounted to the node w and not to the find operation. In order to bound these costs over the course of the operations, we need that we apply path-contraction. Because of this, we know that w will get a new link, leading to a node w'' higher in the tree. From one of the lemmas above, we know that the rank of w'' is strictly larger than the rank of w'. Thus, if w belongs to rank group g, we are allocating at most F(g) - F(g - 1) cost units to w under this rule. Notice that the above argument applies just as well to the alternative path-contraction technique in which the path length is only halved.

The total cost is now bounded by

m * (G(n) + 2) + sum_{g = 0}^{G(n) - 1} #{nodes in rank group g} * (F(g) - F(g - 1))
The number of nodes in rank group g, starting with rank F(g - 1) + 1, can easily be estimated because we know that there are at most n / 2^x nodes with rank x. This gives that there are at most sum_{r = F(g - 1) + 1}^F(g) n / 2^r <= n / 2^F(g - 1) nodes in rank group g. So, our formula becomes
m * (G(n) + 2) + sum_{g = 0}^{G(n) - 1} n * (F(g) - F(g - 1)) / 2^F(g - 1)
Which we simplify to
m * G(n) + n * sum_{g = 0}^G(n) F(g) / 2^F(g - 1)
A clever choice is F(0) = 0 and F(g) = 2^F(g - 1). Then the left term gives m * log* n, and the right term only n * G(n) = n * log* n. End.

Rank Groups

Exercises

  1. We consider array-based union-find for a set of 8 elements, using arrays a[] and b[] as in the examples of the text. As union strategy we either use first-to-second or by-size. The following union operations are performed: (3, 5), (5, 1), (1, 2), (0, 7), (3, 1), (2, 5), (7, 6). For each of the union strategies, give the complete sequence of resulting a[] and b[] values and indicate for each operation the number of renamings.

  2. We consider tree-based union-find for a set of 8 elements, using an array a[] as in the examples of the text. As union strategy we either use first-to-second or by-size. The following union operations are performed: (3, 5), (5, 1), (1, 2), (0, 7), (3, 1), (2, 5), (7, 6). For each of the union strategies, give the complete sequence of resulting a[] values and also draw the corresponding sets of trees.

  3. For array-based union-find applying union-by-size it was shown that each element could be renamed at most log n times and that therefore the total number of renamings is bounded by n * log n. In the given example the number of renamings is bounded by n / 2 * log n. Prove that this latter value is sharp. That is, show that the maximum number of renamings is bounded by n / 2 * log n. You may assume that n = 2^l for some positive l. Hint: use an argument involving a potential function. Denoting the size of the subset of element i by n_i, the potential of a set of subsets is given by sum_{0 <= i < n} log (n / n_i) / 2. Compute the potential for the initial and the final situation and show that the number of renamings during any union operation is bounded by the decrease of the potential.

  4. Rewrite the tree-based union-find without path contraction so that the union is performed by depth and not by size. Prove that also in this case the depth is bounded by O(log n).

  5. We consider tree-based union-find for a set of 10 elements. As union strategy we use first-to-second. For the finds we use path contraction. The following union operations are performed: (3, 5), (9, 4), (5, 2), (8, 4), (0, 7), (7, 4), (4, 1), (1, 2), (2, 6). Draw the resulting tree. Now the following find operations are performed: 4, 5 and 0. Draw the tree after each operation.

  6. We consider tree-based union-find with first-to-second union strategy and finds with path contraction. The path contraction hooks all nodes on the search path directly to the root of the tree. One time unit is counted for each traversed tree link. Give a sharp upper bound for the time of any choice of m consecutive find operations. So, these finds are not interrupted by unions. Prove the correctness of the given bound. For which m is the amortized time per find operation constant?

  7. We consider tree-based union-find with first-to-second union strategy and finds with path contraction. For the path contraction the alternative strategy is applied, hooking each node on the search path to its grandparent. One time unit is counted for each traversed tree link. Give a sharp upper bound for the time of any choice of m >= n consecutive find operations. So, these finds are not interrupted by unions. Prove the correctness of the given bound. For which m is the amortized time per find operation constant?

  8. We consider tree-based union-find with first-to-second union strategy and finds with path contraction. The path contraction hooks all nodes on the search path directly to the root of the tree. Construct a sequence of O(n) unions and finds taking Omega(n * log n) time. Hint: first look for trees of depth k = 3, 4, ..., which more or less are found back after one union and one find.

  9. We consider tree-based union-find with first-to-second union strategy and finds with path contraction. For a set of n elements, the following unions are performed: (i, i + 1), for all 0 <= i <= n - 2. Show the resulting tree for n = 16. The path contraction is performed with the alternative method, traversing the path to the root only once. Draw the resulting trees after performing find(4) and find (0). In general, when performing find(k) for an element k lying at distance d from the root of its tree, at what distance does it exactly lie after the find operation?

  10. Describe a schedule of unions and finds, performed by-rank and with path compression, leading to a somewhat larger time consumption. Hint: Start with a very basic goal, like "how can we assure that during n finds at least 2 * n links are traversed?".

  11. The given program, can be used for testing the performance of the four combinations of union-by-size and path contraction:

    First build tree structures, performing n - 1 union operations by picking at any time two of the remaining roots at random. Then perform k * n find instructions. Count the number of links traversed until reaching the roots. The cost measure is the average number per operation over the last n find instructions.

    Perform the above tests for one given large n, for example n = 2^25 and consider how the numbers develop for k = 1, 2, 3, ... . Perform the above tests for k = 1 and n = 2^x, for x = 10, 11, 12, ... . The experiments for small x must be repeated sufficiently often. Plot the results as a function of x. Which strategy appears to be the best choice in practice considering both performance and overhead?".

  12. When performing tree-based union-find, it is no longer possible to efficiently print an overview of the elements in all subsets. In the given program, the method print has high complexity. Specify this complexity as a function of n, the number of elements. Now write a procedure, either in Java or in pseudo-code, which computes out of the information available in the array a[] an array b[]. b[] is defined as follows: b[i] contains the successor of i in a circular list containing all elements of the subset to which i belongs. The given procedure should have linear complexity.





Dictionaries

A dictionary is an ADT supporting the operations find, insert and delete. Sometimes this data structure is also called search tree. The best-known dictionary tree implementation guaranteeing O(log n) time consumption is the AVL tree, which is not considered here. Another well known structure is the 2-3 tree. Here we consider the immediate generalization of it, the a-b trees. The analysis of their amortized behavior is particularly enlighting. Then we consider two less organized structures which nevertheless guarantee good amortized or expected behavior.

a-b Trees

Definition

a-b trees provide an interesting alternative to AVL trees. They are characterized by the following properties: Here a and b are parameters with a >= 2 and b >= 2 * a - 1. The depth of an a-b tree with n leafs is at least round_up(log_b n) and at most round_down(log_a n). If a > 2, then some special cases arise for trees with 2, ..., a - 1 leafs. Adding a dummy elements makes live easier. In the following description we first consider the case a = 2, b = 3, being the smallest values for which the idea works.

The internal nodes hold one or two keys guiding the searching process. In a binary node there is one key k, being larger than or equal to the values in the left subtree and smaller than the values in the right subtree. In a ternary node there are keys k_1 and k_2: k_1 being larger than or equal to the values in the left subtree and smaller than the values in the middle subtree; k_2 being larger than or equal to the values in the middle subtree and smaller than the values in the right subtree.

2-3 Tree

Find

Searching is just as easy as in a binary search tree. Suppose we are looking for a value x. Then, in a binary node with a single key k, we perform
  if      (x <= k)
    goleft();
  else
    goright();
In a ternary node, with keys k_1 and k_2, we perform
  if      (x <= k_1)
    goleft();
  else if (x <= k_2)
    gomiddle();
  else
    goright();
In this way we continue until we reach a leaf. There we test whether the key is equal or not and perform a suitable operation in reaction.

The construction and these more elaborate comparison imply a certain overhead, but all the rest becomes much simpler because of this. When implementing a dictionary ADT based on 2-3 trees, then it is a good idea to have different classes for internal nodes and leafs, as they allow very different operations and have different fields as well.

Insert

For an insertion, we search were the node should come. If it is not there, we create a new leaf with the appropriate key. If the internal node to which it should be attached has degree two so far, then everything is fine: the new leaf is attached, and we are done. Otherwise, this parent node (which was on the point of getting an illegal degree four) is split in two internal nodes of degree two each. The new internal node should be added to the internal node one level up. There we must check the degree and possibly split the node again. In this way we either find a node with degree two and exit, or ultimately split the root in two and add a new root.

Delete

Deletions can be performed by just marking deleted nodes and rebuilding the structure if it gets too polluted. However, in this case, there is a rather simple inverse of the insertion operation. First we look for the element to delete. If we have found it, then there are three cases to distinguish for deleting a child w from node v:

It is important to point out that a-b trees grow or shrink by adding or deleting the root. So, changes of height arise at the top, not at the bottom as in AVL trees.

It is essential that 3 + 1 = 2 + 2 and that 1 + 2 = 3. The first implies that, once the maximum degree of a node is exceeded, it can be split in two nodes with degree within the legal range; the second implies that when two nodes are fused the resulting node cannot have too many children from the start. Thus, we could proceed similarly for a 3-5 tree, because 5 + 1 = 3 + 3 and 3 + 2 = 5. But, a 3-4 tree would be clumsy, because 4 + 1 = 5 < 3 + 3. For this reason we have required for a-b trees that b >= 2 * a - 1, assuring that b + 1 >= 2 * a and that a + a - 1 <= b.

Insert and Delete on 2-3 Tree

Amortized Performance

2-3 trees have the problem that after splitting a node (as the result of an insertion) the new nodes have degree 2, which is the minimum. Therefore, it may happen that a subsequent deletion immediately leads to a fusion again. These structural changes may go all the way up to the root every time. For example, this happens if we have a 2-3 tree with n = 2^k leafs, for some k >= 1, which has only binary internal nodes. Deleting the largest element causes all the nodes on the rightmost path to fuse; subsequently reinserting this element splits them again. Even though all this takes only logarithmic time, it still means a considerable increase of the cost.

Expensive 2-3 Tree Operations

This does not sound very efficient. The AVL trees gave us (in addition to the time for searching) only O(1) time per operation for restoring the structure. For the 2-3 trees above, it is possible that alternating inserts and deletes results in restructuring again and again the whole path up to the root. We now show that an arbitrary sequence of n insertions and deletions into an arbitrary 2-5 tree with at most n leafs causes at most O(n) restructuring operations. This shows that amortizedly 2-5 trees share the good properties of AVL trees: O(1) time per operation (in addition to the time for searching).

The idea is simple and provides a very nice example of an amortized analysis. For a-b trees, we will call nodes with degree a or b critical. All other nodes are said to be non-critical. In the case of 2-5 trees, splitting a node happens when a child is added to a node which had degree 5 before. The result is two nodes of degree 3. These resulting nodes cannot be split or fused immediately again. So, here we have a situation were splitting a critical node results in two non-critical nodes. Likewise, when fusing a node, we originally must have had two critical nodes of degree 2 each and the result is a non-critical node of degree 3. So, splitting and fusing can be viewed as an investment: we are achieving more than just maintaining the defining property. This is a typical situation in which an amortized analysis may be effective: a single operation may be expensive, but this can be viewed as an investment. The rest is a matter of good bookkeeping.

In our case we will deposit a token on every node whose degree is changed by addition or removal of a child or when a child is stolen by a sibling. In particular this implies that all critical nodes carry tokens, because any node is non-critical upon creation. The tokens on a node can be viewed as a prepayment for its possibly later restructuring. As an invariant we want to maintain that any critical node has a token. Because initially all internal nodes may be critical, we must be willing to come with an initial investment of up to n - 1 tokens, one for each internal node.

Inserting a node may result in a sequence of splitting operations. The rules are so, that only critical nodes get split, so the cost of this is covered by consuming the deposited tokens. The sequence of splittings ends either by creating a new root or by increasing the degree of a node. In each case we pay one token.

Deleting a node may result in a sequence of fusion operations. The rules are so, that v and v' get fused only when neither of them had more than 2 children before the deletion. So, each of them was critical and had tokens. These tokens are consumed to cover the cost of the fusion. The sequence of fusions may end in two possible ways. Either the degree of the node v in which the fusion ends is decreased, or v steals a child from a sibling v'. In the first case a token is deposited on v, in the second case on v'.

Summarizing, we see that if we start in a situation with a token on each critical node that then

So, the time is proportional to the number of deposited tokens, which equals the number of critical nodes in the beginning + one token for each operation. For any sequence of n operations we need at most 2 * n tokens in total: 2 tokens per operation.

Alternative Analysis

The given analysis using tokens is simple and works well in this case. However, it is not always possible to work with tokens, because they cannot be used in a flexible way. A token-based analysis only works if we can assure that every node addressed has tokens, it is not good enough if we can assure that there are tokens somewhere. The more general idea is to work with a potential, a positive function of the data structure. For a data structure with 0 elements the potential should be 0. In our case the potential could be the number of critical nodes. The amortized cost of an operation is defined to be
t_amortized = t_actual - potential_before + potential_after.

Lemma: If, for some data structure with n elements, t(n) gives an upper bound on the amortized time for an operation and p(n) an upper bound on the used potential function, then any sequence of m >= p(n) / t(n) operations takes at most 2 * m * t(n) time.

Proof: For any sequence of m operations, taking actual times t_actual(i), 1 <= i <= m time, we have

    T_tot =  sum_{i = 1}^m t_actual(i)
          =  sum_{i = 1}^m (t_amortized(i) + pot_before(i) - pot_after(i))
          =  sum_{i = 1}^m  t_amortized(i) + potential(0)  - potential(m)
          <= sum_{i = 1}^m  t_amortized(i) + potential(0).
          <= m * t(n) + p(n).
  
For m >= p(n) / t(n), p(n) <= m * t(n), and thus, for these m, T_tot <= 2 * m * t(n). End

The lemma implies, that the given definition of amortized time is closely related to the intuitive definition of amortized time as being the average time per operation over a sufficiently long sequence of operations. It also expresses the value of m that must be taken in order to assure an asymptotically optimal claim: m must be Omega(p(n) / t(n)). When defining a potential, the main objective is that it gives a small value of t(n). But, in order to obtain the strongest performance claims, even the value of p(n) should be taken into account.

In the case of the 2-5 trees, the actual time of an operation can be equated to the number of nodes addressed. When inserting, a certain number h of nodes is split and somewhere one node is addressed which is not split. So, t_actual = h + 1. These h nodes where all critical and the splitting does not introduce new critical nodes. So, this reduces the potential by h. The node at the top may be made critical, so

t_amortized_insert = h + 1 - h + 1 = 2.
When deleting a node, along a path of length h nodes get fused. These 2 * h nodes were all critical before. Here we are using in an essential way that we first try to steal a child before fusing. At the top one node is addressed which is not fused, but which may become critical. This gives
t_amortized_delete = 2 * h + 1 - 2 * h + 1 = 2.

Here both approaches can be applied and both give O(1) amortized time. In general an argument with tokens is clearer and may have didactical advantages where it can be applied. On the other hand, analyzing the amortized time with the definition based on potentials is more mathematical and can also be applied for harder problems.

Insertion Sort

Balanced search trees are suitable for sorting: numbers are inserted successively, and finally they are output in order. This sorting strategy is called insertion sort. a-b trees are particularly suited for this purpose, because it is easy to keep the leafs connected in a doubly linked list. It is convenient that all leafs lie at the same level and that the tree is not growing at the bottom. This allows to efficiently perform range queries, operations like "output all elements with keys between x and y". Outputting all elements in order is a special case of a range query.

Insertion sort has the interesting property that it is adaptive, that is, for easy instances it runs faster than for hard instances. An easy instance in this case means an instance which is already almost sorted. For a more precise statement we define the notion of an inversion. In a sequence of numbers S_i, an inversion is a situation in which S_i > S_j for i < j. That is, S_i and S_j are wrongly arranged. In a sorted sequence there are no inversions, and reversing a sorted sequence results in one inversion for every pair of numbers, n * (n - 1) / 2. Let generally I(S) denote the number of inversions in a sequence S. We will show that with a slightly modified search procedure a 2-3 tree can be used to perform insertion sort for a sequence S in O(n + n * log(I(S) / n)).

There is no need to start the search for an element at the root: if we already have an idea where it should be inserted, then we can try to reduce the time for the searching by starting close to the expected target leaf. If this is successful, then we can get very good performance, because the amortized time for insertions and deletions is constant. In case we want to reduce the time for performing insertion sort on almost sorted sequences, it is natural to assume that the elements are supplied in increasing order. So, then we should start the search at the rightmost leaf. From there the search moves up until the new value is no longer smaller than all keys in the node and then moves down along the usual path.

In an a-b tree all internal nodes have degree at least a >= 2. So, a subtree of height h contains at least 2^h leafs. The search for a value x moves up h + 1 levels to a node v only if all the leafs in the right subtree of v have values larger than x. So, this accounts for at least 2^h inversions. The total time for all searches is given by O(n + sum_i h_i), where h_i + 1 is the number of levels the search for element S_i moves up. We know that

sum_i 2^{h_i} <= I(S).

So, we want to put an upper-bound on sum_i h_i under this condition. Because 2^h_i strongly increases with h_i, the sum of the h_i is maximized when all h_i have the same value (this argument can be formalized using the Lagrange multiplicator method for computing extremal values on a surface). In that case, for all i, 2^h_i <= I(S) / n, which implies that h_i <= log(I(S) / n), and thus sum_i h_i <= n * log(I(S) / n).

A practical disadvantage is that the tree now must have links in both directions. In a normal a-b tree this is not needed, because the way back can be pushed on a stack while searching forward: as splitting and fusion operations are performed only along the search path, this is all one needs to know. So, we need quite an elaborate data structure for solving a simple problem like sorting. On the other hand, this is a result which cannot easily be achieved in another way: most sorting methods (merge sort, quick sort, heap sort) are not adaptive at all. Bubble-sort is adaptive, but in a much weaker sense: bubble-sort is fast when each elements stands close to its final position: if the maximum distance is d, then it can be implemented in O(n * d) time. In this case the number of inversions is bounded by O(d * n) and insertion sort takes O(n + log d * n), which is better for any non-constant d. Furthermore, bubble-sort is really bad if a single element has to move far.

Splay Trees

A splay tree is a very weakly balanced binary search tree, offering an alternative to AVL trees. The basic balancing operation of splay trees is the rotation just as for AVL trees. However, no balance information is maintained. The advantage is that one saves overhead, both for storing and for updating. So splay trees are self-adjusting: some kind of balancedness is established without enforcing it. Any single operation (insert, find or delete) on a splay tree with up to n nodes may cost O(n), but, performing m >= n operations, the amortized operation time is bounded by O(log n).

Self-adjustment typically implies that one cannot prevent a bad situation from building up, but once one is performing an exceptionally expensive operation, the structure gets improved. A well-known example of a self-adjusting structure is the union-find structure with path contraction: long paths cannot be excluded, but once one has to traverse such a long path, performing an expensive find operation, all the nodes on the path are hooked directly to the root node, implying that a subsequent find for any of the nodes on the path is much cheaper.

Simple Strategy

Splay trees may have any binary structure, in particular they may consist of a chain of n nodes. Because in this case accessing the deepest node takes linear time, the structure must be changed, because otherwise this operation might be repeated again and again. A simple idea is to turn the accessed element into being the root of the tree by a sequence of single rotations. The rational behind this is that if an element is accessed, it is likely that it will be accessed again. In practice access sequences are mostly not random: rather a small subset of elements is accessed much more frequently than others (as an example one can think of the names the police types in its database).

Splay-Tree Rotation

Unfortunately, this strategy is too simple. If first the keys 0, 1, ..., n - 1 are inserted in increasing order, a tree of depth n will be the result: every new node is first inserted as a right child of the root, and then a single rotation turns it into the root. This sequence of insertion is fine: it takes O(n) time. However, now accessing the keys in increasing order is very expensive: the time of accessing key i is proportional to O(n - i), for a total of O(sum_{i = 1}^n i) = O(n^2). The reason is that after searching 0 the depth is still n - 1, and more generally after searching i the depth is n - 1 - i.

Expensive Splay-Tree Operations

Good Strategy

The above single rotations are apparently too simple to achieve any kind of self-adjustment. We would like that expensive accesses to deep-lying nodes reduces the depth of most nodes on the search path. Cheap accesses to shallow nodes must not necessarily lead to an improved structure: we cannot hope to always improve the structure, if the tree is perfectly balanced, any restructuring can only make it worse.

Now we distinguish three cases:

Notice that the zig and zig-zag case are treated in the same way as in the simple strategy: in the zig case a single rotation is performed, in the zig-zag case a so-called double rotation, which has the same effect as performing two single rotations. Only in the zig-zig case the performed operation is different. At a first glance it is not clear at all, that this gives an improvement: it appears to be an unnecessarily unbalanced operation.

Splay Operations for Node X

The above strategy is called splaying. Without providing a proof we can see already now that splaying is much better than performing single rotations. As an example we consider the tree that is constructed by inserting the elements 0, 1, ..., n - 1, in order. This generates a tree of depth n - 1, a chain with only left children. Accessing these nodes now in the same order does not only turn the elements to the root, but also reduces the depth of the tree by almost a factor two with each access.

Let us state more precisely how the three dictionary operations, find, insert and delete are performed:

Splay-Tree Operations

Amortized Performance

Any of the operations, find, insert and delete, has cost proportional to the cost of the involved splaying operations (one for find and insert, two for delete). Thus, in order to put a bound on the amortized time of the splay-tree operations, it suffices to put a bound on the amortized time of the splaying operation. The cost of the splaying operation is proportional to the number of accessed nodes: accessing node x has cost proportional to the number of nodes on the path leading to x.

It is not easy to bound these costs in an amortized argument. Somehow we must define a potential function which does not increase too much (that is, not more than O(log n)) in any operation, and at the same time strongly decreases when accessing a deep lying node. A first idea might be to take as potential for a tree T the function p(T) = sum_{u in T} depth(u), where depth(u) gives the depth of node u in T. However, this potential may increase too strongly: adding a new smallest or largest value pushes all existing nodes one level deeper. Thus, if there were n nodes, p(T_new) - p(T_old) = n, giving an amortized time of at least n.

An idea which at a first glance works better is to take p(T) = sum_{u in T} log(depth(u) + 1). Even this potential may increase too strongly: if the tree T with n nodes is perfectly balanced, then there are (n + 1) / 2 nodes at depth d = log(n + 1) - 1. Adding a new smallest or largest element, all these elements are pushed one level down. So, just considering these elements, p(T_new) - p(T_old) >= n / 2 * (log(d + 1) - log d) = n / 2 * log(1 + 1/d) ~= n / (2 * d).

A potential function that really works, is given by

p(T) = sum_{u in T} log(s(u)).
Here s(u) gives the size of the subtree rooted at u, counting u itself as well. Defining the rank of a node u as rank(u) = log(s(u)) simplifies the formulation of p():
p(T) = sum_{u in T} r(u).
A very nice property of ranks based on the size of the subtree is that a rotation changes the ranks only for the nodes involved in the rotation. The depths of the nodes in the subtrees changes, but their ranks remain the same. Clearly the rank of any node of a tree with n nodes is at most log n. So, p(T) <= n * log n.

Theorem: The amortized time to access any node of a tree with n nodes is bounded by 2 + 3 * log n = O(log n).

Proof (outline): We remind that the amortized time of an operation is given by the actual time plus the change of potential. The change of potential equals the potential after the operation minus the potential before the operation. In our case the total potential is the sum of the potential of all nodes. Because the potential of the nodes changes only along the path from the root to the node x we are accessing, the amortized time can be written as

t_amortized = sum_{u in path to x} (1 + r_new(u) - r_old(u))
Here we account one cost unit for each accessed node.

We want to rewrite this sum as a sum over the operations: a zig-zig and a zig-zag operations is attributed a real cost of 2 cost units, a zig operation is attributed 1 cost unit, and handling the root stands for one cost unit as well. The amortized cost of any of these operations, is equal to this cost plus the sum of the new potentials of all involved nodes minus the sum of their old potentials. Thus, by regrouping the sum giving T_amortized, we get

t_amortized = 1 + sum_{operation} amortized_cost_of_operation

The trick is that we will show that the amortized cost of a zig-zag or a zig-zig operation for any node x on the path can be estimated on 3 * (r_new(x) - r_old(x)) and of a zig operation on 1 + 3 * (r_new(x) - r_old(x)). This estimate takes into account the cost for the other involved nodes and the changes of their potentials. Such an estimate is very effective in this case, because even though the splaying gets higher and higher in the tree, it is always the same node x we are working on. Thus, all but two terms of the sum cancel each other, only the negative contribution from the first summand and the positive contribution from the last summand remain. This gives the result of the theorem:

t_amortized = 1 + 1 + 3 * (r_after(x) - r_before(x)) <= 2 + 3 * log n.
End.

Lemma: The amortized time for a zig operation at node x is bounded by 1 + 3 * (r_new(x) - r_old(x)).

Proof: The amortized time of a zig operation involving nodes x and y as in the picture illustrating the operations, can be written as

t_amortized_zig = 1 + r_new(x) + r_new(y) - r_old(x) - r_old(y).
Using that r_new(y) < r_old(y) gives
t_amortized_zig <= 1 + r_new(x) - r_old(x).
Because r_new(x) > r_old(x) we have r_new(x) - r_old(x) >= 0, and therefore we may add 2 * (r_new(x) - r_old(x)) on the right-hand side. This amount is added to get inequalities of a common form for all three operations. It is desirable to have inequalities of the same form because this allows to estimate T_amortized by telescoping the sum. End.

For the analysis of the zig-zag and zig-zig operations we need a simple inequality which helps us to estimate the logarithmic contributions:

Lemma: If a + b <= c, then log a + log b <= 2 * log c - 2.

Proof: Because the log function is monotonously increasing, it suffices to check the inequality for c = a + b. In that case, the relation to check can be rewritten as log (4 * a * b) <= log ((a + b)^2). This is satisfied if and only if 4 * a * b <= a^2 + b^2 + 2 * a * b, which can easily be verified by considering the function f(a, b) = a^2 + b^2 - 2 * a * b: it assumes its minimum value 0 for a = b. End.

Lemma: The amortized time for a zig-zag operation at node x is bounded by 3 * (r_new(x) - r_old(x)).

Proof: The amortized time of a zig-zag operation involving nodes x, y and z as in the picture illustrating the operations, can be written as

t_amortized_zig-zag = 2 + r_new(x) + r_new(y) + r_new(z) - r_old(x) - r_old(y) - r_old(z).
s_old(z) = s_new(x) and s_old(y) > s_old(x). Thus, also r_old(z) = r_new(x) and r_old(y) > r_old(x). This can be used to eliminate r_old(z) and r_old(y):
t_amortized_zig-zag <= 2 + r_new(y) + r_new(z) - 2 * r_old(x).
Because s_new(y) + s_new(z) < s_new(x), we get, using the above estimate, log(s_new(y)) + log(s_new(z)) < 2 * log(s_new(x)) - 2, which is equivalent to r_new(y) + r_new(z) < 2 * r_new(x) - 2. Substitution gives
t_amortized_zig-zag <= 2 * (r_new(x) - r_old(x)).
Because r_new(x) > r_old(x) we have r_new(x) - r_old(x) >= 0, and therefore we may add r_new(x) - r_old(x) on the right-hand side. This amount is added to get inequalities of a common form. End.

Lemma: The amortized time for a zig-zig operation at node x is bounded by 3 * (r_new(x) - r_old(x)).

Proof: The amortized time of a zig-zig operation involving nodes x, y and z as in the picture illustrating the operations, can be written as

t_amortized_zig-zig = 2 + r_new(x) + r_new(y) + r_new(z) - r_old(x) - r_old(y) - r_old(z).
s_old(z) = s_new(x), s_new(y) < s_new(x), s_old(y) > s_old(x). Thus, also r_old(z) = r_new(x), r_new(y) < r_new(x), r_old(y) > r_old(x). This can be used to eliminate r_old(z), r_new(y) and r_old(y):
t_amortized_zig-zig < 2 + r_new(x) + r_new(z) - 2 * r_old(x).
Because s_old(x) + s_new(z) < s_new(x), we get, using the above estimate, log(s_old(x)) + log(s_new(z)) < 2 * log (s_new(x)) - 2, which is equivalent to r_old(x) + r_new(z) < 2 * r_new(x) - 2. Using this to eliminate r_new(z) gives the claimed result. End.

In this case it appears that we cannot easily replace the argument with the potential function by an equivalent argument involving tokens. The reason is that tokens are supposed to be integral, whereas here we may have a large number of small contributions which together are bounded by O(log n).
Splay trees offer a self-adjusting alternative for AVL and other balanced search trees. An additional advantage is that recently visited nodes stand close to the root. A disadvantage is the large number of performed rotations.

Amortized Cost for Splay Tree

Skip Lists

Description

There are many different dictionary data structures and several of them have practical importance. Some are used because they are simple, others are used because they have good guaranteed performance, others because they are fast in practice. The reason of existence of splay trees is that they are simple to implement and that their operations have less overhead than AVL trees. There is no longer a worst-case guarantee, but at least we could prove good amortized behavior. In this section we go one step further. We present a data structure with extremely simple operations, requiring very little memory. However, the performance guarantee is even weaker: we will show that the expected time for an operation is bounded by O(log n), but there is no hard guarantee, not even for a sequence of operations.

In binary search the nodes whose values are inspected are found by computation, but the same search pattern may be obtained by using a linked structure with nodes of different degrees. For supporting find operations on the values of a sorted array a[] of length n = 2^k, this structure consists of n nodes. Node 0 has degree k. More generally, the degree of node i is given by the number of zeroes at the lower end of the binary expansion of i. That is, if i = sum_{0 <= l < k} b_l * 2^l, b_l in {0, 1} for all l, 0 <= l < k, then the degree of node i equals max{j | 0 <= j < k and b_l = 0 for all l <= j} + 1. A node of degree j is called a level-j node. A level-j node i has links to the nodes s_l, 0 <= l < j, with s_l = i + 2^{j - l - 1}. For example, node 0 has links to the nodes s_0(0) = n / 2, s_1(0) = n / 4, ..., s_{k - 1}(0) = 1. Node s_{n / 2} has links to the nodes s_0(n / 2) = n / 2 + n / 4, s_1(n / 2) = n / 2 + n / 8, ..., s_{k - 2}(n / 2) = n / 2 + 1. There is 1 level-k node and there are 2^{k - j - 1} level-j nodes for all j, 0 <= j < k. So, the total number of links is k + sum_{j = 1}^{k - 1} j * 2^{k - j - 1} = k + 1 * n / 4 + 2 * n / 8 + 3 * n / 16 + ... + (k - 1) * 1 = n - 1, which is not much: on average there is less than one link per node. This result could also have been obtained by noticing that the obtained structure is a binomial tree with 2^k nodes. Any tree with n nodes has n - 1 links, because each node has indegree 1, except for the root which has indegree 0. Once the structure has been constructed, it can be used for performing find operations without any further computation by starting at node 0 and then testing the child at which to continue the search. These ideas can be worked out as follows:

  class Node
  {
    int    v;
    int    j;
    Node[] s;
  
    Node(int i, int v, int k)
    {
      this.v = v;
      if (i == 0)
        j = k;
      else
      {
        j = 0;
        while ((i & 1) == 0)
        {
          j++;
          i = i >> 1;
        }
      }
      s = new Node[j];
    }
  
    boolean find(int x)
    {
      int l = 0;
      while (l < j && s[l].v > x)
        l++;
      if (l == j)
        return x == v;
      return s[l].find(x);
    }
  }
  
  class FindTree
  {
    int  k;
    Node root;
  
    FindTree(int[] a, int k)
    { // a[] has length 2^k
      this.k = k;
      int n = 1 << k;
      int[] b = new int[n];
      for (int i = 0; i < n; i++)
        b[i] = a[i];
      Sort.sort(b, 0, n - 1);
      Node[] nodes = new Node[n];
      for (int i = 0; i < n; i++)
        nodes[i] = new Node(i, b[i], k);
      for (int i = 0; i < n; i++)
      {
        int j = nodes[i].j;
        Node[] s = nodes[i].s;
        for (int l = 0; l < j; l++)
          s[l] = nodes[i + (1 << (j - l - 1))];
      }
      root = nodes[0];
    }
  
    boolean find(int x)
    {
      return root.find(x);
    }
  }

Click here to see the above code fragment integrated in a working Java program.

In the case of binary heaps (presented in the chapter on priority queues) an explicit representation of the tree, based on pointers, is usually replaced by an implicit representation, computing the indices of the children. This is done in order to reduce the memory consumption and to assure that the data of nodes whose indices differ little lie close together in memory. These facts together are the main reason that, in practice, binary heaps are among the best priority queue implementations. Here we do the opposite: a binary search, which can be viewed as a traversal of an implicit tree, is replaced by a search on an explicit tree. The reason to present this alternative search, which on most computer systems will be slower, is that a similar data structure can be used to even support inserts and deletes. In such a dynamic context, binary search cannot be used anymore, because it requires that the array is kept sorted at all times. Deletions might be performed lazily, but keeping an array sorted under insertions may lead to operations taking linear time.

The presented structure is too static for updates. The idea is to relax the precise condition on the degree of the nodes: instead of choosing it in a fixed way based on the index, when inserting a node, its level is selected at random in such a way that the frequency with which level-j nodes occur is the same as before. That is, a newly inserted node should become a level-j node with probability 2^{-j - 1}. This can be realized easily by generating a random k-bit number, and to determine the number of trailing zero bits. Suppose that when inserting a value x we have chosen to construct a level-j node. Then, all links at level j or lower on the search path pointing to a value larger than x are set to point to the new node with key value x and this new node takes over the link from this node. It is handy to start with two sentinel nodes: one has key value -infinity, the other has key value +infinity. The first of these should have a height which is at least equal to the maximum height of all real nodes.

Deleting a node with key value x appears to be harder, because now we must find all nodes pointing to a certain level-j node with key x. But this can be achieved with a slightly modified search: normally, if the key of the next node equals the value we are searching for, the search goes there directly, but now the search continues until it comes to this node with a level 1 node. All these nodes take over the links from the node to remove.

Analysis

How about the performance? The analysis of the memory usage is simple: the expected number of level j nodes is the same as before. Using that the expected value of a sum equals the sum of the expected values, it follows immediately that the expected number of links is n - 1 = O(n). In an implementation we may choose to increase the degree of all nodes by 1, assuring that in any case each node has a link to the next node in the list. Even when doing this, the number of links is bounded by O(n). The running time is harder to bound.

Lemma: The probability that the height of a skip list with n elements exceeds 2 * log n is at most 1 / n.

Proof: The node levels are chosen independently at random and distributed according to the geometric distribution with parameter 1 / 2. Thus, the probability that any node has level l > j is less than 2^{-j}. Using that Prob(A or B) <= Prob(A) + Prob(B), we get Prob(maximum height > j) <= n / 2^j. Taking j = 2 * log n, this probability is less than 1 / n. End.

Theorem: The expected time for an operation on a skip list with n elements is bounded by O(log n).

Proof: Because the time for an insert and a delete is of the same order as a find, we limit our analysis to that of a find. We must show that the path from the node with key -infinity to the node with the value x we are looking for has expected length O(log n). Let X be the random variable giving the length of this path. In any case this X = O(n), so the cases that there are more than 2 * log n levels contribute at most O(1) to the expected value. So, we may assume that there are at most 2 * log n levels. We show that the expected number of steps on any given level is O(1). At each level one vertical step is performed going to the next lower level. So, the vertical steps contribute at most 2 * log n. Let L_i, i >= 1, denote the number of horizontal steps on level i, going from a node at level i to the next node on level i. Because Exp[sum_{i >= 1} L_i] = sum_{i >= 1} Exp[L_i] due to the linearity of expectation, it suffices to prove that there is a constant c so that Exp[L_i] <= c, for all i >= 1. For proving this, it is very handy to view the construction of the skip lists in a different way. Instead of choosing the level of each node independently, it may also be considered to be a repeated selection of subsets: at level 1 the set is S_1 = S, containing all elements. In general, each of the elements in the subset S_i, i >= 1, is also present in S_{i + 1} independently of the others with probability 1 / 2. This gives the same distribution as before. If the search moves through k nodes at level i, then this means that none of the elements corresponding to these was promoted to S_{i + 1} and that the element in position k + 1 was promoted. The probability that this happens is 1/2^{k + 1}. So, the expected number of elements visited at level i is given by 0 * 1/2 + 1 * 1/4 + 2 * 1/8 + ... = 1. End.

A nice feature of skip lists is that memory usage can be traded against performance. Choosing the probability that a level-j node is created smaller, for example 3^{-j}, the memory usage goes down. At the same time the expected access time goes up. A generalization of the above proof shows that when choosing a node to be at level i with probability a^{-j}, a >= 2, the average time for a search is proportional to (a - 1) / log a * log n with an expected memory usage proportional to a / (a - 1) * n. Taking a = 4, the searches become about 50% more expensive, but the memory usage drops from 2 * n to 4 / 3 * n.

Skip List

Implementation

The main argument in favor of skip lists is their simplicity in combination with low overhead. To check whether there are no hidden complications, we consider an implementation in Java. The nodes consist of a key and an array of pointers. This array has length equal to the level of the node. This gives the following implementation:
  class Node
  {
    int    key;
    int    level;
    Node[] next;
  
    public Node(int key, int level)
    {
      this.key   = key;
      this.level = level;
           next  = new Node[level];
    }
  }
So, a node has a size of level + 3 words. The above analysis has shown that the expected level equals 2, so the expected size of a node is 5 words. This includes the key. This is not much, but it is not better than an AVL tree.

The operations are indeed really simple. The method find is comparably complex to a find on a binary tree. At any point of the search there are three possible actions: go one-level down in the same node; go to the next node which is found by the next-link at this level; quit.

  public boolean find(int x)
  {
    // Checks whether there is a node with key x.
    Node node  = head;
    int  level = maxLevel - 1;
    while (level >= 0)
    {
      while (node.next[level].key < x)
        node = node.next[level];
      if (node.next[level].key > x)
        level--;
      else
        return true;
    }
    return false;
  }

Inserts and deletes are simpler than operations on balanced binary trees because there are no restructurings. So, the cost of an insert or delete is equal to that of a find, except for some assignments. There are two assignments for every level of the node to insert or delete, so the expected number of assignments is 4 per insertion. This is almost negligible in comparison to the cost of testing and following the links.

  public void insert(int x)
  {
    // Inserts a node with key x. 
    // Does not test whether key x exists.
    int l = randomLevel();
    Node node = head;
    Node nwnd = new Node(x, l);
    if (l > maxLevel)
      maxLevel = l;
    int level = maxLevel - 1;
    while (level >= 0)
    {
      while (node.next[level].key < x)
        node = node.next[level];
      if (level < l)
      {
        nwnd.next[level] = node.next[level];
        node.next[level] = nwnd;
      }
      level--;
    }
  }

  public void delete(int x)
  {
    // Deletes a node with key x. 
    // Correct when there is no such node.
    // Each key should occur at most once.
    Node node = head;
    int level = maxLevel - 1;
    while (level >= 0)
    {
      while (node.next[level].key < x)
        node = node.next[level];
      if (node.next[level].key > x)
        level--;
      else
        node.next[level] = node.next[level].next[level];
    }
  }

Click here to see the above code fragments integrated in a working Java program. This implementation is rather slow. The main reason appears to be that there are many indirections: every step the key is compared with the key of a successor node, rather than with the key of the current node as is done in AVL trees. Skip lists also have some very nice features which are harder to realize with other dictionary implementations. Because at the bottom level the structure is simply a linear list, it is trivial to perform a range query in expected time O(log(number_of_entries_in_tree) + number_of_elements_in_range). Because of this it is also no problem to have elements with the same key.

Skip lists offer a conceptually simple dictionary data structure. Its time and memory consumption is comparable to that of more sophisticated data structures such as AVL trees.

Exercises

  1. Show that the time for the restructuring operations needed when performing an arbitrary sequence of k >= n insertions (and no deletions) on a 2-3 tree with at most n leafs is bounded by O(k).

  2. Show that the time for the restructuring operations needed when performing an arbitrary sequence of k >= n insertions and deletions on a 2-4 tree with at most n leafs takes is bounded by O(k).

  3. In a-b trees, before fusing two nodes, it is first considered whether a child can be borrowed. This idea is required to assure that the degree condition holds after fusing. However, it also has a positive impact on the amortized number of performed restructuring operations. It turns out to be advantageous to even try to give away a child before splitting a node.

  4. a-b trees are also particularly interesting for large values of a and b. This offers the possibility to reach any element by traversing very few levels of the tree. This can bring substantial improvements in a non-uniform memory, having several levels of cache, a main memory and a hard disk with ever larger access times. Of course maintaining an a-b tree the conventional way, with an array or linked list of values in the internal nodes, implies that finding out the next node to visit costs O(b) time. So, the total time for a search in an a-b tree becomes O(b * log_a n). However, the values in the internal nodes can be maintained in some tree-like dictionary structure as well.

  5. In the chapter on union-find we have been considering the performance of several strategies. One of them was find with path contraction and simple first-to-second unions. This strategy, and the others as well, can also be analyzed in a very clean way using a suitably defined potential function.

  6. The analysis of insertion sort was given for an a-b tree. Repeat the analysis for an AVL tree showing that even for them the time to perform insertion sort can be bounded to O(n + n * log(I(S) / n)). Formulate a general condition that a tree-based dictionary must have in order to assure this result.

  7. Perform the following operations on a splay tree obtained by first performing insert(0), ..., insert(8): find(4), find(5), insert(9), delete(3). Show the situation after each complete operation.

  8. It is considered how many operations must be performed on a splay tree with n elements in order to guarantee an amortized time of O(log n).

  9. The rank of an element x in a set S is the number of elements in S that are smaller than x, where it is assumed that all elements are different. For dictionaries we consider the additional operations compute_rank and find_rank. The first determines for a key x its rank r in the set of keys stored in the dictionary T. The second finds the key x corresponding to a specified rank r.

  10. The standard operations on dictionaries are find, insert and delete, but there are other interesting operations as well. In the question above on high-degree a-b trees, we considered using a tree-like dictionary for maintaining the up to b - 1 values which have to be stored in a node. In this context it was important that this data structure also supports the operations fuse and split. Describe how to perform these operations on skip lists and specify the corresponding time consumption.

  11. It is considered how to construct and merge dictionaries.

  12. Related to merging is the operation insert_batch. This operation inserts m new elements to a dictionary with initially n elements. Clearly this can be performed by m conventional insertions. This takes O(m * log n) time using any reasonable dictionary implementation. In general it will be hard to come below this bound, because most data structures somehow sort. However, assuming that the m numbers are provided in sorted order, we can hope to do as good as O(m + log n).

  13. Constructing a skip list with n nodes by performing n insertions takes O(n * log n) expected time. In the case of general key weights, this is optimal, because this construction implies sorting the keys, but if for some reason the sorting can be performed faster, we may hope to do better. Investigate whether it is possible to construct a skip-list structure from a sorted array containing n key values in O(n) time.

  14. The performance of skip lists depends on the probability distribution used for selecting the level of newly inserted nodes. It was suggested that the standard choice of creating an l-level node with probability 2^{-l} is not necessarily optimal. Use the given program to investigate this. First perform tests with the current setting, measuring time, followed links and memory consumption for n = 2^k, for k = 10, 12, ... . Then modify the program so that an l-level node is generated with probability 4^{-l} and test again.

  15. In the program it is specified for the delete method that the keys should occur at most once.





Hashing

Search trees are a good realization of the dictionary ADT. Using some kind of balanced search trees the operations find, insert and delete can all three be realized in O(log n) worst-case/amortized/expected time. In addition most of these data structures allow several other important operations such as finding the smallest element, enumerating all elements in linear time and range queries. However, if really only these three operations must be supported, hashing offers an alternative which is much simpler and in general even much faster.

Basic Approaches

Chaining

The simplest idea for managing a set of n elements is to create an array of length n. The elements of this array are nodes. A node has one or more data fields and one field of type node. For each element a node is created and they are allocated using a hash function f. All elements allocated to the same array position are linked together in a chain. This is called chaining. If the array has length m, the average chain length is n / m. The probability p_k that a chain has length k, is given by p_k = (n over k) * (1 / m)^k * (1 - 1 / m)^{n - k}.

For m = n, large n and small k, p_k ~= 1 / k! * 1 / e. In that case the expected maximum size is log n / loglog n. In an informal sense, this can be verified by noticing that k! ~= (n / e)^n, and for k = log n / loglog n, this is about 1 / n. It can even be shown that the maximum chain length is Theta(log n / loglog n), with high probability (this means that the probability that a smaller or larger maximum size occurs is bounded by n^{-alpha}, for some alpha > 0).

Hashing with Chaining

A disadvantage of chaining is the additional memory needed for the array and for the pointers. The memory overhead is m + n. For any m <= n this is bounded by O(n). For large data elements this overhead is negligible, but if the data elements consist of a single int it may be profitable to take m somewhat smaller, saving memory at the expense of longer chains. A more serious disadvantage of chaining is that following a link between two nodes typically causes a cache fault (unless n is small). For an unsuccessful find, the expected number of cache faults equals 1 + n / m: one for accessing the array, the rest for the nodes.

Open Addressing

When performing open addressing, the keys of the elements are stored in the array itself. The advantage is that there are no pointers, but now collision handling becomes an issue. For an element which is hashed to some position h, the simplest idea is to try the positions h, h + 1, h + 2, ... (cyclically) and to allocate it to the first which is free. This collision-handling strategy is called linear probing. A major problem with linear probing is clustering. What is even worse, is that a small cluster has a tendency to rapidly grow in size, because any other element which is hashed to any position of the cluster will be inserted at the end of it. This phenomenon is called primary clustering. As soon as the array is occupied for more than 50%, the performance starts to deteriorate seriously, the average time for finds starts to go up.

Hashing with Linear Probing

Not only the average time for finds goes up, the maximum time increases even more strongly when the load factor increases. The attached program allows to study this. The following table gives the maximum chain length for various values of n and the load factor r = m / n. Indicated are the maximum and average values over 1000 experiments for each pair of n and r values.

n = 10^3 n = 10^4 n = 10^5 n = 10^6 n = 10^7 n = 10^8
r = 0.1 7 / 3.2 9 / 4.6 10 / 6.1 12 / 7.5 13 / 9.0 14 / 10.5
r = 0.2 13 / 5.3 15 / 7.7 18 / 10.1 20 / 12.6 21 / 15.2 25 / 17.8
r = 0.3 18 / 7.9 23 / 11.6 28 / 15.4 28 / 19.5 44 / 23.7 38 / 27.7
r = 0.4 26 / 11.7 32 / 17.6 40 / 23.6 55 / 30.0 55 / 36.5 60 / 42.7
r = 0.5 38 / 16.3 50 / 26.4 63 / 36.0 83 / 46.6 83 / 56.8 102 / 67.6
r = 0.6 77 / 26.5 95 / 41.9 139 / 58.9 131 / 76.9 165 / 95.1 157 / 113
r = 0.7 114 / 44.1 187 / 73.4 198 / 105 268 / 141 295 / 174 325 / 210
r = 0.8 233 / 80.7 391 / 146 558 / 227 520 / 304 656 / 391 724 / 475
r = 0.9 492 / 198 1236 / 444 1775 / 738 1861 / 1082 2794 / 1437 2997 / 1815

It appears that for a fixed value of r, the maximum chain length increases logarithmically as a function of n. As a function of r the increase is very strong, for larger r this may imply considerable maximum response times. The experiments also show that the maximum chain lengths fluctuate considerably around their average value.

The clustering problematic is inherent to open addressing, but with quadratic probing or double hashing, the problem can be alleviated considerably. Applying doubles hashing, good performance may be expected up to a load factor of about 80%. However, unless memory usage is the principal consideration, linear probing is to be preferred for the following reasons:

Cuckoo Hashing

The presented approaches are simple and will work fine in general. It is nevertheless worth to consider alternatives. Both open addressing give non-negligible maximum search times. Accepting slightly larger times for inserting, the maximum time for find can be made constant.

The idea is to use two hash functions h_1 and h_2. Each of these should behave like randomly scattering the elements. For an element x, h_1(x) is computed. If position h_1(x) is free, then x is stored in position h_1(x). If h_1(x) is taken, then position h_2(x) is considered. If it is free, x is stored in position h_2(x). If both h_1(x) and h_2(x) are occupied, an element y stored at one of these positions is thrown out and x is stored instead. If y was allocated with h_1, it is reallocated with h_2 and vice-versa. This reallocation may cause a chain of further reallocations. Because elements are kicked out of their slots, this is idea is known as cuckoo hashing.

The strong point is that any element x is either stored at position h_1(x) or h_2(x) and thus can be found with at most two probes. The weakness is that quite often the array may have to be accessed in two very different positions, causing two cache misses, and requiring two function evaluations. Assuming that the functions behave randomly, then it is easy to estimate the insertion time as a function of n and the number of stored elements: each probe is successful with probability m / n, thus the expected time is n / (n - m). The maximum time is bounded by O(log_{n / m} n) with high probability.

There are some important details. Which element is kicked out when h_1(x) and h_2(x) are both occupied? Should we always first try h_1? What should be done if three elements x, y and z remapped to the same set of possibilities, that is, {h_1(x), h_2(x)} = {h_1(y), h_2(y)} = {h_1(z), h_2(z)}. The probability that this happens is so small, that we can afford to rehash.

Best of Two Chaining

Also when applying chaining the maximum queue sizes can be reduced considerably at modest cost. The idea is again to use more than one hash function: Using d hash functions, h_0, ..., h_{d - 1}, for inserting a number x, the values y_i = h_i(x), 0 <= i < d, are computed and the lengths of the chains starting at the positioned y_i is considered. x is inserted where the chain is shortest. This simple idea gives a strong reduction of the maximum chain length. It can be shown that for d = 2, the maximum chain length is bounded to O(loglog n), an exponential improvement over O(log n / loglog n). Taking d larger gives a more gradual reduction.

Theoretically this is an highly appealing idea, but practically its importance is limited. In practice 2 * loglog n is not much smaller than log n / loglog n. Thus, even when the maximum time for finds is the main factor the improvement is modest. For the average time for finds, this approach implies a clear deterioration: using two hash functions, a find may have to traverse two chains. Because the average chain length is the same as before, this means that the average time for finds is doubled (considering that there is even a second evaluation of the hash function).

Nevertheless, the observation that a multiple choice strategy gives much smaller maximum values than a single choice strategy is highly interesting and has been used most in the domain of load balancing. Azar, Broder, Karlin and Upfal have proven the following:

Theorem: Throwing n balls in n bins using the best-of-d strategy sketched above gives a maximum load of ln ln n / ln d +- Theta(1), with high probability.

Here high probability means a probability of at least 1 - n^c, for some constant c > 0.

More recently Vöcking has studied a variant which, without extra cost, is even better. The idea is to partition the set of n bins in d subsets S_i, 0 <= i < d, of size n' = n / d each. Let load(i, j) denote the load of bin j of subset S_i. For allocating a ball, d random numbers y_i, 0 <= i < d, in the range 0, ..., n' - 1 are generated (in the case of hashing a value x, y_i = h_i(x) as before). For all i, 0 <= i < d, the values l_i = load(i, y_i) are looked up. For j, 0 <= j < d, an index so that l_j <= l_i, for all i, 0 <= i < d, the ball is allocated to the bin y_j of S_j.

There is one detail, which must be fixed. How to handle the situation that there are k > 1 indices j_0, ..., j_{k - 1}, with l_{j_0} = ... = l_{j_{k - 1}} <= l_i, for all i, 0 <= i < d? There are two natural strategies:

The first is simpler, the second appears to be better. Even when not subdiving, a leftist tie-breaking can be applied. This gives four possible strategies in total:
  1. Non-subdivided with uniform tie-breaking.
  2. Non-subdivided with leftist tie-breaking.
  3. Subdivided with uniform tie-breaking.
  4. Subdivided with leftist tie-breaking.

Vöcking has shown that the combination of a subdivision with leftist tie-breaking is superior to all others:

Theorem: Throwing n balls in n bins using the best-of-d strategy sketched above using a division in d subsets and leftist tie breaking gives a maximum load of ln ln n / (d * ln phi_d) +- Theta(1), with high probability.

Here phi_d depends on d, but not on n. phi_2 = (1 + sqrt(5)) / 2 ~= 1.61, phi_3 ~= 1.83, phi_4 = 1.92, and for larger d, phi_d increases monotonically to lim_{d -> infinity} phi_d = 2. The essential difference with the result by Azar et al. is the linear decrease with d. Even more interesting than the improvement, however, is the surprising conclusion that an asymmetric strategy may be better than a symmetric one.

All four are implemented in a program which can be downloaded here. Trying several values of n and d gives the following results:

n = 10^3 n = 10^4 n = 10^5 n = 10^6 n = 10^7 n = 10^8
d = 1 5.53 6.69 7.73 8.79 9.75 10.74
d = 2 3.01 / 3.01 3.06 / 3.07 3.46 / 3.45 4.00 / 4.00 4.00 / 4.00 4.00 / 4.00
-- 3.01 / 2.99 3.06 / 3.00 3.45 / 3.00 4.00 / 3.04 4.00 / 3.40 4.00 / 3.99
d = 3 2.41 / 2.39 2.99 / 3.00 3.00 / 3.00 3.00 / 3.00 3.00 / 3.00 3.00 / 3.00
-- 2.38 / 2.02 3.00 / 2.09 3.00 / 2.65 3.00 / 3.00 3.00 / 3.00 3.00 / 3.00
d = 4 2.03 / 2.03 2.20 / 2.20 2.91 / 2.90 3.00 / 3.00 3.00 / 3.00 3.00 / 3.00
-- 2.02 / 2.00 2.19 / 2.00 2.90 / 2.00 3.00 / 2.00 3.00 / 2.00 3.00 / 2.01

All given values are averages over 1000 experiments. For d > 1, the four values for the four strategies are given in the order they are listed above.

The experiments show that any of the multiple-choice strategies gives much smaller maximum values than using a single choice (but remind the above note on the practical relevance for hashing). Strategy 1, 2 and 3 give more or less the same values and strategy 4 is better. So, it is really the combination of the two ideas which gives the improvement. The experiments also show that sometimes, for example for n = 10^6 and d = 3, all strategies are equally good: all experiments give exactly the same value. Using the best-of-4 with strategy 4 gives a maximum value of exactly 2 for all tests even for very large n.

Exercises

  1. Estimate the expected maximum chain length as a function of n and m even for m != n.

  2. Compare the expected number of cache misses for successful and unsuccessful finds, when applying either linear probing or double hashing. It is assumed that the array is 50% full, that the cache lines consist of 16 ints, and that with linear probing, due to clustering, on average one more position has to be accessed than with double hashing.

  3. Study in more detail the development of the maximum chain length of open addressing with linear probing as a function of the load factor r = m / n. Suggest a formula for this development and try to justify it by a theoretical consideration. Additional values can be computed with the given program.

  4. When using linear probing, the maximum time for an unsuccessful find is one larger than the maximum chain length. However, the average time is practically at least as relevant.

  5. Estimate the probability that using two random functions with values in {0, 1, ..., n - 1}, there is a triple x, y, z which is mapped to the same set of two values. Is this the only type of unresolvable collision when applying cuckoo hashing? Estimate the probability of an unresolvable collision when allocating n / 2 elements to an array of length n.

  6. Are the theorems of Azar et al. and Vöcking useful for predicting the maximum loads actually arising in practice?





Priority Queues

A priority queue is an ADT supporting insert and deletemin. The best-known priority queue guaranteeing O(log n) time consumption is the binary heap. For completeness sake it is repeated here. However, the binary heap is by far not the only efficient implementation of a priority queue. We present several alternative structures. Especially for another commonly needed operation, decreasekey, these perform better, having O(1) amortized time.

Heaps

Definition and Operations

Now we are introducing one more fundamental data structure: the heap. A heap has as underlying structure a tree. So, it looks similar to a search tree. However, the defining property is different, and this makes that it has different properties and different usages.
A heap is a tree for which any node v has a key that is smaller (or equal) than the keys of all its children (if any).
The above property will be referred to as the heap property. It clearly implies that the smallest key must stand in the root of the tree, and that the second smallest element is one of its children. Thus, findmin can be done in O(1): just return the key of the root.

Heap-Ordered Tree

A deletemin is slightly harder. It is not hard to remove the root, but we should write another element instead of it, so that afterwards the tree is again a heap. But, this is not too hard either. If the root r of a heap is deleted, it may be replaced by the smallest of its children. In this way the heap property is preserved at the level of the root. Recursively deleting the roots of the heaps at the lower levels gives a correct deletion. When reaching a leaf node, the recursion stops, deleting the whole node. This process can be viewed as a free place, a hole, moving downwards until exiting the tree. Therefore, this is also called deletion by performing a percolate down. In pseudo-code the deletemin looks as follows:

  void percolateDown(Node v) 
  {
    if (v has children) // v is not a leaf
    {
      determine the child w of v with the smallest key;
      v.key = w.key; // maybe even other data to copy
      percolateDown(w); 
    } 
    else
      remove v;
  }

  int deleteMin(Node r) 
  {
    int x = r.key;
    percolateDown(r);
    return x; 
  }

An insert is similar. The new node v can in principle be attached to any node w. If the key of v is smaller than the key of w, then the heap property is restored by exchanging v and w, but possibly v may have to bubble up even further. At a certain level, possibly at the root, v will have found a place in which it is no longer violating the heap property and we are done. This operation in which the inserted element is bubbling upwards through the tree is most commonly called a percolate up.

  void percolateUp(Node w) 
  {
    if (w has a parent v) // w is not the root
      if (v.key > w.key) 
      {
        int x = v.key; v.key = w.key; w.key = x;
        percolateUp(v); 
      }
  }

  int insert(int x) 
  {
    create a new node w;
    w.key = x;
    attach w to an appropriate node v;
    percolateUp(w); 
  }

Percolate up and down are symmetric, but there are also important differences: when percolating up, the key value needs to be compared with only one other value at each level. In a percolate down it must to be compared with the minimum of the children which must be determined first. So, the cost of an insert is O(depth_of_tree), while the cost of a deletemin is O(depth_of_tree * degree_of_nodes). Furthermore, while percolating up, we move along the unique path leading towards the root. When percolating down, it is not a priori known which path will be followed. Another important difference is that a deletemin always goes the whole way until reaching a leaf, while an insert stops as soon as the new key has reached a level where it does not conflict with the key of its parent.

Binary Heaps

In the literature it is sometimes assumed that the tree is binary and perfectly balanced, however, the structure of the tree has no implications for the way the operations are done. One should not think that it is part of the definition of a heap that it is realized as a balanced binary tree. The balanced-tree property is only needed for efficiency reasons: otherwise the tree might degenerate into a structure that resembles a path with depth close to n. Because the time consumption of insert and deletemin is proportional to the depth of the tree, this is highly undesirable.

From the above, we have seen that one has a lot of freedom for doing things. This will be exploited to come with a very simple and very efficient implementation. The tree will always be kept perfectly balanced: that is, it will always be a binary tree with all levels completely filled, except for possibly the lowest.

Perfect Binary Trees with 1, ..., 10 Nodes

This means, that if we are adding a node, performing insert, we must insert it at the next free place in the lowest level. If the last level is just full, we must create a new level, inserting the element as the left child of the leftmost node in the bottom level.

A deletemin cannot be performed as before, because then we cannot control where the hole is going. Therefore, we are modifying the routine. The last position in the bottom level is freed, possibly cancelling the whole level. The key v of this node is temptatively placed in the root, and then it percolates down by exchanges with the smaller child. The whole deletemin now looks like

  void percolateDown(Node v) 
  {
    if (v has children) // v is not a leaf
    {
      determine the child w of v with the smallest key;
      if (w.key < v.key)
      {
        int x = v.key; v.key = w.key; w.key = x;
        percolateDown(w); 
      }
    } 
  }

  int deleteMin(Node r) 
  {
    int x = r.key;
    let v be the rightmost node at the bottom level of the tree;
    r.key = v.key; // maybe even other data to copy
    remove v;
    percolateDown(r);
    return x; 
  }

Heap Operations

Lemma: The deletemin procedure is correct: it removes the entry with minimum key value; preserves the heap property and returns the minimum key value.

Proof: Because we may assume the heap property was given before the u operation, the root is the entry with minimum key. This value is overwritten and returned. It remains to check that the heap property is preserved. A crucial observation is that the it might only be disturbed along the processed path. A formal correctness proof goes by some kind of induction. The assumption is that at any time, the current node v is the only node of the tree in which the heap property may possibly be violated. Once this is proven, the correctness follows, because when the process terminates, the heap property is assured in v because either v.key <= w.key, w being the node with minimum key value among the children of v, or v is a leaf, for which the heap condition is void. Initially the tree is unchanged except for the root, the node at which percolateDown starts. So, assume the hypothesis holds at the beginning of some step. Then if v.key < w.key, the values of v and w are exchanged. After this, key.v < key.w and because w was the node with minimum key among the children of v, it is even true that after swapping key.v <= key.w' for all other children of v. So, the heap property has been established in v. The only node whose key has changed is w, the node which is considered in the next round. End.

Using a perfect binary tree, a heap with n entries has depth round_down(log n), so both operations can be performed in O(log n) time. In a really efficient implementation we do not perform exchanges but keep the element for which the position in the heap still has to be determined in an additional memory position and shift the elements on the path simply one level up or down. Doing this, the number of assignments is reduced from 3 * length_of_path to 1 * length_of_path + 2.

Another observation that is essential for very efficient implementations in practice is that a perfect binary tree can very well be maintained in an array, avoiding all pointers. The idea is to number the nodes level by level from left to right, starting with the root which gets number 0. In that case, for any node with index i, the leftchild has index 2 * i + 1 and the rightchild has index 2 * i + 2. This allows to access the children of a node by a simple computation, which requires two clock cycles (maybe even one because often additions and multiplications can be executed in parallel), which is certainly not more than the cost for fetching the address of leftchild. At the same time it gives a considerable reduction of the memory consumption, saving n pointers. If we start with index 1 for the root, then the left child of node i has index 2 * i and the right child 2 * i + 1, saving half of the additions. This indexing idea even works for d-ary heaps which are based on perfect d-ary trees. A perfect tree which is maintained in an implicit way in an array without any pointers is called an implicit tree.

Node Numbering in Implicit Trees

Expected Insertion Time

For the implementation with a perfect binary tree, the time for insert and deletemin is bounded by O(log n). This is an upper bound of the time consumption. There are many problems though, for which the upper bound is unnecessarily pessimistic, the observed behavior in practice being better. How about our operations?

When performing a deletemin, the element that is put tentatively in the root will typically have a rather large key because before it was a leaf. This is not always so: it is possible that many nodes with small keys stand deep in the tree, but in general this will not be the case. Therefore, typically this element will be percolated down rather far before the process stops: even in practice the cost of a deletemin is proportional to log n.

The situation for insertions is different: better. In practice it turns out that insertions go very fast. Much faster than O(log n). The reason is simple, a precise analysis is quite hard. Of course an analysis requires an assumption: practice cannot be analyzed. So, we assume that the keys are sampled uniformly from an interval, say they are reals in [0, 1]. Let us try to estimate the expected number of calls to the routine percolateUp under this assumption.

Consider the case that we are only performing inserts and no deletemin operations. Randomly and uniformly select a key. It is essential that the previous nodes were sampled from the same probability distribution. The node is only moving up k levels, if it has the smallest key in its subtree of depth k - 1. The lowest level of this tree may have been empty except for the node itself, but all the other levels are full. So, only if the node is the smallest among 2^k or more nodes (also counting the node itself) it is moving up k levels. This means that the expected distance it is moving upwards can be estimated as follows: the probability that the node is moving up exactly k levels is at most 1 / 2^k for all k. Denoting the upwards movement by the random variable X, we get

Exp[X] <= sum_{k > 0} k / 2^k = 1 * 1/2 + 2 * 1/4 + 3 * 1/8 + ... = 2.
Here Exp[X] denotes the expected value of X, which is defined by
Exp[X] = sum_{all possible values i of X} i * Prob[X == i].

The above analysis is not entirely correct: the keys in the lowest levels of the heap are not entirely uniformly distributed. The fact that they are standing there implies that they are somewhat larger than average. However, this dependency is very weak (because there are so few elements at the higher levels of the tree), and the analysis is correct up to a small correction. The computed constant is not that important. The important point to remember is that the expected time for an insert is O(1).

Buildheap

How long will it take to build a heap consisting of n nodes? Doing this by performing n inserts, may lead to a worst-case time of O(n * log n). This happens if the elements are inserted with decreasing key values: in that case the i-th element has to percolate-up about log i positions, so the total time is
sum_{i = 1}^n log i > sum_{i = n / 2}^n log i > sum_{i = n / 2}^n log (n / 2) = n/2 * (log n - 1) = Omega(n * log n).

Can we hope to do better? Yes. The fact that the expected cost for an insertion is O(1) hints, but not more than that, that we may hope to do it in O(n) time. Notice the fundamental difference with a search-tree structure: because the elements of a search tree with n elements can be output in sorted order in O(n) time (by running an inorder traversal), the Omega(n * log n) lower bound on sorting implies that the construction of any search tree, balanced or not, takes Omega(n * log n) time. For heaps there is no such fundamental obstacle against an efficient construction: the elements are only very weakly sorted.

A first idea is to randomize the input and then perform n inserts. This overcomes the problem that the elements may stand in the wrong order: with high probability (meaning that the probability of failure is bounded by O(n^-alpha) for some constant alpha > 0) the whole sequence of insertions costs only O(n) time.

However, this bound can also be established deterministically by a rather simple algorithm. The idea is that we do not maintain a heap at all times (this is not necessary as we are not going to do deletemins during the buildheap). We simply create a perfect binary with n nodes, and then heapify it. That is, we are going to plow through it until everything is ok.

How to proceed? Our strategy must in any case guarantee that the smallest element appears in the root of the tree. This seems to imply that the root must (also) be considered at the end, to guarantee that it is correct. From this observation one might come with the idea to work level by level bottom up, guaranteeing that after processing level i, all subtrees with a root at level i are heaps. Let us consider the following situation: we have two heaps of depth d - 1, and one extra node with a key x, which connects the two heaps:

Two Heaps Connected by a Root

How can we turn this efficiently into a heap of depth d? This is easy: x is percolated down. So, two heaps of depth d - 1 + one extra node can be turned into a heap of depth d in O(d) time.

Now, for the whole algorithm we start at level 1 (the leafs being at level 0) and proceed up to level log n. In pseudo-code this gives the following algorithm:

  void heapify(Heap h, int n) 
  {
    // The nodes at level 0 constitute heaps of depth 0
    for (l = 1; l <= round_down(log_2 n); l++)
      for (each node v of h at level l) do
        percolateDown(v); // Now v is the root of a heap of depth l
      // Now all nodes at level l are the root of a heap of depth l
    // Now the root is the root of a heap of depth round_down(log_2 n)
  }
The correctness immediately follows from the easily checkable claims (invariants) written as comments within the program, which hold because the above observation about the effect of percolating down.

How about the time consumption? At level l we are processing fewer than n / 2^l nodes, and each operation takes O(l) time. Let c be the constant so that the time for processing a node at level l is bounded by c * l, then the total time consumption can be estimated as follows:

sum_{l = 1}^log n c * l * n / 2^l < c * n * sum_{l >= 1} l / 2^l = 2 * c * n. = O(n).

The given algorithm can easily be generalized for any kind of heaps, but for heaps in which all nodes have the same degree implemented in arrays, it can be written even simpler. Assume an array a[] of length n should be turned into a binary heap, then we can do the following:

  void percolateDown(int i)
  {
    int j = i;
    int k = (i << 1) + 1;
    if (k < n && a[k] < a[j])
      j = k;
    k++;
    if (k < n && a[k] < a[j])
      j = k;
    if (j != i)
    { 
      k = a[j]; a[j] = a[i]; a[i] = k;
      percolateDown(j);
    }
  }

  void buildHeap()
  {
    for (int i = (n >> 1) - 1; i >= 0; i--)
      percolateDown(i);
  }
Click here to see the above piece of code integrated in a working Java program.

Heapification of a Binary Tree

d-Heaps

Of course we do not have to limit our studies to heaps that are build out of binary trees. Taking trees of degree 3, 4 or more generally of degree d is possible as well. The heap property remains the same: the key of any node is smaller or equal than that of all its children. Deletemin is simple: remove the root, replace it by the last leaf and perform a percolate down, now considering d children. An insert is also easy: add a new leaf and let the new key percolate up.

Practically there are reasons to choose d to be a power of two: in that case the array implementation requires only bit shifts for the computation of the location of the parent and the children (and no divisions which might be more expensive). For d-heaps mapping the nodes to consecutive numbers in a way so that the indices of the children can be computed easily is the same as before: start with the root, and number on level by level. Giving number 0 to the root, the children of a node with index i are the nodes with indices d * i + 1, ..., d * i + d.

We prove that this is correct. Denote by f_d(k, i) the index in a perfect d-ary tree at position i of level k (the root being at position (0, 0)). We should prove that the children of node with index f_d(k, i) have indices d * f_d(k, i) + 1, ..., d * f_d(k, i) + d. This can be shown by first analyzing the relation between the indices of the leftmost nodes. f_d(k + 1, 0) = sum_{l = 0}^k d^l = d * sum_{l = 0}^{k - 1} d^l + 1 = d * f_d(k, 0) + 1. Now consider the other nodes. Node f_d(k, i) has index f_d(k, 0) + i and its children have indices f_d(k + 1, 0) + d * i, ..., f_d(k + 1, 0) + d * i + d - 1. Substituting f_d(k + 1, 0) = d * f_d(k, 0) + 1 the result follows.

Choosing a tree of degree d reduces the depth of the tree from log_2 n to log_d n. Thus, a deletemin now takes O(d * log_d n). This is more than before. On the other hand, the insert has become cheaper: it only takes O(log_d n). Practically this is not such an interesting improvement as even degree 2 gives us expected time O(1), but theoretically it might be. For example, if we take d = log n, then the cost for the inserts has been reduced to O(log n / loglog n), which is asymptotically faster.

A more important reason, just as for d-ary search trees, is that every access of a node means a cache or page fault. If the tree is shallow, then the number of these accesses is reduced, which in practice will imply a reduction of the time to perform the operations. The right choice of d depends on the type of application. As long as all data fit in the main memory a good choice might be d = 4: this reduces the depth of the tree by a factor two at the expense of few extra operations.

If we consider an application in which the data do not fit into the main memory, then most accesses imply a page fault. In that case the tree should be kept as flat as possible by taking d very large. In that case a good idea is to take d = sqrt(n), assuring that the whole tree has depth 2. More generally, for any d = n^eps, for some constant eps > 0, the depth is constant, assuring that inserts can be performed in constant time.

A problem with such large d is, of course, that when percolating downwards, the minimum has to be selected out of d elements which becomes rather costly: a deletemin takes O(n^eps), which is not good. A solution is to maintain the children of a node not in an array or list, but in a priority queue. Keeping these priority queues up-to-date is not trivial, but clearly any percolate-step, both up and down, can be performed in O(log d) time when using conventional binary heaps for these priority queues of size d. The time for both inserts and deletemins is then bounded by log_d n * O(log_2 d) = O(log_d n * log_2 d) = O(log n). This is the same as before, and due to the more complicated structure this will actually be slower if the data set is small. However, if we have a problem whose size exceeds that of the main memory by a factor 100, then with this approach everything can be organized with at most 2 page faults per operation, whereas otherwise we would need 6 or 7.

Binomial Heaps

Most ADTs are based on a small number of underlying actual data structures. The most important examples are the following: We will add one more great data structure to this list: the binomial forest structure. It allows to efficiently support the two priority-queue operations plus the extra operation of merging. Merging two priority queues means that out of two of them we create one new one containing all the elements. With heaps realized with perfect binary trees, this is hard to achieve. Of course, when there are n elements in total, it can be done in O(n) by building a new heap, but this is not what we are looking for. Using the binomial queue structure, all three operations can be performed in O(log n) time, and, this is also a very interesting property, insertions can be performed in O(1) amortized time. So, binomial queues really offer us some features that none of the previous data structures was offering.

Binomial Trees

A binomial tree has a very special recursively defined structure:

Lemma: A binomial tree of depth d has 2^d nodes.

Proof: The proof goes by induction. The lemma is ok for d = 0 because 2^0 = 1. This is the basis. So, assume the lemma is ok, for all depths up to d - 1. Then, the tree of depth d has 1 + sum_{i = 0}^{d - 1} 2^i = 2^d nodes, because sum_{i = 0}^{d - 1} 2^i = 2^d - 1 (this might again be proven by induction). End.

There is an alternative definition of binomial trees, which gives rise to the same structures:

Smallest Binomial Trees

Binomial Forests

To create a structure with n nodes for some n which is not a power of two, then we simply use the binomial trees corresponding with the ones in the binary expansion of n. For example, for n = 45 = 101101, we would take BNT_5, BNT_3, BNT_2 and BNT_0. Here BNT_d denotes the binomial tree of depth d. Such a structure with at most one binomial tree of each depth is called a binomial forest.

Binomial Forest with 45 Nodes

Using the second definition of binomial trees and the binary addition, it is easy to merge a binomial forest with n_1 nodes with a binomial forest with n_2 nodes: starting with the smallest trees in each forest, if there are two or three trees of the same depth d, two of them are linked to one tree of depth d + 1. The number of these operations is bounded by the length of the binary expansions of n_1 and n_2 and is thus bounded by O(log (n_1 + n_2)). In the literature the operation of combining two data structures to a single one is mostly called melding rather than merging.

As an example we consider n_1 = 22 = 10110 and n_2 = 10111. There is a single BNT_0, which remains unchanged. The two BNT_1 are linked and give a BNT_2. Now there are three BNT_2, two of which are linked and give a BNT_3. This is the sole BNT_3 and it survives. Finally, the two BNT_4 are merged to one BNT_5.

Merging Two Binomial Forests

Binomial Heaps

The explanation of binomial trees and forest so far was not specific to priority queues. Possibly these interesting structures may also be useful for realizing other ADTs. Now we will see how binomial forests can be used for a simple and efficient implementation of priority queues called binomial heaps.

Each node of the forest is used for storing one entry. Each tree is organized as a heap (here we encounter an example of heaps with a non-uniform structure), but there is no condition on how the keys are distributed over the trees. As a result the smallest element may stand in any of the trees. We give an example: a priority queue with 29 entries can be realized as a binomial forest with 29 nodes, binomial trees of size 16, 8, 4 and 1, each tree being a heap:

Binomial Heap with 29 Nodes

How to perform the operations? Findmin is easy, it can be performed in O(log n) time, by determining the minimum value of the keys stored in the roots of all of the at most log n trees in the forest.

The other operations are build on the merge operation, so let us first consider how a merge can be performed efficiently. We already know how to merge two binomial forests into one new binomial forest. The only open question is how to assure that all resulting trees have the heap property afterwards. However, this is trivial: when joining two BNT_d to one BNT_{d + 1}, the idea is to always hook the tree with the larger root to the tree with the smaller root. In this way the heap property holds for the root of the new tree and because the remaining structure is unchanged it holds for all nodes. Thus, each of these join operations can be performed in O(1) time and thus two forest with n_1 and n_2 nodes, respectively, can be merged in O(log (n_1 + n_2)) time.

Insert and Deletemin

Now that we know how to perform merges, all other operations are easy! For inserting we just create a new binomial forest with a single node. This takes constant time. Then we merge it with the existing tree. This takes O(log n) time.

For a deletemin, we look to the at most log n roots of the binomial trees. The minimum is the minimum of these roots. This minimum element is removed. If this is the root of a BNT_d, removing the root results in a bunch of new trees, a BNT_0, a BNT_1, ..., a BNT_{d - 1}. Each of these trees is a heap itself, and thus they constitute a binomial Forest with 2^d - 1 nodes in their own right. This forest is merged with the rest of the binomial forest to obtain the resulting binomial forest. Finding the correct root and removing it takes O(log n), merging the two forests also takes O(log n). So, even deletemin can be performed in O(log n) time.

Inserts can be performed also directly without relying on the merge routine. The idea is that we look for the smallest index j so that b_j = 0 (referring to the binary expansion of n). Then we know that the trees T_0, ..., T_{j - 1} + the new element have to be merged into one new tree with 2^j nodes which is replacing the smaller trees in the binomial forest. One can do this in several ways. One possible way is to add the new element as a new root and then to percolate down. This is correct but not very efficient: at the top level, we have j comparisons, at the next level up to j - 1, and so on. The whole operation takes O(j^2) operations. So, it appears better to nevertheless stick more closely to the merging pattern: first we add 1 to T_0 and create a new tree T'_1, which is added to T_1. This gives a new tree T'_2, which is added to T_2. Etc. This is simple and requires only O(j) time.

Operations on Binomial Forests

Expected Insertion Time

In this section n denotes the number of nodes in the binomial forest and d = rounded_down(log n). Above we have shown that inserts and deletes can be performed in O(log n) time. But, in practice the inserts are mostly much faster.

We analyze the expected time for an insert. The time for inserting an element to a binomial forest with n elements depends on the binary expansion of the number n. Let this be (b_d, ..., b_3, b_2, b_1, b_0), and let z_n be the smallest number so that b_{z_n} = 0. Then the insert involves only the trees with BNT_i with i < z_n. Thus, such an insert can be performed with z_n comparisons. If we have just an arbitrary tree, whose number of elements is uniformly distributed, then with 50% probability b_j = 0 for any j. Thus, the expected number of comparisons for an insert can be given by

T_exp <= sum_{j >= 0} j / 2^j = 2.
This shows that the expected time for insertion is constant, O(1).

The above result, assumes nothing about the distribution of the keys it only assumes that we have no a priori knowledge about the size n of the binomial forest. Therefore, this is already much stronger than the earlier result for binary heaps, where we needed that the keys were uniformly distributed, a fact which lies outside the influence of the programmer: for certain inputs it is good, for others it is not.

Amortized Insertion Time

We now analyze what happens if a sequence of consecutive insertions is performed. Even though we cannot exclude that a single operation is requiring O(log n), these unlucky events do not cluster: we will prove that any sequence of m >= log n operations takes at most O(m) time.

For this analysis we need some theory. First we consider a problem from daily life with the same spirit. Consider a person who wants to keep track of his expenses. There are numerous smaller and larger expenses, so this requires quite a considerable bookkeeping and it is likely that some expenses are forgotten. Assume this person has a regular income of 1000 units per month and he/she had 1270 units on his account at the beginning of the year and 490 units at the end of the year. Then without knowing how much was spent when and where, we can immediately conclude that the sum of all expenses during the year has been 12 * 1000 + 1270 - 490 = 12780.

When trying to determine cost functions in computer science quite often one has to perform "clever bookkeeping". Costs are allocated to operations that did not really cause them in order to later not have to care when they arise. This idea will prove effective here too. It is quite common to make this bookkeeping explicit using tokens. A token is a cost unit. It costs one unit to deposit a token. This can be viewed as a prepayment for future operations: to consume a token, that is executing an operation which earlier has been deposited a token for, is namely considered to be free. More precisely, the amortized time is given by

t_amortized = t_actual + number_of_deposited_tokens - number_of_consumed_tokens.
The total amount of deposited tokens gives the potential of the data structure. The amortized time equals the actual time plus the change of the potential.

If the amortized time as defined above for operations on a data structure of size n can be bounded to t(n) and p(n) gives an upper bound on the potential, then any sequence of m operations takes at most m * t(n) + p(n) time, which means that for m >= p(n) / t(n) the average time per operation for any sequence of operations is bounded by (m * t(n) + p(n)) / m <= 2 * t(n) = O(t(n)). So, the intuitive notion of amortized time as being the average time over a sufficiently long sequence of operations asymptotically coincides with the formalized definition in terms of a potential.

Lemma: The amortized time for performing insertions on a binomial tree is constant. For a structure with n nodes, the used potential has maximum value O(log n).

Proof: Above we noticed that an insert on a forest of size n costs O(1 + z_n) time. Thus, the real cost of an operation is proportional to 1 + z_n. z_n gives the the number of ones in the binary expression of n which have to be turned into zeroes. This number can be as large as log n. However, for any number n, there is exactly one position in which n has a zero where n + 1 has a one. So, it does not cost much to deposit one token for every newly created one. Furthermore, if we start with one token for every one, we can assume that at all times, there is a token available for each one in the binary expression of n. Said otherwise, as potential we use that number of ones in the binary expression of n. For the amortized time this gives t_amortized = 1 + z_n + 1 - z_n = 2. End.

Corollary: Any sequence of m >= log n consecutive insertions to a binomial forest with n elements can be performed in O(m) time.

Leftist Heaps

Merging Heaps

Suppose that we have two heap-ordered trees rooted at the nodes u and v, respectively. These can be merged by the following recursive procedure:
  node merge(node u, node v) {
    if (u.key <= v.key) {
      if (u has less than the maximum number of children)
        hook v directly to u;
      else
        recursively merge v with one of the children of u;
      return u; }
    else {
      if (v has less than the maximum number of children)
        hook u directly to v;
      else
        recursively merge u with one of the children of v;
      return v; } }

This formulation is recursive, but in an actual implementation it is better to take the recursion out because the recursion may have large depth, possibly causing problems with the size of the recursion stack. Furthermore, because the heap operations may be the time-determining factor in a program using them, the intrinsic inefficiency of recursion is also a matter of concern.

The resulting structure has the heap property again, because always nodes with larger keys are hooked to nodes with smaller keys. The problem is that the structure of the resulting heap may violate imposed conditions. For example, generally the result will not be a binary heap with all leafs at at most two levels, not even when both u and v are the root of such structures.

Just as for binomial heaps, if we know how to perform merge, the other operations are easy: insert is performed by merging a new 1-node tree with the existing tree; deletemin is performed by deleting the root node and merging the two resulting heaps; findmin is trivial: just return the key of the root of the tree. So, by fixing more precisely how to perform merge, we can obtain several priority-queue data structures.

Merging Binary Heaps

Merging Leftist Heaps

Leftist heaps are one of the possible concretizations of the above idea. A leftist heap is a binary heap-ordered tree for which the operations are based on a special way of performing the merge operation. Because these binary trees do not have a fixed structure it appears very hard to use arrays for the implementation and to compute the addresses of the children or the parent of a node. Instead, leftist heaps and the other heap structures presented in the remainder of this chapter are based on linked structures. Linking the nodes together facilitates finding the children or parent nodes, but requires additional memory. This is a substantial disadvantage even if there is sufficient memory, because in practice, due to additional cache faults and inefficient table look-up, using more memory typically means that everything goes slower. Nevertheless, some of these data structures outperform binary heaps even in practical contexts, particularly if many decreasekey operations have to be performed.

In leftist heaps the merge proceeds along the right path of the trees. Here the right path of a subtree rooted at a node u is the path obtained by starting at u and then always taking the right branch of the nodes until a node has no right child. Because it takes constant time to process any node on the right paths and no other nodes have to be processed, the time of a merging operation is proportional to the sum of the lengths of the right paths of the heaps rooted at u and v.

The idea to proceed along the right paths is simple, but in itself it does not promise good performance. On the contrary, as a result of the merges the right path becomes longer. Therefore certain balancing operations will be performed which guarantee that the length of the right path of a tree with n nodes has maximum length log n. Once we have proven this, it follows that merging trees with n_1 and n_2 nodes respectively, can be performed in O(log n_1 + log n_2) time. Because the time for inserts and deletemins is of the same order, this will imply that all operations on a heap of maximum size n can be performed in O(log n) time.

Definition: For any node u of a tree T, the null path length of u, denoted npl(u), is the distance of the shortest path from u to a node v in the subtree of u which has less than the maximum allowed number of children.

So, for binary trees the null path length of a node u is the minimum distance to a node with 0 or 1 children. Observe that the definition of npl(u) as a minimum path length implies that for any node u of a tree T, npl(u) = 0 if u has less than the maximum allowed number of children, and that otherwise npl(u) = min_{v is child of u} npl(v) + 1.

Definition: A binary tree T is said to be leftist, if for any node u of T with left and right child v_l and v_r, respectively, npl(v_r) <= npl(v_l). Here it is conventionally assumed that npl(empty) = -1.

Leftist Tree

Let rpl(u) denote the length of the right path starting from a node u.

Lemma: For a leftist tree, for any node u, rpl(u) = npl(u).

Proof: The lemma is correct for any node u which does not have degree 2. If the degree of u is 0, this is obvious. If the degree of u is 1, due to the leftist property, the right subtree of u must be the empty one. For any node u which has children v_l and v_r, respectively, rpl(u) = 1 + rpl(v_r). Applying induction over the value of rpl, we may assume that rpl(v_r) = npl(v_r). Using the leftist property of u, we get rpl(u) = 1 + rpl(v_r) = 1 + npl(v_r) = 1 + min{npl(v_l), npl(v_r)} = npl(u). End.

Lemma: A leftist tree with root r has size at least 2^{rpl(r) + 1} - 1.

Proof: For a tree with a single node r, rpl(r) = 0, and 2^{0 + 1} - 1 = 1 as it should be. Now assume the lemma is correct for all leftist trees with rpl(r) <= k. Consider a leftist tree T with rpl(r) = k + 1. Let v_l and v_r be the children of r, respectively. Because of the above lemma and the leftist property, rpl(v_l) = npl(v_l) >= npl(v_r) = rpl(v_r) = rpl(r) - 1 = k. This implies that the size of T is at least (2^{k + 1} - 1) + (2^{k + 1} - 1) + 1 = 2^{rpl(r) + 1} - 1. End.

In other words, a leftist tree with root r is at least as big as a perfect binary tree of depth rpl(r).

Corollary: For a leftist tree with n nodes rpl(r) <= round_down(log (n + 1)) - 1.

So, leftist trees can be merged in logarithmic time, but the merging may disturb the structural property. Therefore, the value npl(u) is maintained at every node. The merging goes down the right paths and after the recursive calls it is tested whether to exchange the children or not. Then it is also time to update the value of npl.

Implementation

The above ideas have been worked out in a Java program which can be downloaded here. The basic object class is Node. A node has four instance variables: the null path length, a key and two nodes. The kernel of the program is the method for merging two heaps in the class Node. It is made static in order to express the symmetry of the operation and to prevent mixing up things.
  static Node merge(Node u, Node v)
  {
    Node dummy;

    // Assure that the key of u is smallest
    if (u.key > v.key)
    {
      dummy = u;
      u = v;
      v = dummy;
    }

    if (u.right == null) // Hook v directly to u
      u.right = v;
    else // Merge recursively
      u.right = merge(u.right, v);

    // Conditionally swap children of u
    if (u.left == null || u.right.npl > u.left.npl)
    {
      dummy = u.right; 
      u.right = u.left;
      u.left = dummy; 
    }

    // Update npl values
    if (u.right == null)
      u.npl = 0;
    else
      u.npl = min(u.left.npl, u.right.npl) + 1;

    return u; 
  }

The methods insert and deletemin are implemented in the class LeftistHeap. A LeftistHeap has a single instance variable root, giving the root of the heap. This class also contains printing and testing methods, which, using the access at the root, are handed down to the class Node. The methods are trivial except for the fact that we must be careful with null pointers.

  public void insert(int key)
  {
    if (root == null)
      root = new Node(key);
    else
      root = Node.merge(root, new Node(key));
  }

  public int deleteMin()
  {
    if (root == null)
    {
      System.out.println("EMPTY HEAP !!!");
      return Integer.MAX_VALUE;
    }
    else
    {
      int x = root.key;
      if (root.right == null) // Also covers case of single node
        root = root.left;
      else
        root = Node.merge(root.left, root.right);
      return x;
    }
  }

Leftist Heap Operations

Decreasekey

The leftist heaps offer an interesting alternative priority queue data structure. It efficiently supports merge, insert, findmin and deletemin. But how about the decreasekey operation? This operation is of central importance in important priority-queue applications. On binary heaps, d-heaps or binomial heaps decrease key can be performed by an assignment of the new key value followed by a percolate-up. This takes time proportional to the depth at which the node is located. For all these heaps this depth is logarithmic. A delete operation can be performed by setting the key to -infinity and then performing a deletemin. An increasekey can always be performed as a delete followed by an insert with a new larger key. On heaps with degree larger than 2 this is faster than a percolate-down.

On a leftist heap a percolate-up may take Omega(n) time. In itself this is not that serious as long as the amortized time would be good. But, if the tree consists of a chain of n nodes all lying on the left path from the root, repeatedly giving a new minimum value to the deepest lying node, results in an arbitrarily long sequence of operations each taking Omega(n) time. This is unacceptable.

There is an alternative way to perform a decreasekey for node u in a tree T rooted at r. If the new key of u violates the heap order, then we simply cut the link between u and its parent v. As a result we get two trees, which are merged. A practical disadvantage of this idea is that now every node must have a link to its parent as well. More serious is that the cutting may upset the leftist-tree property. Because only nodes on the path from r to v have lost nodes from a subtree and therefore only these nodes may be violating the leftist-tree property. Thus, this property can be reestablished for all nodes in the tree by walking up the path from v and swapping the children if necessary.

This is simple but in principle we might still be running up all the way to the root. However, this is unnecessary. If the deletion of u from a subtree rooted at a node v' at distance k from v still has a shortening impact on the null path length of v', this implies that originally npl(v') >= k. Because this subtree is leftist, it follows that it has at least 2^{k + 1} - 1 nodes, and therefore, there is no need to check nodes v' at distances more than round_down(log(n + 1)) - 1 from v. So, the decreasekey operation has been decomposed into two operations which each can be performed in logarithmic time.

Theorem: A decreasekey operation can be performed on a leftist heap with n elements in O(log n) time.

Decreasekey on Leftist Heap

Skew Heaps

In the chapter on search trees we have seen splay trees. Splay trees are binary trees with the search-tree property. But, unlike AVL trees, there is no bound on the degree of unbalance they may have. Nevertheless, due to the splaying operation we could prove good amortized running time. Splay trees are self-adjusting in the sense that they achieve this without maintaining information about the balances in the tree. The relation between skew heaps and leftist heaps is the same as that between splay trees and AVL trees: a skew heap is a self-adjusting variant of a leftist heap, which may grow arbitrarily unbalanced, but for which we nevertheless can prove O(log n) amortized time for the operations. This is achieved without maintaining balancing information.

Merging Skew Heaps

Actually skew heaps are much more similar to leftist heaps than splay tree to AVL trees. The only difference between a skew heap and a leftist heap is that in a skew heap when merging, the swapping of the children of a visited node on the right path is performed unconditionally. So, now the merging looks as follows:
  static Node merge(Node u, Node v, int d)
  {
    Node dummy;

    // Assure that the key of u is smallest
    if (u.key > v.key)
    {
      dummy = u;
      u = v;
      v = dummy;
    }

    if (u.right == null) // Hook v directly to u
    {
      u.right = v;
      System.out.println("Reached depth = " + d);
    }
    else // Merge recursively
      u.right = merge(u.right, v, d + 1);

    // Swap children of u
    dummy = u.right; 
    u.right = u.left;
    u.left = dummy; 

    return u; 
  }

This method is found in a Java program which can be downloaded here. It can be used for playing around with the data structure, for example when trying to construct an input sequence leading to a long right path.

The purpose of the swapping is to keep the length of the right path bounded. This cannot prevent the right path of a skew heap from growing to a length of Omega(n). Nevertheless it is quite effective. The reason is that the insertions are made on the right side, creating a rather heavy right side. Then everything is swapped, creating a relatively light right side. So, the rather good behavior of skew heaps is due to the combination of the choice to always go right and the swapping.

The self-adjustment in skew heaps has decisive advantages over the balancing in leftist heaps:

Skew-Heap Operations

Amortized Performance

The critical point in amortized analysis is how to choose the potential function, or saying it differently, to determine when the algorithm should pay a token. In the case of the a-b trees, a token was paid when creating a critical node. In the case of the binomial forests, a token was paid for creating a tree of a previously not occurring size. In the case of skew heaps, it is less easy, and it will not be sufficient to pay a constant number of tokens.

The general point in the previous analyses, and that is common to all amortized analyses, is that one introduces some additional cost measure: number of critical nodes, number of trees in the forest. Here we introduce the cost measure number of heavy nodes. A node is said to be heavy, if the size of its right subtree is larger (not equal) than the size of its left subtree. A node which is not heavy is said to be light. The essential point is that swapping the children of a heavy node turns it into a light node. The potential is now given by the number of heavy nodes. Or said otherwise: a token must be deposited for creating new heavy nodes.

Light and Heavy Nodes

The cost of merging two heaps H_1 and H_2 with n_1 and n_2 nodes, respectively, is bounded by the sum of the lengths of their right paths. We know that this path can be very long, so we cannot put a reasonable bound on this. However, as an invariant we will maintain that the cost for all heavy nodes has already been covered by earlier deposited tokens: at all times each heavy node holds at least one token. So, it suffices to bound the number of light nodes on these paths.

Lemma: A tree with k light nodes on its right path has at least 2^k - 1 nodes.

Proof: The proof goes by induction. For a tree consisting of a single node, which is light, the lemma is correct, because 2^1 - 1 = 1. So, assume the lemma is correct for some k > 0. Consider a tree T with k + 1 light nodes on its right path. Let r be the root of T and let u be the first node on the right path of T which is light. Let v_l and v_r be the children of u, respectively. The right child v_r has exactly k light nodes on its right path. So, the subtree rooted at v_r has at least 2^k - 1 nodes. Because u is a light node, the subtree rooted at v_l must be at least as large. So, the subtree rooted at u has at least 2 * (2^k - 1) + 1 = 2^{k + 1} - 1 nodes. T is at least as large as a subtree. End.

In other words, a tree with k light nodes on its right path is at least as large as a perfect binary tree of depth k - 1. Thus, the sum of the light nodes on the right paths of H_1 and H_2 is bounded by round_down(log(n_1 + 1)) + round_down(log(n_2 + 1)).

It remains to prove a similar bound on the number of tokens to deposit in any merge operation. For this one should wonder how new heavy nodes arise. A new heavy node can only be a node on one of the right paths of the heaps H_1 and H_2 to merge, because these are the only nodes for which the balance changes. Further is it important that the merge increases the size of the right subtree or leaves it unchanged, but does not decrease it. Thus, before the swapping heavy nodes are still heavy. So, afterwards all of them become light: only formerly light nodes may become heavy. The number of them was bounded above.

For an extension to the insertion operation it is important to notice that the single node in a one-node heap is light. For an extension to the deletemin operation it is important to notice that removing the root node of a heap has no impact on the number of heavy nodes among the remaining nodes. This operation might thus reduce the number of heavy nodes by one, but not increase it.

So, starting from nothing, any sequence of m operations can be performed in O(m * log n) time, were n is the maximum heap size. Starting with a given tree, the cost of performing m operations is bounded by O(n + m * log n). For m >= n / log n, the average cost per operation is bounded by O(log n). This result is slightly stronger than that for splay trees. Because for splay trees the potential function has maximum value Omega(n * log n), n operations may have to be performed in order to amortize a possibly bad initial configuration.

Fibonacci Heaps

It is justified to say that the Fibonacci heaps are the climax of this chapter. Many of the previously introduced ideas are used here in some form. One might even say that the main reason for presenting some of the earlier structures was to facilitate the presentation of the Fibonacci heaps. A Fibonacci heap is a relaxed variant of a binomial heap. A decreasekey is realized by the cutting operation that is also used for leftist heaps. The only idea we have not encountered before is to perform the merges lazily. The performance of Fibonacci heaps is impressive: insert takes O(1) time (worst-case), decreasekey takes O(1) amortized time and deletemin takes O(log n) amortized time. Whenever the time consumption of Dijkstra's algorithm for finding the shortest-paths on a graph with n nodes and m edges is stated to be O(n * log n + m), implicit or explicit reference is made to the Fibonacci heaps.

Lazy Merging

Denote by T_k the binomial tree with 2^k nodes. A lazy binomial heap is a set of heap-ordered trees T_k. Several of these trees may have the same size. Adding a new key x to a lazy binomial heap is trivial: create a new tree T_0 with one node and key x and add it to the set of trees. This insertion has worst-case time O(1). The earlier constructions achieved O(1) amortized time at best.

When performing a deletemin, we scan the roots of all trees and find the minimum element. The corresponding root node is removed generating some more trees. Then all trees are merged to a binomial forest, with at most one tree T_k for each k >= 0. This merging must be performed systematically: while scanning all trees for finding the minimum, the trees are sorted in buckets according to their size. If there are n nodes, then there are at most round_down(log n) different sizes. Then the merging proceeds as follows:

  for (k = 0; k < log(n); k++)
    while (there are two or more trees in bucket k)
    {
      merge the first two trees T_1 and T_2 in bucket k to a tree T_3;
      add T_3 to bucket k + 1;
    }
In constant time this merging reduces the number of trees by 1. In addition there is constant overhead for each bucket. Thus, the running time is bounded by O(log n + number_of_trees).

Because there can be as many as n trees (when performing n inserts before performing the first deletemin), deletemin can take Omega(n) time. The amortized time is better. As potential we use the number of trees. This is equivalent to depositing a token whenever a new tree is created and consuming it whenever a tree is removed. Then the amortized cost of insert is still constant, because the potential is increased by 1, in addition to the constant execution time. The amortized time of a deletemin is given by

  t_amortized =  t_actual + potential_after - potential_before
              =  log n + number_of_merges + trees_after - trees_before
              =  log n
Here we used that each merge operation reduces the number of trees by one, and that thus number_of_merges + trees_after - trees_before = 0.

Decreasekey

Now we would like to also perform decreasekey. A simple solution is to perform these by percolating up. Binomial trees have logarithmic depth, so this will cost O(log n) for a structure with n nodes. This is a reasonable result, but not better than achievable with binary heaps. We are striving for more. Therefore, we apply the decreasekey idea that was used for leftist heaps: when decreasing the key of some node u, it is first tested whether this violates the heap property. If not, no further action is needed. Otherwise, we simply cut the link between u and its parent v. The new tree rooted at u is added to the forest.

The only problem with this simple approach is that it destroys the structural property of the binomial trees. This cannot, as for leftist heaps, be restored by simply swapping some children. The structure in itself is not something we care about much, but we are concerned about the effect of this on the time to merge. If branches are arbitrarily cut out of trees, then it may happen that there are trees with a high root degree which nevertheless have only few nodes.

The root degree of a tree is called the rank of a tree. A binomial tree of rank k has 2^k nodes. This implies that in the whole forest the highest occurring rank is round_down(log n). The fact that the maximum rank is O(log n) was an essential ingredient in the proof that the amortized time for deletemin is O(log n). If a rank k tree might have as little as k + 1 nodes, then there is no such bound on the ranks. The idea is that we do not want to enforce perfect structure, we only want that the consequence is assured: the number of nodes in a tree of rank k must be exponential in k. This gives that the largest rank is O(log n).

An important observation is that it does not matter so much that a node looses a child as long as it does not loose too many of them. In Fibonacci heaps this idea is taken into account by distinguishing marked and unmarked nodes: An unmarked internal node which looses a child is marked. A marked node u which looses a child is cut from the tree. By this u becomes the root of a new tree and its mark is removed. So, roots are never marked. Internal nodes have a mark if since becoming internal they have lost exactly one child. Notice that by cutting u, its father v looses a child, which may imply that v gets marked or cut from the tree in turn. Therefore, a decreasekey leading to a cutting may lead to a cascade of further cuttings, possibly running all the way up to the root.

We summarize the features of a Fibonacci heap.

Fibonacci-Heap Operations

Amortized Performance

The analysis of the amortized performance is slightly harder than before. Simply taking as potential the number of trees in the forest does not work, because the cutting increases the number of trees. Because there may be many cuttings, this would not give an amortized time of O(1) for the decreasekey. However, a cutting reduces the number of marks. Thus, the marks should be worth more than the tokens deposited for the trees. This is the inspiration behind taking as potential number_of_trees + 2 * number_of_marks. Said otherwise, for the creation of a tree we must deposit one token, whereas for cutting the first child from an internal node we pay two tokens.

Let n denote the number of nodes in a Fibonacci heap. Insert takes O(1) actual time and increases the number of trees by 1, so the amortized cost of an insert is O(1). In the following we will show that the maximum rank of any tree in a forest with n nodes is bounded by O(log n). Because merging works only on roots of trees, the number of marks remains unchanged when merging. Thus, the amortized time for merging is bounded by O(log n). The amortized time for deletemin is similar: the removal of a root r creates at most O(log n) new trees to process, but does not create new marks. On the contrary, if any of the children of r had a mark, it is removed. So, the amortized time for deletemin exceeds the amortized time for merge by O(1 + log n). The amortized time of a decreasekey in which c cuts are performed is given by:

  t_amortized =  t_actual + potential_after - potential_before
              =  O(1) + c + 2 * (marks_after - marks_before)
                          + 1 * (trees_after - trees_before)
              <= O(1) + c - 2 * (c - 1) + c = O(1).
Here we took into account that every cascade of cuts stops at some node v. If v is not a root, then v is marked. In that case the number of marks decreases only by c - 1 when c cuts are performed.

It remains to prove that the maximum rank in a Fibonacci tree with n nodes is bounded by O(log n). The notion of rank is extended to all nodes: for a node u, rank(u) gives the number of children of u. The children of node v are added one-by-one. Therefore, for any node u we can define older(u) = rank of the father at the time u was added. If u is a root, then older(u) = 0.

Lemma: For any node u at any time, rank(u) >= max{0, older(u) - 1}.

Proof: If u is a root, then the statement is void. So, assume u is an internal node and let v be the father of u. At the time u was added as a child to v, we had rank(u) = rank(v) = older(u). Later u may have lost at most one child, but, of course, only if u had children to start with. End.

The Fibonacci numbers Fib() are defined by Fib(0) = 0, Fib(1) = 1, Fib(k) = Fib(k - 1) + Fib(k - 2), for all k >= 2.

Lemma: For any node u the subtree rooted at u, including u, has size at least Fib(rank(u) + 2).

Proof: Let S_k denote the size of the smallest tree with rank k. Clearly S_0 = 1 = Fib(2) and S_1 = 2 = Fib(3). So, assume the lemma is correct up to k - 1. Consider a node u with rank(u) = k. Denote the children, in the order they were added to u, by v_i, 0 <= i < k. Because of this definition, u certainly had at least i children at the time v_i was added. Possibly it even had one more child. So, older(v_i) >= i, for all i. The above lemma gives that rank(v_i) >= i - 1, for all i > 0, and rank(v_0) >= 0. Thus,

S_k = 1 + S_0 + S_0 + S_1 + ... + S_{k - 3} + S_{k - 2}.
Here the 1 takes the root node into account. This formula can also be understood as follows: a tree with rank k was composed by taking a root and connecting it to trees with ranks 0, 1, ..., k - 1. Each of the trees with rank >= 1 may later have lost at most one child, reducing their ranks by one. For k >= 2, the same formula, with all indices smaller by one also holds for S_{k - 1}:
S_{k - 1} = 1 + S_0 + S_0 + S_1 + ... + S_{k - 4} + S_{k - 3}.
Substituting in the above equation gives
    S_k = S_{k - 1} + S_{k - 2} 
        = Fib(k)    + Fib(k + 1) 
        = Fib(k + 2).
  

A slightly different proof can be obtained by using that 1 + sum_{i = 0}^{k - 2} Fib(i) = Fib(k), which can be proven by induction. Using this, we get

    S_k = 1 +       S_0       +  S_0   +  S_1   + ... + S_{k - 2}
        = 1 +      Fib(2)     + Fib(2) + Fib(3) + ... +  Fib(k)
        = 1 + Fib(0) + Fib(1) + Fib(2) + Fib(3) + ... +  Fib(k) 
        = Fib(k + 2).
  
End.

Corollary: The maximum rank of any node in a Fibonacci heap is bounded by O(log n).

Proof: For all k, Fib(k + 2) > Fib(k) ~ 1.618^k. Thus, the ranks are bounded by about log_1.618 n = 1.440 * log n. End.

Pairing Heaps

A pairing heap is a strongly simplified self-adjusting variant of a Fibonacci Heap. It is considered to be one of the most efficient priority queue implementations. Due to this fact and its simplicity it is much used in practical applications. A pairing heap is a tree without any restrictions on the degrees of the nodes. The only is that this tree is heap ordered, that is, for any node u, the key of u is not larger than the keys of any of its children.

Operations

For binary heaps all operations are based on percolations, and for binomial heaps all operations are based on the merge operation. Likewise, for pairing heaps all are based on the linking operation. For two heaps H_1 and H_2, with roots h_1 and h_2 with key values k_1 and k_2, respectively, the linking operation is performed as follows: k_1 and k_2 are compared. If k_1 < k_2, h_2 is added to the set of children of h_1. If k_1 >= k_2, h_1 is added to the set of children of h_2. Clearly a linking operation can be performed in constant time. For implementational reasons, a new child is usually added leftmost. We have encountered the linking operation before: it was also used to fuse two binomial trees of the same size to the next larger binomial tree.

Linking Two Pairing Heaps

The other operations are based on the linking operation. Inserting a new node h_2 into a heap H_1 is done by creating a heap H_2 with h_2 as single node and linking H_1 and H_2. When decreasing the key of a node h_2 from a heap H_1 two cases are distinguished: If h_2 is not the root of H_1, then it goes by removing the heap H_2 rooted at h_2 from H_1; decreasing the value of the key of h_2; and linking H_1 and H_2. If h_2 is the root of H_1, then its key is simply assigned the new smaller key value. The same might be done for a non-root node if the new key value does not violate the heap property, but this requires that the nodes have links to their parents. In an efficient implementation there are no such links. Deletemin is slightly more complicated and offers room for variants. Removing the root node h from the heap H creates a set of l heaps, where l is the number of children of h. Denote these by H_i, for 0 <= i < l. Somehow these H_i are linked together. One correct way, guaranteeing good amortized running time is to do the following:
  1. Link all pairs (H_0, H_1), (H_2, H_3), ... . Denote the new heaps by H'_i. If l is odd, then also H_{l - 1} is added to the set of new heaps. So, now there are l' = round_up(l / 2) heaps.
  2. Start by linking H'_{l' - 1} and H'_{l' - 2}. Then repeatedly link the resulting heap with the heap with next smaller index. So, finally H'_0 is linked to the result of linking all others.

Operations on Pairing Heaps

Implementation

The above formulation does not tell anything about the implementation. However, considering that the number of children of a node is arbitrary, it is natural to use a linked list for this. So, every node has several fields: a key, an index, a pointer to the next node in the list of its siblings, and a pointer to the list of its children. In addition, in order to be able to find the nodes, we need an array with pointers: at position h this array contains a link to the node with index h. There is no need to have links pointing upwards, from a child to its parent.

In the described setting, it is easy to add a child h_2 to a node h_1. It is particularly easy to add a leftmost child: h_2 is inserted at the beginning of the list of children of h_1:

  h2.next     = h1.children;
  h1.children = h2;

Removing a node h_1 is more complicated: because there are neither links to the parent node, nor to the node preceding h_1 in the list of siblings, it is not possible to relink the list so that h_1 is excluded. Instead we exclude h_2, the next node of h_1 in the list of siblings. To do this, the contents from h_1 and h_2 are exchanged and h_1 is set to point to the next node of h_2:

  // If the key and the pointer to the children are also part of the
  // fields of a node, then these should be exchanged here as well.
  position[h1.index] = h2; position[h2.index] = h1;
  int y = h2.index; h2.index = h1.index; h1.index = y;
  h1.next = h2.next;
This works fine, unless h_1 is the last node of the list. One solution is to let the last node point back to the parent node, it has a free pointer anyway. This saves memory, but requires extra testing and dealing with special cases. It is much more convenient to add a sentinel to all children lists. For a sentinel s we set s.next = s. This is easy to test and prevents running out of a list.

Deleting Nodes from Lists

Click here to see the above ideas and some more integrated in a Java program. Running the program reproduces the above example illustrating the operations, showing the heap after each of the operations.

Amortized Performance

Clearly inserts and decreasekeys can be performed in constant time. On the other hand, due to these operations the structure of the heap may be arbitrarily bad, and therefore the cost of a deletemin on a heap with n nodes is Theta(n). However, expensive deletemins tend to improve the structure. This is not only true in some imprecise sense, but can be made hard in the sense that any sequence of x >= n operations on a pairing heap with at most n nodes can be performed in O(x * log n) time. In other words, the amortized time of all operations is bounded by O(log n).

For a long time it was believed that pairing heaps were as good as Fibonacci heaps, which have constant amortized time for the decreasekey operations. However, Fredman has shown (Journal of the ACM, Vol 46, pp. 473-501, 1999), that the amortized time for the decreasekey operation (attributing the cost resulting from the structure degradation to the decreasekey operation which causes it) is Omega(loglog n). Notice that it is not known whether this result is tight.

Exercises

  1. Prove that the heap-property implies that the element with the smallest key must stand in the root. For a binary heap, indicate with small sketches all positions where the second and third smallest element can stand.

  2. Consider a binary heap. Initially it is empty. Then a number of operations is performed: insert 17, 4, 6, 8, 2, 23, 12, 14 followed by 3 times deletemin. Draw the resulting heaps after each operation.

  3. Consider the example program. Make some modifications so that the number of calls to percolateUp and percolateDown can be determined. For n = 4^k, k = 6, ..., 11, determine the number of these calls performed when building a heap, when repeatedly inserting elements and when removing them one by one. Can n insertions indeed be performed in linear time? Match a function f(n) = a + b * n through the number of percolate ups, so that the relative deviation with the observed values is minimized.

  4. Consider the example program. In this program the percolates are implemented in a recursive fashion. Furthermore, the elements on the path are exchanged instead of using the more efficient idea to keep a free position, to shift the elements and finally to close the free position. Rewrite the procedures in a non-recursive version using this idea. Compare the time consumptions for n = 1,000,000.

  5. The underlying idea of buildheap is not limited to perfect binary trees. However, for that particular case an upper bound on the number of accessed nodes could be computed rather easily. In general by heapifying, we mean to allocate the keys in an arbitrary way to a tree with the desired structure, and then to work upwards level-by-level, performing percolate down on all nodes in the levels.
    1. Give the exact maximum value for the cost of heapifying a full perfect d-ary tree of depth k. The cost measure is the number of key values considered. Hint: first give the required expression for d = 2 and 3. Do not forget that when percolating down the keys of all children have to be considered.
    2. For the same number of nodes, is it cheaper or more expensive to heapify a d-ary tree for d > 2 than a binary tree?

  6. This question deals with the problem of how to build a binomial heap with n nodes for a given set of n elements.
    1. Estimate the time when performing n times an insert to an initially empty binomial heap.
    2. Consider a binomial tree of depth d. Assume that the heap property holds for all nodes except for the root. How can this tree be turned into a heap and how long does this take at most. Hint: be careful, the operation is more expensive than you might think at first.
    3. Suggest an algorithm for building a binomial heap which is analogous to the algorithm for building a binary heap in O(n) time.
    4. Show the most important stages of the above construction for the following set of 13 keys: 7, 98, 3, 5, 16, 15, 1, 17, 75, 22, 2, 23, 8.
    5. Prove an upper bound on the time consumption of your algorithm.

  7. How many merge operations are performed, when performing a sequence of insertions on a binomial forest changing the number of nodes from n_0 to n_1? Give an exact expression.

  8. Consider a sequence of n_i insertions and n_d deletemins on a binomial forest. Let n_0 be the initial number of nodes. Assume that the insertions and deletemins are randomly mixed. Uniformly select any of the insertions at random with probability 1 / n_i. Give a bound on the expected number of merges performed during this insertion.

  9. There are several ways to perform an increasekey operation. One possibility is to assign the new larger key value followed by a percolate down. How much does this increasekey algorithm cost for a binomial heap with n nodes? Describe a more efficient way of performing increasekey.

  10. Write a non-recursive version of the merge operation used in skew heaps. Integrate it in the program which can be downloaded here and compare the time consumption for 10,000, 100,000 and 1,000,000 insertions of random keys followed by the same number of deletemins. Draw a conclusion: recursion should definitely be avoided in any time-critical application / using recursion has no major impact on the running time.

  11. Give a sequence of n insertions to an initially empty skew heap which leads to a heap with a right path of length Omega(n). Try to come with a simple general construction. If the length of a path is defined as the number of links to cross in order to travel from the first to the last node on the path then it is actually possible to create a right path of length (n - 1) / 2 for all odd values of n giving the number of keys stored in the skew heap.

  12. We have shown that a decreasekey operation on a leftist heap with n nodes can be performed in O(log n) time. Adapt the proof of this theorem to show that on a skew heap with n nodes decreasekeys can be performed in O(log n) amortized time.

  13. Generalize the notion of leftist trees to trees of degree d. Describe how to merge two d-ary leftist heaps. How long can the right path of a d-ary leftist heap with n nodes be? Specify the time consumption of your merging algorithm.

  14. The leftist property depends on the null path lengths. In the analysis we argue about the size of the subtrees. The analysis of the skew heaps is more directly based on the sizes, using the notions of light and heavy edges. The topic of this question is this difference.

  15. Let M_r denote the minimum-sized Fibonacci tree with rank r.

  16. The definition of Fibonacci heaps can be relaxed. We can allow that a node may loose two children before it is cut from the tree. This reduces the number of cuts. At the same time, the size of the trees may decrease.

  17. Draw the structure of a pairing heap after the following sequence of operations: insert(7, 12), insert(3, 15), insert(5, 8), insert(2, 6), insert(9, 20), insert(0, 16), insert(8, 24), insert(4, 13), insert(1, 5), insert(6, 14). Here, the first value gives the index of the node, the second its key. Then perform decreaseKey(3, 10), decreaseKey(2, 4) and draw again. The format is the same. Finally draw the resulting heap after performing one deletemin operation.

  18. On binary heaps one can even perform increasekey operations on a heap with the minimum value in the root: simply perform a percolate down. On pairing heaps, this operation is more complicated: simply changing the key value of the concerned node h may violate the heap property. Percolating h to its correct position may be too costly. Give a high-level description of an increasekey operation, which is guaranteed to run in O(log n) amortized time. Take care of the special case that h is the root of the heap.

  19. Even on binary heaps we can obtain O(1) insert time by performing lazy inserts: the inserts to make are just entered in a list. Then, when a deletemin is performed, all accumulated inserts are performed in one stroke.

  20. The program implementing pairing heaps is based on a combination of arrays and linked lists. As it is, it uses 5 words of memory per node (ind, nxt, key, pos and chl). This can be improved. Using no linked lists, but a pure array-based structure we can save: four integers per node are enough.





Text

Alphabets and Strings

An alphabet A is a set of symbols. For convenience we will assume that the alphabet is finite. The symbols of an alphabet are called characters. A string is an ordered sequence of characters. The length of a string is the number of characters in a string. The length of a string S is denoted |S|. The empty string, has no characters, it is denoted epsilon.

For a string S = (s_0, ..., s_{n - 1}) of length n, a string Q = (q_0, ..., q_{k - 1}) with k <= n is called a substring of S if q_i = s_{i + a} for some constant a and all 0 <= i < k. Q is a prefix of S, if q_i = s_i for all 0 <= i < k, and Q is a suffix of S if q_i = s_{i + n - k}. The empty string is a substring, prefix and suffix of any string S.

Finding Substrings

An important feature of any editor and browser is the possibility to search for a word or a sequence of several words in a text. In our terminology, this is the problem of finding a specified substring Q in a string S.

Trivial Algorithm

The following trivial algorithm is correct but inefficient: for each of the n characters of S check whether Q can start here:
  int findSubstring(char[] s, int n, char[] q, int k)
    // Returns the index of the first start of q[] 
    // in s[]. -1 is returned if q[] does not occur.
  {
    for (int i = 0; i < n - k; i++)
    {
      int j = 0;
      while (j < k && q[j] == s[i + j])
        j++;
      if (j == k)
        return i;
    }
    return -1;
  }

The running time of this algorithm is bounded by O(n * k), and such a running time which is proportional to the product of n and k actually occurs if S = (a, a, ...., , a, a) and Q = (a, a, ..., a, b). Of course in most practical instances this will not happen, and if the elements of S and Q are randomly sampled, then the expected running time is O(n), because the expected time for each search is constant. If Q is long, then it may be profitable to add some special character as a sentinel to the end of Q. Doing this, the first test in the while loop is not needed anymore.

Optimal Algorithm

Due to its simplicity and good performance for all but very unlucky problem instances, the above algorithm might be acceptable in practice. Nevertheless, we would like to have a better algorithm. The algorithm by Knuth, Morris and Pratt is such an algorithm: it uses only O(n + k) time, which is clearly optimal. Somehow we already knew that this was a simple task: searching a word in a text is done extremely fast and we never noticed a dependency on the length of the word. It is important to notice that the time and memory consumption of this algorithm is independent of the size of the alphabet. There is an alternative algorithm by Boyer and Moore, which has the same worst-case performance but tends to be faster in practice. Both algorithms were proposed in 1977. In this chapter we only consider the first which is the simplest of the two.

If we look at the method findSubstring, then we see that valuable information is wasted every time we increase i by 1 and set j to 0 again after a mismatch. Possibly, we could increase i by more or maybe there is no need to set j to 0. A mismatch of s[] and q[] for certain values i and j in findSubstring means that

q[l] == s[i + l], for all 0 <= l < j,
q[j] != s[i + j].
An increase of i is called a shift. A shift by d is an admissible shift, if
q[l] == s[i + d + l], for all 0 <= l < j - d.
A shift by j or more is always admissible, because then the above condition is void. Making an admissible shift over d > 1 positions saves in two ways: in the first place we have progressed by d positions; in the second place there is no need to test the first j - d positions again.

Lemma: If d is not an admissible shift, then the string given by q[] cannot start as a substring at position i + d of the string given by s[].

Proof: The fact that d is not an admissible shift implies that there is an l, 0 <= l < j - d, so that q[l] != s[i + d + l], but if q[] would start as a substring at position i + d of s[], then these two values should be the same. End.

A safe shift is a minimum admissible shift. The lemma guarantees that a safe shift can be made without possibly missing the string we are looking for. It is the largest shift with this property.

Consider s[] = (... a, a, a, b, a, a, a, b, a, a, a, c, ...) and q[] = (a, a, a, b, a, a, a, b, a, a, a, b). A mismatch is found for j = 11. Admissible shifts are 4, 8, 9, 10 and 11. The safe shift is 4.

Admissible Shifts

The above immediately suggest the following improved algorithm:

  int findSubstring(char[] s, int n, char[] q, int k)
  {
    int i = 0;
    int j = 0;
    while (i < n - k)
    {
      while (j < k && q[j] == s[i + j])
        j++;
      if (j == k)
        return i;
      if (j == 0)
        i++;
      else
      {
        d = safeShift(j);
        i += d;
        j -= d;
      }
    }
    return -1;
  }

The correctness is clear because no solution is skipped and because a positive result is returned only when there is a complete match. Apart from the computation of the safe shift, the time consumption is also clear. There are two loops. The outer loop is executed at most n - k times. So, except for the time spent in the inner loop, the time consumption is bounded by O(n * time_of_safeShift). For the inner loop, it is helpful not to consider the value of j, but the value of i +j. Initially i + j = 0. Finally i + j <= n. In the inner loop i + j increases by 1, elsewhere it does not decrease. So, the inner loop is executed at most n times, taking O(n) time in total. In the remainder of this section we describe how to preprocess q[] in O(k) time so that the safe shifts can be computed by looking them up in a table which takes constant time.

An important observation is that if d is an admissible shift for certain values i and j that then (q[0], ..., q[j - d - 1]) is not only a prefix of (q[0], ..., q[j - 1]), this is a triviality, but even a suffix. The latter follows because on the one hand we know that q[l] == s[i + l], for all 0 <= l < j, on the other hand we know that q[l] == s[i + d + l], for all 0 <= l < j - d. The latter is equivalent with q[l - d] == s[i + l], for all d <= l < j. So, q[l] == q[l + d], for all 0 <= l < j - d. In general for a string Q, a border of Q is a string which is both a prefix and a suffix of Q. The empty string epsilon and Q itself are trivial borders of Q. We are only interested in borders different from Q. The longest such border is called the overlap of Q.

Lemma: The lengths of the borders of Q_j in decreasing order are given by overlap[j], overlap[overlap[j]], ..., continuing until the value is 0.

Proof: By definition the longest border has length overlap[j]. Let j' = overlap[j]. The next shorter border is a prefix and suffix of Q_j of length smaller than j'. But then it is actually a prefix and suffix of Q_{j'}, which is longer and which is a suffix and prefix of Q_j. So, it is a border of Q_{j'}, and because we are looking for the longest border shorter than j', it is the longest border of Q_{j'}, which has length overlap[j']. The general proof can be given using induction. End.

There is a one-one correspondence between admissible shifts and borders: above we have shown that for any admissible shift d there is a border of length j - d. This can also be reversed: for any border b of length |b|, a shift by d = j - |b| is admissible. So, the overlap corresponds to a minimum admissible shift, which is the safe shift we are looking for. The problem has now been reduced to computing the lengths of the overlaps of (q[0], ..., q[j - 1]), for all 1 <= j < k - 1. Denote these values by overlap[j]. Once they have been computed, a table safeShift[] can be constructed in O(k) time: safeShift[j] = j - overlap[j]. For j = 0, the case that a mismatch occurs at the first character, safeShift[j] = 1.

Computing Overlaps

We consider a string Q = (q[0], ..., q[k - 1]) and want to compute the overlap B_j of all prefixes Q_j = (q[0], ..., q[j - 1]) of Q. An overlap is the longest true substring which is both a prefix and a suffix. Performing all tests separately takes O(k^3) time. However, using a clever computation order, it may be considered to be a special case of dynamic programming discussed in detail in another chapter, O(k) is sufficient.

First we consider an example. Let Q = (q[0], ..., q[11]) = (a, a, a, b, a, a, a, b, a, a, a, b). Then we get the following values for Q_j, B_j and overlap[j] for 1 <= j <= 12:

Q_1 = (a) B_1 = epsilon overlap[1] = 0
Q_2 = (a, a) B_2 = (a) overlap[2] = 1
Q_3 = (a, a, a) B_3 = (a, a) overlap[3] = 2
Q_4 = (a, a, a, b) B_4 = epsilon overlap[4] = 0
Q_5 = (a, a, a, b, a) B_5 = (a) overlap[5] = 1
Q_6 = (a, a, a, b, a, a) B_6 = (a, a) overlap[6] = 2
Q_7 = (a, a, a, b, a, a, a) B_7 = (a, a, a) overlap[7] = 3
Q_8 = (a, a, a, b, a, a, a, b) B_8 = (a, a, a, b) overlap[8] = 4
Q_9 = (a, a, a, b, a, a, a, b, a) B_9 = (a, a, a, b, a) overlap[9] = 5
Q_10 = (a, a, a, b, a, a, a, b, a, a) B_10 = (a, a, a, b, a, a) overlap[10] = 6
Q_11 = (a, a, a, b, a, a, a, b, a, a, a) B_11 = (a, a, a, b, a, a, a) overlap[11] = 7.
Q_12 = (a, a, a, b, a, a, a, b, a, a, a, b) B_12 = (a, a, a, b, a, a, a, b) overlap[12] = 8

The example shows that the values of overlap[] are not monotonous. We may also notice that for all j, 2 <= j <= 12, overlap[j] <= overlap[j - 1] + 1. We also observe that in accordance with the above lemma the lengths of all borders of Q_11 are elements of the following sequence: (overlap[11], overlap[overlap[11]], ..., 0) = (7, 3, 2, 1, 0). These borders of Q_11 correspond to admissible shifts (4, 8, 9, 10, 11).

Idea

The idea is to compute the values overlap[j] one after the other, starting with j = 1. In the computation of overlap[j], the value of overlap[j - 1] is used. The following lemma plays a central role:

Lemma: (q[0], ..., q[l - 1]), 1 <= l < j, is a border of Q_j if and only if (q[0], ..., q[l - 2]), is a border of Q_{j - 1} and q[j - 1] == q[l - 1].

Proof: If (q[0], ..., q[l - 1]) is a prefix and suffix of Q_j, then (q[0], ..., q[l - 2]) is a prefix and suffix of Q_{j - 1}. The fact that (q[0], ..., q[l - 1]) is a suffix of Q_j in particular implies that q[l - 1] = q[j - 1]. On the other hand, if (q[0], ..., q[l - 2]) is a suffix of Q_{j - 1}, then q[i] = q[j - l + i], for all 0 <= i < l - 1. If in addition q[l - 1] == q[j - 1], then (q[0], ..., q[l - 1]) is a suffix of Q_j of which it is trivially a prefix as well. End.

Corollary: If B'_j is a border of Q_j, then either B'_j = epsilon, or B'_j = B'_{j - 1} + q_{j - 1} for some border B'_{j - 1} of Q_{j - 1} (here "+" adds a character to a string).

overlap[1] = 0, because the empty string is a prefix and a postfix of epsilon and because the overlap must be a true substring. The corollary together with the earlier lemma stating that all borders of Q_{j - 1} are given by the sequence overlap[j - 1], overlap[overlap[j - 1]], ... , implies that overlap[j] = l + 1, where l = overlap[ ... [overlap[j - 1]] ... ] is the largest number so that q[l] == q[j - 1].

Implementation

The above ideas lead to the following simple implementation:
  void computeBorder(int k, char[] s, int[] overlap)
  // Method is called for k >= 1
  // q[] contains the string
  // overlap[] is allocated by the caller
  {
    overlap[0] =  -1; // Sentinel, saves some tests
    overlap[1] =   0;
    for (int j = 2; j < k; j++)
    {
      int l = overlap[j - 1];
      while (l >= 0 && q[l] != q[j - 1])
        l = overlap[l];
      overlap[j] = l + 1;
    }
  }
The complexity of this algorithm is bounded by O(k^2): because for all l, 0 <= l < k, overlap[l] < l, the inner loop is executed at most j times. Considering that k is mostly smaller than sqrt{n}, O(k^2) will rarely dominate the total time consumption, but here we strive for an optimal implementation.

Actually the given implementation runs in O(k), however, this is not so easy to see. Rewriting the loops facilitates the analysis:

    int l = 0;
    for (int j = 2; j < k; j++)
    {
      while (l >= 0 && q[l] != q[j - 1])
        l = overlap[l];
      l++;
      overlap[j] = l;
    }
  }
The for loop is executed at most k - 2 times. The value of l is initially 0 and finally at least 1. Each time the for loop is executed l is increased by 1 and each time the while loop is executed l is decreased by at least 1. It follows that the while loop is executed at most k - 3 times. Thus, the condition of the while loop is tested at most 2 * k - 5 times, and the whole precomputation takes at most O(k) time.

Theorem: Determining whether a string Q of length k is a substring of a string S of length n can be performed in O(n) time after a precomputation phase which can be performed in O(k) time. The algorithm requires O(k) additional memory.

Tries

Tries are a special kind of search trees, particularly useful for storing words of a text. The name "trie" comes from retrieval, but is usually pronounced like "try".

In a trie the information is stored in the leafs. We first consider tries in which the maximum degree of the nodes is given by the size of the alphabet |A|. Using arrays of length |A| for the set of children, the characters of the word we are searching can be used to jump to the correct child in constant time. So, at any level direct addressing is used. If at any level i of the trie there is no entry for which character i + 1 of the word we are searching is c, then there is no node at level i + 1 with label c. Thus, when searching a word, hitting a null link before reaching the end of the word indicates that the word we are looking for is no entry of the trie. If for a word of length l, we reach a leaf after l steps, then we have found the word.

The above description of a trie assumes that there are no words which are prefixes of other words. This is granted if all words have the same length and are different. If words may be prefixes of other words, then there must be a way to recognize this. One solution is to add a marker to the nodes indicating whether they correspond to the end of a word. However, this violates the property that all information is stored in the leafs, and would add a third kind of nodes. Therefore, it is a good idea to add a special end-of-word symbol as a sentinel to each word. Without having to distinguish cases, this guarantees that a word end is only reached in a leaf.

Trie with 13 Entries

Tries have several advantages: all operations are simple and natural. The search time for short words, which are most common, is small. Using a complete dictionary of English for spellchecking purposes, the average search time is given by the average length of the words in the text. This value probably lies below 5 for most texts. If we would use a balanced binary search tree, the average depth of the entries would lie around 15. Furthermore, we would have to compare words and not characters at each level.

Tries are an example of a radix search structure. In radix search, the search key is not taken as a whole and compared with some key stored in the nodes of a tree, but rather it is considered piece-by-piece. When using keys like strings, which cannot be compared in constant time, but have to be compared in several steps any way, this is a logical choice. The name radix search is inspired by the analogy with radix sort, where the numbers are also not considered as a whole, but digit-by-digit for digits given in a number system with some radix. The simplest form of radix search is to consider the search key bit-by-bit. In that case each node is binary. There is no need to compare: if the bit is 0, we go left, else we go right. Just like bucket and radix sort are based on direct addressing and work without comparisons, so can also radix search structures be used to find an element without comparing keys.

The major disadvantage of tries as presented is that they use much memory: the direct addressing mechanism only works if the children are implemented with an array of length |A|, even if the actual number of children is far smaller. If the degree of the trie is 26, the above problem may not be that serious. But for larger alphabets, of size 256 or 65536, it is undesirable to use 1 KB or more for each node, even if the actual degree is low.

A possible solution is to maintain for the children of each node u a hash table of which the size is proportional to the actual degree of u. If under insertions and deletions the array with the number of children and the hash table is increased and decreased by factors two whenever they are too small (number of children is more than 60% of the size of the table) or too large (the number of children is less than 20% of the size of the table), we can achieve that the memory usage is optimal, while still reaching a leaf at depth l in O(l) time in most cases. At the same time this assures that the amortized time for restructuring is constant for each accessed node. So, even when taking the restructuring into account, the amortized time to search/insert/delete a word with l characters is O(l).

In general hash tables and direct addressing perform less good than one might hope, because each access implies a cache fault. If the data structure is large cache faults cannot be prevented. However, when using hash tables for accessing the children of the nodes of a tree of bounded degree, the access is not essentially different from a normal tree search: we may reasonably assume that the top levels of the structure reside in cache, and that only the few accesses at the deepest levels are expensive. At the same time these one-character hash functions can be very simple, a simple modulo computation will be good enough. If instead we would hash whole words, the hash function has to be chosen much more carefully to prevent clustering. Furthermore, when hashing the whole word S, S may have to be compared with several words in the array, which may be costly. Only some kind of radix search guarantees that we are not processing the same character many times.

Using nodes with (virtual) degree equal to the size of the alphabet is natural, but either may lead to excessive memory usage or to a complication of the structure. It may therefore be better to consider the words as bit strings rather than as character strings. Of course, this multiplies the search time by log |A| for an alphabet of size |A|. On the other hand, now any node has constant size.

Whether using binary keys or characters, it is still not true that the number of nodes is proportional to the number of words stored in the dictionary, because there may be many nodes of degree 1. For example, after "charac" the only possible continuation may be "ter". In patricia tries, this phenomenon is taken into account: nodes of degree 1 are contracted. In a graphical representation of patricia trees, substrings are written along the edges. Because in a patricia tree all internal nodes have degree at least 2, the number of internal nodes is at most n - 1 for a tree with n leafs, that is, for a dictionary in which n words are stored.

Patricia Trie with 13 Entries

Finding Substrings Repeatedly

Basic Ideas

Searching for key words in a set of previously processed words can be solved by any of the search-tree methods presented before, including tries. These dictionary data structures allow for updates: the time for inserts and deletes is comparable to that of searching. However, a body of characters can not always be interpreted as a set of key words. Particularly in bio-informatics contexts, it is rather to be viewed as a single long string, and searching for specified substrings is one of the common problems. Above we have seen how this problem can be solved by the Knuth-Morris-Pratt algorithm in O(k + n) time, where k is the length of the substring and n the length of the text. If there is a single test to perform, this is certainly the best achievable, but for repeated searches, one may want to do better.

One possibility is to perform a batched search. If we want to search for a set of strings Q_i, 0 <= i < m, in a string S, then the strings can be preprocessed by constructing their lexicographical search tree. Then at any position of S we can start running through the tree guided by the characters in S. When exiting from the tree before hitting a leaf, we can proceed to the next starting position in S. The worst-case performance of this algorithm is O(k_max * n), where k_max = max_i{|Q_i|}. If the string S is random, then for an alphabet with two characters (bit strings), the expected running time is O(log m * n).

The strong point of the above approach is that it is simple and that it is still possible to change the text S. On the other hand, it cannot always be assumed that the searches are offered as one batch. If we want to perform frequent single searches, then the above is no solution. A central idea in computer science is to reduce the cost of repeated operations by preprocessing. In general these preprocessing operations may have rather high complexity and may require extra storage, because in most cases, in the real application they will be only applied for small subproblems. Often certain small substructures are preprocessed, serving as bottom of a recursive algorithm, thus allowing a gain by a log- or loglog-factor. Sometimes also, the preprocessing is applied to the whole problem, and then the computational and storage complexity is more important. Such a case we encounter here.

The simplest idea is to extract all O(n^2) substrings of S and to enter them in a trie or another suitable search structure. This allows to search for any string Q of length k in O(k) time if a trie is used and otherwise in O(log n) time. However, this requires Omega(n^3) storage and time. Considering that n is large, this is far too much. In the following we present a better idea which still allows to search in O(k) time, but reduce the preprocessing time and storage to O(n^2), which is somewhat more acceptable. A refinement will bring memory and time consumption down to O(n), leaving nothing to desire.

Suffix Tries

The main idea is that we do not need to store all substrings, but only all suffixes of S in a trie. When the string Q = (q_0, ..., q_{k - 1}) = (s_i, ..., s_{i + k - 1}), then Q is found as the prefix of the suffix (s_i, ..., s_{n - 1}). S has only n suffixes and the sum of their lengths is O(n^2). Starting with the longest suffix, all of S, it is no problem to construct the trie in O(n^2) time.

Suffix Trie for abbabbababbaa

We now describe a construction which in itself is asymptotically not faster than simply inserting the suffixes one-by-one in a trie, but which can be improved to run in O(n) time. It is assumed that S ends with a special character '$' which otherwise does not occur in the text, so no suffix is a prefix of any other suffix. In the following T^(i), 0 <= i <= n, denotes the trie containing all suffixes of S^(i) = (s_0, ..., s_{i - 1}), the prefix of S of length i. T^(0) consists of a single node. T^(i + 1) is constructed out off T^(i). T^(i) is accessed at the leaf corresponding to the substring S^(i) either directly using an additional pointer or by walking down the path from the root. This node is given a new child which is reached by a link with label s_i. Then the nodes corresponding to all other suffixes of S^(i) are accessed in order of decreasing length: if the node has no link with label s_i it is processed as before, adding a new child reached by a link with label s_i, otherwise nothing has to be done.

Efficiently visiting all suffixes of T^(i) in order of decreasing length requires that there are additional cross links from the node corresponding to (s_j, ..., s_{i - 1}) to the node corresponding to (s_{j + 1}, ..., s_{i - 1}), for all 0 <= j <= i - 2. In this way each node can be accessed in O(1) time. Creating the new node together with the cross link takes O(1) as well. So, T^(i + 1) can be constructed out off T^(i) in O(i) time. Thus, the entire construction takes O(n^2) time.

Actually, as soon as the first node is reached which has a link with label s_i, there is no need to visit any further node, because they all will have such a link. This can be seen as follows: if (s_0, ..., s_{i - 1}) contains a substring which is identical to (s_j, ..., s_i), for some j > 0, then it also contains a substring which is identical to (s_{j + 1}, ..., s_i). If we are keeping the cross links for all nodes, these are already pointing to the correct nodes. If, on the other hand, the cross links are maintained only for the nodes corresponding to the suffixes, all of them have to be updated.

Suffix Trie Construction

The above construction does not only require Omega(n^2) time, but also creates a trie of quadratic size. This size can easily be reduced by using the patricia idea: all nodes of degree 1 are contracted. Because there are at most n leaves, this gives a trie with at most n - 1 internal nodes, so there are at most 2 * n - 1 nodes. Now it is not a good idea to write the substrings along the links, because then we would not gain anything: if there are eps * n^2 links in the original trie, for some eps > 0, then there would still be eps * n^2 characters to store in the patricia trie. However, because all of these substrings are of the form (s_l, ..., s_h), it suffices to store two integers, the numbers l and h, for each of them.

Patricia Suffix Trie

The final improvement is to construct this trie of size O(n) without making a quadratic detour. This is quite an elaborate construction and not treated here. It is important to know that it can be done, because suffix tries are typically used for large n values for which a O(n^2) running time would be too large. Summarizing we get

Theorem: For a text S of length n a suffix trie of size O(n) can be constructed in O(n) time.

Data Compression

Introduction

In most texts certain characters are much more frequent than others. Not only the vowels a and e, but also blanks, digits and end-of-line symbols are often frequent. In standard encoding, for each character the same amount of storage is used: 8 or 16 bits. In the context of data transmission and long-term storage, it makes sense to consider more efficient ways of encoding data. In the following we only speak of characters, but one might also apply the same techniques to pairs of characters. A simple technique which exploits the non-uniformity of the character frequencies is the Huffman encoding. On typical texts it gives savings from 25 up to 60% on sufficiently large files. However, frequency non-uniformity is not the only aspect that can be exploited for data compression. If certain characters locally cluster, then the move-to-front technique can be used to turn the distribution non-uniformity into a frequency non-uniformity. The Burrows-Wheeler encoding might be used to enhance the clustering of characters. Other familiar data-compression techniques, which will not be discussed here, are those by Lempel-Ziv and Lempel-Ziv-Welch.

All of the compression techniques considered in this section are loss-free, that is, they have an inverse which allows to reconstruct the original string exactly as it was. For texts this appears a natural thing to require, but for images, there is no need to be able to reconstruct features that cannot be distinguished anyway.

For any loss-free data compression strategy, there is at least one string which cannot be compressed at all: if all strings S of length n over an alphabet A with a characters could be compressed, then we would have an invertible mapping of a set with a^n elements to a set with a^{n'} elements for some n' < n. This is impossible, because such a mapping cannot be injective. In general data compression is even harmful: giving some strings a shorter representation, will make other strings longer to an extend that the sum of all lengths increases. This shows that in general it makes no sense to apply data compression techniques to purely random strings: somewhat paradoxically, no string is harder to describe than a random string.

The compression techniques discussed in this section can be implemented in time O(|A| + |S|) or slightly more. For encoding strategies with higher complexity, it is always an option to subdivide S in several shorter strings and to encode each of them separately. Of course, this gives a weaker data compression, but may nevertheless be better than applying a less sophisticated compression on all of S. After the above discussion, it should be clear that any data compression technique can at best have experimental quality in the sense "on most texts in English the technique gives a reduction of 40%" or the like. One should also be aware that a certain compression is trivial to achieve: most texts consist of less than 128 different characters, so these can be rewritten with 7 bits per character.

Huffman Encoding

The Huffman encoding is performed by constructing a full binary tree. The leafs of the tree correspond to the characters. A left branch in the tree gives a 0 in the code, a right branch gives a 1. Thus, the length of the code of a character c is given by the depth of the leaf corresponding to c. For an alphabet A and a string S, the task is to determine a tree T so that the following cost function is minimized:
cost(T, A, S) = sum_{c in A} f_c(S) * d_c(T).
Here f_c(S) denotes the number of occurrencies of character c in S and d_c(T) the depth in T of the leaf corresponding to c.

The numbers f_c(S) are determined by traversing the text and counting. For a small alphabet, this is most easily done by using an array with one position for each character. If the alphabet is large, and we suspect that most possible letters do not occur, then this may be too costly. A large "alphabet" occurs in a natural way if we consider groups of several letters. In that case any data structure which supports the operations insert and find can be used, for example a hash table.

Once the numbers f_c(S) have been determined, the tree is constructed by starting with a forest of single nodes. There is one node for each character that occurs in the text. These nodes have weights f_c(S). Each of these nodes can be viewed to be the root of a tree. The algorithm works by repeatedly connecting a pair of roots by a new root node. At any time the two roots are selected which have smallest weights. The weight of the new root is the sum of the weights of the connected roots. This algorithm can be classified as a greedy algorithm in the following sense:

Huffman Encoding

The sketched algorithm is simple and efficient. It is not a good idea to first sort all weights, because later new weights have to be handled as well. The appropriate data structure is a priority queue. Initially the a = |A| letters of the alphabet are inserted with their weights as keys. Alternatively, the operation buildheap can be called for them. Then repeatedly two elements are removed and one gets reinserted. So, after a - 1 of these operations there is only one element left. In total 2 * (a - 1) deletemins and a - 1 inserts are performed for a total time of O(a * log a), when using a binary heap.

More interesting is the question whether this code is optimal. The proof is based on the following observations:

If the first observation would not be true, then a cost reduction could be obtained by exchanging the nodes corresponding to c_1 and c_2 in the tree. If any node would have degree one, contracting it would give a cost reduction. From these observations we get

Lemma: For any alphabet A there is an optimal encoding so that the nodes of the two least frequent characters lie at the deepest level and are children of the same parent node.

This lemma tells that the first step of the construction was no mistake.

Theorem: The greedily constructed tree leads to the smallest possible encoding using a fixed code for each character.

Proof: The constructed tree is optimal for |A| <= 2. So, inductively we may assume that the constructed tree is optimal for any alphabet of size |A| - 1. For the alphabet A, let A' denote the alphabet in which the two least frequent characters c_1 and c_2 are replaced by a single metacharacter c' with frequency f_{c'} = f_{c_1} + f_{c_2}. Let T be the tree constructed by the algorithm for A, and let T' be the tree constructed for A'. The greedy construction is so that T and T' are identical except for the subtree containing c_1 and c_2. So, cost(T) = cost(T') + f_{c'}. Let T_opt be an optimal tree for A. The above lemma makes that we may assume that the nodes of c_1 and c_2 are children of the same parent. So, from T_opt we can obtain a tree T'_opt for A' by replacing the subtree with c_1 and c_2 by a single node corresponding to c'. The construction of T'_opt from T_opt implies that cost(S_opt) = cost(T'_opt) + f_{c'}. Because according to our induction assumption T' is optimal for A', we get

    cost(T)  = cost(T') + f_{c'} 
            <= cost(T'_opt) + f_{c'}
             = cost(T_opt).
  
So, T must be optimal as well. End.

For small texts, this type of encoding does not bring profit, because the encoding itself must also be transmitted or stored. The size of the encoding is O(|A|). Another point of consideration is that the text must be processed twice: once to count the numbers f_c, then to perform the actual encoding. In other words, Huffman encoding is a two-pass technique. Some of the other text compressions are one-pass techniques. If the text is a large data base, performing two passes means an undesirable doubling of the running time. A practical solution is to base the encoding on a sample of the whole text. This sample should not just consist of the first section, because this does not need to be representative (think of mathematical books which mostly start by a foreword with very little mathematical notation). Checking in total 100 stretches each consisting of 1/10000 of the text will mostly do. If some rare symbols are not sampled this way, they can always be added to the table with some special codes (for this one leaf, for example 1, ..., 1 should always be kept free).

Move-to-Front Encoding

In the Huffman encoding, a character is replaced by numbers using a fixed mapping. Once computed, the code of a character can be looked-up in a table of size |A|. A different approach is to use encoding numbers which dynamically develop as a result of the characters encountered so far. Initially the characters are arranged in some linear order, for example alphabetically. The characters are inserted in a linear list, with one character in each node of the list. The rank of a character is the distance from the first node of the list, so the first character has rank 0. The text is traversed, and the current rank of a character c is used as its code. In addition to generating the code, upon encountering c the node of c is cut from the list and reinserted at its beginning. This is called the move-to-front encoding.

The encoding is simple. How about decoding? This is just as simple. Again we start with the characters in alphabetical order. The numbers of the encoding describe which operations have been performed at the list, so at any time we know how the list looks, and the characters of the decoding are those that stand in the first position after performing an operation.

Move-to-Front Encoding

Computationally it is slightly less simple. Using an additional array with one pointer for every character of the alphabet, it is no problem to find an element in the list in constant time. But, it is not easy to compute the ranks. The easiest is to work with a binary search tree instead of a linked list. The key of the tree node corresponding to a character c is the position of the latest occurrency of c in the text. If the order in the tree is reversed (so that for any node all nodes in its left subtree have larger and all nodes in its right subtree smaller keys), then the rank of c is given by the inorder number of the corresponding node. Using an additional array with pointers to the nodes in the tree, the node corresponding to c can be localized in constant time. Storing in each node the size of its subtree, the preorder numbers can be computed in time bounded by the depth of the tree. Thus, if the tree is kept balanced all operations can be performed in O(log |A|) time for a tree with |A| nodes, this allows to encode a text string S over an alphabet A in O(|S| * log |A|) time. The decoding can be performed in the same time.

Move-to-Front Implementation

A faster encoding can be obtained by modifying the move-to-front rule. Instead of moving an encountered character to the head of the list, we can also move it one position towards the head of the list. If there are pointers to the nodes in the list, and if the list has links in both directions, then such a swap of two characters can easily be performed in constant time, including the time for updating their ranks. This gentle variant is less adaptive then the radical variant, on the other hand, it is less sensitive to incidental occurrencies of characters in a context which is dominated by a small subset of other characters. Thus, because there is no clear superiority of the radical move-to-front rule, it appears natural to first try this gentle rule which allows to encode and decode a string S in O(|S|) time.

In itself does the move-to-front transformation not achieve any compression. However, if characters cluster, so that the same character frequently occurs in small neighborhoods, then the move to front rule will generate a code in which the small numbers are overrepresented. So, even when in the original text all symbols are equally frequent, not allowing to obtain any saving when applying Huffman encoding, the move-to-front transformation may give us a compressible code. In other words: the move-to-front approach transforms a distributional non-uniformity into a frequence non-uniformity.

Burrows-Wheeler Encoding

The Burrows-Wheeler encoding can be used to enhance or create a distributional non-uniformity. With this technique a string S can be encoded and decoded in O(|S|) time, which is good. It is less clear how effective the approach is. The effectiveness strongly depends on the structure of the text. It is claimed to be most effective for natural texts, in which some specific substrings can be followed only by one or very few characters. For example, after "pr", there will always follow a vowel.

We first describe the approach without paying attention to its implementation and the efficiency thereof. For a string S = (s_0, ..., s_{n - 1}) with n characters S^(i), 0 <= i < n, denotes the string which is obtained from S by shifting the characters i positions cyclically to the left. So, S^(i) = (s_i, ..., s_{n - 1}, s_0, ..., s_{i - 1}). These n strings are sorted lexicographically. Let Q^(i) be the string which in this sorted order occurs at position i. Let x be the value so that Q^(x) = S^(0) = S. In the following, q^(i)_j denotes the character at position j of Q^(i). For a set of n strings like the strings Q^(i), a transversal cut is a string (q^(0)_j, ..., q^(n - 1)_j), for any specified value j, 0 <= j < n. Denote this string by R^(j). The output of the encoding consists of (R^(n - 1), x).

The encoding is straightforward, though it may not be clear how to perform it in less than O(n^2). Whether it is effective is even more unclear. What concerns us here is the decoding. We denote character i in string R^(j) by r^(j)_i. Notice that r^(j)_i = q^(i)_j. The strings Q^(i) are a permutation of the strings S^(i). Because the S^(i) are cyclically shifted copies of S, each transversal cut through the S^(i) and thus also each transversal cut R^(j) through the Q^(i) gives a permutation of the elements of S. This is the key observation that allows to decode.

A permutation pi is defined as follows: pi(i) = rank(r^(n - 1)_i), the rank of the character c at position i of R^(n - 1), when sorting the characters of this string without changing the order of the characters that occur more than once. That is, we should use a stable sorting algorithm. It is easy to determine pi() in O(n) time using a bucket-sort variant. Let S' = (r^(n - 1)_{pi^{n - 1}(x)}, ..., r^(n - 1)_{pi^1(x)}, r^(n - 1)_{pi^0(x)}). Here pi^i(x) is defined by pi^i(x) = pi(pi^{i - 1}(x)) for all i > 0 and pi^0(x) = x. Once pi() has been computed, each further application of pi() can be performed in constant time, so the string S' can be computed in O(n) time. The following theorem states that S' equals S and that thus the decoding can be performed in O(n) time.

Theorem: S' = S.

Proof: The proof goes by induction. The basis of the induction is given by pi^0(r) = r^(n - 1)_x = q^(x)_{n - 1} = s^(0)_{n - 1} = s_{n - 1}. Assume we have established that s_{n - 1 - i} = pi^i(r). This is an element of R^(n - 1), say it is r^(n - 1)_j = q^(j)_{n - 1}. Because Q^(j) is a cyclically shifted copy of S, this implies that s_{n - 2 - i} = q^(j)_{n - 2}. Let Q^(j') be the copy of S which has been shifted one position more. q^(j')_{n - 1} = s_{n - 2 - i}, the element we are looking for. Thus, it remains to show that pi(j) = j'.

The index j' is so that q^(j')_0 = q^(j)_{n - 1}, because Q^(j') is obtained from Q^(j) by shifting it one position. The Q^(l), 0 <= l < n, are lexicographically sorted. In particular, this implies that their first elements, the elements of R^(0), of which q^(j')_0 is one, appear in sorted order. Because R^(0) is a permutation of R^(n - 1), j' is the rank of the element q^(j)_{n - 1} in the sorted order of R^(n - 1). That is, j' = pi(j), as was to be shown.

If the character c = q^(j)_{n - 1}, occurs several times, we still must verify that pi() maps j to precisely the right value. For any index l < j, if q^(l)_{n - 1} = c, then replacing this character by a smaller character c' has no impact on the rank of the character c in q^(j)_{n - 1}. It is here that we essentially use that the sorting is stable. Likewise, for any index l > j, if q^(l)_{n - 1} = c, then replacing this character by a larger character c'', the rank of the character c in q^(j)_{n - 1} remains unchanged. Thus, without loss of generality we may assume that all characters in S are different. End.

Burows-Wheeler Encoding

Exercises

  1. Compute the overlap values for the string Q = (aaaabbbaaaaabaaaabbbaaa).

  2. Detecting a string Q in a text S is a problem which can also be performed by a finite automaton.

  3. Above it was stated that the permutation pi which is used in the decoding of the Burrows-Wheeler encoding can be computed in O(n) time for a string S of length n. This is not true in full generality. Formulate a sufficient condition for this claim to hold.

  4. Investigate the truth of the following claim: "for any string S, first applying move-to-front encoding followed by Huffman encoding, is always at least as good as when directly applying Huffman encoding to S". If the claim is true, you should formally prove it. If it is not true, you should try to determine how much worse the resulting encoding can become by first applying move-to-front.

  5. The Burrows-Wheeler encoding for the 45-character string S = "adadadfdadaafeedeaeaeddhhaabaaabbabbhbgccaacc" is S' = "bhacdaabadcddeeaabaahbccgaafaaeeaddaaefdabhbd". Apply the move-to-front encoding followed by the Huffman encoding to S'. Is the resulting better or worse than when applying this double encoding directly to S?

  6. Write a Java program implementing the trivial algorithm for finding a substring q[] in a string s[] and the better algorithm by Knuth, Morris and Pratt. Test it on a relatively long text for several substrings of various lengths. For example, take the text of one of the longer chapters of this lecture notes and search for 10 words of 3 letters, 10 words of 6 letters and 10 words of 12 letters each. Take words which do not occur in the text. Measure the times and compare them.

  7. Compute a Huffman encoding for the characters appearing in the text of this question. Do not forget to distinguish capitals from small letters, and take symbols into account as well. How many bits do you need for the encoded text? What saving ratio do you achieve?

  8. Write a Java program for computing Huffman encodings. Apply it to a sufficiently long text, take, for example, the text of one of the longer chapters of this lecture notes. Determine the compression ratio. Now apply it again, interpreting groups of k consecutive characters as characters of an alphabet with 256^k characters. Compute the compression ratios for several small k values in order to determine the optimal choice of k and the maximum achievable compression ratio. The approach can be refined by only processing lower-case letters in groups.

  9. Write a Java method for processing a character c during the move-to-front encoding. The header of this subroutine should read "int processChar(char c, int i, Node[] array, BalancedTree tree)". i gives the position of c in the input string S. The array contains pointers to the nodes of the tree, the tree is constructed as above. The class BalancedTree supports the basic operations find, insert and delete. It has an instance variable root of type Node, a pointer to the root of the tree. In the tree smaller values are stored to the left. The nodes have instance variables left and right of type Node, but there are no links upwards. The returned int should give the rank of c. Considering the performed operations, how would the structure of the tree develop if it would not be balanced?

  10. Write a Java program for performing the gentle variant of the move-to-front encoding in which an encountered character is moved x positions towards the head of the list. Combine it with the program for Huffman encoding and determine the optimal choice of x. Characters are not grouped together. Try various source texts. How much additional compression does the move-to-front encoding bring?

  11. Write a Java program for performing the Burrows-Wheeler encoding. You do not need to implement suffix trees, for a string S, the lexicographical sorting of the shifted strings may be performed in O(n^2) time, for n = |S|. On the other hand, the memory usage of your program should be bounded by O(n). Combine the program with the move-to-front and Huffman encodings. Try various source texts. How much additional compression does the Burrows-Wheeler encoding bring?





Divide and Conquer

Old Problems

We have already seen several divide-and-conquer algorithms:

It is slightly arbitrary what is exactly divide-and-conquer. It also depends on the way the algorithm is formulated. As an example we consider again the three algorithms for computing exponents.

  int exponent_1(int x, int n) 
  {
    int c, z;
    for (c = x, z = 1; n != 0; n = n / 2) 
    {
      if (n & 1) /* n is odd */
        z *= c;
      c *= c; 
    }
    return z; 
  }


  int exponent_2(int x, int n) 
  {
    if (n == 0) /* terminal case */
      return 1;
    if (n & 1) /* n is odd */
      return x * exponent_2(x, n - 1);
    return exponent_2(x, n / 2) * exponent_2(x, n / 2); 
  }


  int exponent_3(int x, int n) 
  {
    int y;
    if (n == 0) /* terminal case */
      return 1;
    if (n & 1) /* n is odd */
      return x * exponent_3(x, n - 1);
    y = exponent_3(x, n / 2);
    return y * y; 
  }

All three algorithms are based on the same idea, and therefore they should all be classified in the same way. The basic recursive variant, exponent_2, expresses the divide-and-conquer idea very clearly and therefore it appears logical to call all of them divide-and-conquer algorithms, even though in the non-recursive algorithm the structure is camouflaged quite well, and even though in the second recursive algorithm the second subproblem is not treated separately for efficiency reasons. In general, if occurring subproblems are the same or almost the same, then it is mostly a good idea to solve them only once and to reuse the solution.

General Pattern

The general structure of a divide-and-conquer algorithm consists of three steps:
  1. Divide the problem in two or more subproblems (not necessarily of the same size).
  2. Find solutions for the subproblems.
  3. Combine the solutions of the subproblems to a solution for the whole problem.

This structure is clearly visible in the recursive algorithms for multiplying large integers and matrices. In the case of binary search, one of the two subproblems is void and not considered further. If there is only one subproblem to further consider, one may rather speak of simplification than of divide-and-conquer. The second recursive exponentiation algorithm might also be considered to be an algorithm that works by simplification.

Recursive algorithms are appealing because often they can be formulated very easily. They can mostly also be given in a functional way, which implies that their correctness is generally easier to verify than that of iterative algorithms. Recursion causes overhead, but this problem can be alleviated by not continuing all the way to a base case. Recursive algorithms even have computational advantages:

Of course, there is more to computer science than recursive algorithms, and even though recursive algorithms behave nicely in the light of a memory hierarchy or on a parallel computer, algorithms which are designed for these purposes may perform even much better. An example is sorting. Quick sort is the natural recursive approach to sorting. It works fine even for sorting large data sets and it can also be parallelized quite reasonably. However, multiway merge sort is a much better algorithm for external sorting, and sample sort is a much better idea for parallel sorting.

Any recursive algorithm should also specify one or more well-defined base cases which can be solved in an alternative way. From any possible starting situation it must be guaranteed that the algorithm reaches such a base case, otherwise the algorithm may not terminate. The algorithms used for solving the base cases does not need to have particularly good asymptotic performance, as long as it is fast for the small problems for which it is applied (this is most important if there are many small subproblems to solve). If there are finitely many instances of the base cases, it may be profitable to solve all of them in a precomputation round and to store their solutions.

As an example of the precomputation idea, we consider a special kind of matrix multiplication. The entries of the matrices are 0 and 1 and 2. The operation + is defined by x + y = max{x, y}. The operation * is defined by x * y = min{x, y}. Such a definition makes sense in the context of communication: when sending a signal through two channels, the result will not be better than when sending it over the weakest of the two channels. When there are alternatives, the best of these can be chosen. Using the conventional recursive algorithm, the number of arithmetic operations is dominated by the number of products, which equals n^3. However, there are only 3^9 different 3 x 3 matrices. So, there are only 3^18 < 4 * 10^8 products of 3 x 3 matrices. The result of these products can be precomputed and stored (each in one word). The entries can be used for of the matrices for which the product must be computed can be used for indexing (using direct addressing), so with a few shifts and additions the product of 3 x 3 matrices can be computed, an operation which otherwise would require 27 products. For large matrices this may make the computation several times faster.

Sorting

Quick Sort

Quick sort is the purest example of a divide-and-conquer algorithm:
  void sort(int[] a, int n)
  // sort n elements standing in positions 0, ..., n - 1 of a[]
  {
    if (n > 1)
    {
      int i;
      int a_smaller[n],  a_equal[n],  a_larger[n];
      int n_smaller = 0, n_equal = 0, n_larger = 0;
      int s = a[randomly generated number x, 0 <= x < n];

      // Split the set of elements in three subsets using s
      for (i = 0; i < n; i++)
        if      (a[i] < s)
          a_smaller[n_smaller++] = a[i];
        else if (a[i] == s)
          a_equal[n_equal++] = a[i];
        else
          a_larger[n_larger++] = a[i];

      // Solve two recursive subproblems
      sort(a_smaller, n_smaller);
      sort(a_larger, n_larger);

      // Combine the results
      for (i = 0; i; < n_smaller; i++)
        a[i] = a_smaller[i];
      for (i = 0; i < n_equal; i++)
        a[i + n_smaller] = a_equal[i];
      for (i = 0; i < n_larger; i++)
        a[i + n_smaller + n_equal] = a_larger[i];
    } 
  }

Here we can clearly distinguish a subdivision, a phase in which two subproblems are solved and the final recombination. The element s is called splitter. It is important that this splitter is chosen at random, if we would simply take the first element, the running time would be quadratic for sorted inputs. It is also important that the splitter is selected from the set of numbers in the array and not from the domain of values.

How long does the algorithm take? Here we have the problem of analyzing a randomized algorithm. There are several ways to prove that the running time is O(n * log n) in some sense. We first consider the expected running time.

Theorem: The expected time for sorting n numbers using quick sort is O(n * log n).

Proof: Let t_i be the number of times the number a[i] is involved in a splitting operation. The time consumption is proportional to T = sum_i t_i. Thus the expected time is proportional to Exp[T] = Exp[sum_i t_i] = sum_i Exp[t_i] = n * Exp[t_0]. Here we used the linearity of expectation which allows to take a summation out of the computation of the expected value. This holds even when the random variables are not independent. In the last equality we used that all t_i are equally distributed. This implies that Exp[t_i] = Exp[t_j] for all i and j, 0 <= i, j < n. Thus, the sum can be replaced by the product of the number of terms and any of the values. So, it remains to estimate the expected number of times any number is involved in a splitting operation.

The central concept in the analysis is that of a successful split. The split of a subset of size n' is said to be successful if the size of the largest of the two subsets resulting after the split is at most 3/4 * n'. Independently of n' this happens with probability 1/2, because a good split results precisely if the splitter s is chosen equal to any element with rank between n' / 4 and 3/4 * n'.

Any element is involved in at most log_{4/3} n successful splits, because after this many reductions of the problem size to a fraction 3/4, the problem size certainly has been reduced to 1 (for any alpha, (1 / alpha)^{log_alpha(n)} = 1 / alpha^{log_alpha(n)} = 1 / n). Let X be the random variable giving the number of splits a number is involved in before there have been log_{4/3} n successful splits. Let x_i be the number of splits between successful split i - 1 and i. X = sum_i x_i, and so Exp[X] = Exp[sum_i x_i] = sum_i Exp[x_i] = 2 * log_{4/3} n, because Exp[x_i] = sum_{j > 0} j / 2^j = 2. This gives an upper bound on the expected number of splits any element is involved in. End.

The above is a fine result, but it does not tell us how often the running time may be much larger. The Markov inequality gives very weak bounding, stating that the probability on a running time c * n * log n is bounded by O(1 / c). This is not very satisfactory. In the following we prove that the running time of quick sort is bounded by O(n * log n) even with high probability. In general a claim on the complexity of a problem of size n holds with high probability, when the probability that it is not satisfied is bounded by n^{- eps}, for some eps > 0.

An extremely useful tool in the analysis of randomized algorithms are the so-called Chernoff bounds, which can be used to estimate the tail probabilities of the distribution of the sum of independent Bernoulli trials:

Lemma: let X be the random variable corresponding to throwing a biased coin: with probability p the result is 1, with probability 1 - p the result is 0. Let S be the random variable giving the sum of the outcomes of n independent experiments of the above type. Then

Prob[S > p * n + h] <= e^{- h^2 / (3 * p * n)}

We do not proof this lemma (it is a consequence of a clever application of the Markov inequality). We want to stress the importance of the independence of the experiments. Often this condition is not satisfied. The lemma implies that the deviation from the expected value p * n is not larger than O(sqrt(p * n * log n)) with high probability (as long as this is at least log n). Notice that by taking h slightly larger, the non-success probability becomes much smaller: for any failure probability O(n^{-k}) it suffices to take h = O(sqrt(p * n * log n)).
The typical deviation from the expected value of the sum of n independent Bernouilli random variables grows approximately as sqrt(n).

In the analysis of the time of quick sort we do not need such a strong result (though it does not harm). The following much weaker result, which is easier to prove, is also effective for our purposes:

Lemma: let X be the random variable corresponding to throwing an unbiased coin: with probability 1/2 the result is 1, with probability 1/2 the result is 0. Let S be the random variable giving the sum of the outcomes of n independent experiments of the above type. Then

Prob[S < n / 4] <= n / 4 * 0.91^n.

Proof: The probability P_t that among the n trials there are exactly t successes is given by

Prob[P_t] = (n over t) * 2^{-n}.
Away from the expected value p * n these probabilities are monotonically decreasing. Thus, sum_{t < h} P_t <= h * P_h. Using Stirling's formula, we find that for all n and k:
(n / k)^k <= (n over k) <= (n * e / k)^k.
So, P_t <= 2^{-n} * (n * e / t)^t. This gives
Prob[S < n / 4] <= n / 4 * ((4 * e)^{1/4} / 2)^n.
End.

Thus, when expecting n / 2 successes, we may almost exclude that there are fewer than n / 4 successes.

Theorem: Quick sort sorts n numbers in O(n * log n) time, with high probability.

Proof: In this case we are considering the depth d of the whole tree of recursions. Because at every level at most O(n) work is performed, the total work is bounded by O(d * n). We will show that d = O(log n) with high probability, which gives the claimed result.

Consider the depth of the branch of the recursion tree in which a specific element x lies. At every level the size of the subset to which x belongs is reduced, and the recursion ends when the size of this subset has been reduced to 1, consisting of x itself only.

A sequence of at most log_{4/3} n successful splits (intermixed with any number of non-successful splits) results in a set of size one. The chance that a split is successful is independent of any of the previous actions. Thus, we can use the Chernoff bounds to state that we need only marginally more than 2 * log_{4/3} n = O(log n) splits to guarantee that with high probability log_{4/3} n of them are successful. Applying our weaker lemma we conclude that from among 4 * log_{4/3} n splits at least log_{4/3} n are successful with high probability.

This argument is correct but not complete. One still should address the following point: we have now shown that the branch with element x as a leaf does not lie deeper than O(log n) with high probability. However, the recursion tree has n leafs and we must prove that all of them lie at depth bounded by O(log n). Thus, we have n experiments which are not independent. In the analysis we use that

Prob[a or b] = Prob[a] + Prob[b] - Prob[a and b] <= Prob[a] + Prob[b].
This holds even when a and b are dependent! For a set a_i, 0 <= i < n, of n events with the same probability distribution this observation gives
Prob[a_1 and ... and a_n] = 1 - Prob[not (a_1 and ... and a_n)] = 1 - Prob[(not a_1) or ... or (not a_n)] <= 1 - n * Prob[not a_1].
Thus, if we can bound the non-success probability of a single experiment to n^{- (1 + eps)}, for some eps > 0, then we can guarantee that with high probability even n of these experiments end successfully. This last condition is satisfied in our case (and hardly ever a problem, it should just be remarked) because 0.91^{4 * log_{4/3} n} ~= 0.91 * {9.63 * log_2 n} ~= 0.40^{log_2 n} ~= n^{-1.31}. End.

Merge Sort

Even merge sort might be considered to be a divide-and-conquer algorithm, but this appears somewhat artificial: the normal way of doing merge sort is not the efficient reformulation of a recursive algorithm (as the exponentiation algorithm): it rather feels like a bottom-up approach than like a top-down approach. But, of course merge sort can be formulated very elegantly, with few special cases to consider and obvious correctness as follows:
  static private void merge(int[] a, int l, int m, int h)
  {
    int[] b = new int[h - l + 1];
    int i = l, j = m + 1, k = 0;
    while (i <= m && j <= h)
      if (a[i] <= a[j])
        b[k++] = a[i++];
      else
        b[k++] = a[j++];
    while (i <= m)
      b[k++] = a[i++];
    while (j <= h)
      b[k++] = a[j++];
    i = l; 
    k = 0;
    while (i <= h)
      a[i++] = b[k++];
  }

  static private void sort(int[] a, int l, int h)
  // Sort h - l + 1 elements of a[] running from l to h
  {
    if (h > l) // At least two elements
    {
      int m = (l + h) / 2;
      sort(a, l, m);
      sort(a, m + 1, h);
      merge(a, l, m, h);
    }
  }

  static void sort(int[] a, int n)
  {
    sort(a, 0, n - 1);
  }

Click here to see the above piece of code integrated in a working Java program.

A general point with divide-and-conquer algorithms and more generally with recursive algorithms is that often it is quite easy to verify their correctness. On the other hand, they are mostly less efficient than the corresponding bottom-up algorithms. The efficiency can often be improved a lot by stopping at a slightly higher level. For example, in the given merge-sort algorithm, the number of calls to the subroutine merge_sort can be reduced a lot by applying bubble-sort on all sequences of length less than 8: in the current implementation, merge_sort is called at most 2 * n times, in the alternative one it is called at most n / 4 times. Adding such trivial improvements, often very good performance can be obtained at the same time as a clear and convincing structure.

Selection

Finding the maximum of a set is easy: just traverse the elements and keep track of the currently largest element. This takes O(n). Finding the second largest element is still simple: keep track of the two largest elements and update for each new scanned element. This can still be implemented in O(n), be it with worse constants. More generally, for any constant number k, the k-th largest element can be determined in O(n) time. Using a priority queue, we might generally find the k largest in O(n + k * log n) time. First building a heap in linear time and then performing a variant of Dijkstra's algorithm searching through this heap, the time can be further reduced to O(n + k * log k) time. Nevertheless, for k = n^eps, for any eps > 0, the time becomes Theta(n * log n). Practically, considering the efficient algorithms for sorting, already for modest k it will be more efficient to sort the numbers and then to pick the right element.

The selection problem is to select out of a set of numbers S (we can assume that it is integers that are all different), the element with rank k: the number that would appear at position k (starting at position 0) when the elements of S were sorted.

For a set S stored in an array a[] of length n most algorithms for selecting the element with rank k are of the following form:

  int select(int[] a, int n, int k) 
  {
    int[] asm = new int[n]; 
    int[] aeq = new int[n]; 
    int[] alg = new int[n]; 
    int nsm = 0;
    int neq = 0;
    int nlg = 0;
    s = selectSplitter(a, n);
    for (int i = 0; i < n; i++) 
      if      (a[i] <  s)
      { asm[nsm] = a[i]; nsm++; }
      else if (a[i] == s)
      { aeq[neq] = a[i]; neq++; }
      else    
      { alg[nlg] = a[i]; nlg++; }
    if (nsm > k)
      return select(asm, nsm, k);
    if (nsm + neq > k) 
      return s;
    return select(alg, nlg, k - nsm - neq); 
  }

The only difference in these algorithms is how selectSplitter(,) is implemented. Of course it is not necessary to use additional memory. If it is allowed to rearrange the array a[], the algorithm can be made in-situ in the same way as quick sort.

Randomized Algorithm

Algorithm

There is a simple randomized realization of the above idea: the splitter s is selected as follows:
  int selectSplitter(int[] a, int n)
  {
    return a[random.nextInt(n)];
  }
Here random(n) is supposed to pick uniformly at random any of the values from {0, 1, ..., n - 1}. Click here to see the above piece of code integrated in a working Java program.

This algorithm is effective also in practice, but the worst-case bound is bad: if we split of one element every time, we may need quadratic time. Once n has been reduced to n' for some sufficiently small n', further recursions can be saved by sorting the remaining n' elements and outputting the element with the correct rank. If n' = o(n / log n), the sorting cost is o(n).

Analysis

Let n_t denote the size of the subproblem the algorithm is working on after t reductions. n_0 = n. At latest the recursion terminates when n_t = 1. The total time consumption is proportional to sum_t n_t, so estimating the n_t gives a bound on the time consumption T. For the expected time consumption we use that Exp[T] ~ Exp[sum_t n_t] = sum_t Exp[n_t]. So, we should put a bound on Exp[n_t] for all t.

The value of Exp[n_1] depends on k. If k = 0, Exp[n_1] = sum_{i = 0}^{n - 1} i / n = n * (n - 1) / (2 * n) = (n - 1) / 2 ~= n / 2. If k = n / 2, Exp[n_1] ~= 2 * sum_{i = 0}^{n / 2} (n - i) / n ~= 3/4 * n. This is the worst case (it is not hard to write Exp[n_1] value as a function of k), so we conclude Exp[n_1] <= 3/4 * n. In the same way Exp[n_2] can be computed and by induction we get n_t <= (3/4)^t * n. Thus we get

  Exp[sum_k n_k] =  sum_t        Exp[n_t] 
                 <= sum_{t >= 0} (3/4)^t * n 
                 <= 4 * n.

This argument is correct, but requires some further explanation. Let k_t be the value of k after k passes and let r_t = n_{t + 1} / n_t. r_t is the reduction factor in round t. The expected value of r_t depends on k_t, but not on n_t: repeating the above analysis of Exp[n_1] as a function of n_0 and k_0 it can easily be checked that Exp[r_t] assumes its maximal value 3/4 for k_t = n_t / 2, independently of n_t. So, the r_t are independent of each other.

Independence is essential, because even though we always have Exp[sum_i X_i] = sum_i(Exp[X_i]), there is no general analogue for the expectation of a product of random variables. Only if the X_i are independent, we may take a product out of the Exp[]. We show this for two random variables, the general prove goes by induction:

  Exp[X * Y] = sum_i sum_j i * j * Prob[X = i and Y = j]
             = sum_i sum_j i * j * Prob[X = i] * Prob[Y = j]
             = sum_i i * Prob[X = i] * sum_j j * Prob[Y = j]
             = Exp[X] * Exp[Y].

Because in our case the r_t are independent, we may conclude that Exp[n_t] = Exp[n_0 * Prod_{i <= t} r_i] = n_0 * Exp[Prod_{i <= t} r_i] = n_0 * Prod_{i <= t} Exp[r_i] <= n_0 * (3/4)^t. It was a very good idea to introduce the r_t. Arguing about the n_t, which are clearly dependent on each other, would have been much harder.

So, this sounds very good, the expected time is O(n) with very small constants. Can we even prove that the running time is bounded by O(n) with high probability? Maybe surprisingly, but the answer is no. The probability that r_i >= (1 - alpha) * n is namely at least alpha. Thus, the probability that we have t of these small reductions on a row is at least alpha^t. For constant alpha this is not so serious, but we can also take alpha = 1 / loglog n and t = loglog n (this is not the worst possible choice but it works well for our purpose). For these choices, (1 - alpha)^t = (1 - 1 / loglog n)^{loglog n} ~= 1 / e. So, sum_{i = 0}^{t - 1} n_t >= t * n_t >= loglog n * n / e = Omega(n * loglog n). The probability that this happens is larger than alpha^t = (1 / loglog n)^{loglog n} = 2^{-loglog n * logloglog n}. Because loglog n * logloglog n = o(log n), this is larger than n^{-eps} = n^{-eps * log n} for any eps > 0.

Thus, with non-negligible probability, the algorithm has a running time of Omega(n * loglog n). Notice that this does not contradict the earlier proof of linear expected time. However, this provides a good example of an algorithm with a very nice expected running time, that nevertheless fluctuates almost in the full range allowed by the Markov inequality.

Improvements

The given algorithm is not entirely satisfying in two respects:

Both problems can be overcome. The better randomized algorithms are based on randomly selecting a somewhat larger subset, sorting these elements (recursively performing selection might also do) and either picking the middle one, or picking two elements that are almost guaranteed to be smaller and larger respectively than the element we are looking for. This second idea is the best. Working out the details it can be shown that selecting an element with specified rank from a set with n elements can be performed with n + o(n) comparisons with high probability.

The first idea can only overcome the first problem but has the advantage of being very simple:

  1. Uniformly at random select 4 * log n elements as presplitters.
  2. From these presplitters select (recursively or by sorting) the element s with rank 2 * log n as splitter.
With high probability s has rank at least n / 4 because otherwise there would be 2 * log n presplitters with rank smaller than n / 4, where we expect only 1 * log n. In the same way, the rank of s is at most 3/4 * n. Thus, this choice of s reduces the size of the problem in every round to at most 3/4 of the original size with high probability.

Deterministic Algorithm

Algorithm

We have seen that it is rather easy to come with an efficient randomized algorithm for selection. The expected running time is O(n), and an improved variant even achieves this with high probability. So, apart from the possibly high cost to compute good random numbers, this algorithm is satisfactory for all practical purposes. But, in general, and particularly for such an important problem as selection, it is a fundamental theoretical question whether a deterministic algorithm can achieve the same as a randomized one. This is not always the case: there are problems for which the lower-bound for a deterministic algorithm is higher than the running time achieved by a randomized algorithm. So, at first one cannot tell how the situation will be for the selection problem. But, actually, there is a relatively simple algorithm that runs in O(n). It is due to Blum et al. and was presented 1973. The algorithm is another realization of the generic selection algorithm. The only point which is worked out in an ingenious deterministic way is the selection of the splitter element s.

There are several possible ways. Very cheap, only O(1), is to pick the first element. But this element may be very bad. Considering that the for-loop in which all elements are traversed takes O(n) anyway, it is better to perform a more expensive selection that leads to a more substantial reduction. Our only hope for obtaining a O(n) algorithm along the lines of the above algorithm, is to find in c * n time an s that guarantees that the size of the subproblem on which we recurse is bounded by alpha * n, for some alpha < 1. In that case we would find the following recurrence relation:

  T(n) <= c * n + T(alpha * n)
       <= c * n * sum_{t >= 0} alpha^t 
       <= c / (1 - alpha) * n.

So, our goal has been reduced from performing selection to finding a high-quality splitter in O(n). Not even this is easy. We assume that n is divisible by five, the other cases give no special problems. For any set S = {s_i| 0 <= i < n} with n numbers which are not necessarily all different a median is a number s_j, 0 <= j < n, so that |{i| 0 <= i < n, s_i <= s_j}| >= n / 2 and |{i| 0 <= i < n, s_i >= s_j}| >= n / 2.

  1. Divide the set S = {a[i] | 0 <= i < n} in subsets S_j, 0 <= j < n / 5 of five elements each.
  2. For all j, 0 <= j < n / 5, determine the median of S_j and compose a set S' consisting of all these n / 5 medians.
  3. Determine the medians of the elements in S' and return this value.

The division in step 1 can be performed arbitrarily, but the easiest is to take S_j = {a[i] | 5 * j <= i < 5 * (j + 1)} for all j, 0 <= j < n / 5. The median of five elements can be determined by sorting. Step 3 is solved by applying the selection algorithm recursively. In code this looks as follows:

  int selectSplitter(int[] a, int n)
  {
    // n must be a multiple of 5.
    int[] b = new int[n / 5];              // b[] contains S'
    for (int j = 0; j < n / 5; j++)        // Loop over all subsets
    {                                      //
      Sort.shortSort(a, 5 * j, 5 * j + 4); // Sort elements in subset
      b[j] = a[5 * j + 2];                 // Save median of S_j
    }                                      //
    return select(b, n / 5, n / 10);       // Select median out off b[]
  }

Click here to see a slightly more general version of the above piece of code integrated in a working Java program. In this program S is divided in n / l subsets of size l each.

Analysis

A set of constant size can be sorted in constant time, so in total the n / 5 sorting operations take n / 5 * O(1) = O(n). If T(n) gives (an upper bound on) the time for selecting an element with specified rank from a set of n elements, then the recursive call takes T(n / 5).

How about the quality of this splitter s? Do we have any guarantee? Yes!!! We are going to put a bound on the number nsm of elements in S that may be smaller than s. The same bound also holds for the number nlg of elements that may be larger.

Some elements are certainly not smaller. s is the median of the n / 5 medians of groups of five elements. So, there are at least n / 10 medians with value at least s. Each of these n / 10 medians, has another two elements within its group of five elements S_i that are even larger. So, here we can point out at least 3/10 * n elements which are not smaller than s. It follows that nsm <= n - 3/10 * n = 7/10 * n.

At Least 3/10 * n Elements Excluded

The recurrence relation is now slightly different from before:

T(n) <= c * n + T(n / 5) + T(7/10 * n).
Where c is mainly determined by the time we need for sorting the groups of size 5, but also for the rearrangements. If we guess that T(n) is linear in n (after all, that is what we hope to prove), then T(n / 5) + T(7/10 * n) = T(9/10 * n). Now we find back the above recurrence with alpha = 9 / 10. So, the solution is T(n) = 10 * c * n, which can be verified by substitution.

Improvements

Assume we are not directly interested in minimizing the time, but want to minimize the number of comparisons made by the algorithm. We can try to optimize the algorithm by not dividing the numbers in groups of 5, but in groups of size l. In that case we get the following recurrence relation:
T_l(n) = c_l * n + T_l(n / l) + T_l((3 * l - 1) / (4 * l) * n).
Assuming linearity allows to take the two recursive terms together:
T_l(n) = c_l * n + T_l((3 * l + 3) / (4 * l) * n).

The solution of this is given by T_l(n) = d_l * n for

  d_l = c_l / (1 - (3 * l + 3) / (4 * l)) 
      = c_l * 4 * l / (l - 3).
Now it becomes clear why we should not take l = 3 (in that case T_l(n) = O(n * log n)).

With increasing l, the factor becomes smaller, but c_l increases. How many comparisons do we make in a round? For all n / l groups of size l the median must be determined. Sorting l numbers requires at least round_up(log l!) comparisons. In general one needs slightly more comparisons, but for the smallest values of l, this bound is sharp: for l = 5, 7, 9, 11, we need 7, 13, 19, 26 comparisons, respectively. In addition to sorting the n / s subsets, the elements must be redistributed. This takes n comparisons with s. In total c_l = 1 + round_up(log l!) / l.

Trying various values of l gives

l c_l d_l
5 2.40 24.0
7 2.86 20.0
9 3.11 18.7
11 3.36 18.5
13 3.54 18.4
15 3.73 18.7

So, the number of comparisons is minimized for l = 13.

Notice that the number of comparisons is smaller than for sorting only for fairly large values of n. Of course our estimates were all pessimistic, but nevertheless is the importance of this deterministic algorithm mainly of a theoretical nature. Furthermore, practically, it is awful to perform sorting with the minimum number of comparisons for any l > 5. Rather one would chose to use insertion sort requiring l * (l - 1) / 2 comparisons. On the other hand, on a modern processor, basic operations such as comparisons on data which reside in the cache can be performed at a much higher rate than fetching data. So, a few extra comparisons do not really matter so much.

The cache performance of the algorithm is more important. Even in that respect the deterministic algorithm is worse: the randomized algorithm gets its splitter almost for free, while the deterministic splitter selection traverses all numbers and in addition performs a recursion. Let T_rnd(n) and T_det(n) denote the number of integers that are brought into the cache by the randomized and deterministic algorithm, respectively. Assuming that both splitter selection procedures in practice are about equally good and both reduce a problem of size n to a problem of size alpha * n (even an improved randomized splitter selection has negligible cost), we get

  T_rnd(n) = T_rnd(alpha * n) + n
           = n / (1 - alpha),
  T_det(n) = T_det(alpha * n) + T_det(n / l) + 2 * n
           = 2 * n / (1 - alpha - 1 / l).
This explains why for large n the deterministic algorithm is more than twice as slow as the randomized one and why the optimal choice of l is larger than computed above not withstanding the applied quadratic sorting algorithm.

Lower Bounds

Theoretically there is an interesting difference between sorting and selection. For sorting there is the trivial lower bound round_up(log n!) = n * log n -1.44 * n + o(n). This has almost been matched by an algorithm requiring only n * log n -1.33 * n + o(n) comparisons. For selection there is nothing comparable. It is rather easy to prove that 3/2 * n comparisons are needed, this bound was given in the original paper by Blum et al. With much effort this has been improved to 2 * n + o(n). On the other hand, the best upper bound, established by Dor and Zwick giving an improvement of an algorithm by Schönhage, Paterson and Pippinger, is still 2.95 * n + o(n). So, for selection there is a considerable gap between upper and lower bound.

In the remainder of this section we first prove that the selection problem at least requires n comparisons and then that even 3/2 * n comparisons are needed. This is interesting for two reasons: the arguments are instructive and it shows that for this problem a deterministic algorithm cannot match a randomized algorithm, because randomizedly the problem can be solved with just n + o(n) comparisons.

We focus on selecting the median because this subproblem is the hardest. We may also assume that all values are different. Notice that once the median m = a_j for some j, 0 <= j < n, out off a set of n numbers has been determined, for any other number a_i it must be known whether a_i < m or a_i > m. If this fact would not matter, then a_i could just as well be the median itself. Once we have accepted this, we can apply an information-theoretic argument. The n numbers have to be split in 3 subsets: one consisting of 1 element and 2 of (n - 1) / 2 elements. There are n * ((n - 1) over ((n - 1) / 2)) such arrangements. Thus, the decision tree corresponding to an arbitrary algorithm has at least one leaf at depth at least round_up(log_2 (n * ((n - 1) over ((n - 1) / 2)))) ~= n, using Stirling's formula. For n = 3, 5, 7, 9 this gives lower bounds of 3, 5, 8, 10, respectively.

For all but o(n) numbers the best randomized algorithms indeed only figure out whether they are smaller or larger than a bound on the median. Deterministic algorithms can be kept busy longer. The argument uses an adversary which only gradually fixes the input. We use the notion of a crucial comparison of two elements. Each element y must once during the algorithm be compared with an element x for which it is known that x >= m with the result y > x or with an element x' for which it is known that x' <= m with the result y < x. When comparing two elements x and y, the adversary tries to choose x and y on either side of m. Because (n - 1) / 2 elements can be placed on each side, the adversary can create at least (n - 1) / 2 non-crucial comparisons. Because in addition there must be at least n - 1 crucial comparisons, it follows that any algorithm must perform at least 3 * (n - 1) / 2 comparisons. For n = 3, 5, 7, 9 this gives lower bounds of 3, 6, 9, 12, which is better than the previous result, but still not close to being tight.

Exercises

  1. For finding the splitter in quick sort we can also select three numbers uniformly at random as pre-splitters, and take the median of these three as splitter. This best-of-three technique is frequently applied in practice and often has a positive impact on the sorting time. Show that this technique also in theory appears to be good by proving a reduced upper bound on the expected number of performed comparisons. Hint: suitably redefine the notion of a successful split.

  2. Reformulate the subroutine merge from the merge sort program so that there is no need for allocating extra arrays. In other words: give an in-situ merge routine. Of course it should still run in linear time. Hint: the notion in-situ is here used in the sense that a problem with input size s is solved using s + o(s) memory in total. o(s) can be actually quite large, not necessarily O(1).

  3. If the splitter s in the deterministic selection algorithm has a very small rank, for example 3/10 * n for the case s = 5. Would it then be a good idea to select another splitter s' from the set of n / (2 * s) medians which are larger than s? Hint: Consider the smallest median s' larger than s. Consider how large the rank of this s' might be.

  4. Add a few lines to the program performing randomized selection so that it also computes the value N = sum_k n_k where n_k is the size of the problem in round k. Run the program a number of times for sufficiently large n with k = n / 2. 50 tests with n = 2,000,000 will do. Mark the running times T and the values of N. Compute the average value of N / n. Make a plot of T as a function of N. Does the correlation of N and T justify the usage of N as cost measure?

  5. Give schedules for sorting 5 numbers with 7 comparisons. Also give a schedule for sorting 7 numbers with 13 comparisons if you dare. It is good to draw a decision tree and to choose at any leave the comparison that splits the set of remaining possibilities as evenly as possible in two.

  6. Consider whether for small l selection of a median out off l numbers can be performed with fewer comparisons than by sorting them. More concretely: consider the numbers l = 3, 5, 7, 9 and either prove that the selection requires as many comparisons as sorting or give an algorithm solving this problem with fewer comparisons.

  7. In the text it was considered how to reduce the maximum number of comparisons performed by the deterministic selection algorithm by optimizing the choice of l. The disadvantage was that optimal performance requires sorting sets of size 11 with 26 comparisons, which requires much code. There is a refinement, which either already allows to obtain linear time consumption by sorting sets of three elements or to obtain even slightly better performance when sorting large subsets as before. Instead of taking the median of all n / l medians of groups of l elements, we can also take the median of all n / l' weak-medians of groups of l' elements. A weak-median in a set S with n elements is an element for which the rank is guaranteed to lie close to n / 2. For example, for n = 9, a weak-median might be any number with rank 3, 4 or 5. Of course working with weak-medians gives weaker bounds on the number of elements that is split off in each round, but this may be more than outweighed by the fact that determining a weak median is cheaper. The splitter which is selected as the median of n / 5 medians of 5 elements is also a weak-median: it is guaranteed to have rank between 3 / 10 * n and 7 / 10 * n.

    Study the quality of the weak-medians selected by first determining the medians in all groups of 3 elements and then to take the median from each group of l of these medians. The splitter x is finally selected by determining the median of these n / l' medians, for l' = 3 * l.

  8. Consider a tree T with n nodes. The edges of the tree have weights. We want to preprocess the tree so that afterwards for any pair of nodes u and v, a query dist(u, v), asking for the distance between u and v, can be answered efficiently.

    It should be pointed out that the problem of this question can also be solved by rooting the tree and computing for each node u the sum depth[u] of the weights on the path from the root to u. Using an algorithm for computing the lowest common ancestor w of u and v, the distance between u and v is given by depth[u] + depth[v] - 2 * depth[w]. This idea can be worked out to an algorithm requiring O(n) preprocessing time and storage while the queries can be performed in O(1) time.

  9. Consider an array a[] of n integers, both positive and negative. An interesting problem is to determine the maximum subsequence sum. Define the subsequence-sum function S() by S(l, h) = sum_{l <= i < h} a[i]. The task is to determine a choice of l, and h for which S() achieves its maximum value. The trivial algorithm, turning the definition into a threefold nested loop, takes O(n^3). It is easy to improve this to O(n^2). A divide-and-conquer algorithm performs better.

  10. The subsequence-sum problem, computing for an array a[] of n integers the value S(l, h) = sum_{l <= i < h} a[i], can trivially be solved in O(h - l) time. However, if many such queries are going to be performed, it may be worthwhile to preprocess the array so that each query can be executed faster. So, we are looking for a data structure which allows to efficiently compute the values S(l, h).

  11. The basic recursive algorithm maximum-subsequence-sum algorithm gives a nice application of the divide-and-conquer approach, but it is not optimal. A non-trivial improvement of the O(n^3) algorithm has linear running time. However, even a recursive algorithm can solve the problem in linear time. The idea is to not only return a value of S(), but even some other values, so that S(0, n) can be computed faster once S(0, n/2) and S(n / 2, n) have been computed. Work out the details of this improved recursive algorithm.





Greedy Algorithms

In this chapter several algorithms are considered which all can be classified as being greedy. Greedy algorithms are algorithms for optimization problems, which make decisions consecutively, at any given time making the decision which then, according to some relatively simple criterion appears best. Earlier decisions are never undone. It depends on the problem whether a greedy algorithm leads to an optimum solution or not. For harder problems, for which greedy algorithms do not necessarily find an optimum solution, it is often so that the greedy solution gives a good approximation in the sense that the quality of it is guaranteed to be within a constant factor from the optimum.

Minimum Spanning Forests

The first problem is the minimum-spanning-forest problem. The input is an undirected graph G with n nodes and m edges. The edges are weighted: connected to every edge e there is a weight w(e). The weight w(G') of a subgraph G' is the sum of the weights of the edges of the subgraph. A tree is a connected graph without cycles. A forest is a collection of trees. For a connected graph, a spanning tree is a subgraph that is a tree which is visiting all nodes. For a disconnected graph G, a spanning forest is a subgraph consisting of a spanning tree for each connected component of G. A minimum-weight spanning forest is a spanning forest with minimum weight. The task is to find such a minimum-weight forest.

If the forest should be light, why not start with the lightest edge? This we may repeat. So, the idea is to do the following:

This algorithm, known as Kruskal's algorithm, guarantees that we finally have a forest, because the property that F is a forest is maintained throughout. On the other hand, it is not obvious that the algorithm even gives a spanning forest and that this spanning forest has minimum weight.

Kruskal's MST Algorithm

Lemma: F is a spanning forest.

Proof: The proof goes by contradiction. We assume that in F there are nodes in the same connected component of the graph, which are not connected by a path in F. Let C(u) be the set of nodes reachable from u by nodes in F. Consider a path from u to v. Because v is not in C(u), there must be two nodes u' and v' on this path, connected by an edge (u', v'), with u' in C(u) and v' not in C(u). At some step of the algorithm, the edge (u', v') was considered. The only reason why it may not have been added to F is that u' and v' already were connected by a path in F. But because edges are never removed from F, this implies that u' and v' are connected by a path in F at the end of the algorithm, contradicting the fact that v' not in C(u). End.

Path from u to v

Lemma: F is a minimum-weight spanning forest.

Proof: For the proof we need the concept of a promising subset. A subset of the edges is called promising (other names are in use as well), when there is a minimum spanning forest containing these edges. The empty set is promising. If the final forest F is promising, then it is a minimum-weight spanning forest. So, assume that F is not promising. Then, at a certain step of the algorithm, by adding an edge e = (u, v), we transgressed from a promising subset S to a non-promising subset S + {e}. Let F' be the minimum spanning forest containing S (such a forest exists because S is promising). Because u and v lie in the same component, there must be a path in F' from u to v. Let P be the set of edges on this path. Let P' be the subset of P consisting of the edges that do not belong to S. P is not a subset of S, because otherwise e would not have been added. Therefore, P' cannot be empty. All edges in P' were considered after e. Because the edges were considered in sorted order, this implies that all of them are at least as heavy as e. Now consider the spanning forest F'', obtained by removing one of the edges e' from P' and adding e instead. The weight of F'' is not larger than the weight of F' and thus, if F' was minimum, then so is F''. But this shows that there exists a minimum spanning forest containing S + {e}, in contradiction with the assumption that S + {e} is not promising. End.

Minimum Spanning Tree F'

Now that we know that the algorithm really solves our problem, we can wonder about its efficiency. Typically in an implementation of the above basic idea first all edges are sorted according to their weight. Then, when processing the edges in the order of increasing weights, a union-find structure is used to test whether an edge (u, v) gives a connection between two nodes which were not yet connected by any of the earlier processed edges, or that it creates a cycle. In the first case, the index of (u, v) is written on a list of selected edges and the components of u and v are unified. The union-find structure is initialized with n isolated nodes.

The time for sorting depends on the keys. In many cases, for example if all edge weights are polynomial in n or when they are uniformly distributed, the sorting can be performed in O(m) time, but in general we can bound it by O(m * log m) = O(m * log n). Entering the selected edges in a list takes O(n). The 2 * m find operations and the at most n union operations take O(m * alpha(m, n)) time, which is negligible in comparison with the time for the sorting. Only in case the sorting can be performed in linear time, this last term dominates the overall time consumption.

Click here to see the algorithm integrated in a running Java program. In this program the weighted graph is represented by three arrays of length m: a triple (srce[i], targ[i], wght[i]) stands for an edge between srce[i] and targ[i] with weight wght[i]. This is a convenient way of representing a weighted undirected graph. The sorting of the edges is performed by quicksort until the subproblems are sufficiently small, then a variant of insertion sort takes over. The sorting is performed by rearranging a copy of the array with weights along with an array of edge indices.

A simplified version of the algorithm can be used for computing a spanning tree for an unweighted graph: an unweighted graph is a graph in which all edges have the same weight. So, the sorting can be left out. All the rest is the same. The above lemma shows that we really get a spanning tree.

Applying Kruskal's algorithm for computing a spanning tree of an unweighted graph requires O(n + m * alpha(m, n)) time which is super linear for m which are just a little bit larger than n. Thus, asymptotically it is slightly less efficient than BFS or DFS. However for larger m, this algorithm has a much more structured access to the memory than BFS or DFS: the edges are traversed once from the first to the last; unstructured memory access is performed only on the union-find data structure, which requires O(n) memory. If the values of n and m are such that O(n) data fit in the cache but not O(m) data, this is an advantage. In case O(n) data fit in the main memory but not O(m) data, the advantage is much larger and then the variant of Kruskal's algorithm is much faster than the conventional algorithm. Click here to see both algorithms integrated in a running Java program.

Matching

Now we consider a similar problem. We have a weighted undirected graph as before. The task is to select a subset of the edges so that no two selected edges have a common endpoint. Such a subset is called a matching. The task is to select a matching with maximum weight. The analogue of the previous algorithm is the following: Clearly we finally have a matching. Nevertheless, no one has become famous with this algorithm. The reason is that the resulting matching M is not a maximum-weight matching. The proof is simple: consider a graph with four nodes arranged in a cycle. If the edges are weighted (around the cycle) as 4, 3, 1, 3, then the algorithm will match the edges with weight 4 and 1, for a total weight of 5, while one could also match the two edges with weight 3, for a total weight of 6.

Weighted Matching

Not even the unweighted matching problem can be solved optimally by a greedy algorithm. The unweighted matching problem consists of finding a matching of maximum cardinality in an unweighted undirected graph. Processing the edges in arbitrary order and selecting any edge which does not violate the matching condition does not necessarily select a maximum-cardinality subset.

Unweighted Matching

General Pattern

Both algorithms presented are examples of greedy algorithms. In general greedy algorithms are characterized by the following points:

For most optimization problems one can easily formulate a greedy algorithm. Because of the testing, we are guaranteed to get a feasible solution (for example a tree when trying to construct a spanning tree). Though they are easy and fast, they are not always leading to the maximum/minimum value of the objective function. In this context it is quite conventional to call the solution of a greedy algorithm optimal and not optimum and the values maximal/minimal instead of maximum/minimum. So, for the above matching problem 5 is a (it need not to be unique) maximal value, whereas 6 is the (there may be several matchings achieving the same value, but the value is unique) maximum value.

For ranking the possible decisions there are two common strategies:

Not withstanding the static weights in the case of MST, it might be better not to sort all edges at the start: in the case of general weights this inevitably brings the cost to Omega(m * log n), and we might hope to do better. A good idea is to first determine the m' <= m lightest edges, to sort these and to consider them in order of increasing weight. Then, the other edges are considered and only those which do not create a cycle are maintained, sorted and considered in order of increasing weight. Taking m' ~= 5 * n, this considerably reduces the number of edges to sort for many classes of graphs and works fine in practice as well. This idea is worked out more generally and analyzed in one of the exercises.

Directed Semi-Matching

We consider directed weighted graphs. The task is to find a maximum weight subset of the edges so that no two starting points are the same.

A greedy algorithm might first sort all edges according to their weight in decreasing order; then consider all edges in this order, selecting an edge (i, j) if and only if so far no edge (i, j') has been selected. Using an array to mark the matched nodes, the tests can be performed in constant time, so the whole algorithm runs in O(m * log n) time. Effectively, this greedy algorithm selects for every node the outgoing edge which has maximum weight. This observation leads to an even more efficient implementation: for each node the maximum-weight outgoing edge can be selected in linear time, so the whole problem can be solved in O(n + m) time. This solution reaches the optimum, because the choice for some node u has no impact on the choice for any other node. So, in this case the global maximum is obtained as a sum of the local maxima.

More formally the optimality can be proven by contradiction. Denote the weight of an edge e by w(e). Let S be the subset of the edges which maximizes the objective function. That is, for each node x in V there is at most one edge e = (x, y) in S, and W = w(S) = sum_{e in S} w(e) is the maximum achievable for any subset of the edges with this property. Assume there is an edge (u, v) in S and an edge (u, v') in E, so that w(u, v') > w(u, v). Let S' = S - {(u, v)} + {(u, v')}, the set of edges obtained by substituting (u, v') for (u, v) in S. The feasibility of S' follows from that of S, because for all x != u, the outdegree of x in S' is the same as in S, and for x = u, we have replaced (u, v) by (u, v'), leaving the outdegree unchanged even for u. But, W' = w(S') = sum_{e in S'} w(e) = sum_{e in S} w(e) - w(u, v) + w(u, v') = W + w(u, v') - w(u, v) > W, in contradiction with the assumption that S was a maximum solution.

Maximum-Weight Directed Semi-Matching

Minimum Spanning Paths Problem

We consider weighted undirected graphs. The task is to find a maximum-cardinality minimum-weight subset constituting a set of paths. This sounds very much like the minimum-spanning-forest problem. However, it is much harder. Actually this problem is polynomial-time equivalent with the traveling salesman problem (TSP).

A greedy algorithm might proceed as follows: start with the lightest edge. At all times expand the current path with the cheapest edge connected to one of its endpoints. It is easy to find examples where the solution is not optimum. The solution found does not even need to maximize the number of selected edges. An alternative approach, more like Kruskal's MST algorithm, is not optimal either: First sort the edges according to their weights, and then consider them one-by-one in the sorted order. An edge is selected if both of its endpoints have at most one other incident edge.

Minimum Spanning Paths

Coloring with Rectangles

Suppose we want to specify which pixels of a two-dimensional picture are white and black and suppose that this is done by telling which rectangles of pixels are black and which are white. Each pixel must be covered by at least one of the rectangles. If a pixel is covered by several rectangles, the last one determines its color. The task is to find a minimum coloring according to the above rules. The motivation is that each specified rectangle corresponds to a call to a coloring procedure and the fewer of these calls there are, the faster the coloring can be performed and the smoother an evolving image can be rendered.

It appears to be very hard to determine an optimum solution for the described optimization problem. The corresponding decision problem, answering the question whether there is a correct coloring by a sequence of at most m rectangles is in NP, because for any specified solution it can easily be checked that it is correct.

For an n x n square, a greedy algorithm solves the problem as follows: repeatedly select the rectangle that gives the largest increase of the number of correctly colored pixels. If we consider all pixels which have not yet been colored at all to be wrongly colored, then the first operation is to color the whole square white or black depending on whether there are more white or more black pixels in the picture that must be established. Until all pixels are correct, it is always possible to increase the number of correctly colored pixels by one, by taking a 1 x 1 rectangle. A rectangle is defined by its upper-left and lower-right corner. So, there are less than n^4 rectangles to consider in each round and the algorithm terminates after at most n^2 rounds. This is not particularly efficient, but at least it is polynomial.

Greedy Coloring by Sequence of Rectangles

The quality of the achieved coloring is not obvious, but it is easy to construct examples for which the constructed sequence is longer than necessary. In some rare cases greedy algorithms find optimum solutions, but in general they can be viewed as approximation algorithms. An approximation algorithm, is an algorithm which for an optimization problem, problems like weighted matching in which a feasible solution of optimum cost has to be constructed, finds a correct but not necessarily optimum solution.

The approximation ratio is a very important notion in the approximation-algorithm context. In general this ratio is a function r(n) of the input size n. For a minimization problem it is defined by

r(n) = max_{all inputs I of size <= n} cost_app(I) / cost_opt(I).
Here cost_app(I) denotes the cost of the solution for I found by the approximation algorithm, while cost_opt(I) denotes the (maybe not exactly known) cost of the optimum solution. For maximization problems, the approximation ratio is defined analogously:
r(n) = max_{all inputs I of size <= n} cost_opt(I) / cost_app(I).
An algorithm with an approximation ratio of k, that is, an algorithm which is guaranteed to find a solution for which the cost is guaranteed to lie within a factor k from the optimum, is said to be k-optimal. The approximation ratio is used to divide approximation algorithms in categories: Theoretically the existence or non-existence of a PTAS for a problem is an important question. For many problems it has been proven that the existence of a PTAS would imply that NP = P. The first such result came as quite a shock, because it practically implies that there are problems which cannot even be approximated.

The greedy matching algorithm for unweighted graphs is two-optimal. This can be proven easily. Consider a maximum matching M and an optimum matching M'. For every matched edge in M we have two tokens. If for an edge (u, v) in M, there is an edge (u, v') in M', then the first token is laid on (u, v'). If there is an edge (u', v) in M', then the second token is laid on (u', v). The total number of tokens deposited is at most twice as large as the number of edges in M. At the same time we claim that each edge in M' receives at least one token. Assume that there is an edge (u', v') in M' for which neither u' nor v' is matched by the edges in M, then (u', v') can be added to M without violating the matching condition, which is in contradiction with the assumption that M is maximal. In an analogous way it can be shown that even for weighted graphs the greedy matching algorithm is two-optimal: for each matched edge at most two other edges which are not heavier cannot be taken.

Matroids

Definitions

We want to formulate a more general framework which allows us to understand better what we are talking about and facilitates showing that a problem can indeed be solved optimally by a greedy algorithm.

A subset system S = (E, I) is a finite set E together with a collection I of subsets of E closed under inclusion. That is, if A in I and A' is a subset of A, then also A' in I. The elements of I are called independent subsets. Given a weight function w() mapping E to the positive numbers, the combinatorial optimization problem associated with S is to find the independent subset A so that sum_{e in A} w(e) is maximized.

An example of a subset system is given by E equal to the set of edges of a graph and I all subsets of the edges which constitute forests. For this subset system, the weight function might be either w(e) = 1 for all e, for unweighted graphs, and otherwise w(e) equals W - weight(e), where weight(e) gives the weight of the edge e and W = max_{e in E} weight(e). In the first case the optimization problem constructs a spanning tree in the second case a minimum spanning tree. Another example is that E is a set of vectors of length n and I all subsets of independent vectors. With w(e) = 1 for all e, the problem is to find a maximum-cardinality independent set of vectors. Even for this case one can imagine weighted versions.

For a subset system, the greedy algorithm can be formulated as follows:

  Set greedySelection(Set E, Set I)           // I is a set of sets !!!
  {                                           //
    Set candidates = new Set(E);              // start with cnddts = E
    Set selection  = new Set();               // start with slctn = {}
    while (candidates.notEmpty())             // untested elements ???
    {                                         //
      e = candidates.deleteMaxWeight();       // maximum weight element
      Set enlarged = selection.addElement(e); // enlarged = slctn + {e}
      if (I.isElementOf(enlarged))            // is enlarged in I ???
        selection = enlarged;                 // accept e in selection
    }                                         //
    return selection;                         // selection is final
  }                                           //

This algorithm is a straightforward generalization of the MST algorithm. A subset system is called a matroid if the greedy algorithm finds an optimum solution.

Actually we have one problem for each input, but we typically want to speak about MST as a problem and not MST on a specific graph. We will refer to all these by the MST problem. Also we will more correctly say that a subset system is a matroid when the greedy algorithm solves any instance of it.

Main Theorem

The following theorem is extremely handy for proving that the combinatorial optimization problem associated with a subset system can be tackled by the greedy algorithm:

Theorem: For a subset system M = (E, I), the following statements are equivalent:

  1. M is a matroid.
  2. If S_1 and S_2 are elements of I, and |S_2| = |S_1| + 1, then there is an e in S_2 so that S_1 + e in I.
  3. For any subset A of E any maximal independent subset has the same cardinality.

Proof: The equivalence is proven by showing that (1) implies (2), (2) implies (3) and (3) implies (1).

Assume that M is a matroid, (1), but that (2) does not hold. So, let |S_1| = p and |S_2| = p + 1. Assume that for none of the elements e in S_2 - S_1, S_1 + e in I. Under these assumptions we can define a weight function on the elements so that then the greedy algorithm would not have found the optimum solution, giving a contradiction with (1). These weights are w(e) = p + 2 for all e in S_1, w(e) = p + 1 for all e in S_2 - S_1 and w(e) = 0 for all other e. The greedy algorithm first picks all elements of S_1. Then it considers the elements from S_2 - S_1, but according to the assumption, none of these can be added. The other elements have weight 0, so they do not improve the result. Thus, the weight of the greedy solution equals p * (p + 2) = p^2 + 2 * p. Taking all elements of S_2 gives a total weight of at least (p + 1) * (p + 1) = p^2 + 2 * p + 1, because each of the |S_2| = p + 1 elements of S_2 has weight p + 1 or more.

Extendibility of Independent Sets

In order to prove that (2) implies (3), we assume that (2) holds but not (3) and derive a contradiction. So, let S_1 and S_2 be two maximal independent subsets of A. Assume |S_1| < |S_2|. Construct a subset S'_2 of S_2 with |S'_2| = |S_1| + 1 by leaving some elements away if necessary. Because M = (E, I) is a subset system, S'_2 is independent. So, (2) can be applied to S_1 and S'_2, implying that there is an element e in S'_2 so that S_1 + e is independent, contradicting the maximality of S_1.

Assuming property (3) we finally prove (1). Again the proof goes by contradiction. So, assume the greedy algorithm picks S, while there is another independent set S' with larger weight. Because both S and S' are maximal solutions, their cardinality is the same according to (3). So, let S = (e_1, ..., e_i) and S' = (e'_1, ..., e'_i). We may assume that the elements are ordered so that w(e_1) >= w(e_2) >= ... >= w(e_i) and w(e'_1) >= w(e'_2) >= ... >= w(e'_i). We claim that w(e_j) >= w(e'_j) for all j, 1 <= j < i. This claim implies sum_{j = 1}^i w(e_j) >= sum_{j = 1}^i w(e'_j), giving a contradiction with the assumption that the weight of S' is larger than that of S.

The claim is proven by induction. For j = 0, it trivially holds. Now assume the claim holds for some j. Suppose it does not hold for j + 1. That is, suppose w(e'_{j + 1}) > w(e_{j + 1}). Let A = {e in E| w(e) >= w(e'_{j + 1})}. The set X = {e_1, ..., e_j} is a maximal independent subset in A. X is maximal, because if there would be an element e in A, that is an element with w(e) >= w(e'_{j + 1}) > w(e_{j + 1}), so that X + e is independent, then the greedy algorithm would have chosen e instead of e_{j + 1} for its next addition. However, X' = {e'_1, ..., e'_j, e'_{j + 1}} is an independent subset in A with larger cardinality, contradicting (3). End.

So far we have encountered the following three matroids: It is a useful exercise to prove that these are matroids by checking that either condition (2) or (3) holds for them. For the matrix matroid, a central theorem from linear algebra states that any maximal independent set of vectors has the same cardinality, namely the rank of the system. Also for the partition matroid condition (3) can be used: any maximal independent set contains as many elements as there are nodes of non-zero outdegree.

For the graphic matroid, condition (2) is most convenient. Let T be a forest with l edges and T' a forest with l' > l edges, then T can be extended as follows:

  Set enlarge(Set T, Set T')
  // For forests T and T' with |T'| > |T| this 
  // procedure constructs a forest T'' containing 
  // all edges of T with |T''| >= |T'|.
  {
    T'' = T';
    for (all edges e in T)
    {
      T'' += e;
      if (T'' contains a cycle)
      {
        determine an edge e' in T' - T on the cycle;
        T'' -= e';
      }
    }
    return T'';
  }

The only point to check is, that on any cycle that may arise there is an edge e' in T'. But this is clear because otherwise all edges on the cycle are in T, which contradicts the fact that T is a forest. Thus, finally we find a forest T'' which contains all edges in T. Because the number of elements in T'' is not decreasing, finally we have l'' = |T''| >= |T'| = l' > l. For the special case l' = l + 1, this gives l'' >= l + 1. Hence |T'' - T| > 0, which implies that T'' - T contains at least one edge e in T' so that T + {e}, a subset of the forest T'', is a forest.

Intersection of Matroids

If (E, I) is a subset system and (E, I') is a subset system, then is (E, I & I') also a subset system, where "&" gives the intersection of two sets. This is true, because if A is in I & I' then A is in I and in I'. So, any set A' which is a subset of A is in I and in I' as well because both are subset systems, and thus A' is in I & I'. Unfortunately, if (E, M) and (E, M') are matroids, then in general it is not true that (E, M & M') is a matroid as well. In the following we consider some examples.

For an undirected graph G = (V, E), let I be the set of all subsets of the edges so that each node in V is the starting point of at most one edge in the subset. Let I' be the analogous set of subsets so that each node is the endpoint of at most one edge. (E, I) is the partition matroid and (E, I') is equivalent. For an undirected bipartite graph in which all edges are running between a set V_1 and a set V_2, a directed graph G can be obtained by directing an edge (u_1, u_2) with u_1 in V_1 and u_2 in V_2 from V_1 to V_2. For this graph G the subset system I & I' consists of all matchings of the underlying bipartite graph. Because the matching problem cannot be solved in a greedy way, it follows that (E, I & I') cannot be a matroid.

Intersection of Two Partition Matroids

For a directed unweighted graph G = (V, E), let I be the set of subsets so that when disregarding the direction of the edges these constitute a forest. Let I' be the set of all subsets so that each node in V is the endpoint of at most one edge. The intersection of this graphic and partition matroid gives a subset system, the elements of which are known as branchings: acyclic subgraphs without shortcuts in which all nodes have indegree at most one. Constructing a branching establishes some kind of command structure. The task is to find a maximum-cardinality branching. This is not a matroid: greedily considering all edges and accepting those that do not violate the conditions does not always lead to a maximum-cardinality subset.

Branching

The examples make clear that combinatorial optimization problems associated with the intersection of two matroids can, in general, not be solved by a greedy algorithm. Nevertheless, there is a general method for solving these. The method is rather complicated, analogous to the solution of the general matching problem, but it has polynomial time. In complexity there is a great difference between the unweighted case, like computing a maximum-cardinality matching, and the weighted case, like computing a maximum-weight matching.

Combinatorial optimization problems associated to the intersection of three matroids are probably considerably harder: the directed Hamiltonian path problem, which is NP complete, is associated to the intersection of three matroids. The directed Hamilton path problem asks for an answer to the question whether for a directed graph there is a directed path traversing each node of the graph exactly once. The answer is positive if for a graph with n nodes there is a branching with n - 1 edges when in addition all nodes have outdegree at most one. So, it is associated to the intersection of two partition matroids and one graphic matroid. Therefore, a general polynomial-time algorithm for solving problems associated to the intersection of three matroids would prove NP = P.

Alternative Solutions

The greedy algorithm for MST and weighted semi-matching is fine, but there are alternatives. MST can also be solved with Prim's algorithm. Which is similar to Kruskal's algorithm, but for the next edge it only considers edges which originate from nodes which have earlier be reached. So, it maintains a tree at all times. Prim's algorithm can be viewed as a variant of Dijkstra's algorithm for solving the single-source shortest-path problem.

For weighted semi-matching the greedy algorithm considers all edges as elements of a big pool. Sorting or repeatedly selecting the next heaviest element costs O(m * log n) in general. In this case, there is an alternative way to solve the problem, which never takes more than O(n + m) time. The idea is to first create adjacency lists. This can be done by performing a bucket sort on the first index of the edges. This takes O(n + m) time. Then, for each node, the heaviest outgoing edge must be selected. For node i, if the length of its adjacency list is denoted l_i, this takes O(1 + l_i) time. Thus, for some constant c, all maxima can be determined in sum_i c * (1 + l_i) = c * (n + sum_i l_i) = c * (n + m) = O(n + m) time. The problem can be solved even simpler: an array a[] of length n is created and initialized with a[u] = 0, for all 0 <= u < n. Then all edges are processed. If w > a[u] for an edge (u, v) with weight w, we set a[u] = w. Finally a[] contains the values we are looking for.

In general: greedy algorithms are mostly very efficient, but the details of the implementation may nevertheless be important.

Customer Scheduling

The common notion of a greedy algorithm is more general than the above. In the following we consider some problems which appear not to fall in the category of a subset system with a weight function on the elements.

A server has to execute n jobs. The jobs require different amounts of time, let t_i, 0 <= i < n, be the time for job i. The task is to find a schedule that minimizes the average completion time, or equivalently the sum of all completion times. Here a schedule is nothing more than a permutation pi(), specifying the order in which the jobs are executed The completion time of job i is the sum of the times of all jobs that are scheduled before it plus t_i. Let s_i, 0 <= i < n, be the time of the job scheduled after i other jobs have been executed. In terms of the s_i, the objective function has the following value:

P(pi) = s_0 + (s_0 + s_1) + (s_0 + s_1 + s_2) + ... = sum_{0 <= k < n} (n - k) * s_k.

What order to choose? Lightest first scheduling, performing the jobs in order of increasing completion time, is simple and requires only O(n * log n) time to determine.

Lemma: Lightest first scheduling minimizes the average completion time.

Proof: Assume that a schedule pi'() in which the jobs appear in non-sorted order is optimal. Thus, there are two jobs i and j, with t_i > t_j, with pi'(i) < pi'(j). Now consider the schedule pi''() which equals pi'() except for pi''(i) = pi'(j) and pi''(j) = pi'(i). Let x = t_i - t_j. Under pi''(), job i completes at the same time as job j under pi'(). Under pi''() job j completes c steps earlier than job i under pi'(). Any job k with pi'(i) < pi'(k) < pi'(j) is executed x steps earlier under pi''(). For all other jobs the execution time remains the same. So P(pi'') = P(pi') - x * (pi'(j) - pi'(i)) < P(pi'), because x > 0 and pi'(j) - pi'(i) > 0. Thus, pi''() is better than pi'(), contradicting the optimality of pi'(). End.

Making Change

Consider a country, one of the many, with coins and bank notes in denominations 1, 2, 5, 10, 20, 50, etc. The task is to design an algorithm to pay a specified amount using the minimum number of coins and bank notes assuming that all of these are available in sufficient numbers. In the following we will only speak of coins, though certainly the larger ones will be made of paper.

How about a greedy algorithm? That is,

  Stack makeChange(int amount, int[] coins, int n) 
  {
    Stack stack = new Stack();      // Creates an empty stack
    while (n > 0 && amount > 0) 
    {
      n--;
      int coin = coins[n];
      while (coin <= amount) 
      {
        stack.push(coin);
        amount -= coin;
      }
    }
    if (amount > 0)
      System.out.println("Failed to find a paying schedule !!!");
    return stack;
  }

Because there are coins of denomination 1, this algorithm will always be able to pay the required amount. It is efficient: if the algorithm composes a pile of c coins, then the algorithm takes O(n + c) time, which might be the optimal. The algorithm requires (apart from the output) only O(1) memory. The remaining question is about the quality. Is the number of used coins minimum?

Lemma: For a coins systems with denominations x * 10^i, for x in {1, 2, 5}, and all i >= 0, the greedy algorithm is optimal.

Proof: We claim that for paying an amount N < 10 it is optimal to use the following decompositions: --, 1, 2, 2 + 1, 2 + 2, 5, 5 + 1, 5 + 2, 5 + 2 + 1, 5 + 2 + 2. For the amounts that are paid with 0 or 1 coin, N in {0, 1, 2, 5}, this is clear. For the amounts which are paid with two coins, N in {3, 4, 6, 7}, it satisfies to verify that there is not a single coin of these denominations. For the amounts which are paid with three coins, N in {8, 9}, it must be checked that these cannot be obtained as the sum of two denominations. These sums are 1 + 1 = 2, 1 + 2 = 3, 1 + 5 = 6, 2 + 2 = 4, 2 + 5 = 7 and 5 + 5 = 10. The given schedules are the ones constructed by the greedy algorithm, so for N < 10 it is optimal.

Consider now some N > 10 and assume there is an optimal schedule consisting of n coins with values c_0, ..., c_{n - 1}. Let k be the number of digits in the decimal expansion of N (omitting leading zeroes). Let d_j = sum_{0 <= i < n| 10^j <= c_i < 10^{j + 1}} c_i, for all j, 0 <= j < k. That is, d_j is the contribution of all coins with values from 10^j up to 10^{j + 1}. Assume there is a j, 0 <= j < k, with d_j >= 10^{j + 1}. Because the coins in the considered range have values 1 * 10^j, 2 * 10^j or 5 * 10^j, this implies that there must be a subset of these which has value x * 10^j, for some x in {10, 11, 12, 13, 14}. Using coins of values 1, 2 and 5, the minimum number of coins to obtain 10, 11, 12, 13 and 14, is 2, 3, 3, 4 and 4, respectively. Using a single coin of value 10^{j + 1}, at least one coin can be saved, in contradiction with the supposed optimality of the schedule. So, in the following we may assume that any optimal schedule has d_j < 10^{j + 1} for all j, 0 <= j < k. This implies that if N = sum_{0 <= j < k} b_j * 10^j, the contributions b_j * 10^j must be obtained as an optimal combination of values x * 10^l, for x in {1, 2, 5}. This is precisely what the greedy algorithm gives, treating the digits in order of decreasing importance. End.

Now consider a system with denominations x * 10^i, for x in {1, 4, 6} and all i >= 0. The greedy algorithm will pay 8 units as 6 + 1 + 1, which is clearly not as good as 4 + 4. Of course this observation is a good reason to conclude that a system based on 1, 4 and 6 is inferior to a system based on 1, 2 and 5, but odd systems do exist and we would like to construct good schedules even for them. We return to the problem of making change again later on.

Exercises

  1. A maximum spanning tree is a spanning tree with maximum weight. Consider whether a greedy algorithm works. Give a proof or a counter example.

  2. The presented greedy algorithm for unweighted matching picked the edges in arbitrary order. We consider a more greedy algorithm that appears to perform better. The degree of an edge (u, v) is defined to be the sum of the degrees of u and v. The edges are sorted in increasing order according to their degrees, and then processed in this order. Either provide a graph for which this approach does not find a maximum matching or prove that is is optimal.

  3. Consider a set of n vectors. The weight of vector i (for example its Euclidean length), 0 <= i < n, is given by l_i. The problem is to select a maximum-weight set of linearly independent vectors, that is, a set S of vectors, so that L = sum_{i in S} l_i is maximized. Formulate a greedy algorithm and consider whether it always finds an optimum solution.

  4. We consider again the suggested improvement of the greedy algorithm for MST algorithm. This idea is not limited to the MST problem: for any matroid, one might perform the following steps (assuming that the matroid is a maximization problem):
    1. Select, with a linear time selection algorithm, the element s with rank x * n, for some suitable x. Here the rank is defined by the weight function w().
    2. Traverse all elements in E and split the set of elements E in subsets E_> = {e in E| w(e) >= w(s)} and E_< = {e in E| w(e) < w(s)}.
    3. Sort the elements in E_> according to their weight in decreasing order.
    4. Perform the greedy algorithm for all elements in E_>. The set of selected elements is denoted by S.
    5. Traverse all elements in E_< and construct the set E'_< = {e in E_<| S + e in I}.
    6. Sort the elements in E'_< according to their weight in decreasing order.
    7. Perform the greedy algorithm for all elements in E'_<.

    After step 5, there is no need to further consider the elements in E''_< = E_< - E'_<. The elements in E''_< are said to be filtered out. These elements contribute to the cost only during the selection of s in step 1 and during the filtering in step 5, but they do not take part in a sorting operation. The total time consumption of this algorithm is given by O(n + m) + T_sort(x * n) + T_sort(|E'_<|) + m * T_test + |E'_<| * T_test <= O(n + m) + T_sort(x * n + |E'_<|) + 2 * m * T_test. So, if the 2 * m tests can be performed in O(m) time or slightly more, the time consumption is determined by the time to sort x * n + |E'_<| weights. The simple greedy algorithm for weighted semi-matching takes Theta(m * log n) time. This time consumption is due to the fact that all edges are sorted. In this particular case, we have seen how to solve the problem in O(n + m) time for all graphs. However, due to its simplicity, this problem is nevertheless interesting for studying further.

  5. Consider the weighted matching problem on graphs. Prove that the the greedy algorithm is 2-optimal. Give examples showing that this bound is tight. In other words, give examples for which the greedy matching constructs a matching which is only half as heavy as possible.

  6. The time consumption of the suggested greedy algorithm for the black-and-white coloring problem depends on the time for determining the rectangle which gives the larger increase of the correctly colored number of pixels. This is an instance of a two-dimensional variant of the maximum-subsequence-sum problem, which we will call maximum-submatrix-sum problem, which is of independent interest. The input to this problem is an n x n matrix a[][] with positive and negative values. Define the submatrix-sum function S() by S(x_l, y_l, x_h, y_h) = sum_{x_l <= i < x_h, y_l <= j < y_h} a[i][j]. The task is to determine a choice of x_l, y_l, x_h and y_h for which S() achieves its maximum value.

  7. We consider special cases of the maximum-subsequence-sum and maximum-submatrix-sum problem. All positive numbers have value 1, all negative numbers have value -infinity. In this case a subsequence or submatrix for which the sum of all values is positive cannot contain any negative values, because a single of these outweighs all positive values. The subsequence problem can be solved trivially by traversing the array once. The submatrix problem is algorithmically more interesting. Applications of the submatrix problem arise whenever a fault-free rectangle with maximum area should be cut out of an n x n slice of raw material. Therefore it will be called the maximum-perfect-rectangle problem. A three-dimensional variant of the problem, which may be called the maximum-perfect-block problem, has even more applications. The value of a large flawless diamond is much higher than that of several smaller diamonds even if their total weight is larger. So, starting with a raw imperfect diamond, the main task is to find a cutting that maximizes the size of the largest resulting flawless diamond.

  8. Give an example of a black and white picture for which the greedy coloring algorithm does not construct a minimum-length sequence of rectangles. Show that the approximation ratio of this greedy algorithm can be arbitrarily poor. That is, show that there is a class of inputs with increasing sizes, for which the performance ratio goes to infinity.

  9. Describe an optimal algorithm for the black-and-white coloring problem for an n x n picture. Express the complexity of your algorithm in terms of n and the number of needed coloring operations r.

  10. Let E be a finite set. Let C = {S_1, ..., S_m} be a collection of subsets of E. Let T = {e_1, ..., e_t} be a subset of E. T is said to be transversal if there is an injective mapping f: {1, ..., t} -> {1, ..., m} so that e_i in S_f(i) for all i. Show that M = (E, I) with I the set of transversal subsets is a subset system. Show, by using the most suitable of the three equivalent formulations, that this M, with equal weights for all elements, is a matroid. That is, show that a maximum-cardinality transversal subset can be constructed in a greedy way (though it might be hard to test whether a set selection + e belongs to I).

  11. Consider a variant of a matching problem: the nodes have weights and the quantity to maximize is the sum of all weights of all nodes incident on a matched edge. Consider whether this problem can be solved by a greedy algorithm: either indicate why this is not possible or present an algorithm together with a proof of its optimality.

  12. Consider a weighted variant of the customer-scheduling problem. Job i, 0 <= i < n, takes t_i steps and brings a profit of x_i - f_i * c_i, where x_i and f_i are parameters and c_i is the completion time of job i. All jobs must be executed. The task is to maximize the total profit, that is, to find a permutation pi, so that P(pi) = sum_{0 <= i < n} x_i - sum_{0 <= i < n} f_i * sum_{j| pi(j) <= pi(i)} t_j. Because the sum of the x_i is independent of pi(), this is equivalent to minimizing sum_{0 <= i < n} f_i * sum_{j| pi(j) <= pi(i)} t_j. The earlier problem is the special case that f_i = 1, for all 0 <= i < n. Formulate a suitable greedy algorithm and consider whether it is optimal.

  13. This question deals with the problem of determining an optimal coin system. Quality measures are that there should not be too many different denominations; that the greedy algorithm works; that the average required number of coins is small; and that an amount can even be paid with few coins if one denomination is missing. First consider coin systems with denominations 10^i, x * 10^i and y * 10^i, for all i >= 0.

  14. We are used to a decimal coin system, but other choice might be equally good or better. In the following we consider a system with denominations 16^i, x * 16^i and y * 16^i, for all i >= 0. The advantage of such a hexadecimal system is that the number of needed denominations is smaller. For a complete system up to 100,000, a decimal system needs 5 * 3 = 15 denominations, whereas a hexadecimal system needs only 4 * 3 = 12, because 16^4 = 65536, so even if x = 2, there is no need for a coin with value 2 * 16^4. It is to be expected that this comes at the price of needing more coins on average. Let c_10(N) be the minimum number of coins for paying an amount N with the optimal decimal schedule determined in the previous question. Let C_10(100,000) = sum_{0 <= N < 100000} c_10(N) Define c_16(N) and c_16(100,000) analogously.

  15. Give some general conditions under which the problem of making change can be solved optimally in a a greedy way. Prove your claims.

  16. Study the arguments in the prove that the greedy algorithm is optimal for a money system with coins of values x * 10^i, for x in {1, 2, 5}.





Approximation and Linear Programming

An approximation algorithm for an optimization problem is a polynomial-time algorithm for a hard, mostly NP-hard, problem, which finds a feasible solution for which it can be proven that the value of the objective function lies within a factor from the optimum value for all inputs. In this latter sense approximation algorithms are more than just heuristics, which may come without any guarantee. For an approximation algorithm A, let f_A(I) denote the value achieved on an instance I, while f_*(I) denotes the optimum value. For a maximization problem, the performance ratio of A is defined to be the maximum over all I of f_*(I) / f_A(I). For a randomized algorithm A, f_A(I) is replaced by the expected achieved value, Exp[f_A(I)]. An algorithm with performance ratio c is also said to be a c-approximation.

There is a general method for obtaining approximation algorithms for an optimization problem P. It works as follows:

Many famous optimization problems can be tackled this way, even though it is easy to invent optimization problems which cannot be formulated as an ILP. By estimating the error that is made when rounding the variables in S' the deviation from the optimum may be bounded. The rounding can be performed in several ways, which may be problem specific.

We consider a simple example, which can be interpreted geometrically. Consider the following system of linear inequalities:

  x >= 0,
  y >= 0,
  x + 4 * y <= 16,
  6 * x + 4 * y <= 30,
  2 * x - 5 * y <= 6.
The objective function is f(x, y) = 6 * x + 5 * y. The goal is to find the feasible solution (x'_0, y'_0) for which f is maximal.

The five inequalities define a closed two-dimensional subset I of R^2. Because it is the intersection of convex subsets, it is convex itself. Moving along any line segment within I, the value is either constant, or increases in one direction and decreases in the other. This implies that the maximum of f over I is not only assumed in the interior of I. Arguing on, it follows that there must even be a vertex of the polytope on which f assumes its maximum value. For the given example, this reduces the problem to checking the value of f for the five vertices which is not much work: (x'_0, y'_0) = (2.8, 3.3), with f(x'_0, y'_0) = 33.3.

If in addition there are conditions that x and y should be integral, then any of the high-lighted points is a candidate for giving the maximum value of f. These points have the special property that starting from the optimum (2.8, 3.3) of the LP they are non-dominated in the sense that there are no points in I, which lie closer to (2.8, 3.3) with both x- and y-value. Actually, we do not even have to consider the point (3, 2), because f can only assume its maximum value in this point if the value here is the same as in (2, 3) and (4, 1). These non-dominated points in I are also called Pareto optima. The notion of Pareto optimality is named after the Italian economist Vilfredo Pareto (1848-1923). It is widely used in game theory and economics. In the latter context, an allocation of resources is Pareto optimal if there is no way that some individual could be made better off without making any other individual worse off. Checking the Pareto optima, we find that f is maximal for (x_0, y_0) = (4, 1). f(4, 1) = 29, which is 4.3 less that the value achieved by the LP. Notice that (4, 1) does not lie close to (2.8, 3.3) in a geometric sense. The most natural rounding idea in this case would have been to take the closest integral solution in I, which is (2, 3), giving f(2, 3) = 27, which is 2 less than the optimum.

Linear Programming

In its extreme simplicity, the given example expresses many important ideas: for any LP, the maximum, when finite, is (also) assumed on a vertex of the polytope bordering the set of feasible solutions. The algorithms for efficiently solving LP consist of methods to rapidly converge to this vertex by performing a coordinated walk from vertex to vertex. The classical simplex method rapidly finds a solution for most problem instances, but one may construct pathological inputs for which the it visits an exponential number of vertices before reaching the optimum. The more recent ellipsoid method has polynomial running time on all inputs, but is more elaborate and not necessarily better in practice.

The solution of the LP limits the search for the optimum of the ILP to the Pareto optima. For Pareto optima which lie on a line segment only the endpoints of the line segment need to be tested. For some problems this approach allows to find optimum solutions in polynomial time:

However, in general there remain exponentially many points to check. Therefore, the search is limited to a polynomial subset, for example by simply picking the feasible solution which lies closest to the optimum solution of the LP in some sense. If x = (x_0, ..., x_{n - 1}) is the optimum of the LP and x' = (x'_0, ..., x'_{n - 1}) is any feasible solution with |x_i - x'_i| < 1 for all i, 0 <= i < n, then |f(x) - f(x')| < sum_{0 <= i < n} |a_i|, where a_i is the coefficient of the i-th coordinate in the objective function. So, the absolute difference from f(x) can be bounded easily, but the relative difference may be arbitrarily large.





Dynamic Programming

Binomial Coefficients

Assume we want to compute the binomial coefficients (n over k) = n! / ((n - k)! * k!). Sometimes Stirling's formula will do, but we may also need the exact value. This requires n multiplications. Instead of a multiplicative approach we may use the following formulation instead:

This immediately suggests a recursive algorithm, the only operation used is addition, so this appears nice. However, the time consumption becomes terrible: if we use N = n + k, then we see that T(N) = T(N - 1) + T(N - 2) + something. This we have seen before!

Just as one should not compute Fibonacci numbers top-down recursively, so should one not compute binomial coefficients recursively. But, of course, just as Fibonacci numbers could also be computed in a bottom-up fashion, so can one do this here: just compute the entries of Pascal's triangle row by row, always saving the last row.

Doing this, the whole computation takes Theta(n * k) time and Theta(k) space, which is clearly much better than the recursive algorithm. If we take the size of the numbers into account, then we get the following sharp estimate. (n over k) can be approximated rather accurately by Stirling's formula. This gives

(n over n/4) ~= (n / e)^n / ( (n/4 / e)^{n/4} * (3/4 * n / e)^{3/4 * n}) = 4^{n / 4} * (4 / 3)^{3 / 4 * n} ~= 2^{0.56 * n}.
We know that (n over k) >= (n over n/4) for all n/4 <= k <= 3/4 * n. So, it takes Omega(n^2) bits to write down all the numbers in row n of Pascal's triangle. We also know that (n over k) < 2^n, because sum_k (n over k) = 2^n, so it takes O(n^2) bits to write down all these numbers. Conclusion: the numbers in row n require Theta(n^2) bits to write down. Adding these numbers pair-wise to compute row n + 1 implies that additions involving in total Theta(n^2) bits are performed, which (assuming constant size word length) takes Theta(n^2) time. The total time for evaluating the triangle up to row n becomes sum_{i = 1}^n c * i^2 ~= c/3 * n^3 = Theta(n^3).

Is this even better than the trivial algorithm? Unfortunately, unless k is small, this appears not to be the case: even though n! is a very large number, we may assume that n and k itself are integers. So, each multiplication can be performed in a time that is linear in the size of the large number, just as the computation of a sum. Thus, this takes sum_{i = 1}^n log(i!) ~= sum_{i = 1}^n i * log i = Theta(n^2 * log n).

Nevertheless, this algorithm clearly shows the features of the dynamic programming:

Top-Down vs. Bottom-Up

The dynamic-programming idea can be worked out in two ways:
Top-down:
starting from the problem to solve, simpler problems are solved recursively. The recursion stops when coming to a basic case or when coming to a problem for which the solution has been computed before and is still available.
Bottom-up:
starting from trivial instances solutions for more complex instances are build, until eventually the solution for the instance we are trying to solve is constructed. The advantage is that the construction order is fixed and can be planned ahead. The disadvantage is that one may even solve many instances which contribute in no way to the solution of the whole problem.

If a complete table is computed with the bottom-up algorithm, then it can also be used for solving other problems. On the other hand, for the hard problems we are considering, the full table is often far too large, and therefore it is often unfeasible to store the complete table. For example, if we are computing (n over k) for large n and k, then one might not want to store the whole table of size Theta(n * k), or even Theta(n^2), but rather only the at most k values from the last row which are needed in order to compute the values of the next row.

For the binomial coefficients, the bottom-up approach is the one we have presented: for determining (n over k), start with (1 over 1), then row-by-row compute all values (n' over k'), with n' <= n and k' <= k. The top-down approach is somewhat better, but still most values have to be computed. Because of the very regular structure, this saving could also have been achieved for a bottom-up algorithm. The following table gives all the values that were computed with the original bottom-up algorithm and (highlighted) those that are computed by the top-down algorithm. All these are really needed for determining the value (12 over 5).

Computing (12 over 5) Using a Table

For many other problems, it is clear that most of all the possible values for smaller problems are not needed, so it would be wasteful to compute all of them. At the same time it may be hard to tell explicitly which values must be computed to find the answer for the whole problem. The top-down approach with a table works fine for such cases. This argument has one flaw: how can we test whether an entry is already computed or not? Basically there are three options:

There are several reasons why the idea with initialized values may work well in practice. The first is that assigning a fixed value, for example -1, goes considerably faster than computing a value. In the second place, it may happen that we want to evaluate the function for many values. In that case, it is a very good and practical idea to create a table of sufficient size for the largest problem that will be solved. This table is initialized before the first computation. Then when the computation is performed, we keep track of all the entries that are used in a list (implemented as array). After the routine all these positions are reset. In this way the computation becomes only marginally slower, and the initialization costs can be amortized over several computations.

Making Change

The problem of making change was considered before. There are coins with several denominations, sufficiently many coins of any denomination, and the task is to pay a specified amount N with the fewest possible coins. For certain sets of coin values, the greedy algorithm, always paying the largest coin whose value does not exceed the remaining amount to pay, achieves the optimum, but for others it does not.

Dynamic Programming is Optimal

The problem with the greedy approach is not the idea to construct the solution coin by coin, but the fact that at any time only one choice is considered. Denote the number of different denominations by n and denomination i by d_i, for 0 <= i < n. The amount to pay is denoted N and c(n, N) gives the minimum number of coins selected from the available denominations to pay this amount. c(n, N) can easily be expressed recursively:
c(n, N) = 1 + min_{0 <= i < n} {c(n, N - d_i)}.
This says nothing more than that a large sum can be paid optimally by starting to pay that coin that leads to a sum that requires the fewest further coins. The recursive expression must be completed by fixing the value of c(n, N) for N <= 0: c(n, 0) = 0, c(n, N) = infinity for N < 0.

This observation immediately leads to algorithms constructing optimum schedules. It is clearly not a good idea to do this recursively without added cleverness: the time would clearly be exponential, similar to Fibonacci (but even worse). The real way of doing it is, of course, dynamic programming. One can either do this bottom-up or top-down as discussed above. Anyway, this requires something like Theta(n^2 * N) time, because it takes Theta(n) time to compute each of the n * N entries of the table.

There is an alternative way of finding the schedule. It bases on the observation that it does not matter in which order the coins are paid out. So, for example we can pay the coins in order of decreasing denomination. This is just as in the greedy algorithm, the difference being that the largest used coin does not need to be the largest possible coin. Thus, we pay 78 as 60 + 10 + 4 + 4 and not as 4 + 10 + 60 + 4. But the greedy algorithm would pay 60 + 10 + 6 + 1 + 1. This implies that at any stage of constructing the paying schedule, we must decide whether to use the largest remaining coin or to not never use it anymore. Defining c(i, N) to be the minimum number of coins for paying an amount N using only the i smallest coins, those with values d_0, ..., d_{i - 1}, we get

c(i, N) = min{1 + c(i, N - d_{i - 1}), c(i - 1, N)}

How much time does this cost? Even after this improvement many values must be computed. In general one cannot conclude anything better than Theta(n * N). It is important to notice that Theta(n * N) is really not a very good result, because N may be a very large number. Expressed in terms of the size of the input, n * N may even be exponential: in many cases the input has size O(n * log N) (there is no need to have coins with values larger than N). If n = O(log N), which for example happens in the common case that there is a constant number of coins for each factor 10, then O(n * log N) = O(log^2 N).

Table for Paying 18 in a System With 1, 4, 6, 10

In the special case that the system is based on a smaller set of n' denominations multiplied with powers of 10, then because we are considering the denominations in decreasing order, we can be sure that for a denomination x * 10^i, at worst we will consider all N - j * 10^i, for 0 <= j <= N / 10^i. So, in that case, the total number of positions that might have to be evaluated is bounded by n' * N + n' * N / 10 + ... <= 10/9 * n' * N <= 10 * N = O(N). This is a real asymptotic improvement, be it not so decisive because N is typically the large number, not n.

The above ideas have been implemented in a Program. As was to be expected, for large N the computation is not particularly fast. In addition, due to the extreme memory consumption, it cannot be used for really large N. On a 2.66 GHz PC with 2 GB main memory, the maximum is about 12,345,678, for which the bottom-up computation takes 3.4 seconds and the top-down computation 2.0 seconds. Apparently the smaller number of values to compute outweighs the extra cost due to a less structured memory access, extra testing and overhead due to recursion.

Once the table has been computed, schedules can be computed much faster: for every position we store which move lies on the optimal path: e.g. 1 means "pay this coin and continue at c(i, N - d_{i - 1})", while 0 means "continue at c(i - 1, N). In some cases both alternatives lead to optimal schedules, for example 18 can be paid as 10 + 4 + 4 or as 6 + 6 + 6, but there is no need to express this. Given the table, computing the schedule for N takes O(n + c(n, N)) time. Hereafter the storage is reduced to n * N bits.

Schedules in a System With 1, 4, 6, 10

Dynamic Programming is Not Efficient

The time O(n * N) to find a schedule for an amount N in a system with n different denominations is really bad. If for example N = 12,345,678, with the 1, 4, 6 system we considered before, then n = 22. So, the table has about 3 * 10^8 entries. This takes much memory and time, even on a very fast computer. Before we mentioned that theoretically O(n * N) does not even need to be polynomial in the size of the input which is O(n * log N).

In this case, but not for all problems of this kind, there is a much simpler solution, which is even much more efficient, The problem of computing an optimal paying schedule can also be viewed as a shortest path problem on a directed unweighted graph. The concerned graph has N + 1 nodes and edges from node j, 0 <= j <= N to all nodes j + d_i, for 0 <= i < n. The path we are looking for, is a shortest path from node 0 to node N. Such a path can be found by running a BFS algorithm.

This idea in itself is not leading to a big improvement of the running time. However, because of the regular structure of the graph, there is no need to explicitly store it: the graph can be maintained implicitly. This gives considerable savings in the memory consumption, even though running BFS still requires an array of length N + 1 for marking the visited nodes and their distances from node N and a queue which contains all reached but not yet processed nodes. So, the memory consumption is reduced from O(n * N) to O(N).

For the s-t shortest path problem, the problem of finding a shortest path from node s to node t, there is a heuristic, which is provably effective for random graphs: on a graph with n nodes, the expected number of visited nodes is reduced from something close to n to about O(sqrt(n)). This heuristic turns out to be quite effective for the graph we are considering as well and leads to a considerable reduction of the running time.

The class of random graphs G_{n, p} consists of graphs with n nodes. Each of the edges is present independently of any other edges with probability p. So, the expected degree of any node is p * n. It is even true, that for any subset S of size s, the expected number of neighbors is p * n * s, and more interestingly, the expected number of neighbors that do not lie in S itself is p * n * s * (1 - s / n). This implies that the cardinality s_d of the set S_d(i) of nodes at distance d from node i is approximately (n * p)^d as long as s_d is only a small fraction of n. The probability that two randomly picked sets of size x from a set of in total n elements are non-intersecting is given by (1 - x / n)^x <= e^{- x^2 / n}. So, an intersection is likely for all x >= sqrt(n), and for x = sqrt(n * log n) there is an intersection with high probability. This implies that for two nodes i and j, S_d(i) and S_d(j) intersect with high probability as soon as d is so large that s_d >= sqrt(n * log n).

The above considerations immediately lead to an efficient algorithm for finding s-t paths on random graphs. The idea is to search both from s and from t and to stop as soon as there is a node that is reached from both sides. One has to be slightly careful, not to get mistakes by 1: the algorithm should process all nodes of a given distance d from the queue which gives the nodes reached from s before processing all nodes at distance d from t. Then it proceeds with the nodes at distance d + 1 from s, and so on. For the rest all is easy. Of course there is slightly more overhead than in a simple BFS, but all this is at most a factor two extra. The saving is enormous: from considering an expected number of n edges before finding t from s, this expected number is now down to O(sqrt(n)).

More generally, this heuristic is effective for all graphs that sufficiently behave like a random graph. There are two properties a graph must have in order for the heuristic to be effective:

The first property implies that there are not too many edges to already visited nodes to consider. The second implies that only a small fraction of the nodes must be visited. These properties are not satisfied for all types of graphs. For example for the graph given by a road map, most edges will be local. Taking two cities far apart, will probably require that a constant fraction of the whole graph is traversed.

In our case the situation is better: the values d_i may be assumed to range from small to large. Because of these large values the graph does not have too much of a local character. Applying the BFS algorithm on a 2.66 GHz Pentium IV processor to the 1-4-6 money graph for N = 12,345,678 computes the distance for 8,181,430 nodes and tests 179,991,460 edges which takes 6.7 seconds. The heuristic starting from both sides computes the distance for 136,282 nodes and tests 2,998,204 edges which takes 0.23 seconds. The program can be downloaded here.

Further Improvements

Considerable savings can also be achieved without completely changing the original approach. So, we apply dynamic programming based on the recursive formulation c(i, N) = min{1 + c(i, N - d_{i - 1}), c(i - 1, N)} but do not solve the second subproblem if it can be estimated that the returned value will be of no interest any way. This idea is most effective if the values d_i are sorted so that d_i > d_{i - 1}, for all i, 1 <= i < n. For any regular money system, the values can be generated in such an order, otherwise the numbers must be sorted which takes O(n * log n) time. In that case it follows that
  c(i - 1, N) >= min_{0 <= j < i - 1} {round_up(N / d_j)} 
               = round_up(N / d_{i - 2}). 
So, if for c(i, N) we have already computed c(i, N - d_{i - 1}) and find that 1 + c(i, N - d_{i - 1}) <= round_up(N / d_{i - 2}), then we know that c(i, N) = min{1 + c(i, N - d_{i - 1}), c(i - 1, N)} = 1 + c(i, N - d_{i - 1}). Thus, there is no need to compute c(i - 1, N).

Because the denominations often make quite considerable steps and because the greedy algorithm is quite good in most cases, in very many cases this will allow to reduce the degree of a node of the recursion tree from two to one. That is, in such nodes there is no branching at all. The process of reducing the degree of a node of a recursion tree is called pruning. Actually this is the first example of a branch-and-bound approach discussed in more detail in a later chapter.

Table for Paying 88 in a System With 1, 4, 6, 10, 40, 60

The pruning technique helps reducing the time consumption, but how about the memory usage? Just adding a test to the recursive procedure does not save memory. Furthermore, all memory must still be initialized, so we can hope to gain at most a constant factor in the time consumption. In such cases it may be profitable to only maintain a conceptual table, actually using some more advanced set data structure.

These ideas have been implemented in a program which can be downloaded here. The program contains both versions: a table-based and a set-based one. The set is implemented as an array of sorted linear lists: there is one linear list for each row of the table. This is asymptotically not the best possible solution, but it is simple and effective. The set supports three operations: insert, isElement and getValue. The memory consumption of this data structure is bounded by O(n + t), where t gives the number of computed values.

The pruning turns out to be extremely effective. For N = 12,345,678 only 374 table values have to be computed to be compared with the 15,089,135 values computed by the version without pruning. This also has a positive effect on the time consumption, the table-based version requires 1.4 seconds. Most of this is for initializing the table. The set-based version is therefore much faster, it takes only 5 milliseconds! This version is even much faster than the BFS-based algorithms and its memory usage is negligible.

Conclusion

Let us summarize the impressive achievements of this section. We started with a problem which at a first glance was untractable. Dynamic programming allowed to solve it for all small values of n and N. It turned out that for this problem the top-down approach is faster than the bottom-up approach. However, for large N this required a lot of memory and time. The complexity of the algorithm was exponential in the size of its input. An alternative approach, interpreting the problem as a search problem on unweighted graphs, considerably brought down the memory consumption, and using a nice heuristic even the time became much smaller. Adding some more intelligence to the original algorithm, pruning branches which cannot possibly lead to an improved solution, brings even much more. It computes very few values, and by replacing the table by a set data structure, the memory consumption becomes proportional to the number of actually computed values.

Principal of Optimality

Dynamic programming can be used quite generally provided that one condition is satisfied: it should be possible to find the global optimum, by taking the best small step combined with the best solution of the remaining problem. This is what is called the principal of optimality. It is often trivially satisfied: clearly if there are n coins, the best way of paying an amount is to pay any of them first and then paying the remainder optimally.

But how if we only have one 20-cent piece? 23 cent can be paid as 20 + 1 + 1 + 1, this is optimal, so N(23) = 4. How many coins do we need for 43? It is no longer true that N(43) = min{N(23) + 1, N(33) + 1, N(38) + 1, N(42) + 1} = 5. We must use two 10-cent coins (if we have them) to pay the remaining 23, so N(43) >= 6 > 5 = N(23) + 1.

In general the principal of optimality cannot be applied (or only with extra care) if there is a shared resource. Such a shared resource can be found also in the following problem: "Find the longest simple path from node u to node v in a graph". Here a simple path is a path visiting any node only once. Clearly, this problem cannot be split-up: Solving a subproblem does not take into account that one is not allowed to visit the nodes that are visited in the other subproblem.

Knapsack

Knapsack is a classical optimization problem and one of the best known. It is NP-hard. Nevertheless we will here present an apparently polynomial-time solution for it. We come back to this point after we have seen it.

There are n objects. Each with a value v_i and a weight w_i. Unfortunately we can only carry objects with a total weight not exceeding W. Which objects to take along and which ones to leave behind? The choice of objects can be formulated using a 0-1 function x on the objects: for all i, 0 <= i < n, x_i = 0 indicates that object i is not taken, while x_i = 1 indicates that it is taken along. So, expressed mathematically, the task is to determine the values of the x_i so that

V(x) = sum_{0 <= i < n} x_i * v_i
is maximized under condition that
W(x) = sum_{0 <= i < n} x_i * w_i <= W.
Simply trying all possible choices, computing the maximum achieved value over all feasible solutions, is simple but very work intensive: there are 2^n choices of the x-values, and each test takes O(n) time. So, such an exhaustive search takes O(n * 2^n) time. This method allows to solve the problem for n up to about 30.

A greedy solution, maybe the one you intuitively apply yourself when packing for a trip, is to pick the objects in order of decreasing value per weight. This leads to a simple greedy algorithm:

  int greedyKnapsack(int[] w, int[] v, int n, int W)
  // Computes the sum of the values of the objects that are 
  // taken along when applying a greedy selection strategy.
  {
    double[] a = new double[n]; // a[] gives qualities
    int[]    b = new int[n];    // b[] gives indices
    for (int i = 0; i < n; i++)
    {
      a[i] = (double) v[i] / w[i];
      b[i] = i;
    }
    Sort.invSort(a, b, n); // sorts a[] in decreasing order
    int V = 0;
    int X = 0;
    for (int i = 0; i < n; i++)
      if (X + w[b[i]] <= W)
      {
        V += v[b[i]];
        X += w[b[i]];
      }
    return V;
  }

Is it good? Yes, it is good, but in general it is not optimal. As an example we consider a set of six objects with weights 1, 2, 3, 4, 5, 6 and values 1, 8, 10, 10, 19, 25, respectively. W = 10. The greedy algorithm sorts the packets according to their usefulness and considers them in that order. In this way the selected packets are those with weights 6, 2 and 1, giving V = 34. The optimal selection consists of the packets with weights 5, 2 and 3, giving V = 37.

The above formulation of the knapsack problem is an example of an ILP. Replacing the condition x_i in {0, 1} by 0 <= x_i <= 1, for all 0 <= i < n, gives the corresponding LP. This LP can be solved in a greedy way: If the objects are arranged so that v_i / w_i >= v_{i + 1} / w_{i + 1}, for all 0 <= i < n - 1, the optimal choice of the x_i is (1, ..., 1, a, 0, ..., 0), for some a, 0 <= a < 1, and with k ones, for some k so that sum_{0 <= i < k} w_i + a * w_k = W. The optimality of this, can be proven by showing that a hypothetical optimal solution with a different choice of the x_i can be improved. This analysis exposes the reason why the greedy algorithm is not optimal for the ILP: after taking the first k packets, some space is left. This space, possibly a large fraction of W, is filled with objects with less value per weight unit.

A good approach to this problem is using dynamic programming. The approach is very similar to what we have done with the coins. Let V(i, j) be the maximum achievable value of V when the weight limit is j and we are allowed to use only the objects with number less than i. The maximum achievable value we are looking for is given by V(n, W). We are creating an n x (W + 1) table, which is filled using the rule

V(i, j) = max{V(i - 1, j - w_i) + v_i, V(i - 1, j)}.
Notice that we incorporated the fact that each object can be taken only once by writing V(i - 1, j - w_i) instead of V(i, j - w_i). This is clever. There is no need to arrange the items in any specific order, particularly they do not need to be sorted according to weight or profit per weight unit.

The values can be computed bottom-up proceeding either row-by-row or column-by-column. Alternatively V(n, W) can be computed top-down, computing only those values that are really required, using the table to prevent computing the same value several times. Whether this is actually faster or not depends on the details of the programming: the computation is less regular, and therefore it is not sure that computing 31 values can be performed faster than computing 66 values. Clearly, the real question is how the saving ratio develops as a function of n and W. In general the savings need not to be particularly large: in row n - i, we may have to compute up to 2^i values, so already for quite small i we may have to compute (almost) all values. The greedy approach and the top-down dynamic-programming version have been implemented in a program which can be downloaded here. As long as there are not too many packets it runs extremely fast.

Packing a Knapsack of Size 10

Knapsack Packing Schedules

Is this a polynomial-time solution? It is polynomial in n and W, but the definition of polynomial is that it must be polynomial in the input size. The input consists of O(n) numbers of at most log W bits each. So, the input has size O(n * log W) bits. If W is not polynomial in n, for example W = 2^n, the time is not-polynomial in the input size. Because knapsack is only hard for very large numbers (unlike many other NP-hard problems), it is a member of a slightly less untractable subclass of NP. We will see the problem again later.

Chained Matrix Multiplication

Assume we have n matrices M_i and want to compute the chained product M_1 . ... . M_n. If all matrices are m x m, then there is not so much to do, except for choosing the best routine for computing the products. If the matrices have different size however (assuming they "fit"), it matters in which order the products are computed, that is, how the brackets are put. Matrix product is not commutative, but it is associative, so brackets can be placed anyway we like. Let M_i be an d_{i - 1} x d_i matrix. Assuming that we use the conventional product, then computing the product of an a x b and an b x c matrix takes a * b * c operations and the result is an a x c matrix. If n = 3, d_0 = 10, d_1 = 40 and d_2 = 50 and d_3 = 30, then (M_1 . M_2) . M_3 takes 10 * 40 * 50 + 10 * 50 * 30 = 35000 multiplications. M_1 . (M_2 . M_3) takes 40 * 50 * 30 + 10 * 40 * 30 = 72000 multiplications. The differences can be arbitrarily large.

Chained Matrix Product

So, one can try to compute the best schedule first. Again one can apply greedy approaches. For example, one can always start to multiply those two matrices that have the longest common side length of all those that are still available. In this way this largest number will show up only in one product, but in the given example this idea does not work. An alternative greedy approach might be to make the cheapest multiplication first. This does not work either. An example is given by 3 matrices with d_0 = 40, d_1 = 40, d_2 = 60 and d_3 = 50. The suggested greedy approach would give 40 * 40 * 60 + 40 * 60 * 50 = 216000, the other schedule gives 40 * 50 * 60 + 40 * 40 * 50 = 200000. Of course one can enumerate all possible schedules, which, considering the enormous time complexity of the product itself, will in general be not a problem. However, even though the precise number of possible ways to compute the product is not obvious, it is clearly exponential, and for n = 100, this idea does not work.

So, we must try to find a better approach. The idea is to consider where to make a cut. Making a cut between M_i and M_{i + 1} means that we somehow compute the products M' = M_1 . ... . M_i and M'' = M_{i + 1} . ... . M_n first, before computing M' . M''. Clearly, there are n - 1 possible cuts to make. Setting T(n) the number of possible ways to put the brackets this gives

T(n) = sum_{i = 1}^{n - 1} T(i) * T(n - i), for all n > 1,
T(1) = 1.
These numbers are called Catalan numbers and grow very fast. T(15) = 2.674.440. More generally, it has been shown that T(n) = Omega(4^n / n^2).

Simple recursion is again out of the question, because, analogously to the recursive Fibonacci computation, the time is again at least the number of possible ways to bracket the whole expression. So, dynamic programming might be wise. Create an n x n table, where T(i, j) gives the minimum required number of multiplications for the partial product M_i . ... . M_j. So we are only interested in computing the table entries T(i, j) with j >= i, the values on and above the diagonal. The answer T(n) we are looking for is given by T(1, n).

The values in the table must be computed in a somewhat clever way. Which ones are easy? Of course: T(i, i) = 0 for all i, 1 <= i <= n, because there is nothing to multiply for a single matrix. Other trivial values are the T(i, i + 1), which are given by T(i, i + 1) = d_{i - 1} * d_i * d_{i + 1}. In general, the matrix can be filled up diagonal by diagonal, using that

T(i, i + s) = min_{i <= k < i + s} { T(i, k) + T(k + 1, i + s) + d_{i - 1} * d_k * d_{i + s} }.
The expression inside the minimization corresponds to the cost when cutting between M_k and M_{k + 1}.

On diagonal s, there are n - s elements to compute, each being the minimum over s values. So, the computation takes time proportional to sum_{s = 1}^n (n - s) * s = n * sum_{s = 1}^n s - sum_{s = 1}^n s^2 ~= n^3 / 2 - n^3 / 3 = n^3 / 6 = O(n^3). If for each computed value T(i, i + s) we also mark the minimizing k-value, the optimal schedule can be found in O(n) time once the table has been computed, but probably it is more efficient to compute it separately afterwards in O(n^2) time, not slowing down the main computation.

Table for Chained Matrix Product

Chained matrix product is one of the nicest examples of the power of dynamic programming: the simple recursive solution has exponential time while dynamic programming requires polynomial time (which is negligible in comparison to the subsequent matrix-multiplication problem), and we do not, as for Fibonacci numbers and binomial coefficients know an easy even more efficient alternative. Furthermore, there is no obvious greedy algorithm which performs almost equally good for most problem instances as for knapsack.

Drawing Trees Compactly

Normally trees are drawn with the root at the top, its children one level lower and so on. Trees drawn in this way typically get very wide. In this section we look at alternative ways of drawing trees, so that they get nicer shape or smaller area in the following sense: The bounding box is the smallest axis-parallel rectangle that can be drawn around a figure. The area of the bounding box can be considered to be the area of a figure (because when drawing it on paper, this is the minimum area of the rectangular sheet). The normal drawing style produces a drawing with Theta(n * log n) area for a perfect binary tree.

Here we restrict ourselves to a very special kind of drawings:

One can study several problems. We will try to compute the minimum area of the bounding box for a given binary tree. We further specify the problem by stating that the tree T comes together with a specified root, and that the root of any subtree should be drawn in its upper-left corner. Furthermore, here we will only consider perfect binary trees with n = 2^k - 1 nodes for k > 0. All these conditions can be relaxed, leading to stepwise more complicated solutions. Even this strongly simplified problem appears to be extremely hard. At first it is not obvious that it can be solved in polynomial time at all. The dynamic-programming solution we present is not trivial, but it has low polynomial complexity.

Compact Tree Drawing

The idea is to work with two matrices. W(k, h) gives the minimum width for such a tree with height bounded by h. Analogously, H(k, w) gives the minimum height for a tree with width bounded by w. Some of these values are easy to determine: W(k, 0) = H(k, 0) = infinity for all k > 0, and W(1, h) = 1 for all h > 0, H(1, w) = 1 for all w > 0. Once all values W(k, h) have been determined, the minimum area A(k) can be computed in linear time as follows:

A(k) = min_{h > 0} {h * W(k, h)}
This takes linear time because W(k, h) reaches its minimum value for a value h that is bounded by n.

Possible Placements

The above restrictions on how a tree can be drawn imply that, for a tree with root v and left and right subtrees rooted at u and w, respectively, there are only two ways to construct a drawing of the whole tree given rectangular drawings of the subtrees: the drawings of the subtrees can either be placed next to each other, or over each other. We define W_hor(k, h) to be the minimum achievable width when only considering the placement of the drawing next to each other. W_ver(k, h) is defined analogously to be the minimum achievable width when only considering the placement of the drawing over each other. H_hor(k, w) and H_ver(k, w) are defined analogously as the minimum achievable heights when only considering the placement next to and over each other, respectively. Clearly

W(k, h) = min{W_hor(k, h), W_ver(k, h)},
H(k, w) = min{H_ver(k, w), H_hor(k, w)}.

Some of these restricted values are easy to express in terms of smaller values:

W_hor(k, h) = W(k - 1, h - 1) + W(k - 1, h),
H_ver(k, w) = H(k - 1, w) + H(k - 1, w - 1).
How can we compute W_ver(k, h) and H_hor(k, w)? Here we use a trick. W_ver(,) can be computed from the values of H_ver(,). Suppose we have H_ver(k, w) = (I, I, I, 12, 8, 6, 5, 5, 5, 4, 4, 4, 4, 4, 4, ...) for w = 0, 1, ... , where I denotes infinity. H_ver(k, 4) = 8 tells us that the optimal vertical placement with width 4 has height 8. Inverting, this implies that with height 8 a width 4 can be achieved, that is W_ver(k, 8) = 4. For height 9, 10 and 11 we cannot do better, but H_ver(k, 3) = 12 tells us that W_ver(k, 12) = 3. In this way we find W_ver(k, h) = (I, I, I, I, 9, 6, 5, 5, 4, 4, 4, 4, 3, 3, 3, ...). It may appear that we are dealing with infinite arrays. However, if we know that H(k - 1, w) reaches its minimum for some value w = w_min, it follows that H_ver(k, w) reaches its minimum value for w = w_min + 1, because then both terms have reached their minimum value. So, everything is finite.

Minimum Hamiltonian Cycle

Definitions

For a graph G = (V, E), a Hamiltonian path is a path visiting all nodes in such a way that no node is reached more than once. If such a path exists, G is said to be Hamiltonian. A similar problem is that of constructing a Hamiltonian cycle, a tour visiting all nodes exactly once. At a first glance these problems sound similar to the problems of constructing an Euler path or an Euler tour, a path or tour, respectively, traversing all edges. Such Euler paths or tours can be constructed in O(n + m) time, where n = |V| and m = |E|, and their existence can be tested in O(n) time if the degrees of the nodes are given. In contrast, there is no known polynomial-time algorithm for determining Hamiltonicity, it is one of the most famous NP-complete problems.

In this section the weighted version of the above problems is considered. For a weighted graph G, a minimum Hamiltonian path, is a Hamiltonian path of minimum weight. A minimum Hamiltonian cycle is a Hamiltonian cycle of minimum weight. The problems of finding such a path or cycle will be denoted MHP and MHC, respectively. Without loss of generality, it may be assumed that a weighted graph is complete by, if necessary, completing the graph with edges of infinite weight. A graph is Hamiltonian if and only if the minimum Hamiltonian path of the corresponding complete graph has finite weight.

MHC is also known as the traveling salesman problem, TSP. Slightly different is the Chinese postman problem, CPP, in which a minimum weight tour has to be found visiting all nodes at least once. The unweighted version of CPP is uninteresting, because such a tour exists if and only if the graph is (strongly) connected). The weighted version of CPP is NP-hard. Computationally it is even harder than MHC because there is more freedom. If the triangle inequality holds CPP is the same problem as TSP, because then the shortest path from a node u to a node v is always given by the edge (u, v), and thus there is no need to revisit a node which has been visited before. All problems can be considered for directed and undirected graphs. Because an undirected graph can be viewed as a directed graph with two oppositely directed and equally-weighted edges for any undirected edge, algorithms for undirected graphs are more general. However, for undirected graphs there may be better approaches, not doubling the number of edges and not introducing many equally-weighted edges.

Computationally MHP and MHC are closely related. Using T_MHP(n, m) and T_MHC(n, m), the time consumption for solving MHP and MHC, respectively, on graphs with n nodes and m edges, we have

Lemma: For directed graphs, T_MHP(n, m) <= T_MHC(n + 1, m + 2 * n).

Proof: For a graph G = (V, E) with n nodes, let G' = (V', E') be the graph with n + 1 nodes obtained from G by adding a node u with zero-weight edges to and from all nodes of G. That is, V' = V + {u} and E' = E + sum_{v in V} (u, v) + sum_{v in V} (v, u). It is easy to verify that a minimum Hamiltonian cycle of G' corresponds to a minimum Hamiltonian path of G: Let C be a minimum Hamiltonian cycle of G' and let P be the path obtained from C by removing node u and its two incident edges. Suppose there is a Hamiltonian path P' in G with weight(P') < weight(P). Let s and t be beginning and endpoint of P and let C' be the cycle obtained by adding to P node u and the edges (t, u) and (u, s). Because these edges have weight zero, weight(C') = weight(P') < weight(P) = weight(C), contradicting the minimality of C. Because the cost of constructing G' is negligible, this implies T_MHP(n, m) <= T_MHC(n + 1, m + 2 * n). End.

Lemma: For directed graphs, T_MHC(n, m) <= T_MHP(n + 1, m).

Proof: For the second inequality, let G'' = (V'', E'') be the graph obtained from G by choosing an arbitrary node u in V and splitting it into nodes u_in and u_out. More precisely, V' = V - {u} + {u_in, u_out}. E' is the same as E for all edges not incident on u. E' has an edge (v, u_in) for each edge (v, u) in E and an edge (u_out, v) for each edge (u, v) in E. Any Hamiltonian path in G'' must necessarily run from u_out to u_in, and therefore there is a one-one correspondence between Hamiltonian paths in G'' and cycles in G. End.

Algorithm

In this section a dynamic-programming MHC algorithm is presented. The graph G has n nodes. The nodes are identified with their indices, which run from 0 to n - 1. G is assumed to be complete. The weight of an edge (i, j) is denoted weight(i, j). Let MHP_{s, S, t} denote the shortest path from a node s to a node t passing through all nodes in the set S. MHC(G) denotes the length of a minimum Hamiltonian cycle on G.

Like the previous algorithms, the starting point is a simple recursive expression of the quantity we want to compute:

MHP_{s, empty, t} = w(s, t),
MHP_{s, S, t} = min_{v in S} {MHP_{s, S - {v}, v} + w(v, t)},
This expresses nothing more than that the shortest path from s to t running through S passes through some node v in S before traversing the edge (v, t). MHC can be solved using
MHC(G) = MHP_{n - 1, {0, 1, ..., n - 2}, n - 1}, n - 1}.

Simply performing recursive calls, the running time would be something like O(n!). Dynamic programming means that we should store computed results in order not to compute the same value twice. How should this be organized efficiently? The idea is to identify a subset S with a number x written in binary: bit i of x is 1 if and only if node i belongs to S. Using an (n - 1) x 2^{n - 1} table, there is a row for all i < n - 1, and a column for all subsets of {0, 1, ..., n - 2}. The value MHC(G) can then be computed by taking the minimum over n - 1 values. In total there are O(n * 2^n) values to compute and each value takes O(n) time. So, the total time consumption is bounded by O(n^2 * 2^n). The outlined algorithm is integrated in an applet.

Solving MHC with Dynamic Programming

Exercises

  1. Consider the computation of binomial coefficients with help of dynamic programming. For computing (n over k), with k <= n / 2, how many entries of the table are computed at most as a function of n when building up the table row by row until hitting (n over k) (in terms of the picture given above, the question is how many black and red numbers there are at most)? How many entries must be computed (in terms of the picture given above, the question is how many red numbers there are at most)?

  2. Give a non-recursive procedure computing (n over k) along the lines of the bottom-up dynamic programming solution requiring at most O(k) memory and minimizing the number of computed entries (so, in terms of the picture given above, the task is to only compute the red numbers).

  3. Give a divide-and-conquer algorithm for computing n! and analyze its time consumption when using the best multiplication subroutines. Show that (n over k) can be computed in the same time order.

  4. Consider the problem of making change. In the example with coins 1, 4, 6, 10, 40, 60, ..., the greedy algorithm might use 3 * k coins instead of the optimum of 2 * k. Can it be worse? How bad can it be? Give a class of examples achieving the claimed bound.

  5. Give an infinite monetary system with values d_i for which the top-down computation of the minimum number of coins for paying any amount N requires the computation of only O(N) values.

    Give a system which is particularly bad: as a function of N, (1 - o(1)) n * N values must be computed, where n is the number of coins with values smaller than N.

  6. Consider the problem of making change. In the example for paying 18 with coins 1, 4, 6, 10, we have seen that there are only very few instances for which the greedy strategy is not good. If the schedule-computing algorithm is going to be implemented in a cash register we might want to minimize memory consumption. So, the idea is to store all pairs (N, i) of values for which the algorithm should not apply a greedy strategy in some searchable structure. For this one can use a sorted array or a hash table. For composing a schedule with c coins to be selected out of n denominations, this implies that at most c + n values must be inspected. For practical values of c and n and the current speed of processors, this marginal slowdown does not matter.

  7. Consider the program for computing paying schedules using the BFS method. How does the maximum number of dequeued nodes develop as a function of N? Try sufficiently many values of N in each factor-10 range. With N you should go until the limit of the memory of your computer.

  8. Consider the program for computing paying schedules using dynamic programming in combination with pruning. How does the maximum number of computed set values develop as a function of N? Try sufficiently many values of N in each factor-10 range. The table-based version should be commented out when going with N to the maximum.

  9. How bad can the greedy algorithm for knapsack be? Consider the performance ratio, the value of V_opt / V_greedy, where V_opt gives the best achievable V value and V_greedy the V value achieved by the greedy algorithm for the same problem. Either prove an upper-bound or demonstrate by a parametrized example that it is unbounded.

    Formulate a simple condition on the weights w_i and the values v_i that guarantees that the greedy algorithm is optimal.

  10. Compute the optimal schedule for a chained matrix product M_1 x M_2 x ... x M_8, where M_i, 1 <= i <= 8 has size d_{i - 1} x d_i, for the following values of d_i: (5, 4, 7, 8, 4, 2, 5, 8, 5). How many computation would it cost to perform the trivial order of computation, starting to multiply from the left, that is ((((((M_1 x M_2) x M_3) ... x M_8?

  11. Consider a knapsack problem with W = 12 and 7 packets with the following (w_i, v_i) pairs: (6, 25), (2, 8), (5, 19), (7, 26), (3, 10), (4, 10), (1, 1). Use dynamic programming to compute the maximum achievable value of V. Give the complete table, highlighting the values that must be computed when applying a top-down computation. Also give the table with packing schedules, highlighting the choices which are not greedy.

  12. Give an exact expression for the width and height of the bounding box of a tree drawing when applying the subtrees alternatingly next to and over each other. Without loss of generality, we assume that for even k, the tree T_k with 2^k leaves is obtained by putting two trees T_{k - 1} next to each other. Use these results to give an expression for the area A(k). Compute A(6).

  13. We consider the above-presented algorithm for computing the minimum area A(k) of the bounding box of a rectangular tree layout for a perfect binary tree T_k with 2^k leaves.

  14. Implement the above-presented algorithm for computing the minimum area A(k) of the bounding box of a rectangular tree layout for a perfect binary tree with n = 2^k - 1 nodes. What is the complexity of the algorithm in terms of k? Can this time be further reduced? Use this implementation to compute A(k) for sufficiently many k. How does A(k) develop? Try to give a recurrence relation for A(k) and solve it.

  15. A string A = (a_0, ..., a_{n - 1}) is an ordered sequence of m characters from some finite alphabet. String S = (s_0, ..., s_{k - 1}) is said to be a subsequence of A, if S can be obtained from A by deleting n - k characters without rearranging them. For strings A and B, S is called a common subsequence is S is a subsequence of both A and B. A longest common subsequence is a common subsequence of maximum length. For strings A and B of length n and m, respectively, give an O(n * m) algorithm to compute the length L of a longest common subsequence. Hint: apply dynamic programming, computing values l(i, j) in a suitable order. Here l(i, j) is defined to be the length of the longest common subsequence of A_i = (a_0, ..., a_{i - 1}) and B_j = (b_0, ..., b_{j - 1}), prefixes of length i and j of A and B, respectively.

  16. An integer array a[] of length n, indexed from 0 to n - 1, is said to be sorted, if a[i - 1] <= a[i] for all 0 < i < n. A subsequence of an array is defined in the same way as that of a string. Particularly, the elements do not need to be consecutive. Give a dynamic-programming algorithm for computing the longest sorted subsequence S of A. Specify the time complexity of your algorithm. Hint: it is not hard to obtain an algorithm with running time bounded by O(n^2), but this is not the best possible.




Backtracking

Backtracking is an efficient way of performing exhaustive search for problems for which we do not know a better algorithmic solution. So, typically the complexity will be exponential in the input size, but the kind of exponentiality may become so much better that now one can find solutions for non-trivial problem sizes, whereas a dumb approach would not have found any interesting result. The classical example of applying backtracking is the k-queens problem: backtracking allows to solve about twice as large problems. This we will not prove, but we will see the technique and cite some numbers.

Queens Problem

One of the more famous problems in computer science (and before!) is the problem of placing queens on a chess board so that no two of them can "see" each other. The traditional version tries to place eight queens on an 8 x 8 board, but in general we may try to place n queens on an n x n board. A queen is active along rows, columns and diagonals, so they must be placed so that no two share a row, column, diagonal or anti-diagonal.

8-Queens Problem

The most dumb idea is to try all subsets of size n. This implies testing (n^2 over n) ~= (e * n)^n possibilities, quite outrageous already for k = 8. Slightly better is to realize that queens should be in different columns. So, a solution is given by a vector (v_0, ..., v_{n - 1}) in which v_i indicates the column in which the queen in row i is positioned. If we now also realize that all v_i must be different, then the number of tests is reduced to n! ~= (n / e)^n. Substantially better, but still not good at all.

A shortcoming of these methods is that we first generate a complete solution, and then test whether it is feasible. Many solutions with a common impossible prefix are generated and tested. Here backtraking comes in and brings large savings (which are hard to quantify other than by experiment). This results in the following recursive method:

  void placeQueen() 
  // Recursive method which tries to place queen k in row k.
  {
    if (k == n) 
    // All queens have been placed. 
    {
      h++;    
      if (h <= hmax)
        printSolution();
    }
    else    
      for (int i = 0; i < n; i++) 
        if (coln[i] && main[i - k + n - 1] && anti[i + k]) 
        // Place queen k at position i of row k.
        {
          pstn[k] = i;
          coln[i] = main[i - k + n - 1] = anti[i + k] = false;
          k++;
          placeQueen();
          k--;
          coln[i] = main[i - k + n - 1] = anti[i + k] = true; 
        }
  }

Here coln[], main[] and anti[] are boolean arrays which have been initialized with all positions equal to true. They are used to mark occupied columns, diagonals and anti-diagonals, respectively. The complete program can be downloaded here. The time consumption of the program is O(n * c(n)) where c(n) gives the number of calls to the method placeQueen().

First Steps when Backtracking for 8-Queens Problem

For n = 12, 12! = 479,001,600. During the entire backtracking the method placeQueen is called c(12) = 856,189 times, and the first solution is found after 262 calls to this method. On a fast computer the entire test for n = 12 takes about 100 milliseconds. Because the program requires only O(n) memory and n is small, the whole computation fits in the first level cache and can therefore be performed at the full speed of the processor. With this more efficient program solutions can be found for problems which otherwise would have been out of reach.

A Solution for 16-Queens Problem

General Pattern

In general we assume that the solutions to the problems we consider can be formulated as a vector v. This vector may give the numbers of the items to put in the backpack or the indices of the columns in which to place the queens. There are two tests that can be performed on these vectors. The first is for testing whether a vector of length k might still lead to a feasible solution by checking that none of the specified rules are violated. A vector which passes this test is called k-promising. The second is for testing whether the vector gives a solution. Because it is not always so that all solutions have the same size, this might be more involved than testing that the vector is k-promising and that k has reached a certain value.

The space of partial solutions is represented as a graph. The nodes are situations, and the edges link together situations that are one step away. Typically the graph is very large or even infinite, and is not completely constructed at the beginning, but only as far as needed: when the search reaches a k-promising situation (v_0, ..., v_{k - 1}), the algorithm generates the adjacent k + 1 vectors (v_0, ..., v_{k - 1}, v_k). The graph is thus not explicit but implicit.

Backtracking is a method for searching such large implicit graphs. It tries to minimize unnecessary work by never expanding a path from a node that corresponds to a vector which is not promising. The following simple procedure finds all solutions:

  void backTrack(int[] v, int k)
  // v is an array of sufficient length. 
  // The first k elements of v[] constitute a k-promising vector.
  {
    if (isSolution(v, k))
      printVector(v, k);
    else
      for (all possible values for x for v[k])
      {
        v[k] = x;
        if (isPromising(v, k + 1))
          backTrack(v, k + 1);
      }
  }

If there are solutions which are extensions of other solutions, then the "else" should be removed. Depending on the problem the tests and the graph are different. The definition of k-promising may also include extra conditions, to guide the search, for example to prevent that equivalent vectors are considered more than once.

Knapsack

We consider the same knapsack problem as before: there are n objects with weights w_i and values v_i. The goal is to maximize the value of the objects selected, respecting a limit W on the sum of the weights of the selected objects. The vectors give the indices of the objects in order of their selection. In principle any vector (i_0, ..., i_{k - 1}) with sum_{j = 0}^{k - 1} w_{i_j} <= W is k-promising, but in order to assure that each selection is considered only once, we will impose as well that the i_j are increasing. So, we are searching a graph with in total 2^n nodes (all subsets of n elements). Actually, this graph is a tree. All leafs are potential solutions.

Because this is an optimization problem, the notion of solution is not entirely adequate, backtracking is rather designed for decision problems, in which one should answer questions of the type "is there a feasible solution?" or "is there a feasible solution achieving a value of at least V?". If we would ask the latter question, we could output a selection of objects as soon as sum_{j = 0}^{k - 1} v_{i_j} >= V. In this case the extension of a solution might again be a solution.

The search starts with the empty set, which is a 0-promising set. Then the algorithm adds one element and tests whether the weight is below W. If yes, it recurses and tries to add a second element. In this way it continues until there are no further elements to add (given the increasing order of the indices), or adding an element would violate the weight limit.

This approach is applied to the example with weights 1, 2, 3, 4, 5, 6 and values 1, 8, 10, 10, 19, 25, respectively. The greedy algorithm finds V = 34 picking {1, 2, 6}, the optimum selection is {2, 3, 5}, giving V = 37. The backtracking tree has 30 nodes. Each node is generated in constant time. The simple alternative algorithm tests all 2^6 = 64 subsets. Each test takes O(n) time. So even for such a tiny problem with quite a high weight bound there is already a considerable saving.

Solving the Knapsack Problem with Backtracking

Optimizing Backtracking

The reason that backtracking improves over trivial approaches is two-fold: The first point is most fundamental to backtracking, but the second point may largely contribute to the performance as well. For example, in the common case that the task is to pick k out of n elements each solution occurs in k! different arrangements. In such cases, the second point is assured by only considering the lexicographically first arrangement. This means that when trying possible choices c[i] for element i, 0 <= i < k, that for i >= 1 it suffices to only consider c[i] > c[i - 1].

Based on symmetry considerations it is often possible to exclude even more possibilities. In the case of the queens problem, it is clear that for n = 8, if (0, 4, 7, 5, 2, 6, 1, 3) is a solution, then so is (7, 3, 0, 2, 5, 1, 6, 4). That is, for c[0] it suffices to consider values <= 3. This trivial change gives a 50% improvement. Further symmetry considerations, which are harder to implement, can bring more. In principle each configuration occurs in eight variants: there are rotations over 0, 90, 180 and 270 degrees, which can be combined with the considered reflection in a vertical line through the middle.

In an optimization problem such as knapsack or when searching for all queen-placements, the order in which the choices are tested is of no importance: eventually the whole tree will be tested and the order has no impact on the size of the tree. However, for the queens problem and many game-like problems, one will be happy with any solution achieving the specified requirements. In that case it may be wise to use a non-trivial order of the choices. In general a modified probing order may make it hard to assure that each possibility is tested only once. However, if the order is the same at all levels of the recursion, then we can first construct an integer array choice[] of length n, giving a permutation of the indices and then one applies exactly the same algorithm, with c[i] > c[i - 1] for all i >= 1 using (choice[c[0]], ..., choice[c[k]]) as choice vector. If we are trying to solve a problem like the queens problem for increasing values of some parameter n, which has to do something with the size. From the complete solutions for smaller values of n we may gain an impression in which subsets of space of possible solutions the solutions lie particularly dense. Then for larger n it may be extremely useful to direct the search first to the regions where the solutions are supposedly lying densest.

The above sketched focusing idea is worked out for the queens problem. The 8-queens problem has 92 solutions. 4 of them with v_0 = 0 or 7, 8 with v_0 = 1 or 6, 16 with v_0 = 2 or 5, and 18 with v_0 = 3 or 4. This suggests that, when not taking into account symmetries, it is a good idea to choose for v_0 the testing order (n / 2 - 1, n / 2, n / 2 - 2, n / 2 + 1, ..., 0, n - 1). Looking further one gets the impression that the solutions of the queens problem lie densest around a diamond-shaped figure with v_{i + 1} >= n / 2 if v_i < n / 2 and v_{i + 1} < n / 2 if v_i >= n / 2. The modified program can be downloaded here. So far this idea is just based on some supperficial observation and some basic intuition, it does not have to be effective. Luckily, trying shows that it indeed gives great improvements: before the problem could be tackled for n up to about 30, now we come twice as far. Considering the resulting solutions for larger values of n, it immediately appears that the first solution found always starts with (n/2 - 1, n/2 + 1, n/2 - 4, n/2 + 4, n/2 - 7, n/2 + 7, n/2 - 10, n/2 + 10, ...). Fixing such a pattern as start, the program can be downloaded here, gives even much better results, as can be seen from the following table. Here basic, better and best denote the number of calls to the method placeQueen for finding the first solution with the first, second and third variant of the program. For n = 160, the first solution is (79, 81, 76, 84, 73, 87, 70, 90, 67, 93, 64, 96, 61, 99, 58, 102, 55, 105, 52, 108, 49, 111, 46, 114, 43, 117, 40, 120, 37, 123, 34, 126, 31, 129, 28, 132, 25, 135, 22, 138, 19, 141, 16, 144, 13, 147, 10, 150, 7, 153, 4, 156, 1, 159, 131, 26, 134, 23, 133, 24, 136, 29, 127, 21, 137, 20, 128, 27, 125, 30, 130, 33, 112, 36, 124, 45, 119, 44, 110, 51, 106, 54, 98, 60, 97, 63, 92, 62, 95, 65, 86, 72, 83, 75, 80, 78, 48, 82, 42, 100, 56, 103, 59, 113, 47, 116, 53, 104, 50, 101, 35, 109, 57, 118, 41, 115, 18, 107, 32, 122, 11, 139, 38, 140, 17, 145, 39, 142, 8, 152, 157, 2, 154, 158, 155, 143, 3, 9, 0, 151, 88, 121, 77, 71, 94, 91, 69, 6, 89, 85, 68, 66, 74, 148, 12, 15, 149, 14, 146, 5).

n basic better best
26 397,700 97 26
28 3,006,299 1,892 28
30 56,429,620 206 1,746
32 87,491,426 2,643 473
54 --- 291,892 55
56 --- 6,345,996 801
58 --- 4,439,731 227
60 --- 74,669,904 680
120 --- --- 772
130 --- --- 149,696
140 --- --- 40,552,241
150 --- --- 1,098,291
160 --- --- 3.6 * 10^9

We see: a tiny twist in the organization brings us much more than any symmetry consideration / implementation optimization / parallelization could ever have brought us. Nevertheless, even this improvement brings us only somewhat further: for n > 160 even the best approach requires too many evaluations and reaches its limit. The effect of the improvements of the solutions of the n-queens problem are typical for improvements of exponential-time algorithms: in the best case one reduces the exponentiality, for example from x^n to (x / c)^n. By this one can solve c times larger problems in the same time as before. Thus, the maximum size for which a problem can be solved in "acceptable" time becomes c times larger. For the n-Queens problem, the step from testing all n! possibilities to the backtracking tree is such an improvement. Hereby the maximum solvable problem increases from about 15 to 30. The presented focusing strategy brings this to about 160. A further clever idea might give another factor, but it will probably not allow to solve problems for n = 1000.

Sometimes thinking brings more than any algorithmic optimization: For any n = 6 * k + 4, the following gives a solution:

c[0] = n / 2,
c[n / 2] = n / 2 + 3,
c[i] = (c[i - 1] + 2) mod n, for all other i.
A similar schedule works for n = 6 * k + 5.

General Patterns for Queens Problem

The presented backtracking algorithms explore the graph in a depth-first manner, as this is exactly what recursion gives us. In the given examples, it does not matter in what order the graph is traversed, so there we can apply DFS, given that this is what one gets easiest. However, there are problems for which even nodes at very large recursion depth may be feasible. If we perform DFS for such problems, it may take unnecessarily long (infinitely) to return from a branch without solutions. For such instances, if we want to find a single solution, it is better to apply an alternative graph traversal such as BFS. Later we will see even more subtle strategies, using a priority queue instead of a simple stack or queue, the goal being to more rapidly find really good solutions.

Exercises

  1. Determine the largest values of n for which the queens problem can be solved in 0.1, 1, 10 and 100 seconds, respectively? Now imagine some optimization of the implementation by which the algorithm runs 10 times faster, or imagine that we are buying a better computer with a 10 times faster processor or that we have access to a parallel computer with 10 processors. How much bigger problems does this allow us to solve? What does this tell us about the importance of optimizing the implementation / buying a faster computer or a parallel computer?

  2. Consider the program, solving the queens problem.

  3. For which values of n has the n-queens problem no solutions? Proof your answer. How many solutions are there for other n? Trying the program, suggests that the number of solution strongly increases with n.

  4. Above we discussed the issue of making the queens placement faster by exploiting symmetries. Except for the reflexive symmetry, this is not easy to implement. The reversed question is easier: how many non-symmetric solutions are there? Approximately 1/8 of the solutions remain, but not exactly, because some are multi-symmetric. For example, for n = 8, there are in total 92 solutions in 12 non-symmetric variants.

  5. Generalize the queens problem for a three-dimensional board. Now a queen can move along x-, y- and z-axis, along x-y, x-z and y-z diagonals and along x-y-z diagonals. Design a modified algorithm along the lines of the given one and turn it into a program. Try to determine the smallest n > 1 for which n^2 queens can be placed.

  6. Backtracking is much used in games. For example, for playing a game in which initially 32 pins are placed on a board with 33 fields. There is one player. In a legal move the player jumps with one pin over another. Doing this, the pin over which is jumped, is removed. There comes a moment that no further legal moves are possible. The task is to minimize the number of pins left, ideally a single pin is left in the middle position.

    Game to Solve with Backtracking

    Design a backtracking solution for this problem. For the time complexity it is particularly important to assure that the same situation is not handled again and again. Recognizing symmetries may help to further reduce the number of evaluations.

    Turn the developed algorithm into a program. The complexity measure is the number of jumps performed.

  7. Consider a knapsack problem with W = 12 and 7 packets with the following (w_i, v_i) pairs: (6, 25), (2, 8), (5, 19), (7, 26), (3, 10), (4, 10), (1, 1). Use backtracking to compute the maximum achievable value of V. Give the complete tree.

  8. Consider the problem of computing a minimum vertex cover. A vertex cover of an undirected graph G = (V, E) is a subset S of the nodes so that for any edge (u, v) either u in S or v in S (or both).

    Graph with 8 Nodes

  9. Consider the exact-sum problem. This is a variant of knapsack in which the task is to find a subset of the items so that the sum of their weights is exactly some specified value W or to conclude that no such subset exists.

  10. Prove the correctness of the presented general solutions of the queens problem for n = 6 * k + 4. Construct similar solutions for other values of n and prove their correctness. Hint: it may be necessary to distinguish between n which are a multiple of 4 and those which are not. It appears that there are actually 12 cases to distinguish, not just 6.





Branch-and-Bound

General Pattern

We have seen three methods for solving opimization problems already and we will add one more here. They are

So, branch-and-bound is in the first place a method to prune the tree corresponding to feasible partial solutions of an optimization problem. However, in a context of an optimization problem, it is only natural to also try to focus the search on branches that appear most promising. That is, to first develop branches that, based on their lower and upper bounds and other criteria, promise to lead to the best solutions. The rational is that developing such branches probably leads to finding improved lower or upper bounds, which allows to prune more of the other branches. Thus, in general one does not perform a pure DFS or BFS search, but rather a tuned search, using a priority queue to maintain the list of generated but not yet completely explored nodes. At every stage we proceed with the node at the head of the priority queue.

Applying branch-and-bound requires specifying at least the following four points:

In a maximization problem, for any newly computed lower bound it is considered whether it is larger than the maximum value found so far. If yes, this maximum value is updated. If for any node of the branching tree the upper bound is less than or equal to this maximum value, there is no need to further expand the tree from this node. Typically we do not only want to know the maximum achievable value but also the corresponding solution. However, this is never a problem: if we mark in which node the maximum value was established, then either this node already corresponds to a complete solution itself, or for this node upper and lower bound where both equal to the maximum value. In that case the construction that was giving the lower bound estimate, for example using some greedy algorithm, can be used to obtain the complete solution in little additional time.

Knapsack

In the knapsack problem there are n objects. The packets have weights given by a vector w and values given by a vector v. The task is to determine a subset of the packets for which the sum of the values is maximal under condition that together they do not weigh more than W. The maximum achievable value is denoted by V. A solution is represented by a bit vector x, for which x_i = 1, means that object i is taken along while x_i = 0 denotes that object i is not taken along.

We start with an empty knapsack. Then the objects are considered in order one by one, branching on the decision either to take or not to take an object along. In other words, we are fixing the x_i one by one, starting with x_0. After settling the first j x-values, the residual weight capacity is given by W - sum{i = 0}^{j - 1} x_i * w_i. The tree is developed in a DFS manner. If the packets are indexed in order of decreasing v_i / w_i, then the packets that appear most interesting are considered first. Quite good lower bounds are obtained by filling the residual weight capacity of the knapsack in a greedy way only selecting packets which have neither been selected before nor discarded. An upper bound can be obtained by using that after considering the first j packets, none of the further packets can contribute more than v_j / w_j value units per available weight unit. That is, if x_0, ..., x_{j - 1} are specified, then

V(x_0, ..., x_{j - 1}) <= sum_{i = 0}^{j - 1} x_i * v_i + round_down((W - sum_{i = 0}^{j - 1} x_i * w_i) * v_j / w_j).
Here V(x_0, ..., x_{j - 1}) denotes the maximum achievable V-value under condition that the first j values of the vector x are chosen as specified. Rounding down is correct because the actual value is an integer. If for an integer k and a real y, we know k <= y, then even k <= round_down(y).

For the knapsack problem there are some easy improvements which help reducing the number of computed bounds. When developing the tree, at any stage, it is considered whether there are induced decisions, before evaluating lower and upper bounds. By induced decision we mean that only one of the two alternatives is feasible. For the considered problem it may be that the weight of the next most attractive packet exceeds the residual weight capacity. In that case the only option is to not take the packet. When branching, the upper bounds are evaluated first. This saves some lower bound evaluations if we have reached a bound-situation. Lower bounds have to be recomputed only when since the latest computation a non-greedy step has been made.

We consider the same example as in the chapter "Dynamic Programming": n = 6, w = (1, 2, 3, 4, 5, 6), v = (1, 8, 10, 10, 19, 25) and W = 10. In sorted order the (w_i, v_i) pairs are (6, 25), (2, 8), (5, 19), (3, 10), (4, 10) and (1, 1). Assume that we have reached the node which might be designated with (1, 0, 0), that is, we have decided to take (6, 25), but not (2, 8) or (5, 19). There are 4 weight units left, and the greedy algorithm will fill this residual capacity by taking (3, 10) and (1, 1). Because (6, 25) was already taken, this gives a lower bound of 36. For the upper bound we get 25 + round_down(4 * 10 / 3) = 38.

Branch-and-Bound for Knapsack

Improving Branch-and-Bound

For small problems it does not matter, but for larger problems it is crucial to minimize the number of branching nodes. The maxim is
Once you start branching, you are lost!

One extra branching node early in the tree may multiply the whole search time by a factor. Considering that the search time typically is exponential in the input size, it becomes clear that almost any polynomial effort is justified if this leads to fewer branching nodes. The principal ways to achieve this are:

The first will almost always be profitable. Possibly the optimal strategy is to adapt the methods for computing bounds to the size of the problem to solve and the depth of the node in the tree: the larger the problem and the higher the node, the more can be saved by reducing the degree of this node. The second might be even more important, because it helps to reduce large problems to moderate sized problems. Often several problem reduction methods can be used alternatingly, until no further reduction is possible. In a good branch-and-bound algorithm a long sequence of induced decisions is followed by a single branching.

We consider how the branch-and-bound algorithm for the knapsack problem can be improved. In the current implementation, the upper bound is computed in an unnecessarily simple way. After settling x_0, ..., x_{j - 1}, only v_j / w_j is considered. However, if object j is lighter than the residual weight capacity of the knapsack, we will certainly need further objects to fill it up and these may give smaller value per weight unit. If we use linear time for computing the lower bound, then there is no need to only use constant time for the upper bound. On the contrary: in maximization problems the upper bound is far more important than the lower bound. For knapsack the upper bound can be determined with a modification of the greedy algorithm: the residual weight capacity is filled with objects in greedy order until coming to an index k so that

  W' = sum_{i = 0}^{j - 1} x_i * w_i + sum_{i = j}^{k - 1} w_i <= W,
       sum_{i = 0}^{j - 1} x_i * w_i + sum_{i = j}^k       w_i >  W.
So, 0 <= W - W' < w_k. An upper bound is given by
  V' = sum_{i = 0}^{j - 1}            x_i * v_i 
     + sum_{i = j}^{k - 1}                  v_i 
     +                     (W - W') / w_k * v_k
For small j this will give a much better estimate. Now it also becomes useful to make induced decisions as early as possible: if there is a heavy object with high average value per weight unit, it may have a negative impact on the computed upper bound. So, we suggest to always immediately throw out all packets whose weights are larger than the residual weight capacity.

Even for the small problem considered before, this leads to a considerably smaller tree. After including (6, 25) and (2, 25), the residual weight capacity is 2. Thus, all other objects except for (1, 1) are excluded. Upper and lower bound are equal at 34 and no further branching is needed. Also, after excluding (6, 25), the greedy algorithm finds a lower bound of 37, the same as found by the improved upper-bound algorithm.

Improved Branch-and-Bound for Knapsack

Steiner Minimum Trees

Problem Description

The Steiner tree problem has several variants of which we only consider one: the Steiner problem on graphs. The input consists of We must compute a subgraph (which will be a tree) connecting all the nodes in N. If the selection is given by a 0-1 function x on the edges, then the task is to find the selection x minimizing the sum of the weights W(N, x) of the selected edges defined as follows:
W(N, x) = sum_{e in E} x_e * w_e.

There are some special cases. If N = V, that is, if all nodes are terminals, then the problem is to find a minimum spanning tree. This problem has several easy solutions. If |N| = 1, the selection can be taken empty and W = 0. If |N| = 2, the task is to find a shortest path in a weighted graph which can be solved with Dijkstra's algorithm. For all other |N| the problem is hard in general. An approximation is obtained by determining a minimum spanning tree and omiting the edges which are not needed for connecting the terminals.

We consider an example with 10 nodes, 14 edges and 4 terminals. The terminals are marked with double circles. In the example the reduced minimum spanning tree has weight 32, whereas the Steiner minimum tree, connecting all terminals, has weight 31. For some approaches it is important to distinguish non-terminal nodes in which the Steiner tree branches. Such nodes are called Steiner points.

The Minimum Steiner Problem on Graphs

Notice that even though in this example the weight of the Steiner minimum tree is only slightly smaller than the weight of the reduced minimum spanning tree (in the exercises an example has to be given with an arbitrarily large difference between the two), the choice of edges is very different. Particularly, the Steiner minimum tree does not need to consist of edges which are particularly light.

The Steiner minimum tree problem is NP-hard, so polynomial time solutions are unlikely. The problem has been tackled with dynamic programming, but these approaches were not particularly successful. Branch-and-bound supported by many clever heuristics has proven to be much more effective, and now the problem can be solved exactly for problems with thousands of nodes. There has also been done much work in the direction of computing approximate solutions in polynomial time. It is not hard to compute a two-approximation. 1993 Zelikowsky went beyond this, presenting an 11/6-approximation algorithm. After several further improvements, Hougardy and Prömel came 1999 with a 1.598-approximation algorithm, still being far from 1. In the time Zelikowsky presented his algorithm, exact solution could be found only for tiny problems. Since then, due to the tremendous progress in the domain of exact algorithms, approximations have lost most of their practical relevance.

Branch-and-Bound Solution

The branching criterion is here a decision to either certainly put an edge into the solution, or to certainly not take an edge into the solution. So, we are beginning with the set of all possible trees (this set is of course not explicitly constructed). Then gradually, we make more and more decisions, finally boiling down to one specific tree. This is done by maintaining at any point of the search tree two sets: IN and OUT. IN (OUT) gives the edges that are certainly (not) in the solution tree. At any point we compute upper and lower bounds and use these to prune the branching tree.

The edges in OUT are removed from the graph, while the graph is contracted along the edges in IN. Contraction means: removing the edge and replacing the two nodes connected by the edge by a single new node. If at least one of the endpoints of a contracted edge is a terminal, then the new fused node is a terminal as well. If both endpoints of a contracted edge are terminals, then the contraction reduces the number of terminals by one. As soon as we reach a situation with only one terminal, we are done. As soon as we reach a situation with two terminals, the problem can be solved by finding a shortest path in the remaining graph. For estimating the upper and lower bounds, we should not forget to add the weights of the edges in IN.

In the following we only give a few very elementary ideas. The theory around Steiner trees fills several books. Keeping the above maxim in mind, in any good strategy, the focus should lie on minimizing the number of branching operations by complementing the branch-and-bound with methods to reduce the graph. As a heuristic for the selection of the edge to branch upon we use the following rule:

Find the terminal t for which the difference between lightest and next lightest edge is maximal, then pick the lightest edge incident on t.

A very simple lower bound is derived from a lower bound on the weight of spanning trees: a spanning tree has weight at least equal to the sum of the weights of the cheapest edges connected to each node minus the weight of the cheapest edge in the whole graph. This is so, because if we consider a spanning tree and root it at an endpoint u of the cheapest edge, then the edge weights can be attributed to the nodes at the deeper end of the tree edges. So, each node v except for the root node u gets attributed a weight that is at least as much as the cheapest edge connected to v. Analogously, a lower bound on the weight of any Steiner tree is obtained by determining for each terminal the lightest adjacent edge and summing the cost of these. If there are adjacent terminals, the weight of the lightest edge must be subtracted. An upper bound can be obtained simply by computing a spanning tree of all nodes using all remaining edges. Then, this spanning tree can be reduced to those edges that are actually lying on a path connecting the nodes of N. The smaller |N| is, the less accurate these bounds tend to be.

There are a few obvious graph reductions:

  1. Self loops, edges of the form (u, u), can be removed.
  2. Parallel edges, edges running between the same nodes, can be replaced by the single edge with the smallest cost.
  3. Edges leading to a non-terminal node of degree one can be removed.
  4. A pair of edges (u, v) and (v, w), with v a non-terminal node of degree two can be replaced by a single edge (u, w) whose cost is the sum of the costs of (u, v) and (v, w).
  5. If an edge (u, v) is the only edge leading to a terminal node u, then it can be put in IN without branching.

In a branch-and-bound algorithm the sets IN and OUT mostly remain implicit. At any point of the recursive algorithm it is only necessary to know the graph on which we are working, the earlier achieved value and the sum of the weights of the included edges. That some edges have been excluded is not further relevant, they are just not there anymore. In the easiest implementation, for each node of the branching tree a whole new graph is created with the edges that remain after the graph reductions. This requires quite some memory, but because the branching depth cannot be very large, this is not that serious as long as the tree is developed in a DFS manner. Of course this also implies considerable copying of large parts of the graph which remain unchanged. In earlier days, when the graphs for which the algorithm could be applied had fewer than 100 nodes, this was of minor importance, but when we think of graphs with thousands of nodes it may make sense to work on a single graph while maintaining in each node of the branching tree only a list of all changes that have been made, changes which have to be undone when returning from the recursion.

Example

We consider what these rules mean for the tree of the above picture. The reduced minimum spanning tree has cost 32. The initial lower bound is 2 + 8 + 5 + 7 - 2 = 20. For all terminals the cost of the cheapest and the second cheapest edge differs by 1. So, edge (0, 1) is the first edge to branch on. Here we assume that if there are several equally good alternatives the first is taken. If (0, 1) is included, IN = {(0, 1)} and OUT = {}. The lower bound is 2 (for the included edge) + 3 + 8 + 5 + 7 - 3 = 22.

After Including (0, 1)

Now the terminal node 0, which consists of node 0 and 1 fused, has two edges of cost 3. So, the largest difference is found for another terminal: the next edge to branch on is (3, 2). If (3, 2) is included, IN = {(0, 1), (3, 2)} and OUT = {}. The lower bound is 2 + 8 (for the included edges) + 3 + 2 + 5 + 7 - 2 = 25.

After Including (3, 2)

Now the terminal node 3, which consists of node 3 and 2 fused, has cheapest edges of cost 2 and 3, which is the maximum difference. So, the next edge to branch on is (3, 6). If (3, 6) is included, IN = {(0, 1), (3, 2), (3, 6)} and OUT = {}. The lower bound is 2 + 8 + 2 (for the included edges) + 3 + 1 + 5 + 7 - 1 = 27.

After Including (3, 6)

Now the largest difference is still found for terminal node 3, which consists of node 3, 2 and 6 fused. So, the next edge to branch on is (3, 5). If (3, 5) is included, IN = {(0, 1), (3, 2), (3, 6), (3, 5)} and OUT = {}. The lower bound is 2 + 8 + 2 + 1 (for the included edges) + 3 + 3 + 5 + 7 - 3 = 28.

After Including (3, 5)

The largest difference is still found for terminal node 3, which consists of node 3, 2, 6 and 5 fused. So, the next edge to branch on is (3, 0). If (3, 0) is included, IN = {(0, 1), (3, 2), (3, 6), (3, 5), (3, 0)} and OUT = {}. The lower bound is 2 + 8 + 2 + 1 + 3 (for the included edges) + 3 + 5 + 7 - 3 = 28. Including this edge fuses the terminal nodes 0 and 3.

After Including (3, 0)

By the reductions, the graph has now strongly changed. It happens that the two cheapest edges with the largest difference are now found at terminal node 8, so the next edge to branch on is (8, 9). If (8, 9) is included, IN = {(0, 1), (3, 2), (3, 6), (3, 5), (3, 0), (8, 9)} and OUT = {}. The lower bound is 2 + 8 + 2 + 1 + 3 + 7 (for the included edges) + 4 + 5 + 4 - 4 = 32.

After Including (8, 9)

The lower bound equals the already established upper bound, so there is no reason to continue here. That is, the recursion backtracks one level. If (8, 9) is excluded, IN = {(0, 1), (3, 2), (3, 6), (3, 5), (3, 0)} and OUT = {(8, 9)}. The edges (3, 9) and (9, 7) are contracted because node 9 has degree 2, and the new edge (3, 7) with cost 9 is parallel to the edge (3, 7) with cost 6 and eliminated. Because now terminal node 7 has degree one, the edge (7, 3) with cost 6 is added to IN without branching. Hereafter terminal 7 continues to have degree one and the edge (7, 8) with cost 11 (this is the former edge (3, 8)) is added to IN without branching. The result is that only terminal node 7 is left, having reached a cost of 33.

After Excluding (8, 9)

Hereafter, the recursion backtracks again, not having found a new better solution. If (3, 0) is excluded, IN = {(0, 1), (3, 2), (3, 6), (3, 5)} and OUT = {(3, 0)}. Terminal node 0 (which consists of node 0 and 1) has degree 1, and thus edge (0, 4) is added to IN without branching. This brings the lower bound to 2 + 8 + 2 + 1 + 3 (for the included edges) + 4 + 4 + 5 + 7 - 4 = 32.

After Excluding (3, 0)

Because this equals the earlier established upper bound, there is no need to proceed on this branch. The operations continue in this way, traversing the whole branching tree which has 53 nodes in total. The optimum solution is only found shortly before the end in the branch which excludes edge (0, 1).

The Minimum Steiner Problem with Branch and Bound

The program which was used to perform the sketched traversal of the branching tree can be downloaded here. The output, giving a concise overview of the situation in each node and the performed operations, can be downloaded here. One should notice that here, as in the above discussion, because of the fusion of nodes and contraction of edges, the listed edges are not necessarily those of the original tree.

The number of visited nodes is to be compared with the huge number of trees that connect the four terminals. Furthermore, it is hard to systematically generate such trees. The easiest is to generate all 2^14 = 16384 subsets of the edges and then to check whether they give a feasible solution and to keep track of the minimum cost of all feasible solutions.

Further Graph Reductions

We consider one more simple idea to reduce the graph:
Edges violating the triangle inequality can be removed. An edge (u, v) with cost c violates the triangle inequality if there are edges (u, w) and (w, v) with costs c' and c'', respectively, with c >= c' + c''. Notice that (u, v) can be removed even when c = c' + c''.

In the above example, this leads to an improved situation after including edge (3, 0): edge (3, 8) with cost 11 can be removed because of the edges (3, 9) with cost 4 and edge (9, 8) with cost 7. Hereafter terminal node 8 has degree one, so edge (8, 9) is added to IN without branching. This brings the lower bound to 2 + 8 + 2 + 1 + 3 + 7 (for the included edges) + 4 + 5 + 4 - 4 = 32, and so there is no need to proceed on this branch, immediately saving two nodes of the tree.

Graph After Including (3, 0)

Adding this graph reduction reduces the total number of nodes from 53 to 31. This is typical: each further graph reduction tends to give a further strong reduction of the number of nodes of the branching tree. The important point is that one reduction may not be applicable first, but once another reduction has been applied it may find something to do again. In the given example it was the new reduction which created a situation in which reduction 5 could become active again. Ideally, this process keeps rolling until reaching a bounding instance. The output of the program with the added graph reduction can be downloaded here.

Branching Tree when Applying one more Reduction

Time-Efficient Programming

The implementation of the used program is primitive: the used algorithms are the simplest one can think of, in many cases they do not achieve the optimum time order. Of course this is not something to be proud of, but at the same time it is not so serious either. We consider in what context this may actually be the best one can do, touching on a crucial practical point.

For research purposes it might be good to invest any thinkable effort to implement the best known implementation for every detail. One might even think of an implementation on a parallel computer to gain another factor 10. However, this requires a tremendous effort and is only a good idea if programming time plays no role at all. A more common situation is that one decides that a problem should be solved by using a computer program, because using pen and paper becomes too time consuming, too boring and too fault-prone and that one can spend at most one day on this. After this day, there should be a running and rather good program.

So assume we have a finite amount of time and want to come with quite a good program tackling a hard combinatorial problem. How should we proceed? This is the problem of time-efficient programming. So, here we do not strive for minimizing the computation time, but rather for a weighted minimum of programming and computation time putting considerable weight on the programming time. Exhaustive search requires very little programming time, but is so terribly inefficient, that one can only solve the smallest problems. For the Steiner minimum problem, one will not come much further than m = 30 in this way. So, this is probably not what we want, though, if the program is only intended for verifying the correctness of the solutions of small examples like the one considered above, this actually might be good enough.

If one implements a branch-and-bound algorithm, then typically there are instances which may require visiting an exponential number of nodes. This one should accept as a fact. However, by adding heuristics this exponentiality may be strongly reduced or even eliminated for important classes of inputs (such as random graphs or planar graphs or any graph which is sufficiently dense or sparse). Instead of adding additional reduction strategies, we might also have spent the little time we have on optimizing the implementation. What should we do?

Of course there is no general answer to the above question because it has not been specified how bad the basic implementation is or how much can be gained by adding heuristics. So, let us assume that the time for processing a single node of the branching tree with a simple implementation takes O(t_1(n)) while this can be reduced to O(t_2(n)) with a more elaborate implementation. Also assume that a basic version of the branch-and-bound algorithm processes on average f_1(n) nodes, while by adding some heuristics this can be reduced to f_2(n). Here the average is meant to be computed over some relevant subset of problems. In the available time we can either write a program running in O(t_1(n) * f_2(n)) or O(t_2(n) * f_1(n)). What is better?

As an example we consider t_1(n) = n^2, t_2(n) = n, f_1(n) = 2^n and f_2(n) = 2^{n / 2}. These values are arbitrary but not untypical. For any T >= 0, let n_1 = n_1(T) and n_2 = n_2(T) be values so that 2^{n_1} * n_1 = T and 2^{n_2 / 2} * n_2^2 = T, respectively. For a certain maximum waiting time T, n_1(T) and n_2(T) are the maximum sizes of the problems that can be tackled when optimizing the implementation, respectively, spending more time on adding heuristics. For large T, the solutions are n_1 ~= log T - loglog T, n_2 ~= 2 * (log T - 2 * loglog T). For example, for T = 10^12, n_1 ~= 34.8 and n_2 ~= 67.6.

When striving for the most efficient branch-and-bound implementation while having limited programming time, it is generally better to spend time on adding better methods for computing bounds and reducing the size of the problem than on optimizing the implementation.

Traveling Salesman Problem

Possibly the most famous of all NP-hard problems, the major example of a hard problem one finds in popular-scientific articles, is the traveling salesman problem. The problem is appealing because of the contrast between its simple formulation which is understandable even to non-scientists and its near-intractability.

Problem Description

For a graph, which may be directed or undirected, a tour is a path starting and ending in the same node: it is given by a sequence of nodes (u_0, ..., u_k), so that all pairs (u_i, u_{i + 1}), 0 <= i < k, give an edge of the graph while u_0 = u_k. The feasible solutions to the traveling salesman problem, TSP, are given by all tours which visit all nodes at least once: {u_0, u_1, ..., u_k} = {1, 2, ..., n}, that is, interpreted as a set, the multiset of visited nodes contains the set of all node indices. In the literature one may find the additional condition that all u_i are different. The task is to find such a tour which for a weighted graph minimizes the sum of the weights of the used edges.

The formulation of the problem is simplified if it is considered on a complete graph for which the triangle-inequality holds. From any given graph G, such a complete graph G' can be constructed by giving an edge (u, v) in G' length equal to the shortest path from u to v in G. Of course the graph must be connected and there should be no negative-cost cycles, otherwise the whole problem is not well-defined.

On such complete graphs, among the optimum solutions to a TSP there is also a solution which visits each node only once: a subtour (u, v), (v, w), where v is a node which is visited also elsewhere, can be replaced by (u, w), because the triangle-inequality guarantees that (u, w) is not longer than (u, v) and (v, w) together. A tour visiting all nodes exactly once can be represented as a permutation of the node indices. This limits the number of feasible solutions to n! for a graph with n nodes.

Improved Greedy Solutions

TSP is an extremely well-studied problem. Correspondingly, there are many good heuristic solution methods. Most of them only give approximations. A well known heuristic is the following greedy strategy: start at one of the nodes, repeatedly use the shortest outgoing edge leading to a node which has not been visited before.

This heuristic is not very good, but (especially on undirected graphs) it can subsequently be improved by two-swaps: a two-swap means that in the sequence of edges (S_0, (u_i, u_{i+1}), S_1, (u_j, u_{j+1}), S_2) the edge (u_i, u_{i+1}) is removed and replaced by (u_i, u_j), that the subsequence S_1 is reversed, that the edge (u_j, u_{j+1}) is removed and replaced by (u_{i+1}, u_{j+1}), while the rest of the sequence of edges remains unchanged. On the permutation this works as follows:

  (u_1,     ..., u_{i-1}, u_i,     u_{i+1}, 
   u_{i+2}, ..., u_{j-1}, u_j,     u_{j+1}, u_{j+2}, ..., u_n) --> 
  (u_1,     ..., u_{i-1}, u_i,     u_j, 
   u_{j-1}, ..., u_{i+2}, u_{i+1}, u_{j+1}, u_{j+2}, ..., u_n)
A general method to find a minimal (not minimum!) solution is to perform cost-reducing two-swaps until there are no cost-reducing two swaps to perform anymore. Such a solution will be called two-swap optimal.

A special class of graphs are Euclidean graphs, in which the distance between any pair of nodes u and v is given by their distance in the two-dimensional Euclidean metric. These graphs have practical importance because mostly road distances are closely related to Euclidean distances and because only for graphs of this type we can easily construct good solutions by looking at the map giving the positions of all nodes, which then typically are called cities. In the following we use a Euclidean graph with 10 nodes as example. The greedy algorithm constructs the tour (0, 3, 4, 5, 9, 8, 2, 1, 6, 7, 0).

Greedy Solution for a Euclidean TSP Problem

It is clearly advantageous to perform a two-swap for (4, 5) and (8, 2), giving the tour (0, 3, 4, 8, 9, 5, 2, 1, 6, 7).

After Two-Swap((4, 5), (2, 8))

The next profitable swap is for (0, 3) and (1, 6), giving the tour (0, 1, 2, 5, 9, 8, 4, 3, 6, 7).

After Two-Swap((0, 3), (1, 6))

The next profitable swap is for (0, 3) and (1, 6), giving the tour (0, 1, 2, 5, 9, 8, 4, 3, 7, 6).

After Two-Swap((3, 6), (7, 0))

The tour (0, 1, 2, 5, 9, 8, 4, 3, 7, 6) is a minimal solution: there are no cost-reducing two-swaps to perform anymore. At the same time it is not the optimum: the tour (0, 1, 3, 4, 2, 5, 9, 8, 7, 6) is slightly shorter.

Optimum Solution

In some approaches one allows to occasionally perform a cost-increasing two-swap in order to come out of a local minimum. Another, more systematic way, to find better solutions, is to perform k-swaps instead of two-swaps. A k-swap removes k edges and glues the resulting k subsequences again together in any of the 2 * 4 * ... * (2 * k - 2) = (k - 1)! * 2^{k - 1} different possible ways, one of them being the original gluing. In the given example, the optimum solution is obtained by a three-swap: remove the edges (1, 2), (4, 8) and (3, 7) and replace them by (1, 3), (4, 2) and (8, 7). When performing k-swaps, there are (n over k) * (k - 1)! * 2^{k - 1} possibilities to test. Each test can be performed in O(k) time, thus, for constant k, a profitable k-swap, if any, can be found in O(n^k) time. A solution which cannot be improved by making k-swaps is called k-swap optimal. Any k-swap optimal solution is also (k - 1)-swap optimal (because one of the edges which is takenout can be put back in its own place), but because of the strong increase of the complexity with k, the following will be a good idea:

   int performSwaps(int[][] d, int[] s, int n, int k)
  {
    // Performs all possible k swaps and returns their number
  }

  void swapOptimal(int[][] d, int[] s, int n, int k)
  // d[][] is cost matrix
  // s[]   gives preliminary solution
  {
    if (k == 2)
      performSwaps(d, s, n, k);
    else
      do
        swapOptimal(d, s, n, k - 1);
      while (performSwaps(d, s, n, k) > 0);
  }

The quality of the found solutions improves considerably with k. The length of a 5-swap optimal solution is often hardly longer than that of an optimum one. On the other hand, a guarantee to find the optimum cannot be given unless k almost equals n.

Branch-and-Bound Solution

TSP is one of the problems which are most suitable for treatment with branch-and-bound, because we have good bounds in both directions. We have already seen how rather good (even tunable to a large extent) upper bounds can be obtained. Rather good lower bounds can also be obtained easily: Because a complete tour in particular contains a spanning tree as a subgraph, a lower bound is given by the cost for a minimum spanning tree. In the following we describe the method in more detail.

For TSP it is natural to develop the branching tree in a more subtle way than before: it is a good idea to use a priority queue in which all the reached leafs are maintained. These leafs correspond to partial tours. The algorithm starts by entering the starting node into the priority queue. Because there is only one entry, the key is irrelevant. In general the entries in the priority queue correspond to partial tours and the entered keys are the upper bound values computed for them. As associated information the entries also contain a lower bound estimate. As long as the priority queue is not empty, the entry with smallest key is removed with a deletemin operation. Let (u_0, ..., u_k) be the partial tour corresponding to this entry, where u_0 indicates the starting node. If the lower bound corresponding to this entry exceeds the current best achieved value, then this partial tour is not further interesting and we can proceed with the next entry. Otherwise, for all nodes v not in {u_0, ..., u_k} lower and upper bounds are computed for the partial tour (u_0, ..., u_k, v). This upper bound is used as key when inserting a new entry with all this information into the priority queue. This insertion is only performed when the lower bound is less than the current best achieved value. Therefore, it makes sense to compute lower bounds before computing upper bounds.

For a partial tour (u_0, u_1, ..., u_{k-1}, u_k) a lower bound is given by taking the sum of the costs of the edges (u_0, u_1), ..., (u_{k-1}, u_k) and adding the cost of a spanning tree spanning all non-visited nodes plus u_0 and u_k. An upper bound is given by taking the sum of the costs of the edges (u_0, u_1), ..., (u_{k-1}, u_k) and adding the cost to complete (u_0, u_1, ..., u_{k-1}, u_k) in a greedy way to a tour.

If the partial tour (u_0, u_1, ..., u_{k-1}, u_k) consists of n - 1 nodes, then no new entries are generated. Instead it is determined how much it costs to complete (u_0, u_1, ..., u_{k-1}, u_k) to a tour by going via the single remaining node back to u_0. On this occasion, and whenever generating an upper bound, it is considered whether a new best achievable value is obtained, and if yes this value is updated.

The algorithm is designed so that we may expect to rapidly discover a small achievable value, which then may be used to discard most of the entries deleted from the priority queue and to save adding even more. In principle tours are started from both directions. This might be prevented, but because the greedy algorithm works differently from both directions, this may also mean that good paths are found later. Possibly a different priority strategy leads more rapidly to a good bound and thus more pruning. Maybe one should give priority to the entry with the smallest lower bound or some kind of mixed strategy. Whatever one does, there is no guarantee. At best one finds a heuristic that on the kind of graphs one wants to investigate works well.

Example

The branch-and-bound algorithm is illustrated with an example derived from the graph considered above. In order to slightly limit the complexity we have reduced the set of edges to those that appear most useful. The weights are more or less proportional to the actual distances in the two-dimensional plane, but we have rounded them to integers and made them different from each other as far as possible, maintaining their relative order. All non-indicated edges are as long as the shortest path in the graph.

Input Graph for TSP with Branch-and-Bound

Starting from 0 we consider including the edges (0, 1), (0, 3), (0, 6) and (0, 7). In all cases, and this is generally true, the constructed minimum spanning tree is the same, Its edges have weights 4 + 4 + 4 + 5 + 5 + 5 + 6 + 6 + 7 = 46. In addition we get the cost of the included edges, which gives different lower bounds for each of the choices.

Minimum Spanning Tree

Running the greedy algorithm starting from node 1, 3, 6 and 7, gives 4 rather different paths with, by coincidence, rather similar lengths: (0, 1, 3, 4, 5, 9, 8, 2, 6, 7, 0) with cost 7 + 5 + 4 + 5 + 4 + 7 + 10 + 15 + 4 + 10 = 71; (0, 3, 4, 5, 9, 8, 2, 1, 6, 7, 0) with cost 6 + 4 + 5 + 4 + 7 + 10 + 8 + 13 + 4 + 10 = 71; (0, 6, 7, 4, 3, 1, 2, 5, 9, 8, 0) with cost 7 + 4 + 10 + 4 + 5 + 8 + 5 + 4 + 7 + 17 = 71; (0, 7, 6, 3, 4, 5, 9, 8, 2, 1, 0) with cost 10 + 4 + 8 + 4 + 5 + 7 + 7 + 10 + 8 + 7 = 70. The current best achieved value is 70.

Greedy paths Starting from Node 0

Thus, we are entering four new entries to the priority queue: (71, 53, (0, 1)), (71, 52, (0, 3)), (71, 53, (0, 6)) and (70, 56, (0, 7)). Let us fix that from the entries in the priority queue with the same key, the one that is entered first is taken out first again (FIFO order). Thus, in sorted order the priority queue looks as follows:

  (70, 56, (0, 7)), 
  (71, 53, (0, 1)), 
  (71, 52, (0, 3)), 
  (71, 53, (0, 6)).

In the next step, (70, 56, (0, 7)) is removed from the priority queue. In order to compute the new lower bounds, we compute the cost of a minimum spanning tree spanning the nodes {0, 1, 2, 3, 4, 5, 6, 8, 9}. It is the same as before except for the edge (6, 7). It has weight 42. Now we consider all adjacent nodes. These are reached over the edges (7, 3), (7, 4), (7, 6) and (7, 8). The respective lower bounds are 63 = 10 + 42 + 11, 62 = 10 + 42 + 10, 56 = 10 + 42 + 4 and 64 = 10 + 42 + 12. None of these exceeds the current best achieved value 70. The new upper bounds are found by running the greedy algorithm starting from node 7. These paths have lengths 69, 66, 57 and 66, respectively, giving upper bounds of 79, 76, 67 and 76. Here we find a new current best achieved value: 67.

Greedy paths Starting from Node 7

Adding the four new entries to the priority queue makes it look as follows:

  (67, 56, (0, 7, 6)), 
  (71, 53, (0, 1)), 
  (71, 52, (0, 3)), 
  (71, 53, (0, 6)),
  (76, 62, (0, 7, 4)), 
  (76, 64, (0, 7, 8)), 
  (79, 63, (0, 7, 3)).

In the next step, (67, 56, (0, 7, 6)) is removed from the priority queue. A minimum spanning tree spanning the nodes {0, 1, 2, 3, 4, 5, 8, 9} consists of the same edges as before except for the edges (6, 7) and (0, 6). It costs 35. Starting from 6, 1 and 3 are the only non-visited neighbors. The respective lower bounds are 62 = 14 + 35 + 13 and 57 = 14 + 35 + 8. The greedy paths from these have lengths 61 and 53, respectively, giving upper bounds of 75 and 67.

Greedy paths Starting from Node 6

Adding these to the the priority queue gives
  (67, 57, (0, 7, 6, 3)), 
  (71, 53, (0, 1)), 
  (71, 52, (0, 3)), 
  (71, 53, (0, 6)),
  (75, 62, (0, 7, 6, 1)), 
  (76, 62, (0, 7, 4)), 
  (76, 64, (0, 7, 8)), 
  (79, 63, (0, 7, 3)).

In the next step, (67, 57, (0, 7, 6, 3)) is removed from the priority queue. A minimum spanning tree spanning the nodes {0, 1, 2, 4, 5, 8, 9} is different from before. For the graphical representation it is helpful to remove the nodes which do not need to be visited. Applying Kruskal's or Prim's algorithm this is established easily by just ignoring the edges leading to the visited nodes, but of course in a real implementation one might also gradually reduce the graph. One should not forget that actually we are working on a complete graph, in which all missing edges have the lengths of the shortest path. The minimum spanning tree has weight 35.

Minimum Spanning Tree

Now we consider all adjacent non-visited nodes. These are reached over the edges (3, 1), (3, 2) and (3, 4). The respective lower bounds are 62 = 22 + 35 + 5, 64 = 22 + 35 + 7 and 61 = 22 + 42 + 4. None of these exceeds the current best achieved value 67. The new upper bounds are found by running the greedy algorithm starting from node 3. These paths have lengths 46, 46 and 45, respectively, giving upper bounds of 68, 68 and 67.

Greedy paths Starting from Node 3

Adding these to the the priority queue gives
  (67, 61, (0, 7, 6, 3, 4)), 
  (68, 62, (0, 7, 6, 3, 1)), 
  (68, 64, (0, 7, 6, 3, 2)), 
  (71, 53, (0, 1)), 
  (71, 52, (0, 3)), 
  (71, 53, (0, 6)),
  (75, 62, (0, 7, 6, 1)), 
  (76, 62, (0, 7, 4)), 
  (76, 64, (0, 7, 8)), 
  (79, 63, (0, 7, 3)).

Branch and Bound for TSP with Priority Queue

In this way we can continue. As long as we do not discover a better current best achievable value, there is not much to prune. However, we want to point out that actually there is more pruning than it appears: in the above example, in every step we only considered the nearest neighbors. In the most basic implementation, however, there is no such limitation: the graph is complete and all non-visited nodes are considered. Many of the resulting entries have high lower bounds. For example, upon removing (67, 57, (0, 7, 6, 3)) from the priority queue we only considered the neighbors 1, 2 and 4 of 3, inserting three new entries in the priority queue. In general the lower bound for such an entry is given by 22 + 35 + distance_from_node_3. If we would have considered all nodes, we would have found entries

  (70, 66, (0, 7, 6, 3, 5)),
  (--, 68, (0, 7, 6, 3, 8)),
  (--, 70, (0, 7, 6, 3, 9)).
The first would be entered in the priority queue, but become obsolete after the next improvement of the current best achieved value. The other two would be filtered out before even computing their upper bounds.

Alternative Approaches

The here presented branch-and-bound approach is natural and simple, but there are other approaches as well. In the described approach, we are gradually extending a partial tour to a tour. The branching is on the selection of the next node on the tour. As a result a node at depth k of the branching tree has degree n - k - 1. The selection criterion is trivial: always take the last node on the partial path. Alternatively, just as we did for the Steiner problem, we can branch on the selection or exclusion of an edge. This leads to a binary branching tree. In a first implementation, the edge to branch on can be selected as in the Steiner problem.

Both approaches are not equivalent but not so different as it may appear. The relation between the two approaches is similar to the relation between Prim's and Kruskal's algorithm for computing a minimum spanning tree: the first starts with a single node, for example node 0, and then continuously adds the edge to the node which lies nearest to the current tree; the second repeatedly starts with an empty set and repeatedly adds the cheapest edge which does not connect two nodes that are not yet connected by earlier selected edges.

Assignment Problem

Problem Description

The assignment problem is a basic optimization problem that nevertheless already has many interesting aspects. In the assignment problem, there are n agents and n tasks. In principle each agent can perform any of the tasks, but at different costs. The task is to allocate one task to every agent, so that the sum of the costs is minimized. This problem is also known under the name bipartite weighted matching problem.

The assignment problem is important because of its numerous practical applications. It can for example be used to model the problem of pairing men and women each matching having a certain desirability, or for allocating teachers to classes, or for allocating new students to housing facilities. In general the assignment problem can be used to model all problems of matching two parties, where the different possible matchings have different weights. Impossible matches (like a teacher of history which should not teach a physics class) can be modeled by giving infinite value to the corresponding cost.

Mathematically the problem can be formulated as follows. Given an n x n cost matrix M, we should determine a permutation P of n numbers, so that C(P), the cost with respect to P, given by

C(P) = sum_i M(i, P(i))
is minimized. P(i) can be interpreted as the task performed by agent i. Alternatively the problem can be formulated as follows: find values x_{ij} so that
x_{ij} = 0 or 1;
sum_i x_{ij} = 1, for all j;
sum_j x_{ij} = 1, for all i;
which are minimizing the value of C(x) defined as
C(x) = sum_i sum_j x_{ij} * M_{ij}.
This is a linear programming formulation. In general linear programming (LP) is quite an easy problem which can be solved in polynomial time (with the famous ellipsoid method) and even more efficiently with heuristics. However, because of the first condition the above formulation is actually an integer linear programming (ILP) problem. In general ILP is NP-hard, so the problem cannot be solved almost automatically in reasonable time by turning the wheel of a known problem solving machine.

Branch-and-Bound Solution

If one does not want to perform a deep problem-specific study, then branch-and-bound is one of the approaches one would soon think of. As always, this implies that the four main questions must be settled: lower bound, upper bound, choice criterion and tree development strategy.

A usable lower bound is obtained as the maximum of the sum of the minima in the rows and the sum of the minima in the columns. Once a certain number of choices are made, the upper bounds and lower bounds are of course determined by only looking at the remaining submatrix of M, adding the already incurred costs.

A relatively good upper bound is obtained by taking the value achieved by one of the following two greedy algorithms:

The value achieved by each of these algorithms is the same.

For the assignment problem, the choices are to either allocate an agent to a task or to definitely not assign it. An assignment of agent i to task j means that row i and column j are scratched from the cost matrix. Deciding that agent i should not be assigned to task j can be modeled by setting M(i, j) = infinity. At any stage of the processing, we branch on the minimum remaining value in the cost matrix M. As a result at first the greedy approach will be tried, which is reasonable. For simplicity we will perform the search in DFS order.

Example

As an example we consider an assignment problem with n = 5 and the following cost matrix:

Cost Matrix for Assignment Problem

Applying the described branch-and-bound approach to this cost matrix results in a branching tree with 23 nodes. So, there are 12 leafs, which is not much if we consider that there are 5! = 120 possible assignments.

Branch-and-Bound for Assignment

Improvements

Tolerance-Based Branching

The greedy rule for constructing a solution and for deciding which edge to branch on is simple to implement and works quite well. However, theoretically it is not so well founded: if the two smallest values in the cost matrix are almost equal, then taking the lightest edge is not so much more motivated than taking the second lightest. It is much more important which consequences this has for the further selections. One edge must be selected from each row and column anyway, so there is no reason to be too greedy.

Before we were considering the minimum-Steiner-tree problem. There we could also have picked the lightest edge incident upon any terminal, but we did not do so. Rather we picked the edge which was most "tolerant" in the sense that not taking it would imply taking a considerably more expensive edge. Even for the assignment it is better to base the selection of the entry to branch on on some kind of tolerance measure. The tolerance of an entry should quantify the negative implication of not taking it. Said in a more mathematical way, the tolerance of an entry is the amount by which its value can be increased before it is no longer the first choice. For various problems this notion should be worked-out differently. In the case of the assignment problem the following is the very reasonable and easy to compute:

Definition: The tolerance of an entry (i, j) of the cost matrix a[][] of an assignment problem is given by row_second[i] + col_second[j] - 2 * a[i][j]. Here row_second[i] and col_second[j] give the second smallest values in row i and column j, respectively.

The branching is performed on the entry for which the tolerance is maximal. The smallest entry in the cost matrix has a positive tolerance because in particular it is also the smallest entry in its row and column. However, in general this smallest entry does not have maximal tolerance. The above given 5 x 5 cost matrix provides an example: the smallest entry is (a, 1), which has value 20 and tolerance 23 + 21 - 2 * 20 = 4, but the tolerance is maximal for (b, 3), which has value 25 and tolerance 28 + 36 - 2 * 25 = 14.

It is a good idea to also apply this selection strategy when computing an upper bound by picking edges one-by-one without backtracking. If this is done, the upper bounds on any left path is constant, which considerably reduces the number of calls to the method for computing upper bounds.

If we apply the branch-and-bound algorithm with modified selection rule to the 5 x 5 cost matrix considered before, the tree develops quite differently. Even for this very small problem the tree becomes considerably smaller. It may be expected that for larger values of n the advantages of this tolerance-based selection becomes even more evident.

Better Branching for Assignment

Better Lower Bounds

The main weakness of the current algorithm is the primitive way the lower bounds are computed. As these, in a minimization problem, are the bounds which decide over branching or not, improved lower bounds will have a great impact on the number of processed nodes. Of course, we must find a trade-off between effort and quality. In the following we present a bound which can be computed with a constant factor extra work.

So far the lower bounds were taken as the maximum of the sum of the row minima and the sum of the column minima. We consider in more detail the sum of the row minima. Assume l > 1 row minima are located in the same column j of the cost matrix. Let the rows for which this happens have indices i_0, ..., i_{l - 1}. Clearly only one of them can actually choose this minimum value. This observation can be worked out in many ways. The simplest is to let l - 1 contribute the second smallest value. Only the entry for which the difference between smallest and second smallest value is maximal contributes its smallest value. In a formula for these l rows this gives the following contribution:

  sum_{k = 0}^{l - 1} row_second[i_k] -
  max_{k = 0}^{l - 1} (row_second[i_k] - a[i_k][j]).
Looking at all rows, we see that the contribution of row i with minimal value at position a[i][j] is now given by row_second[i] unless there are no further rows which have their minimum in column j or unless row_second[i] - a[i][j] is maximal among all those rows for which the minimum lies in column j.

It depends on the nature of the matrix how big the difference between the simple and the improved lower bound will be. If the weights are more or less uniformly distributed, then the expected number of columns with l minimal values is given by n / (e * l!). So, in that case the expected number of rows in which the second smallest value is taken is given by n / e * sum_{l >= 1} (l - 1) / l! = n / e * (sum_{l >= 1} l / l! - sum_{l >= 1} 1 / l!) = n / e * (e - (e - 1)) = n / e. Here we used that sum_{l >= 1} l / l! = sum_{l >= 1} 1 / (l - 1)! = sum_{l >= 0} 1 / l! = e. This is quite good in itself, but the improvement is much larger if there are columns with many small values (attractive apartments, handsome men, well-paid jobs), then almost all are forced to accept a second-best choice.

If we apply the branch-and-bound algorithm again to the same cost matrix applying both improvements, we see that the tree develops in the same way as it should do, but that right from the start all lower bounds are larger, and that this helps to cut of the left-most branch at its root, saving four more nodes.

Better Lower Bounds for Assignment

The algorithm has been worked out into a running Java program which can be downloaded here. The program asks the value of n, then the values of the cost matrix can be entered. Alternatively a random n x n cost matrix is generated. Both selection approaches and the improved lower bounds have been implemented. In this program for each node of the branching tree a new cost matrix is created. This leads to an extremely simple program, but it is not the best one can do.

The above idea for computing improved lower bounds can be further refined. In the current implementation, it is only assured that in each row or column at most one first choice can be selected. A good reason for this is computational: the improved lower bound as given can still be computed in O(n^2) time, just as the basic lower bound. However, in general there will even be many second choices hitting upon each other and on first choices. Taking into account all collisions between the first k choices means that a weighted bipartite matching problem with two sets of n nodes each and (k + 1) * n edges must be solved. It is to be expected that already for very small values of k this gives almost perfect lower bounds. Of course this comes at quite considerable costs and implementation effort.

Branch-and-Bound is Exponential

It is not hard to construct inputs for which any of the above presented branch-and-bound strategies have exponential running time. Even for the assignment problem, for which polynomial-time solutions exist, this is the case. Without pushing our example to the limit, we show that the time consumption is Omega(2^{n / 2}).

Let A_x be the following 2 x 2 matrix:

Matrix A_x

The cost matrix M is now composed with n copies of A_x on the diagonal (so that the diagonal of the A_x lies on the diagonal of M) for suitable values of x. Number the 2 x 2 submatrices A_x on the diagonal from the upper-left corner to the lower-right corner and denote the value of x in submatrix i, 0 <= i < n / 2, by x_i. The values x_i are given by This precisely gives x_{n / 2 - 1} = 1 and x_i = 2^{n / 2 - 2 - i} for all smaller i. All other values are equal to the sum of all values x_i, that is, they are x^{n / 2 - 1}. These are large values, but the total size of the input is polynomial in n, because it is bounded by n^2 * log (2^{n - 1}) < n^3.

Hard Cost Matrix

At this point it is important to assume that the branching and the upper-bound computation choose the first of several equally good alternatives (though using slightly larger numbers it is easy to construct an example in which the greedy algorithm has a unique choice) So, if we have M_11 = 0 in a matrix M with all M_ij >= 0, the first choice will be to take a = 0. Under this assumption, for every A_{x_i}, the greedy algorithm will always first choose the assignments so that we get cost 0 + x_i, and only after evaluating the whole subtree find the better assignment with cost 0 + 0. Then it will repeat all the wrong decisions at the lower level, before finding the right choices again. Here it is important that the x_i for small i become larger, otherwise the improved upper-bound would allow to cut a large part of the later visited nodes.

Thus, we get a full binary tree of depth n / 2. Such a tree consists of Theta(2^{n / 2}) nodes. This is exponential in the memory consumption which is bounded by O(n^3). Of course, for this particular case it would not be hard to modify the strategy so that the algorithm would immediately find the optimal solution. However, a different cost matrix would be bad for this modified strategy. Branch-and-bound is an effective method for solving hard problems which can be programmed extremely easily, but we should not hope to obtain guaranteed polynomial time solutions when using it.

Bad Case of Assignment with Branch and Bound

Polynomial Time Solutions

Branch-and-bound as a solution for the assignment problem may be fine in practice even though we know that there are instances for which it is exponential. However, the fact that branch-and-bound is exponential for the assignment problem does not imply that the problem cannot be solved in polynomial time: interpreting it as a bipartite weighted matching problem, we see that it is the intersection of two matroids for which there are general polynomial-time algorithms. For this specific problem there are even slightly more efficient algorithms, one of the best-known is the so-called Hungarian algorithm which has running time O(n^3).

So, should we rather use this polynomial-time algorithm for solving the assignment problem? Maybe yes, but not certainly: the branch-and-bound algorithm is very simple and often the number of visited nodes is actually quite small. With good management the upper and lower bounds can be computed efficiently, so there is reasonable hope that the branch-and-bound approach is more efficient.

The most famous case of such a dualistic situation is for solving linear programming problems, LPs, (optimization problems with conditions formulated as linear inequalities). The simplex algorithm works fine for most inputs, but one can construct inputs for which it takes exponential time (the algorithm makes progress in every iteration, but for special cases this progress can be very small). The ellipsoid method guarantees polynomial time but has gained more practical interest only recently.

The conclusion is that in general polynomial-time algorithms is the thing we are striving for, but that for some problems, for which the polynomial-time algorithms are rather elaborate and inefficient, it may in practice nevertheless be better to apply a strategy like branch-and-bound even though it is inherently exponential.

Trying the current implementation of the branch-and-bound algorithm solving the assignment problem shows that the time consumption strongly increases with n, too strongly to be competitive against an O(n^3) algorithm. However, this implementation is only a first step, all the ideas that have helped to make the branch-and-bound algorithm for the Steiner-tree problem more efficient can be applied here as well: refined analysis may help to obtain induced decisions; better upper and lower bounds may help to cut more branches. Of course, then the simplicity argument does not apply anymore.

Making Change Revisited

Now that we know what "branch-and-bound" means, we can try to classify the improved algorithm for making change which in many cases was recursively evaluating only one of the branches because for the second branch it could be concluded that it could never give a shorter paying schedule. In a certain sense this is a branch-and-bound algorithm. However, there are important differences as well.

In a pure branch-and-bound algorithm it is possible that there are different paths to the same position and that from there the same computation is performed twice. If for example in the 1-4-6 money system the amount to pay is 1899, then the path (1000, not 1000, 400, 400, not 400, not 100) brings us to the same table position as (not 1000, 600, 600, 600, not 600, not 400, not 100). In a normal branch-and-bound algorithm this does not lead to bounding for the second path, because the lower bound for this node is 5 and the best computed upper bound is 9. So, there is no reason to stop here.

Only the use of a table allows saving recomputing earlier computed values. In the above example both paths have the same length, but in general it may be that the second time we reach a node the path leading there is shorter, and therefore it cannot be excluded that the second visit leads to a new optimal solution. For example, if there is also a coin of 900, then there is a non-greedy path with two coins to this node, which is leading to the optimal solution. So, the value of this node either must be still available or it must be recomputed.

It depends on the problem whether it is profitable to store values for earlier computed subproblems. For some problems it can be excluded that the same subproblem can be reached over different paths through the branching tree. In that case there is nothing to gain by storing values. In other cases it cannot be excluded but is highly unlikely. For example, for knapsack with weights which range over a large interval, it is unlikely that the sum of the weights of two different subsets of packets are equal. So, in this case it is probably cheaper to recompute a few values than to maintain the additional data structure.

In games like chess in a sequence of three moves starting with a white move it mostly does not matter in which order white performs its two moves, that is, (w_1, b, w_2) is mostly equivalent to (w_2, b, w_1), where w_1 and w_2 are moves by white and b is a move by black. This case is somewhat different though. Here the problem can be overcome by only considering the lexicographically first of several permutations. But even in chess there are examples of coming to the same situation along really different paths: if both players move back and forth the same piece (Kh5-h6, Tf5-f6, Kh6-h5, Tf6-f5), then the sequence is not a permutation of the empty sequence, but the situation is the same (except for the fact that two more moves have been made), and there is no need to consider these nodes separately.

Storing the values may slowdown the computation by at most a constant factor, in the light of the possibly large savings this is no big deal. The extra memory consumption is more serious. A pure branch-and-bound algorithm with DFS tree-evaluation order uses memory proportional to the depth of the tree. By saving values, the memory becomes proportional to the number of evaluated nodes in the tree, in general this means an exponential increase. A solution is to impose a maximum on the number of stored values. Once this maximum has been reached either no new values can be added or one stored value has to be discarded for each added value. A reasonable criterion for managing the set of stored values is the least-recently-used strategy.

Storing values so that it can cheaply be figured out which values are stored while also allowing to discard the least-recently-used value requires a dual data structure. For finding the values we need a dictionary ADT, which supports insert, delete and find. A node of the dictionary must contain the key which corresponds to the subproblem it stands for and the previously computed value. In addition there should be a queue for implementing the LRU strategy. From the entry in the dictionary there must be a pointer to the corresponding position in the queue, and from the queue there must be a pointer to the entry in the tree. It is not necessary to store anything else in the queue. When coming to a subproblem which corresponds to a key k, k is searched in the dictionary. There are two possible situations:

If the queue is implemented with a doubly linked list all queue operations can be performed in O(1) time. The dictionary can be realized as a balanced search tree or as a hash table. The first gives guaranteed performance, the second is generally more efficient.

LRU-Based Entry Management

It is a matter of taste whether one designates these algorithms as branch-and-bound algorithms enriched with a table, or as dynamic-programming algorithms which use bounding in order not to compute values which cannot be better than an already known value. This classification will mostly depend on how the algorithm was developed. If, as in the case of making change, the algorithm is obtained by stepwise refining a dynamic-programming algorithm, it is natural to designate the final algorithm as a refined dynamic-programming algorithm.

Exercises

  1. Consider a knapsack problem with W = 12 and 7 packets with the following (w_i, v_i) pairs: (6, 25), (2, 8), (5, 19), (7, 26), (3, 10), (4, 10), (1, 1). Use branch-and-bound to compute the maximum achievable value of V. Give the complete tree and indicate all values that must be computed.

  2. Consider the problem of computing a minimum vertex cover. A vertex cover of an undirected graph G = (V, E) is a subset S of the nodes so that for any edge (u, v) either u in S or v in S (or both). Before greedy and backtracking algorithms were considered for this problem. Branch-and-bound allows to find the optimum solution far more efficiently than backtracking.

  3. Give an example of a class of graphs in which the weight of the reduced minimum spanning tree becomes larger than the weight of the Steiner minimum tree by an arbitrarily large factor.

  4. Consider the following heuristic for computing a Steiner tree T of a graph G (first described by Kou, Markowsky and Berman in Acta Informatica 15, pp. 141-145, 1981): Compute the all-pairs shortest-paths distance table for the set of terminals. This table can be interpreted as the weights of the edges of a complete graph G'. Determine a minimum spanning tree T' of G'. Each edge of T' corresponds to a path in G. Let T be the union of all these paths.

  5. Prove that, under a natural condition on the edge weights, the optimum TSP solution is at most twice as expensive as the cost of a minimum spanning tree. What condition should be satisified? Give an example graph for which this estimate is sharp. Give another example showing that the cost of a TSP may be only marginally higher than that of a minimum spanning tree. The conclusion is that taking twice the cost of a minimum spanning tree gives a two-approximation to TSP, and that this bound cannot be sharpened. In general, an approximation algorithm is said to give an k-approximation if its estimate lies within a factor x from the real answer.

  6. How good is the greedy algorithm for finding approximations to the TSP? It clearly gives an upper bound, but how much more expensive than the optimum solution might a greedy solution be? Prove that this method gives an k-approximation for some constant k, or present a class of examples that there is no such constant k.

  7. Prove that a two-swap on Euclidean graphs is cost-reducing if and only if the involved edges cross each other. From this it follows that any crossing-free tour is optimal with respect to two-swaps. Knowing this, present a class of graphs with tours which are two-swap optimal, but nevertheless do not give a k-approximation for any constant k.

  8. Give a class of graphs and a tour on them, which for any sufficiently large size cannot be improved by any k-swap, for some constant k, but which nevertheless are not giving optimum solutions to the traveling salesman problem.

  9. This question considers an implementation of the presented branch-and-bound algorithm for TSP and improvements thereof.

  10. Use the program solving the assignment problem using branch-and-bound for performing some tests.

  11. In the program solving the assignment problem using branch-and-bound for every node of the branching tree a new matrix with the correct values is created from the matrix one level higher in the tree. This was done for simplicity reasons, but is not very efficient. In this question you are asked to analyze how serious this is, how the implementation can be improved, to implement the improvement and to measure the impact on the running time. The cost matrix has size n x n.

  12. It appears that computing the upper bounds for the assignment problem is more expensive than computing the lower bounds. Such an unbalanced situation is never good: at most doubling the cost for processing the nodes we can apply an alternative lower bound procedure which costs the same as computing the upper bounds. Alternatively, using a simpler upper-bound procedure, the cost for the nodes can be reduced a lot. A simpler method for finding the upper bound is based on a simpler greedy algorithm, trivially running in O(n^2) time. The idea is to assign the agents in order to the cheapest task which is not yet taken by any of the agents with smaller indices. The branching choice is now also simplified: at top level we are branching on agent a, on the next level on agent b and so on.

  13. Considering that the computation of the upper bounds for the assignment problem is rather an elaborate procedure even though it can be improved considerably, it may make sense to invest more for computing the lower bounds as well: in a minimization problem the lower bounds are more important than the upper bounds. The current lower bounds are obtained by replacing the intersection of two graphic matroids by computing the value of each of these matroids separately and taking the maximum. The following is an alternative which may or may not be better: reduce the number of nodes in each of the sets of the bipartite graph from n to a value n' = n / x by fusing subsets of x nodes on either side to a single node. The resulting graph has many parallel edges but at first only the lightest one is visible. So, we have a complete bipartite graph G'_0 with n' nodes on either side. Compute a minimum-weight matching for G'_0. If an edge (i, j) is selected, then it is replaced by the next lightest edge between i and j. This gives a new complete bipartite graph G'_1. More generally: x matchings are computed n graphs G'_0, ..., G'_{x - 1}. For computing the matchings G'_i, we may either use the algorithm recursively or apply the Hungarian algorithm which has running time O(n'^3).

  14. In the current implementation of the program solving the assignment problem using branch-and-bound, the lower bound is recomputed in each node of the tree from scratch. Because the computation of the upper bounds is even more expensive, this is not that serious, but above we considered how to make this cheaper, so it may be worth even to optimize the computation of the lower bounds.

  15. The example that the branch-and-bound algorithm for the assignment problem is exponential was based on the assumption that of several equally good possibilities the first is taken. Modify the example so that the greedy algorithm at any step only has one choice. The branching tree should be the same as before, all weights should be integral. The size of the input should still be polynomial in n and this should be demonstrated.

  16. We have seen that performing branch-and-bound to the assignment problem may result in a tree with 2^{n / 2 + 1} - 1 nodes. Consider whether there are even much worse problem instances or argue that this is about the worst one can get. Notice that there is still a huge gap between 2^{n / 2} and n!.

  17. The bad example for the branch-and-bound assignment algorithm was using numbers that were exponential in n. Consider the time for cost matrices of this form with x_i = 1 for all i. Prove that now the time is polynomial in n.

  18. The bad example for the branch-and-bound assignment algorithm was using numbers that were exponential in n. Construct a class of inputs for which the branch-and-bound algorithm requires time exponential in n for which all entries of the cost matrix are polynomial in n, or argue that one really needs big numbers.

  19. Give a dynamic-programming formulation of the assignment problem for an n x n cost matrix M which requires at most O(2^n) memory. Give an estimate for the computation time. Conclude that even though this is not very good, that it nevertheless is much better than trying all permutations, which takes Theta(n * n!) = Theta(n^{3/2} * (n/e)^n). Illustrate the algorithm by applying it to the 5 x 5 cost matrix considered above.

  20. Consider a variant of the queens problem. We consider the slightly more realistic problem that several queens are of the same color, that is, they can be placed in arbitrary positions as long as no two queens are placed in the same position. A position of the board is called covered if one of the queens can reach this position in one move, going horizontally, vertically or diagonally. The task is to compute MaxCov(n, k). Here MaxCov(n, k) gives the maximum number of covered positions when placing k queens on an n x n board.

    Sub-Optimal Solutions to the MinCov(8, 8) Problem

    Clearly k must be at most n^2. For k >= n, MaxCov(n, k) = n^2 in a trivial way. But how about for example MaxCov(n, n / 2)? What is the smallest k for which MaxCov(n, k) = n^2? Write a branch-and-bound program which can be used to answer these questions for modest values of n, for example n = 8, 10 and 12. In principle there are (n^2 over k) placements. This is a very large number in itself. It is crucial that the same configuration is considered only once, some simple symmetry arguments may help to save even slightly more.

    Consider the positions in a fixed order and develop the tree in DFS order. Strong lower bounds are obtained by greedily placing the remaining queens, placing each queen at the position which maximizes the number of additionally covered positions. For the bounding it is more important to have a good upper bound.

    After placing j queens an upper bound is obtained as the sum of the number of currently covered positions and an upper estimate of the number of positions that can be covered with the remaining k - j queens. Such an estimate is obtained as the sum of four greedy contributions: the maximum numbers of positions additionally covered when greedily placing k - j pieces that either only move horizontally or only vertically or only normal-diagonally or only anti-diagonally.

    Unfortunately the given bound is often a considerable over-estimate. A much stronger, bound is obtained by first greedily placing k - j horizontally moving pieces, then k - j vertically moving pieces, then k - j normal-diagonally moving pieces and finally k - j anti-diagonally moving pieces. In this way many double countings are prevented. However, this greedy placement does not really give an upper bound. This leads to unjustified cutting and this may even concern the branch with the optimum. If one adds a correction (2 * (k - j - 1) is a good choice) this problem may be alleviated, but at best one obtains a very good approximation in this way. The advantage is that the computation becomes much faster. For time consuming problems one might even consider to first compute a strong lower bound with this approach before running the exact algorithm.

  21. Consider a variant of the above problem: now we want to compute MinCov(n, k), the minimum number of covered positions when placing k queens on an n x n chess board.

    Sub-Optimal Solutions to the MinCov(8, 8) Problem

    An interesting question is how many queens can be placed so that MinCov(n, k) < n^2. Another question is how MinCov(n, n) develops as a function of n. Write a branch-and-bound program which can be used to answer these questions for modest values of n, for example n = 8, 10 and 12.

    Consider the positions in a fixed order and develop the tree in DFS order. Strong upper bounds are obtained by greedily placing the remaining queens, placing each queen at the position which minimizes the number of additionally covered positions.

    After placing j queens a lower bound is obtained as the sum of the number of currently covered positions and a lower estimate of the number of positions that will be covered with the remaining k - j queens. A good lower bound has a strong impact on the computation time. At the same time it is not clear how to obtain such a bound. A modest idea is the following: determine for all positions how many free positions would be covered by placing a queen there. Sort these numbers in an array a[]. Number a[k - j - 1] gives a lower bound.

  22. In the chapter on backtracking we considered the game with the 32 pins on a 33-position board. The backtracking solution can easily be turned into a branch-and-bound solution by adding upper and lower bounds. In a minimization problem, a good lower bound is most important.

    A simple but effective bound is obtained by considering the number of components. A component is a subset of the pins which cannot possibly come in contact with the pins of any other component. Clearly, eventually at least one pin from every component will survive and thus the number of components gives a lower bound on the number of remaining pins.

    How does one determine the components? It does not suffice to only consider the adjacency structure. One must also consider the jumping possibilities. Therefore, to the existing real pins we must add virtual pins: a virtual pin is added to any empty position which can be reached by a jump, that is, when there are pins in the two positions immediately to the north, south, west or east of it. The process of adding virtual pins must be iterated until no more can be added. Because in every round at least one pin is added, this takes at most O(n^2) time, for a game board with n positions. Once the virtual pins are placed, the components can easily be determined by any connected components algorithm, particularly easy is the algorithm based on union-find.

    One can either impose an upper bound of 1, because we know that this is possible, and even if we would not know it already, such a bound can be imposed to faster test whether it can be achieved: an upper bound of 1 allows to immediately bound the expansion of the tree as soon two components arise, so this gives the strongest possible pruning. On the other hand, it is somewhat unnatural to impose this bound from the start. Particularly, playing the same game on a different game board may not have an optimum of 1. Therefore, it is more elegant to use as upper bound the best that has been achieved so far. In practice even this will soon become a very small number (achieving 1 is hard, but achieving 3 is easy). This has the additional advantage that one soon gets a good solution before maybe after a long search getting the best.





Graph Algorithms

As we have seen, the assignment problem can be solved with branch-and-bound, but there are also polynomial-time solutions for this problem. In this chapter we will consider these. Branch-and-bound is simpler, more general and possibly faster in practice, but inherently exponential. With the study of matching we touch on the area of problems called combinatorial optimization. Many of these problems can be formulated as problems of graphs. Therefore, it is essential to know how to solve a basic set of graph problems: computing minimum spanning trees, shortest paths, matchings and maximum flow. These and some other problems are treated in this chapter.

Minimum Spanning Trees

Prim's Algorithm

In the chapter on greedy algorithms we have studied Kruskal's algorithm for finding a minimum spanning forest. There is an alternative algorithm, called Prim's Algorithm with comparable performance. It only works for connected graphs, for a graph consisting of several components, it should be run for each component separately.

The idea in Prim's algorithm is to grow the minimum spanning tree from a single node: Initially the tree is empty. The set of reached nodes consists of a single node s, which we will call the starting node. At any later time, the lightest edge leading from a tree node to a non-tree node is added.

Prim's MST Algorithm

This algorithm is similar, but not identical to Dijkstra's algorithm: it is not true that for any node v all nodes on the shortest path from s to v are in the spanning tree. Nevertheless, we can implement Prim's algorithm just like Dijkstra's algorithm. In this case the entry of a non-tree node v in the priority queue is not the length of the shortest discovered path from s to v, but the weight of the lightest edge leading from a tree node u to v.

This algorithm does not need a union-find data structure: for testing whether edges are potentially interesting, it is sufficient to maintain an array of bits marking the tree nodes. When using a priority queue, in the course of the algorithm at most n inserts, n deletemins and m decreasekey operations are performed. With an appropriate priority queue, these operations can be performed in O(n * log n + m) time. For graphs with m > n * log n, this is less than the time O(m * log n) of Kruskal's algorithm.

Click here to see the algorithm integrated in a running Java program. In this program the priority queue is implemented using a binary heap. The heap has been implemented using arrays. The array key[] contains the keys and the array ind[] contains the indices of the elements in the heap. The array pos[] gives the position of an element with specified index in the heap. So, for each element i represented in the heap, ind[pos[i]] == i. Here we exploit that the indices of the entries in the priority queue are from a finite domain. So, we can use direct addressing. In general one should use a dictionary ADT for this, for example by using hashing. In the given implementation deletemin and decreasekey take O(log n), most other operations take constant time. So, the overall running time is O((n + m) * log n). However, in practice the cost of the decreasekeys will tend to take much less than m * log n because not all edges result in a decreasekey, and because nodes do not move every time from the bottom of the heap to its top. The file also contains two alternative MST algorithms. They lead to fewer percolate operations and for sparse graphs, for example for graphs with n = m, this reduced number of percolates even leads to a reduced time consumption.

Of course, in many practically important cases the time for Kruskal's algorithm can be reduced to either O(n * log n + alpha(n, m) * m) or even to O(alpha(n, m) * m). Here O(alpha(n, m) * m) is the time for performing m find operations on a structure of size n. Already for m a little bit larger than n alpha(n, m) is a constant. The first happens when the trick of starting with a small subset of the edges is successful, the second happens when the weights of the edges are polynomial in n, so that the sorting can be performed in linear time.

The proof of correctness for Prim's algorithm goes analogously to that of Kruskal's algorithm. Assuming that until a certain point of time the partial solution is promising, we can concentrate on the first mistake. So, assume that until step t the constructed partial solution S is promising and that adding the edge e in step t leads to a non-promising partial solution S + e. Let F be the minimum spanning tree containing S. F + e contains a cycle. Let e' be the other edge in F connecting S and F - S. e' has weight at least as large as e, so F' = F - e' + e is not heavier than F and must therefore also be a minimum spanning tree. But F' contains S + e, in contradiction with the assumption that S + e is not promising.

Boruvka's Algorithm

Both Kruskal's and Prim's algorithm add one edge at a time. However, we can add many more edges without risk of loosing optimality. This idea is used in an algorithm known as Boruvka's algorithm, who designed it with application on a parallel computer in mind. The algorithm works in rounds. For a graph G, in every round the following steps are performed:
  1. Each node u traverses its adjacency list and determines the lightest edge (u, v). It sets next[u] = v. This induces a directed graph G' with n nodes and edges (u, next[u]).
  2. The cycles of length 2 in G' are resolved: each node u with next[next[u]] == u and u < next[v] sets next[u] = u. All edges (u, next[u]) with next[u] != u are added to the set S of edges in the minimum spanning tree.
  3. Each node u determines the root root[u] of the tree in G' to which it belongs.
  4. G is transformed and reduced. Any edge (u, v) is replaced by (root[u], root[v]) (but it should be remembered from which actual edge this edge stems). All nodes with root[u] != u are removed. From the new adjacency lists self-loops are removed. Of a set of parallel edges only the lightest is kept.
To understand how the algorithm works, we should consider the structure of the graph G', G is undirected, so u ranges among the possible choices for next[v], where v = next[u]. This implies that the edge (v, next[v]) is certainly not heavier than the edge (u, next[u]). In other words, along any path in G' the weights of the edges are weakly decreasing. If we assume that the weights are all different (ties can always be broken by adding the index of the nodes as a secondary weight criterion), this means that the weights along a path must be strongly decreasing until we reach a cycle of length 2. Thus, G' is a directed forest, with cycles of length 2 at the roots.

Boruvka's MST Algorithm

Because each node chooses an edge, and because each edge is chosen at most twice, the number of different chosen edges is at least n / 2. Thus, at most n / 2 nodes survive a round, which means that after at most log n rounds only one node is remaining.

The correctness of Boruvka's algorithm can be proven with an extension of the argument of Prim's algorithm. The running time is more interesting. The number of nodes gets strongly reduced in every round, but this is not necessarily true for the number of edges. Each round can be performed in O(m) time, so the whole algorithm takes O(m * log n) time.

The worst-case time consumption is not better than the above algorithms, but this method has nevertheless importance. The most important aspect is that this algorithm can be parallelized quite well. Observing that a graph with n nodes can never have more than n * (n - 1) / 2 edges, it also follows that for very dense graphs the number of edges does decrease along with the number of nodes. This puts a bound on the complexity of O(n^2).

More generally, for a graph with m = n^2 / 4^k, the time can be bounded on O(log (n^2 / m) * m). We prove this. Denote by n_t the number of nodes after round t, m_t is defined analogously. n_0 = n, m_0 = m. We know that n_t <= n / 2^t and that m_t <= min{m, n_t^2}. Thus, using k = log_4 (n^2 / m), we get

  T_total  = sum_{t >= 0} (a * n_t + b * m_t )
          <= sum_{t >= 0} (a * n / 2^t + b * min{m, n^2 / 4^t})
           = O(n) + b * (sum_{t = 0}^k m + sum_{t > k} n^2 / 4^{t - k})
           = O(n) + O(k * m) + O(m).

Depending on the graph structure, Boruvka's algorithm may in practice be even better. As an example we consider the problem of computing a minimum spanning tree on a planar graph. On planar graphs the nodes can have arbitrary degree (a star graph with one central nodes and n - 1 neighbors is planar), but, on planar graphs without self-loops and parallel edges, in total the number of edges is linear in n (it is at most 3 * n - 6). Fusing adjacent nodes does not impair the planarity of the graph. So, for planar graphs we may assume that at all times m_t = O(n_t), thus the whole algorithm is running in O(n) time. This holds independently of the weights or the details of the structure. Neither Kruskal's nor Prim's algorithm achieves this.

Shortest Paths

The shortest path problem has several variants. Here we consider the following two:

In the unweighted case the SSSP Problem can be solved using breadth-first search in O(n + m) time. In the weighted case with all edge weights positive, Dijkstra's algorithm is the standard solution to the problem, requiring O(log n * (n + m)) when using binary heaps and O(n * log n + m) when using Fibonacci heaps.

Bellman-Ford Algorithm

If there are negative weights, then Dijkstra's algorithm does not need to be correct. The problem is that one can not be sure that the currently closest node to s has reached its final distance. The problem cannot be overcome by determining the largest negative value and adding this to all edges: this penalizes paths with more edges and may give a different solution.

Negative Edge Weights

Fortunately, there are several simple solutions. Unfortunately, these algorithms are considerably slower. One is known under the name Floyd-Warshall Algorithm and has running time O(n^3). Because of its simplicity and the very small constants hidden in the O(), this algorithm is nevertheless rather practical. It actually solves the all-pairs-shortest-path problem. If this is what one needs, it is a good option. Another algorithm, much closer to Dijkstra's is called the Bellman-Ford algorithm. The only data structure it needs is a queue (not a priority queue). It solves the SSSP problem in O(n * m). For sparse graphs this is quite ok, for dense graphs Floyd-Warshall may be better.

In general, the notion of shortest paths is not well-defined when there are so-called negative-cost cycles, cycles on which the sum of all costs is negative (or zero). One must either assume that no such cycles exist (that is what we will do in the following), or one must have a mechanism for detecting them.

  void negativelyWeightedSSSP(int s, int[] dist) 
  {
    for (v = 0; v < n; v++)
      dist[v] = INFINITY; 
    Queue q = new Queue(n);
    dist[s] = 0;
    q.enqueue(s);
    while (q.notEmpty()) 
    {
      v = q.dequeue();
      for (each neighbor w of v)
        if (dist[w] > dist[v] + weight[v, w]) // shorter path
        {
          dist[w] = dist[v] + weight[v, w];
          if (! q.isInQueue(w))
            q.enqueue(w);
        }
    } 
  } 

It is essential to test that an element is not on the queue before enqueuing it, otherwise the capacity of the queue may be exceeded. This algorithm is really very simple, but also quite dumb: all nodes on the queue are dequeued, processed and possibly enqueued again and again. Nevertheless, there are no substantially better algorithms.

Does the algorithm terminate at all? Yes, if there are no negative-cost cycles. How long does it take to terminate? For the analysis of this we need to introduce the concept of a round:

The same notion of round is useful in the unweighted case: in that case round r consists of processing all nodes at distance r from s. In the case of weighted graphs processed with the Bellman-Ford algorithm, the time analysis is based on the following result:

Lemma: For any node v whose shortest path from s goes over r edges, dist[v] has reached its final value at the end of round r.

Proof: For any node v which lies on a negative-cost cycle, there is no shortest path consisting of a finite number of edges, so for these v the claim is void. So, in the following we only consider the subgraph consisting of all nodes not lying on a negative cost-cycle. If this subgraph is empty, there is nothing to prove. Otherwise, we proceed by induction. For r = 0, it trivially holds: s has reached its final distance. Now assume the claim holds for r - 1. Consider a node w whose shortest path from s goes over r edges. Let v be the last node on this path before w. v has reached its final distance in some round r' <= r - 1. When v was getting its final distance, it was enqueued a last time, and in round r' + 1 <= r the algorithm sets distance[w] = distance[v] + weight[v, w], which is the correct value. End.

Corollary: If there are no negative-cost cycles, then at most n - r + 1 nodes are processed in round r.

Proof: If there are no negative-cost cycles, then there is at least one node v whose shortest path from s consists of r' edges for any r' <= r - 2. These at least r - 1 nodes are reaching their final dist-value no later than round r - 2, and thus are not inserted in the queue anymore in round r - 1 or later. At worst the other n - r + 1 nodes may arise in the queue processed in round r. End.

Theorem The Bellman-Ford algorithm has running time O(n * m).

Proof: The corollary implies that there are at most n rounds in each of which at most m edges must be processed. End.

The theorem is sharp in the sense that there are inputs for which the time consumption indeed is Omega(n * m), this can even happen for dense graphs. So, in theory the Bellman-Ford algorithm may take up to Omega(n^3) time and is therefore in the worst-case not better than the Floyd-Warshall algorithm. However, this view is too pessimistic: in practice there will often be fewer rounds and for those graphs that require many rounds, typically one has not to process all edges every time.

With the BFS-based algorithm and Dijkstra's algorithm, it is always explicitly known which nodes have reached there final distance values. The Bellman-Ford algorithm is different: at the end the distances are correct but during the algorithm it is not yet known which nodes have reached their final values: a node that does not appear in the queue may reappear later on.

Click here to see the Bellman-Ford algorithm integrated in a working Java program. In this implementation there is no test for negative-cost cycles. So, it may happen that the program does not terminate!

In the following picture the operation of the Bellman-Ford algorithm is illustrated. Because the order in which the neighbors of a node is not part of the specification of the algorithm, there are various correct possibilities, the picture shows one of them. The marked nodes are those that are not in the queue at the beginning of this stage and will not appear in the queue anymore.

Bellman-Ford Algorithm

Floyd-Warshall Algorithm

Text to appear in due time.

Recursive APSP Algorithm

Text to appear in due time.

Euler Tours

An Euler tour on a graph G = (v, E) is a cycle (a path starting and ending in the same node) visiting all edges in E exactly once. An Euler path is a path visiting all edges in E exactly once. The following fact is well-known:

Fact: A graph has an Euler tour if and only if all nodes have even degree. A graph has an Euler path if exactly two nodes have odd degree.

The positive side of this claim will follow from the construction hereafter. the negative side is not hard to prove. First we reduce the Euler path problem to the Euler tour problem: if v_1 and v_2 are the two nodes with odd degree, then we can add a single extra edge (v_1, v_2). Now all nodes have even degree. Construct a tour for this graph, and start the tour in v_1 first traversing the edge (v_2, v_1) (if the tour runs the other direction, then one starts with (v_1, v_2) or one can reverse the tour). This tour finishes in v_1 again. Omitting the first edge of the cycle gives a path running from v_1 to v_2.

How to construct a tour? The algorithm is simple:

The algorithm is correct, because of the following observations:

The whole algorithm takes O(n + m) time, which is surprisingly little for an apparently complex problem. There is an alternative algorithm, a modified DFS search, which constructs the tour without first explicitly constructing cycles. Looking into the algorithm it becomes clear that it is essentially doing the same. The efficiency of both algorithms is comparable. The given construction has the advantage of being easier to understand. Furthermore, in applications it is often not required to construct a single tour. In that case the simple algorithm based on "walk along the edges until you get stuck" is good enough and simpler.

Constructing an Euler Tour

Edge Coloring Bipartite Graphs

What is finding Euler tours good for, except for drawing nice pictures? In the first place, an Euler tour models a kind of road cleaning problem: starting from their main station, a squad of road cleaners must traverse all streets of a village. Driving over already cleaned streets means a waste of time. An Euler tour gives a possible optimal route for the cleaning squad.

Another important application is as a subroutine in a slightly more advanced problem. Consider a group of friends who have rented a fitness center for two hours. There are n_1 friends and n_2 training machines. Each of them wants to train for 30 minutes on 4 (not necessarily different) machines. Two questions arise:

There is a big difference between these questions. The first is a question about existence. So far we were not confronted with existence questions. Clearly the smallest element in a set can be selected, clearly a sorted order can be constructed, clearly membership can be tested, and elementary algorithms are obvious. For our problem it is not a priori clear that there exists any solution. Maybe we are just asking too much.

The second question is of a different nature. Here we ask about computability. Proving existence might for example go by contradiction and does not need to be constructive. Clearly any construction implies existence, but the other implication does not hold, and there are many problems for which we know that a solution exists, but for which so far no one could give an algorithm with acceptable (= less than exponential) running time.

Fortunately, in the case of the fitness center, there is a surprisingly simple algorithm. It is based on our accumulated knowledge. If there are more than four persons wanting to use the same machine, than clearly they cannot be scheduled in 4 time slots. So, in that case the answer to the first question is negative. Therefore, in the remainder of the discussion, we assume that there are at most four persons wanting to use any of the machines.

We first abstract the problem, reformulating it as a problem for graphs. There is a node for each of the friends and a node for each of the machines. There is an edge for each of the wishes. That is, if person i wants to use machine j, there is an edge (i, j). This graph is bipartite: all edges run between the subset V_1 of n_1 nodes corresponding to the friends and the subset V_2 of n_2 nodes corresponding to the machines.

This graph has degree 4 (each person wants to train on four machines, each machine is selected at most four times). If we succeed to allocate a number from {0, 1, 2, 3} to all edges so that for each node of V_1 and V_2 no two edges have the same number we are done. An assignment of a value x to edge (i, j) can namely be viewed as assigning person i to machine j in time slot x. The condition that all nodes in V_1 and V_2 get each number at most once means that a person is not scheduled to more than one machine at a time and that a machine is not allocated to more than one person at a time. Assigning values from {0, 1, 2, 3} is equivalent to coloring the edges with four different colors.

An edge coloring of a graph is an assignment of numbers to the edges so that no two edges incident upon a node have the same number. In general it is very hard to compute a coloring using a minimum number of colors, but for bipartite graphs this is much easier. Particularly it is generally true that it is possible to construct a coloring with d colors when d is the maximum degree of any node. This is a consequence of Hall's (also called König's) theorem which states that on bipartite graphs there is a matching which matches all nodes of maximum degree. This provides the step in an inductive proof: a coloring can certainly be found by repeatedly constructing such a maximum matching. Surprisingly coloring bipartite graphs can be performed much more efficiently than this. A single matching in a bipartite graph with n nodes and m edges takes O(sqrt(n) * m) time. For a regular bipartite graph of degree g, m = g * n, so repeatedly matching takes O(sqrt(n) * g * m) time. A coloring can be constructed in just O(log g * m) time.

Consider a bipartite graph with node sets V_1 and V_2, each with n nodes. Assume that the graph is regular of degree g. A first idea for constructing a coloring is to start allocating the colors to the first node of V_1, then the second and so on. When assigning the colors of node i, we should respect the conditions imposed by earlier assigned colors. If one is lucky this may work, but in general this approach will get stuck when we must assign a color c to an edge (i, j) while node j has an edge (i', j) for some i' < i, which was already assigned color c while coloring node i'. Another "greedy" strategy may also work, but not always. The idea is that one tries to assign the colors one by one. The algorithm gets stuck when, while assigning color c, one comes to a node i for which all the uncolored edges lead to nodes which already have an edge that was given color c before.

A working and efficient strategy is based on Euler splittings. For a g-regular graph G with g even, the graph can easily be split in two edge-disjoint subgraphs G_0 and G_1 each of degree g / 2. This is done by constructing an Euler tour of the graph, numbering the edges on the tour alternatingly 0 and 1. All edges which have been numbered 0 belong to G_0, all other edges to G_1. Because an Euler tour of a graph with n nodes and m edges can be constructed in O(n + m) time, this splitting costs time O(g * n).

In general, for a graph of degree g = 2^k, the algorithm consists of k rounds. In round i, 0 <= i < k, the algorithm is working on 2^i subgraphs in which all nodes have degree 2^{k - i}. Each of the 2^i operations in round i takes O(2^{k - i} * n) time, so the total amount of work in any of the rounds is O(2^k * n) = O(g * n) = O(m). Thus, in total over all rounds the algorithm takes O(k * m) = O(log g * m) time. For g which are not powers of 2, the construction of such colorings is considerably harder. After much research it has been established (by Cole, Ost and Schirra in Combinatorica, Vol. 21, 2001) that even the general case can be solved in O(log g * m) time.

Coloring Regular Bipartite Graphs

Unweighted Bipartite Matching

Definitions and Basic Approaches

We are considering an undirected unweighted graph G = (V, E) with n nodes and m edges. A matching M of G is a subset of the edges with the property that no two edges of M share an endpoint. Edges in M are said to be matched the other edges are said to be unmatched or free. Nodes that are the endpoint of a matched edge are said to be matched the other nodes are said to be exposed. In the unweighted matching problem, the goal is to find for a given graph a matching which maximizes the number of matched nodes.

The greedy algorithm, picking the edges in any order and considering whether they can be added is not optimal. For example, for a graph with four nodes connected with three nodes in a linear way, the greedy algorithm may start by picking the middle edge, blocking any further edge.

If for an edge e, u(e) denotes one endpoint of the edge and v(e) the other endpoint, then the matching problem can also be formulated as an ILP. The problem is to define a function f mapping the edges to numbers so that sum_e f(e) is maximized under the following conditions:

sum_{e | u(e) = w or v(e) = e} f(e) <= 1, for all w in V
f(e) = 0 or 1, for all e in E

First constructing an optimum solution to the corresponding LP and then rounding to integral values is much better than the greedy algorithm, but it does not necessarily give an optimum solution as must be shown in one of the exercises.

Augmenting Paths

A path p = [u_1, u_2, ..., u_k] is alternating when [u_1, u_2], [u_3, u_4], ... are free, while [u_2, u_3], [u_4, u_5] ... are matched. p is called augmenting when both u_1 and u_k are free.

Lemma: An augmenting path with respect to a matching M is simple.

Proof: Assume that the path is not simple. Then either the path is traversing some node i more than once, or it is finishing in a node i that was traversed before. The first would imply that i is incident on two or more matched edges, which is in contradiction with M being a matching. The second implies that the endpoint is incident on at least one matched edge, which is in contradiction with the path being augmenting, which means among other things that the last node is free. End.

Lemma: Let P be the set of edges on an augmenting path p = [u_1, ..., u_{2 * k}] with respect to a matching M, then M' = M ^ P is a matching of cardinality |M| + 1. Here A ^ B denotes the symmetric difference of A and B: A ^ B = (A - B) + (T - S).

Proof: Check that it is a matching by distinguishing three cases after assuming that two edges e and e' in M' = M ^ P are incident upon the same nodes:

For the number of edges, we notice that P has 2 * k - 1 edges. k - 1 of these are in M, the other k are free. So, |M'| = |M ^ P| = |M| + k - (k - 1) = |M| + 1. End.

Theorem: A matching M in a graph G is maximum if and only if there is no augmenting path in G with respect to M.

Proof: If there is an augmenting path, then the matching is not maximum because of the lemma. Now assume that M is not maximum. Let M' be maximum instead. Consider the edges in M ^ M'. Because this is the union of two subsets of matchings, all nodes in the subgraph G' = (V, M ^ M') have degree 2 or less. If the degree is 2, one of the incident edges is from M, the other from M'. A graph of degree two is a composition of cycles and paths. In the cycles the edges of M and M' alternate, and therefore they all contain the same number of edges from M and M'. Because |M'| > |M|, there must be at least one path with more edges from M' than from M. Even here the edges alternate, and therefore this path begins and ends with an edge from M'. So, this is an augmenting path. End.

This theorem gives a perfect characterization of the maximality of matchings in terms of augmenting paths. In the remainder we will consider how to find such paths: all presented matching algorithms are based on repeatedly finding augmenting paths.

Unweighted Bipartite Matching Algorithm

On bipartite graphs everything is easy: test all nodes on one side in order. For those that are not yet matched by the time of testing, perform a BFS of alternating paths and test whether this brings us to any other exposed node. A node that is matched once will always remain matched, and therefore n of these BFS operations are sufficient. Shortest paths are always simple, so the paths found can indeed be used for increasing the number of matched nodes. Takes O(n * (n + m)) for a graph with n nodes and m edges.

This algorithm sounds extremely natural and one might believe that it holds without limitations. However, this is not true. On a general graph (that is, one that is not bipartite), the above theorem holds, but it is not true that all augmenting paths are found by performing a BFS. Consider the following graph:

Hard to Find Augmentation

So, why is the algorithm correct for bipartite graphs? Let the set V of nodes be composed of V_1 and V_2, so that all edges are running between V_1 and V_2. Assume that we are starting the BFS searches from the nodes in V_1. Any pair of consecutive edges, an unmatched edge followed by a uniquely defined matched edge can be replaced by a single edge in a reduced graph consisting only of the nodes in the subset V_1. A target node is now any node of V_1 from which there is an edge to an unmatched node in V_2. A path in this reduced graph corresponds to an alternating path, and a path to a target node can be extended to an augmenting path. Most important, and it is here that we use that the graph is bipartite, is that any augmenting path starting from a node in V_1 is of this form.

The algorithm for the non-bipartite case is considerably more complex. The easiest implementation takes O(n^4) time.

Maximum Flow

Problem Definition

Network N = (s, t, V, E, b), where b gives the capacities of the edges. A flow f, is an assignment of values f(i, j) to the edges so that
sum_j f(j, i) = sum_j f(i, j), for every i in V: flow conservation
0 <= f(i, j) <= b(i, j), for every i, j: capacity constraints
The goal is to find a maximum flow, that is, to find an f for which the flow out of s or into t is maximized:
sum_i f(s, i) = sum_i f(i, t)
If an edge (i, j) with capacity b(i, j) is carrying a flow f(i, j), then the amount b(i, j) - f(i, j) is called the residual capacity. An edge with residual capacity zero is said to be saturated.

A cut through a network is a division of the set of nodes, s and t in two subsets V_s containing s and V_t containing t, which are mutually disjoint and together contain all nodes. The capacity of a cut is the sum of the capacities of all edges leading from V_s to V_t. Notice that we do not count, positive nor negative, edges running in the other direction. One of the most famous theorems in this domain, which we here give without proof, is the so called maxflow-mincut theorem:

Theorem: The value of the maximum flow equals the capacity of the minimum cut.

It appears that this theorem cannot be turned into an algorithm: finding the mincut is, in general, hard. However, the theorem provides us with a mean to argue. We will use it in our analysis of networks in which all edges have capacity one. An example of a flow network together with a maximum flow and the mincut is given in the following picture:

Maxflow = Mincut

Trivial Algorithms

The simplest idea is to repeatedly look for a path from s to t, like the augmenting paths. Then the flows along this path are increased by the minimum residual capacity along the path. What does this give us? In general this does not give a maximum flow, it only gives a maximal flow.

Applying this idea to the following example in which all edges have capacity one, then, the algorithm will find the direct path from s to t, transporting one flow unit. No further augmenting paths can be found. So, we get stuck with a flow of one unit, whereas one can clearly transport two units.

Hard to Find Maxflow

Correct Algorithms

The correct idea is to always look for paths in the associated network N(f) = (s, t, V, E(f), a), where for each edge (i, j) in E, E(f) has edges (i, j) and (j, i), with capacities b(i, j) - f(i, j) for the first and f(i, j) for the second. If in the original graph there were already two edges (i, j) and (j, i), then the edges in N(f) should be taken together, summing their capacities, so that there are no multiple edges.

Without proof we state the following result (it is directly related to the maxflow-mincut theorem):

Theorem: A flow f is maximum, if and only if there is no path from s to t in N(f).

Even though the approach of repeatedly constructing N(f) and searching paths from s to t in it eventually gives a maxflow, this is not a particularly efficient way to compute a maximum flow. In the following figure the edge (u, v) has capacity 1, while all other edges have capacity x, for some large number x, then we may find as paths an alternation of (s, u, v, t) and (s, ..., v, u, ..., t), increasing the flow by one unit after every operation. So, it takes 2 * x operations to find the maximum flow of 2 * x units. Because the size of the input is bounded by O(log x), the running time is exponential in the size of the input.

Maxflow May Take Long

Polynomial-Time Algorithm

One may feel that the above exponential time was only arising because we were taking the wrong path. This is true. Increasing only along shortest paths from s to t in N(f) guarantees polynomial time, but still not particularly good: O(n * m^2). We want better: O(n^3) is the goal. The presented algorithm achieving this is due to Karzanov (1974). It is an improvement of several earlier polynomial-time algorithms. Since then work went on, but no breath-taking new results were found, and all of these algorithms are considerably more complex.

AN(f) is the so-called auxiliary network with respect to f. It is a subnetwork of N(f): containing precisely all those edges of N(f) that lie on a shortest path from s to t. Because in our algorithm we are only looking for shortest paths, this will still allow us to find all interesting paths. At the same time this network is much simpler: it is layered: a network is layered, if there are subsets V_0 = {s}, V_1, ..., V_{d-1}, V_d = {t}, so that all nodes of V lie in one of the V_i, and so that all edges run between two consecutive subsets.

Notice that N(f) can be constructed from N and f in O(n + m) time in a trivial way, and that AN(f) can be found from N(f) by BFS (one from s and one from t, an edge (i, j) lies on a shortest path from s to t if d_s(i) + 1 + d_t(j) = d_s(t)).

It is very wasteful to throw away AN(f) after finding a single path. It is a much better idea to first find all possible paths in AN(f), and then adding this maximal flow f' in AN(f) to f. It can be shown that when doing this, the distance from s to t in AN(f + f') is at least one larger than it was in AN(f). This immediately implies that we need at most n of these stages before we come to a network in which s and t are disconnected: the search is over. So, the total time of an algorithm working along these lines will be O(n * time per stage).

Karzanov's Algorithm

Constructing a Maximal Flow in a Layered Network

In the following we will show how a maximal flow in AN(f) can be constructed in O(n^2), thus giving the desired O(n^3) time bound for the complete algorithm. The idea is not to find individual paths, but rather to push/pull a certain amount of flow from s to t along several paths at a time. The underlying reason is that finding an s-t path is (at least in theory) not cheaper than O(n + m), so, we can just as well do something more than just finding one s-t path.

Define the throughput of a node by the minimum of the sum of the capacities of the ingoing edge and the sum of the capacities of the outgoing edges. In general it is not true that we can construct a flow so that throughput(i) is flowing through node i. But, if we start at the node i with the smallest throughput x of all nodes, then it is certainly possible to push x units of flow from i to t and to pull x units of flow from s to i: we will not get stuck on the way. Afterwards the throughput of node i is reduced to 0, and we might remove i together with all its incident edges from AN(f). This already shows that we must perform at most n of these push-and-pull operations.

How to organize the push-and-pull operations? In a systematic way of course. We apply two rules:

After each push-and-pull operation the capacities of the edges in AN(f) over which flow is pushed or pulled are reduced and the throughputs of the involved nodes are updated. Then the minimum throughput is found again and the next push-and-pull operation is performed. Notice that we can afford O(n) overhead per operation, so we do not have to optimize the selection of the minimum and the like.

How expensive is the above algorithm? During a stage every edge gets saturated at most once. So, this gives O(m) in total for all operations of a stage. During every push-and-pull operation, each node is visited only once (because we are working level by level!) and except for possibly saturating many edges, the work performed for a node is constant. So, this gives O(n * n) for the at most n operations performed during a stage.

Theorem: Using the following ideas, maximum flows can be computed in O(n^3) time:

Cases that can be Solved Faster

If we consider networks in which all edges have capacity one, unit-capacity networks then we can apply the same algorithm but it will run faster. This we will not show. If it is even so that all nodes of the network either have indegree or outdegree 1, such networks are called simple in this context, then the algorithm runs even faster. This we will show.

Lemma: On a unit-capacity network a stage takes O(n + m) time.

Proof: Every edge is handled at most once, particularly there is no partial filling, so this takes O(m) time. When visiting a node, this will either be to remove the node from the network or at least one of the incident edges will be saturated. The first cost is bounded by O(n), the latter costs can be attributed to the first edge saturated and is thus bounded by O(m). End.

Lemma: On a simple unit-capacity network the distance d from s to t satisfies d <= n / maxflow + 1.

Proof: Divide the node set V in subsets V_i, 0 <= i <= d, constituted of all nodes at distance i from s. Notice, that we work with distances in the original network N, not those in an associated network N(f). The whole flow of x units is running through each of the V_i (because there are no bypassing edges, as the definition of the V_i is based on the distance of the nodes from s). So, because each node can transfer only one flow unit, each of the V_i must have size at least x. That is, there can be at most n / x of these subsets. End.

Theorem: On a simple unit-capacity network the maxflow algorithm requires at most O(n^{1/2} * m) time.

Proof: The time per stage is bounded to O(m) because of the above lemma. So, it suffices to show that the number of stages is bounded by O(n^{1/2}). We distinguish to cases:

In the first case, the number of stages can be bounded by the value of the maxflow, because in every stage the flow is increased by at least one unit. In the second case, we are using the last lemma, but not so easily as one might believe. Let x be the value of the maximum flow. Consider the first stage of the algorithm in which the value of the flow exceeds x - n^{1/2}. Let this be stage y. Hereafter at most n^{1/2} further stages will increase the flow to x. Before stage y, the value of the maximum flow through the network N(f) was at least n^{1/2}. One can show that even N(f) has unit-capacities and is simple, when N is so. Thus, in these networks the distance from s to t cannot exceed n^{1/2}, and because this distance increases for every performed stage, we must have y <= n^{1/2} + 1. End.

Faster Bipartite Matching

Now that we know how to find maximum flows, we can use this knowledge to solve other problems. There are many problems that somehow can be formulated as a maximum flow problem, by choosing the right capacities, adding some edges, etc. This is also the case for the bipartite matching problem. And, surprisingly, this leads to a more efficient solution of the problem than the special purpose algorithm we developed before.

Consider a bipartite graph, with node sets V_1 and V_2. This is turned into a network as follows: we add a source s, from which there is an edge to all nodes in V_1 and a sink t, to which there is an edge from all nodes in V_2. The edges in the graph are directed from the nodes in V_1 to those in V_2.

An integer flow from s to t in this network corresponds to a matching (because every node in V_1 is transferring at most one unit of flow, which means that it has at most one outgoing edge carrying flow (that is why we imposed that the flow should be integral). The same argument shows that also on the other side the matching property is satisfied. A maximum flow corresponds to a maximum matching: a larger matching would show how to pump more flow, a larger flow would show how to construct a larger matching.

In general, as shown above, maximum matchings can be computed in O(n^3) time, but in this case our network is simple (in the above defined sense) and has unit capacities, thus the maximum flow can be found in O(n^{1/2} * m) time, which is much faster than O(n^3), particularly for somewhat sparser graphs. The algorithm is so that a flow is not unnecessarily fractioned, and therefore the resulting flow will be integral. More formally this can be shown by induction: initially the flow is integral (namely equal to 0) and all capacities are integral. Thus, all throughputs are integral, and thus all edges get integral flows (because their residual capacities are integral). So, the next flow is integral again.

Using Maxflow for Bipartite Matching

Weighted Bipartite Matching

Problem and Context

Matching for non-bipartite graphs is substantially harder than bipartite matching. Another complication is to consider matching for weighted graphs. The hardest combination, but still solvable in polynomial time is non-bipartite weighted matching. We will limit ourselves to bipartite graphs, of which we will consider the weighted case now.

Consider a weighted graph, G = (V, E), with weight w_{i, j} associated to the edge running from node i to node j. The goal is to find a matching so that the sum of the weights of the matched edges is maximized. Without loss of generality (though this might have negative influence on the running time), we may assume that the graph is a complete: non-existing edges can be replaced by edges with weight zero, matching these has no impact on the sum of the weights, so any maximum matching in the original graph is also a maximum matching in the complete graph. For bipartite graphs, we can even assume, if this is needed in some argument, that the two subsets have the same cardinality: we can always add nodes to the smaller subset with only zero-capacity edges to the other subset. The maximization problem we are considering can be replaced by a minimization problem: an edge with weight w_{i, j} is replaced by one with "cost" c_{i, j} = W - w_{i, j}, where W is the maximum over all w_{i, j}. In this form the problem becomes the same as the earlier considered assignment problem.

The problem now has a simple formulation as a linear program. We must have

sum_{j = 1}^n x_{i, j} = 1, for all i,
sum_{i = 1}^n x_{i, j} = 1, for all j,
x_{i, j} >= 0, for all i and j.
and the goal is to minimize
C = sum_{i = 1}^n sum_{j = 1}^n c_{i, j} * x_{i, j}.

Notice that there is no way to impose that we would like to have integral values of the x_{i, j}. Of course we can impose this, but not as a linear program. And, non-integral solutions may exist to this problem. Examples are easy to construct: 2 nodes on each side, c_a1 = 3, c_b2 = 5, c_a2 = c_b1 = 4. However, fortunately, these non-integral solutions never lead to a value of C that is larger than what can be reached by integral solutions (non-integral solutions are always linear combinations of integral solutions). A chic way of saying this is to say "all basic feasible solutions are integral".

Bipartite Graph with Non-Integral Maximum Matching

Hungarian Algorithm

The following algorithm is known as the "Hungarian" method, and goes back on Kuhn (1955). Later this algorithm has been generalized to become the "primal-dual" method, which is a powerful method used, among other things, to turn solving weighted problems into repeatedly solving non-weighted problems.

Again we are performing searches for augmenting paths, but this time we do not search through all edges from the start. We start with a small graph, having only few admissible edges. When we get stuck on this graph, some further edges (at least one) are made admissible, and the process is repeated until all nodes are matched. We will certainly find a perfect matching this way. The main question is how to add the edges in such a way that the result will be a cost minimizing matching.

The method will be presented in a way that does not immediately give rise to an efficient algorithm. With some modifications it can be turned into an algorithm running in O(n^3). The central lemma on which the whole method is based is

Lemma: The optimal solution remains the same if a constant (possibly negative) is added to any row or column of the cost matrix.

Proof: Instead of the former values c_{i, j}, we now have cost values c'_{i, j} = c_{i, j} + p_i + q_j, where p_i gives the values added to all entries in row i, and q_j gives the values added to the entries in column j. The objective function (that is, the function to minimize) is now

C' = sum_{i = 1}^n sum_{j = 1}^n c'_{i, j} * x_{i, j}
= sum_{i = 1}^n sum_{j = 1}^n c_{i, j} * x_{i, j} + sum_{i = 1}^n sum_{j = 1}^n p_i * x_{i, j} + sum_{i = 1}^n sum_{j = 1}^n q_j * x_{i, j}
= sum_{i = 1}^n sum_{j = 1}^n c_{i, j} * x_{i, j} + sum_{i = 1}^n p_i + sum_{j = 1}^n q_j
So, this is larger than before by sum_{i = 1}^n p_i + sum_{j = 1}^n q_j, a value which does not depend on the choice of the x_{i, j}. End.

The method is explained using the following bipartite graph with five nodes on each side. This is the same graph as considered in the chapter on branch-and-bound algorithms.

Weighted Bipartite Graph

In a preparatory step, first in each row we subtract from all entries the minimum value in the row, and thereafter in each column from all entries the minimum value in the column. For the example graph the cost matrix is modified as follows:

Preparatory Steps

Notice that these manipulations do not lead to any negative values, because the column minima are determined after subtracting the row minima. The preparatory step is so that afterwards there is at least one zero in each row and column. This means that in the zero-cost graph, the bipartite graph which has an edge corresponding to each zero in the cost matrix, all nodes have degree at least one. If we are lucky the zero-cost graph has a perfect matching. A matching is said to be perfect if all nodes are matched. On a bipartite graph with n nodes on each side, this means that in total there are n matched edges. Because this matching corresponds to a zero-cost matching of the graph corresponding to the corrected cost matrix, it is a minimum matching of this weighted graph (here we use that all values are still positive!). The above lemma states that it even gives a minimum matching of the original graph.

If the zero-cost graph has no perfect matching we continue to modify the cost matrix until the resulting zero-cost graph has a perfect matching. These corrections are performed in a systematic way, based on the crossing-out lemma:

Lemma: The minimum cardinality of a set of lines (horizontal and vertical) covering all zeroes in an n x n matrix equals the size of a maximum cardinality matching in zero-cost graph.

Proof: A matching in the zero-cost graph cannot have more matched edges than needed to cover all zeroes: such a matching would use two edges that lie either in one row or one column (covered by one line). This shows that the cardinality of a matching is at most as large as the cardinality of a set of lines covering all zeroes. It is not easy to see that the maximum of the first equals the minimum of the second. This proof is not given. End.

This lemma is one of many "min-max" lemmas, which are typical for primal-dual algorithms (the minimum solution of the dual algorithm corresponds to a maximum solution of the primal, and the values are equal in this optimum).

The following steps are repeated until the zero-cost graph has a perfect matching:

  1. Construct the zero-cost graph and compute a maximum cardinality matching.
  2. Construct the corresponding set of lines (horizontal and vertical) covering all zeroes in the current cost matrix.
  3. Determine the minimum non-zero value x among all uncovered positions in the current cost matrix.
The operations in the third step are the same as subtracting x from all uncovered rows and adding x to all covered columns. The fun of this, is that the position (i, j) which had value x before becomes a 0 now, while most existing zeroes remain zero and no negative values arise. Only zeroes that are on the intersection of a covered row and column may get a positive value. But these were apparently not very interesting. The zero-cost graph Z' is a subgraph of the weighted graph G' corresponding to the corrected cost matrix M'. If Z has a perfect matching, this means that the weighted matching problem for G has a zero-cost solution. This is clearly optimal, because there are no negative edge weights. Thus, this gives an optimum solution of the assignment problem of M'. Because M' is obtained from the original cost matrix M by modifying the entries in the rows and columns by the same amount, the above lemma gives that this optimum solution for M' is also an optimum solution for M.

We continue the example: a maximum matching might be (a, 1), (b, 5), (d, 5) and (e, 2). Because the matching is not perfect we still have to continue. The minimum covering of the zeroes corresponding to the matching is given by row b, row d, row e and column 1. The smallest value x which is not covered is the 2 at position (a, 5). The values are changed according to the given rules by x. Due to the modifications a new zero is created. There is always at least one new zero. This additional zero is precisely what we need: the new zero-cost graph has a perfect matching: (a, 5), (b, 3), (c, 1), (d, 2) and (e, 4). In the original graph the cost of this matching is 26 + 25 + 21 + 27 + 50 = 149, the same value we found with the branch-and-bound algorithm.

Crossing Out

There is an important, but without further explanation somewhat obscure rule which must be applied to assure that substantial progress is made. The crossing-out sometimes leaves freedom: some lines can just as well be drawn horizontally as vertically. Making the wrong choices, it may happen that the size of the matching of the zero-cost graph remains the same during many rounds. The result may be that the running time becomes exponential in the size of the input. The theory underlying the algorithm assures that when the crossing-out is performed so that vertical lines are used only when required to cover all zeroes with the minimum number of lines, the size of the matching increases by at least one after each round until reaching size n.

In the following example, we start with a cost matrix for which the maximum matching of the corresponding zero-cost graph has cardinality 4, matching the edges (a, 3), (c, 1), (d, 5) and (e, 4). The lines that cover all zeroes in the cost matrix must be drawn through these positions, but there are three possible ways to do so: the lines through (a, 3) and (c, 1) can be drawn almost any way. Drawing the line through (a, 3) horizontally and the line through (c, 1) vertically, the minimum uncovered value is the 1 at position (c, 3). Modifying the matrix and proceeding similarly in the next round, gives back essentially the same matrix as the one with which we started: it takes x / 2 rounds before finding a larger matching. Drawing the lines through (a, 3) and (c, 1) both horizontally, such an increase is obtained after one round.

Importance of  Crossing Out Systematically

Exercises

  1. Consider the heap-based program for computing minimum spanning trees. We know that in general the running time is bounded by O((n + m) * log n) due to the operations performed on the heap. Experimentally determine the running time as a function of n and m: test all values of n = 2^k, for k = 12, 14, ..., 20 and m = 4 * n, 16 * n, ..., 256 * n, as far as these problems can be solved in reasonable time on the available computer. Try to fit a function with constant, linear and linear times logarithmic terms through these results. How important is the term n * log m? Now replace the binary heap by a pairing heap. Repeat the above experiments and compare the performance.

  2. Using a binary heap in Prim's algorithm has a worst-case running time of O((n + m) * log n). Construct a class of inputs for which the worst case running time Omega(n^2 * log n) is needed. This shows that for dense graphs, in particular for complete graphs, a trivial array-based implementation of a priority queue with deletemins in linear and decreasekeys in constant time may be substantially better, because then the algorithm is guaranteed to run in O(m + n^2), which is O(n^2) for all graphs without multiple edges.

  3. The given implementation of the Bellman-Ford algorithm is not terminating if a negative-cost cycle arises. This is undesirable, even though in this case it makes no sense to compute the shortest paths, the procedure should terminate with a decent output. Ideally, the correct distances should be computed for all good nodes, nodes which are not reachable from s by a path leading through a negative-cost cycle, while the distance is set to some special value for the other nodes, which we here call bad nodes.
    1. How many times can a good node be enqueued at most?
    2. How frequently is a bad node enqueued? How many rounds may pass at most before a bad node is enqueued again?
    3. How many rounds must be performed before counting the number enqueue operations allows to distinguish good and bad nodes?
    4. After how many rounds has dist[v] certainly reached its final value?
    5. Suggest a minor modification to the program realizing the same worst-case running time correctly handling negative-cost cycles.
    6. A tremendous disadvantage of this idea is that now many rounds are performed, even though the shortest paths to all good nodes may contain only few edges. At the expense of O(n) extra storage and slightly higher costs per round, the algorithm can be modified such that even in the presence of negative cycles the algorithm is guaranteed to perform hardly more than the minimum number of rounds. Describe how.

  4. Consider the following strategy for finding a matching on an unweighted graph: first an optimum solution is constructed to the LP, then the maximum value f(e) is determined. f(e) is fixed at 1. For all other edges e' leading to u(e) and v(e), f(e') is fixed at 0. All edges for which the value of f has been fixed are removed from the graph, and the algorithm is repeated until there are no edges left. Give an example of a graph for which this algorithm does not lead to an optimum matching.

  5. Consider the simple unit-capacity network corresponding to the bipartite graph in the text above Perform the computation of maxflow showing the steps of Karzanov's algorithm, that is, by drawing the sequence of arising N(f) and AN(f) networks. The nodes on the left side are considered before the nodes on the right side. Nodes are considered from the highest to the lowest and an edge (u, v) is considered before an edge (u, v') when v stands higher than v'.

  6. Consider the network in the following picture.

    Flow Network with 11 Nodes

    Indicate the mincut and determine the value of the maxflow. Repeatedly compute N(f) and AN(f) as was done above to find the maxflow. Draw the whole sequence of pictures.

  7. On a unit-capacity network a stage of Karzanov's algorithm can be implemented to run in O(n + m) time. For arbitrary weights, the cost of a round is bounded by O(n^2 + m). This time consumption has four contributions: computing all throughputs; at most n times selecting the node with minimum throughput; in each push-and-pull operation addressing each node at most once; in each push-and-pull operation partially saturating one edge for each node; in total over all operations saturating each edge at most once.
    1. Among the listed contributions there are three that, with the suggested implementation take O(n^2) time. Which three?
    2. For two of these contributions it is proven in the lemma that they are actually bounded by O(n + m) for a uni-capacity network. Notice that this is only a matter of analysis, it is not necessary to modify the algorithm. Which O(n^2) contribution does require a modified algorithm?
    3. Give a detailed description how even this problematic contribution can be reduced to O(n + m).
    4. Does the given approach even work for networks with arbitrary capacities? If not, describe an improved implementation for the general case and specify its time consumption.

  8. An independent set I in an undirected graph G = (V, E) is a subset of E so that no two nodes of I are connected by an edge in E. Many graph algorithms are based on repeatedly removing an independent set. The performance of such algorithms depends on the size of the removed independent sets. Therefore, it is natural to consider the problem of finding maximum independent sets, the maximum-cardinality independent-set problem. In general this problem is NP-hard, implying that we have little hope of solving it in polynomial time. For such problems one may try heuristics and special cases, the topic of this exercise.

    For a graph G, an independent set can be constructed as follows: consider the nodes in order, add a node u to I if none of the neighbors of I was added to I before. Give an example of a graph that shows that this algorithm can be arbitrarily bad in the sense that (asymptotically) the ratio between the number of selected nodes and the size of the maximum independent set can be arbitrarily large.

    A better idea is to slightly modify the above algorithm: repeatedly select a node u with currently minimal degree and add u to I, then remove all neighbors of u and their edges. Continue with these operations until no nodes are left. Work the algorithm out to an efficient algorithm. What is the time complexity of your algorithm?

    Give an example of a simple graph (a graph without selfloops and multiple edges) for which this algorithm does not find the maximum independent set. Hint: there is a graph with 20 nodes for which the algorithm selects 7 instead of 8 nodes. Possibly there are even smaller examples.

    Generalize the given example to show that even for the improved algorithm the ratio of the maximum possible number of nodes and the selected number is not bounded by a constant.

    Show that including a node u with degree one does not compromise the possibility to find a maximum independent set. Hint: consider the optimum solution and distinguish two cases.

    Using the above observation give an algorithm to construct a maximum independent set in trees running in O(n) time.

  9. A generalization of the above considered maximum-cardinality independent-set problem is the maximum-weight independent-set problem: every node u has a weight w(u), and the task is to construct an independent set I for which sum_{u in I} w(u) is maximized. Clearly this problem is even harder to approximate. Therefore, we only consider it for trees.

    Denote the maximum weight of an independent set in a tree T by W(T). Consider a specific node s in T. s does not need to be the root of T. Express W(T) inductively, distinguishing the cases that s in I or that s is not in I.

    The inductive expression immediately leads to a recursive algorithm for computing W(T). Formulate the algorithm, leaving the choice of s unspecified. This is an example of a divide-and-conquer algorithm, the underlying idea is similar to that of quicksort.

    Let Z(n) be the time for running your algorithm on a tree with n nodes. Give a recurrence relation specifying the time complexity in terms of the sizes of the subproblems to solve. Which nodes are the worst choices for s? Show, that for any tree T the time Z(n) may be exponential.

    Which nodes are the ideal choices for s? How much time does it take to find such a node? Now refine the algorithm always choosing these ideal nodes.

    Formulate the recurrence relation for the time consumption Z'(n) of the improved algorithm and estimate it as good as you can, in any case proving a polynomial time bound.

  10. Consider again the Hungarian algorithm for computing a maximum matching of a weighted bipartite graph. The crossing-out lemma is of great importance, because it allows to efficiently find a minimum cardinality set of lines covering all zeroes by first solving a maximum cardinality matching of an unweighted bipartite graph. Here we encounter the central idea of primal-dual constructions: a weighted problem is reduced to repeatedly solving an unweighted problem. Nevertheless, the matching of the graph with edges corresponding to the zeroes in the cost matrix does not immediately give the set of lines we are looking for.
    1. What problem remains to be solved?
    2. Reformulate this problem as a well-known decision problem.
    3. The crossing-out lemma guarantees that this decision problem has a solution. How much time does it take to find it?
    4. It was mentioned that in order to assure polynomial running time the crossing-out must be performed so that columns are used only if necessary. This turns the decision problem into an optimization problem. Formulate the optimization problem and consider the time needed for solving it.

  11. Consider the time complexity of the Hungarian algorithm for computing a minimum-weight matching of a complete bipartite graph with n nodes on each side.
    1. How much time would a basic implementation take for each complete iteration of the main loop?
    2. Mention some ideas that may help to reduce the time consumption per iteration.
    3. Mention some ideas that may help to reduce the time consumption of an iteration when using the computations performed in the previous iteration.

  12. Consider the size s_t of the maximum cardinality matching of the zero-cost graph after t iterations of the main loop of the Hungarian algorithm for computing a minimum-weight matching of a complete bipartite graph with n nodes on each side.
    1. Give a cost matrix for which s_0 is particularly small.
    2. Estimate the expected value of s_0 for the case that the values in the cost matrix are chosen independently and uniformly from an interval 0, ..., M - 1.





Lower Bounds

So far we have been talking about the design and analysis of algorithms. We have become quite good at that and for all the algorithms considered the proven (or just claimed) time bounds are sharp in the sense that there are inputs for which this particular algorithm requires this much time. So, the stated time bounds are tight for the algorithm. However, this does not say anything about the quality of our algorithm. Who tells us that we need sqrt(n) * m time for unweighted bipartite matching? Actually, this appears not to be a very natural bound. Therefore, we would also like to prove lower bounds on problems, not only on algorithms. Clearly this is a rather hard task, because we must then argue about all possible algorithms. Not surprisingly, there are hardly any non-trivial lower bounds of this kind known. It is easy to show that problem like matching require at least Omega(n + m) time, but anything beyond this is terribly hard. In the following we present two basic strategies that allow to prove lower bounds. Other methods are mainly there to prove that problems have comparable hardness without being able to prove how hard they are.

Information-Theoretic Arguments

Information-theoretic arguments are most useful for problems that involve some kind of comparisons of elements. A decision tree is used to represent the operations of an algorithm. Each node represents the set of still possible solutions. The leafs correspond to situations with only one possible solution left. This solution can be output, which is also called a verdict. Connected with every internal node, there is also some question and the outgoing edges correspond to all possible answers to these questions. In many cases the questions are of yes/no type, but this is not necessarily so, and the degree can also be larger than 2. The longest path from the root to a leaf corresponds to the longest sequence of questions that may have to be traversed using this algorithm to come to a conclusion. Thus, if the questions can be answered in constant time, then the (worst-case) time order is proportional to the depth of the tree. Decision trees are even suitable for analyzing the average-case behavior of algorithms: the average depth of the leafs gives, if one can argue that the algorithm terminates with any of the leafs with equal probability, the average number of questions to answer, and is thus proportional to the average running time of the algorithm.

In order to talk about the worst-case time for a problem, we must show that the decisions tree of any algorithm has a leaf at a certain depth. Similarly can we prove facts about the average-case time for the problem by showing that in any decision tree the average depth of the leafs exceeds the bound to prove. Useful facts are:

Lemma: Any binary tree with n leafs has at least depth log_2 n, rounded up. In any binary tree with n leafs the average depth of the leafs is at least log_2 n.

Proof: The first proof goes by induction: a tree of depth 0 has no more than 1 = 2^0 leafs. A tree of depth k + 1 has no more leafs than 2 trees of depth k together. This shows that a tree of depth k has at most 2^k leafs.

The second proof is similar but harder. It is not so easy to say something about the average value directly. It is much easier to put a lower bound on the sum of the depths of the leafs. Let h(k) be the minimum possible sum for a binary tree with k leafs. h(1) = 0. And generally, h(k) = k + min_i{h(i) + h(i - k)}. Inductively we prove that h(k) >= k log_2 k. For k == 1, this assumption is satisfied. For larger k we get h(k) >= k + min_i{i * log i + (k - i) * log (k - i)}. Differentiation shows that the minimum is obtained for i = k / 2. Thus, h(k) >= k + 2 * (k / 2) * log (k / 2) = k * log k. End.

Is this not strange, that the average depth is the same as the maximum depth? No, this is a lower bound, and the best tree is in both cases the same: a perfect binary tree with all leafs at the same depth.

Sorting

The most famous problem that can be analyzed with this method is sorting. The sorting problem is equivalent to figuring out according to which of the n! permutations the numbers are arranged by. So, at the root of the decision tree there are n! possibilities, which must all be split out into leafs: a tree with n! leafs. Using that n! > (n / e)^n we obtain lower bounds of n * (log n - 1.6..) for the worst case and for the average case of comparison-based sorting.

The nice thing with these lower bounds is that they even can be applied for small values of n. Because 3! == 6, it follows that we need at least 3 comparisons for sorting 3 numbers, and at least log_2 4! = 5 comparisons for sorting 4 numbers. Clearly these are lower bounds, and a priori it is not clear whether there are schedules actually achieving these bounds. The answer is no: the book mentions that log_2 12! = 29 (rounded up), but that it has been shown that one needs 30 comparisons at least (which can be achieved).

Merging

Can we also find a lower bound on the time for the simpler problem of merging two sequences, S_1 and S_2, of length n / 2 each. In the final sorted sequence of length n, the elements from S_1 can occupy an arbitrary subset of the positions (within these positions the order of the elements is given). That is, there are (n over n / 2) possibilities at the start of the merging algorithm. (n over n / 2) > 2^n / sqrt(n) do, merging takes at least (and on average) n - log n comparisons

List-Ranking on a Processor Network

In the case of processor networks, the information theoretic argument can also be applied in another way. We consider a very simple system, which can be considered to model quite generally parallel computers that exchange information by communicating over an interconnection network of finite capacity. Our system consists of two computers, P_1 and P_2, connected by a bus that can be used for bidirectional traffic: b bits can be sent in a single time unit. The processors have infinite computational power, that is, we only consider the communicational complexity of problems. This is clearly quite a generous assumption. Communication lower bounds derived on this system give a lower bound on the number of bits that must cross a "bisection" of a processor network with more than 2 processors.

On our 2-processor system we consider the number of bits that must be transferred from P_1 to P_2 when solving the list-ranking problem for a list of length n. In the list-ranking problem the task is to determine for each element of a linked list the distance to the last or first element of the list. This distance will be called the rank of the element. Sequentially the problem can simply be solved by following the links (once the first node of the list has been determined) and numbering on. In parallel or in other computational models (external computation) the problem is much harder but still optimally solvable.

Here we deal with lower bounds. How many bits must be communicated from P_1 to P_2? It is not a priori clear that the problem cannot be solved in some magic way, requiring very little communication. In sorting problems we require that the n / 2 smaller numbers afterwards stand in P_1 and the larger n / 2 numbers stand in P_2. Clearly, if initially the numbers stand falsely arranged, then all numbers in P_1 must be sent to P_2, this is clear and simple, and the obtained lower bound is sharp to within lower-order terms.

For the list-ranking problem we make an, at a first glance, strong limitation of the problem: we only consider inputs in which all the nodes with even rank are stored in P_1 and all those with odd rank in P_1. This means that all elements in P_1 have a successor in P_2 and vice-versa. The list-ranking problem for the elements in P_1 is now equivalent to determining according to which of the (n / 2)! permutations the elements are arranged. This information is only available in form of the successors of the elements in P_2. Telling which of the (n / 2)! permutations it is, requires log_2 (n / 2)! bits. Using Stirling bounds, gives that this is more than log_2 (n / (2 * e))^{n / 2} = n / 2 * (log_2 n - 2.6..) = n / 2 * log_2 n - O(n). Assuming that b = log_2 n, the number of bits in a number in the range up to n, this means that we need n / 2 - o(n) time units for the communication, the same number (except for the lower-order term) as for sorting.

Adversary Argument

Adversary arguments are mainly used to show that at least the whole input must be considered. The idea is that we argue like: Assume the algorithm is not considering some little piece of the input, then if this is relevant at all, an adversary may choose this piece so that the answer constructed is wrong. This is the easiest way of thinking of an adversary argument. The more "adversary-like" way of viewing the approach is that the input is not specified at the start except for its size, but only is specified for as far as inspected. For every probe of the input the program is performing, the adversary decides which value to specify, consistent with any earlier specified values, so that the algorithm obtains the least possible information. If at any given point the algorithm would decide to stop, while the adversary knows that there are still more than one solutions possible, he specifies the rest of the input so that the answer of the program is wrong.

Minimum

In some cases the adversary argument is nothing else than an informal way of handling a decision tree, but not always. Consider the problem of finding the minimum element of an array of length n. At first there are n possibilities, but it is not clear how to formulate the problem in terms of a decision tree: how does the tree branch upon inspection of an element. Clearly not in a binary way. An adversary argument easily gives the right result: Assume an algorithm for determining the minimum inspects fewer than n positions of the array. Let k be a position that is not inspected. As the algorithm did not inspect position k, its outcome cannot depend on it. But clearly it should: if the value of k is smaller than the minimum, the algorithm should output this value, otherwise it should output one of the other values.

Graph Problems

Assume a DFS algorithm is not inspecting all edges. Then let e be a non-inspected edge. The program's output is independent of the edge e, but clearly the DFS tree may depend on it.

Sorting on Mesh Computers

Not all adversary arguments are this trivial. The book (12.3.3) offers a nice proof that finding the median requires at least 3/2 * n comparisons. Here we consider a very different example. Consider a processor network consisting of n^2 processing units, PUs arranged in a two-dimensional n x n grid. PUs can swap their number with a neighbor. The task is to rearrange the numbers so that the numbers appear in sorted order in the PUs, in a row-major order. That is, within every row sorted from left ot right, and the numbers increasing from row to row. An example for n = 4 is given:
  20  1 17 12       1  4  7 10
  55  7 21 34  ->  11 12 16 17
  35 16 44 10      18 20 21 28
  18 28  4 11      34 35 44 55

Because a number can travel only one position in every time step, and because possibly the number in a corner must travel whole the way to the opposite corner, this clearly requires at least 2 * n - 2 steps. In the following stronger lower bound, it is essential that numbers can only be swapped: at all times there is only one number in every position. After step 2 * n - 2 * sqrt n, no information from the upper-right sqrt(n) x sqrt(n) corner, the joker zone has reached the lower-left corner yet. At that moment some number z is standing there. The values of the n numbers starting in the joker zone can be chosen so that z has its final position in the rightmost PU of a row. That is, it takes at least another n - 1 steps to move z to its final position. Hence, we found a lower bound of 3 * n - o(n) for sorting in this computational model.

The 3 * n + o(n) bound is almost tight: it has been shown that 3 * n + o(n) steps are sufficient. In a slightly more realistic model, where PUs can hold more than one number, and numbers are not swapped, but sent and maybe even copied, sorting can be performed in 2 * n + o(n) steps.

Exercises


This page was created by Jop Sibeyn.
Last update Sunday, 05 December 04 - 12:45.
For any comments: send an email.