Preface and Notation

This document contains the text of a course in randomized algorithms.

As html is not suited for mathematical formulas, some additional notation is used (as used in the typographical package Latex). a_i denotes a with subscript i. a^i denotes a to the power i. <= and >= are used as in most computer languages. Curly brackets, "{" and "}", are use to group things. sum_{i = 0}^j denotes the sum for i running from 0 to j. The same may also be written sum_{0 <= i <= j}. ~= stands for "approximately equal" and ~ for "proportional to". Greek letters are written out. For logical expressions we either use the notation from C or the operators are written out in text. So, "a && !b" is the same as "a and not b". sqrt stands for the square root function and log without specifying the base number for the logarithm to basis 2. There are a few more notations, but they should be understood easily. New notions are printed bold there where they are defined. Inside the chapters particularly important ideas and short key notes are highlighted without introductory text. There are many pictures. These are intended to be self-explanatory. Typically they are placed just after the text fragment to which they belong, generally there is no direct reference to them in the text.

There are very few references to the literature. Clearly most of the presented material is not new. A part is even common knowledge. More directly this text is based on material found in the following book:

These lecture notes more or less follows this book, but the presentation is different. It has been attempted to make the material more easily accessible. Several students have contributed by pointing out errors and spots which required better explanation.

Notice: the following text may be overcomplete. At the examination it is expected that the students only know all that has been presented during the lectures.

Table of Contents





Introduction

Basic Probability Theory

Mathematically probability is a mapping from some kind of events to the real numbers so that the sum over all events of their probabilities equals 1. An event is an element of a set which is called the probability space. A random variable is a function mapping events to values. The above assumes discrete events, for a continues probability space one has to formulate slightly differently, but for our purposes this will do. The expected value Exp[X] of a random variable X is given by
Exp[X] = sum_i (i * Prob[X = i]).
This is also called the first moment of X. The second moment is obtained by replacing i * Prob[X = i] by i^2 * Prob[X = i], this equals the expected value of X^2. Higher moments are defined analogously. The variance Var[X] of X is defined as
Var[X] = Exp[X^2] - (Exp[X])^2
The variance does what one expects: it tells something about the way the values are spreading around the expected value.

An example of a probability space are all possible outcomes of throwing two indistinguishable dices in one stroke. In this case the probability space is the set {(1, 1), ..., (1, 6), (2, 2), ..., (2, 6), ..., (5, 5), (5, 6), (6, 6)}. The cardinality of this set is 21. Probabilities may be (but must no be) assigned so as to correspond to real probabilities: Prob[(i, j)] = 1 / 36 if i = j and Prob[(i, j)] = 1 / 18 if i != j. Many random variables can be defined, for example, the random variable X giving the sum of both values. This X assumes values 2, 3, ..., 12. With the defined probabilities Prob[X = 2] = Prob[X = 12] = 1 / 36, Prob[X = 3] = Prob[X = 11] = 2 / 36, Prob[X = 4] = Prob[X = 10] = 3 / 36, ..., Prob[X = 7] = 6 / 36. The expected value of X, which is 7, can be computed easily. The variance is given by 1974 / 36 - 7 * 7 = 5 5/6.

A basic but useful relation is Markov's inequality which applies to random variables with positive values, that is Prob[X < 0] = 0. For them we have

Prob[X >= k * Exp[X]] <= 1 / k.
The proof can be given by contradiction: if this probability would be larger, then the expected values would be larger. A stronger estimate, based on the variance is called Chebycheff's inequality, but we will not need it.

Two random variables X and Y are called independent if

Prob[X = x and Y = y] = Prob[X = x] * Prob[Y = y], for all x and y.
An example of independent random variables: throwing two coins, one after the other, X and Y being the number of heads (0 or 1) appearing up on the first and second coin respectively. If the coins are glued together, then X and Y are clearly dependent. Either X = Y, or X = 1 - Y depending on how the coins are glued. So, in that case Prob[X = 0 and Y = 0] is either 0 or 1, different from Prob[X = 0] * Prob[Y = 0] = 1/2 * 1/2 = 1/4.

The conditional probability of an event X = x under condition Y = y is denoted as Prob[X = x| Y = y]. It is defined as

Prob[X = x| Y = y] = Prob[X = x and Y = y] / Prob[Y = y].
For example, if X gives the outcome of throwing a dice, and Y the outcome of throwing another dice and Z = X + Y, then Prob[Z >= 10] = 6 / 36 = 1 / 6, but Prob[Z = 10| X >= 5] = Prob[Z >= 10 and X >= 5] / Prob[X >= 5] = (4 / 36) / (2 / 6) = 1 / 3. Of course we can also write
Prob[X = x and Y = y] = Prob[X = x] * Prob[Y = y| X = x] ,
expressing the cumulative effect of the operations: first we check the value of X in full independence, then we add Y, where the choice may be impacted by the earlier choice of X. Of course this can be extended to
Prob[and_{i = 1}^k X_i = x_i] = Prob[X_1 = x_1] * Prob[X_2 = x_2| X_1 = x_1] * Prob[X_3 = x_3| X_1 = x_1 and X_2 = x_2] * ... * Prob[X_k = x_k| and_{i = 1}^{k - 1} X_i = x_i].
The proof can easily be given by induction. Notice that if X and Y are independent, Prob[X = x| Y = y] = Prob[X = x and Y = y] / Prob[Y = y] = Prob[X = x] * Prob[Y = y] / Prob[Y = y] = Prob[X = x]. Thus, independence and conditional probability behave in a natural way.

Prob[X = x or Y = y] = Prob[X + x] + Prob[Y = y] - Prob[X = x and Y = y] <= Prob[X = x] + Prob[Y = y]. Using induction this can be used to prove that for a set of n random variables X_i, 1 <= i <= n, Prob[or_{i = 1}^n X_i = x_i] <= sum_{i = 1}^n Prob[X_i = x_i]. The most important special case is that all X_i are equally distributed and that x_i = x for all i. In that case Prob[or_{i = 1}^n X_i = x] <= n * Prob[X_1 = x]. This also implies that Prob[or_{i = 1}^n X_i > x] <= n * Prob[X_1 > x], a relation which is frequently used in estimates.

Expected values are linear:

Lemma: For two random variables X and Y, Exp[X + Y] = Exp[X] + Exp[Y].

Proof: The most important step is an alternative way of writing the expected value of Exp[X]. Exp[X] = sum_x x * Prob[X = x] = sum_x x * sum_y Prob[X = x and Y = y] = sum_x sum_y x * Prob[X = x and Y = y]. Here we use that Prob[X = x] = sum_y Prob[X = x and Y = y], which follows from the interpretation of Prob[X = x and Y = y] as being an entry in a two-dimensional table of probabilities. Using this relation for Exp[X] and the analogous one for Exp[Y], we get

Exp[X] + Exp[Y] = sum_x sum_y x * Prob[X = x and Y = y] + sum_x sum_y y * Prob[X = x and Y = y] = sum_x sum_y (x + y) * Prob[X = x and Y = y] = Exp[X + Y].
End.

The essential point is that it is not required that X and Y are independent. Using induction this lemma can be used to prove that this even holds for any countable number of random variables:

Corollary: For n random variables X_i, 1 <= i <= n, Exp[sum_{i = 1}^n X_i] = sum_{i = 1}^n Exp[X_i].

An important special case is that the X_i are equally distributed. In that case Exp[sum_{i = 1}^n X_i] = n * Exp[X_1]. That this is something special becomes clear when one considers Exp[A * B]. In general this cannot be written as Exp[A] * Exp[B]. For example, if Prob[A = 0] = Prob[A = 1] = 1/2 and B = 1 - A, then Exp[A] = Exp[B] = 1/2, so Exp[A] * Exp[B] = 1 / 4. On the other hand, Exp[A * B] = 0 * Prob[A * B = 0] + 1 * Prob[A * B = 1] = 0 * 1 + 1 * 0 = 0. However, if A and B are independent, then Exp[A * B] = Exp[A] * Exp[B]. This can be shown analogously to the proof of the linearity of the expected value.

Quick Sort

The following gives a (of several possible alternatives) version of quicksort for sorting a set S of n numbers:
  1. Choose an element s uniformly at random from S.
  2. Split the set S in S_<, S_= and S_> consisting of the elements that are smaller, equal and larger than s, respectively.
  3. Recursively sort S_< and S_>.
  4. Output the elements of S_<, S_= and S_> in this order.

Expected Time

Here we are not interested in the details. We only would like to obtain a somewhat accurate estimate for the expected number of performed comparisons. This is not so easy, because of the random choices and the recursion which are hard to control. Let a_i denote the number from S with rank i. That is, the number that finally appears at position i of the sorted sequence (in the following we will assume that all elements are different, which clearly only increases the number of required comparisons). The random variable X_{i, j} gives the number of times a_i and a_j are compared. Because an element that is used as a splitter is singled out once and for all, X_{i, j} in {0, 1}. Using the linearity of expected values, it follows that the expected number of comparisons is given by
Exp[sum_{i = 1}^n sum_{j = i + 1}^n X_{i, j}] = sum_{i = 1}^n sum_{j = i + 1}^n Exp[X_{i, j}].

Now we have reduced the problem of analyzing quicksort to that of analyzing the X_{i, j}. If p_{i, j} is the probability that X_{i, j} = 1, then Exp[X_{i, j}] = 0 * (1 - p_{i, j}) + 1 * p_{i, j} = p_{i, j}. This is a trivial but general fact: the expected value of a 0-1 random variable X equals Prob[X = 1]. The execution is viewed as a binary tree, in each node of the tree we write the splitter element y, all elements in S_< are arranged to the left and all elements in S_> to the right. a_i and a_j are compared iff one is an ancestor of the other. Now define a permutation pi of the elements by traversing the tree in level-order, that is, level by level, starting with the root. There are two further observations that allow us to compute p_{i, j}:

This gives
      sum_{i = 1}^n sum_{j = i + 1}^n       p_{i, j} = 
      sum_{i = 1}^n sum_{j = i + 1}^n       2 / (j - i + 1) =
      sum_{i = 1}^n sum_{k = 2}^{n - i + 1} 2 / k <=
  2 * sum_{i = 1}^n sum_{k = 1}^n           1 / k = 
  2 * sum_{k = 1}^n H_k <= 2 * n * H_n.
where H_k gives the k-th harmonic number. Because H_n = sum_{i = 1}^n 1 / i ~= integral_1^n 1 / x dx = ln n, this shows that the expected number of comparisons for this variant of quicksort is bounded by O(n * log n). This estimate is rather rough. More careful analysis allows to put bounds on the constant. Running a small program gives the exact values for small n:
  n =     2, sum =      1, sum / n =  0.500, sum / (n * log n) = 0.500
  n =     4, sum =      4, sum / n =  1.208, sum / (n * log n) = 0.604
  n =     8, sum =     16, sum / n =  2.115, sum / (n * log n) = 0.705
  n =    16, sum =     50, sum / n =  3.184, sum / (n * log n) = 0.796
  n =    32, sum =    139, sum / n =  4.371, sum / (n * log n) = 0.874
  n =    64, sum =    360, sum / n =  5.636, sum / (n * log n) = 0.939
  n =   128, sum =    889, sum / n =  6.951, sum / (n * log n) = 0.993
  n =   256, sum =   2123, sum / n =  8.297, sum / (n * log n) = 1.037
  n =   512, sum =   4945, sum / n =  9.660, sum / (n * log n) = 1.073
  n =  1024, sum =  11297, sum / n = 11.033, sum / (n * log n) = 1.103
  n =  2048, sum =  25420, sum / n = 12.412, sum / (n * log n) = 1.128
  n =  4096, sum =  56502, sum / n = 13.795, sum / (n * log n) = 1.150
  n =  8192, sum = 124344, sum / n = 15.179, sum / (n * log n) = 1.168
  n = 16384, sum = 271382, sum / n = 16.564, sum / (n * log n) = 1.183
  n = 32768, sum = 588170, sum / n = 17.950, sum / (n * log n) = 1.197
  n = 65536, sum =1267172, sum / n = 19.336, sum / (n * log n) = 1.208
  n =131072, sum =2716025, sum / n = 20.722, sum / (n * log n) = 1.219
The values in the last column appear to converge to some value around 1.3. This is also suggested by experiments though in actual applications the deviations from the expected value are so larger that experiments must be repeated many times in order to get a somewhat reliable estimate of the expected number of comparisons performed. An exercise asks to compute the limiting value in an analytical way. Here it is not so much the (known) result as the derivation which is interesting.

Stronger Bounding

In some cases it might be important to know a stronger kind of bounding. Expected time is fine, but if we have a critical application (waiting customers, airplanes without fuel, power plant control), we want to have guaranteed performance, or at least performance that can be guaranteed with acceptably high probability. The expected value tells very little about the distribution. For a positive random variable X, for example giving the running time of an algorithm, an expected value Exp[X] = x does not exclude that Prob[X = k * x] = 1 / k (and that Prob[X = 0] = 1 - 1 / k). This is the counter part of Markov's inequality.

Let us assume that the element s is chosen uniformly at random and that all elements are different. How long does the algorithm take? Here we have the problem of analyzing a randomized algorithm. There are many proofs that this time is O(n * log n) in some sense. The following statement also contains some nice ideas itself. We are using the notion of with high probability, which means that the probability that the claim is not satisfied is less than n^{- eps}, for some eps > 0. We might use the so-called Chernoff bounds, which are the most important tool in estimating the tail probabilities of the distribution of the sum of independent Bernoulli trials (a Bernoulli trial is a 0-1 experiment with a given probability p to get 1 and probability 1 - p to get 0: throwing a biased coin).

Lemma (Chernoff Bound): let X be the random variable corresponding to throwing a biased coin: with probability p the result is 1, with probability 1 - p the result is 0. Let S be the random variable giving the number of 1 outcomes in n independent experiments of the above type. Then

Prob[S > p * n + h] <= e^{- h^2 / (3 * p * n)}
Prob[S < p * n - h] <= e^{- h^2 / (3 * p * n)}

We do not proof this lemma (it is a consequence of a clever application of the Markov inequality). We want to stress the importance of the independence of the experiments. This is often a condition that is not satisfied. If the impact of the dependencies can somehow be bounded, then one can use the Azuma inequalities, which are similar but slightly more complicated. We will introduce them further down. The lemma implies that the deviation from the expected value (which is p * n) is not larger than O(sqrt(p * n * log n)) with high probability (as long as this is at least log n). Notice that by taking h slightly larger, the non-success probability becomes much smaller. In the analysis of the time of quicksort we do not need such a strong result (though it does not harm). The following much weaker result is also effective for our purposes:

Lemma: let X be the random variable corresponding to throwing an unbiased coin: with probability 1/2 the result is 1, with probability 1/2 the result is 0. Let S be the random variable giving the number of 1 outcomes in n independent experiments of the above type. Then

Prob[S < n / 4] <= n / 4 * 0.91^n.

Proof: The probability P_k that among the n trials there are exactly k successes is given by

P_k = (n over k) * (1/ 2)^n.
Away from the expected value n / 2 these probabilities are monotonically decreasing. Thus, sum_{h < k} P_h <= k * P_k. Using Stirling's formula, we find that for all n and k:
(n / k)^k <= (n over k) <= (n * e / k)^k.
So, P_k <= 2^{-n} * (n * e / k)^k. This gives
Prob[S < n / 4] <= n / 4 * ((4 * e)^{1/4} / 2)^n.
End.

In words this lemma states: "when expecting n / 2 successes, we may be almost sure that there are actually at least n / 4 successes".

Theorem: Quicksort sorts n numbers in O(n * log n) time, with high probability.

Proof: We are considering the depth of the recursion tree. Because at every level at most O(n) work is performed, bounding this depth to O(log n) would give the result.

Consider the depth of the branch of the recursion tree in which a specific element x lies. At every level the size of the subset to which x belongs is reduced, and the recursion ends when the size of this subset has been reduced to 1, consisting of x itself only.

The central idea of this proof (and of variants of it) is that of a successful split: a split is successful if the size of the subset is of x reduced to 3/4 of its original size or less. This happens with probability at least 1/2.

Claim: the probability on a successful split is at least 1/2. The set in which x comes is not larger than the largest of S_< and S_>. In general, Prob[max{X, Y} >= c] = Prob[X >= c or Y >= c] <= Prob[X >= c] + Prob[Y >= c]. Using this, the claim follows from the fact that Prob[|S_<| >= 3/4 * |S|] = Prob[|S_> >= 3/4 * S] < 1/4.

A sequence of at most log_{4/3} n successful splits (intermixed with any number of non-successful splits) results in a set of size one. The chance that a split is successful is independent of any of the previous actions. Thus, we can use the Chernoff bounds to state that we need only marginally more than 2 * log_{4/3} n = O(log n) splits to guarantee that with high probability log_{4/3} n of them are successful. Applying our weaker lemma we conclude that from among 4 * log_{4/3} n splits at least log_{4/3} n are successful with high probability.

This argument is correct but not complete. One still should address the following point: we have now shown that the branch with element x as a leaf does not lie deeper than O(log n) with high probability. However, the recursion tree has n leafs and we must prove that all of them lie at depth bounded by O(log n). Thus, we have n experiments (which are not independent). Here we use that

Prob[a or b] = Prob[a] + Prob[b] - Prob[a and b] <= Prob[a] + Prob[b].
This holds even when a and b are dependent! Using this observation, we obtain for a set of n equally likely events a_1, ..., a_n:
Prob[a_1 and .. and a_n] = 1 - Prob[not (a_1 and ... and a_n)] = 1 - Prob[(not a_1) or ... or (not a_n)] <= 1 - n * Prob[not a_1].
Thus, if we can bound the non-success probability of a single experiment to n^{- (1 + eps)}, for some eps > 0, then we can guarantee that with high probability even n of these experiments end successfully. This last condition is satisfied in our case (end hardly ever a problem, it should just be remarked) because 0.91^{4 * log_{4/3} n} ~= 0.91 * {9.63 * log_2 n} ~= 0.40^{log_2 n} ~= n^{-1.31}. End

Independent-Set Selection

We consider another problem. In a graph a subset of the nodes S is called independent if no two elements in S are connected by an edge. The construction of large independent sets is an important subroutine in many graph algorithms. Such independent sets can often be eliminated and if one repeatedly succeeds in reducing the size of the graph by a constant factor, then the whole algorithm will be efficient (typically linear in the size of the graph). An important special case of this problem is that of computing independent sets on lists: an acyclic directed connected graph in which every node has in- and outdegree 0 or 1.

Maximal Independent Set in a List

Let n be the number of nodes of the list. Of course a maximum independent set (of cardinality n / 2) can be constructed in linear time by starting at the first node and then alternatingly select and unselect the nodes. If the first node is not specified from the beginning, it can be found in linear time by determining the indegree of the nodes: the node with indegree 0 is the first node of the list. However, in special contexts (parallel and external computing) we would like to make decisions locally. There is a deterministic algorithm for this, but the following randomized algorithm is much simpler:
  /* for node i, succ[i] denotes the next element in the list */
  For (each node i)
    randomly and uniformly pick x[i] from {0, 1};
  For (each node i) 
    if (x[i] == 0 and x[succ[i]] == 1)
      y[i] = 1;
    else 
      y[i] = 0;
We claim that the set S of nodes with y[i] == 1 is an independent set. The correctness of this claim is obvious: if y[i] == 1, then x[i] == 0, and thus, for the node i' with succ[i'] = i, x[succ[i']] = 0, which implies that y[i'] == 0. Also, if y[i] == 1, then for the node i'' = succ[i], we must have x[i''] == 1, which implies that y[i''] = 0. So, if i in S, then neither its predecessor nor its successor belong to S.

How large is the selected independent set S? The expected value is trivial to compute: for any element, the probability of being selected depends only on the choice of x[i] and x[succ[i]]. In one out of four cases these values are such that y[i] == 1. Each case is equally likely, so Prob[y[i] == 1] = 1/4. Using linearity of expected values, we get:

Exp[sum_i y[i]] = sum_i Exp[y_i] = n * (0 * 3/4 + 1 * 1/4) = n / 4.
In an implementation we should treat the final node in a special way: the final node i sets y[i] = 1 iff x[i] = 0. So, for this node Prob[y[i] = 1] = 1/2. In the above and following analysis we ignore this small difference.

What can we prove with high probability? This is not so easy. The problem is that the x[i] are dependent (no two consecutive ...). Thus, we cannot immediately use the Chernoff bounds as we would like to do. There are two solutions.

The simplest, but nice, idea, is to concentrate on a smaller set of results that are independent. Let Z_j be the random variable giving the value x[i], where i is the index of the j-th element in the list. Z_j depends on Z_{j + 1}, but not on Z_{j + 2}. More generally, the set of values Z_{2 * k}, 0 <= z < n / 2, are independent. This is true, because Z_{2 * k} only depends on the choices x[i] and x[i'] where i is the node with rank 2 * k and i' is the node with rank 2 * k + 1. Thus, now Chernoff bounds immediately give that sum_j Z_j > n / 8 - O(sqrt{n * log n}), with high probability.

This is quite nice, but we have lost a factor 2 here. Instead of n / 4 we are proving that it is actually not much less than n / 8! Not very satisfactory, because experiments show that the actual value lies close to n / 4. To prove this, we need a more powerful canon:

Lemma (Azuma Inequalities): Let X_1, ..., X_n be independent random variables. For each i, X_i takes values in a set A_i. Let f: prod_i A_i -> real_numbers be a measurable function satisfying |f(x) - f(y)| <= c, when x and y differ only in a single coordinate. Let Z be the random variable f(X_1, ..., X_n). Then for any t > 0,

Prob[|Z - Exp[Z]| > t] <= 2 * e^{-2 * t^2 / (c^2 * n)}.

In our case the X_i are the random variables giving the values x[i], and f gives the size of the resulting independent set. If one decision is changed, then this increases/decreases f by at most 1. So, c = 1. Once we have shown this, all is as easy as before: take t = sqrt{n * log n} to show that the deviation from the expected value of Z, which is n / 4, is very small.

Min-Cut

A cut through a graph G = (V, E) with n nodes and m edges is a set of edges which when removed lead to the graph being disconnected. There is here no requirement of the resulting components having the same size, so a correct cut is for example given by removing all edges incident to any particular node. An interesting question (in a restricted form this has to do with network flow) is the size (that is the number of edges) of the minimum cut. This does not need to be the set of edges incident on a node: all nodes may have high degree, but nevertheless the graph may be hardly connected. In our case we consider undirected multigraphs (that is, allowing more than one edge to run between a pair of nodes). This is not an easy problem even though it can be solved in polynomial time (a trivial but inefficient method is to compute maxflows between any pair of nodes).

Mincuts in graph with 12 nodes

There is a very simple randomized algorithm. The idea is to repeatedly pick an edge and perform a contraction along this edge. Contraction means that the two nodes u and v at the endpoints are replaced by one new node, which has edges to all nodes to which u and v had edges before, except for the edges running between u and v themselves. This process is repeated until there are only two nodes left. A cut is given by all edges (in their original form) that are running between these two nodes. Is this a minimum cut? This depends on the choices of the edges made. It is easy to find examples for which this is not a minimum cut. On the other hand, is the algorithm not too bad: there are relatively few edges in the minimum cut, and therefore it is relatively likely that none of these will be picked until the end of the procedure. This we will quantify.

In the following we assume that the edge to contract along is picked uniformly at random, that is, if the graph currently has m' edges, then any of them is picked with probability 1 / m'. Let k be the size of the mincut. Then we must have m >= k * n / 2, because all nodes must have degree at least k in this case. The mincut is denoted by C. Let E_i designate the event of not picking any of the edges in step i, 1 <= i <= n - 2. Prob[E_1] >= 1 - k / (k * n / 2) = 1 - 2 / n. Assuming E_1, we have n - 1 nodes with degree at least k when making the second choice. This gives Prob[E_2| E_1] >= 1 - k / (k * (n - 1) / 2) 1 - 2 / (n - 1). More generally, Prob[E_i|and_{j = 1}^{i - 1} E_j] = 1 - k / (k * (n - i + 1) / 2) = 1 - 2 / (n - i + 1). Using the relation on rewriting a multiple-and as a product of conditional probabilities, we get

  Prob[and_{i = 1}^{n - 2} E_i] = 
  prod_{i = 1}^{n - 2} Prob[E_i|and_{j = 1}^{i - 1} E_j] =
  prod_{i = 1}^{n - 2} (1 - 2 / (n - i + 1)) = 
  2 / (n * (n - 1)) > 2 / n^2.

So, we have some (unfortunately rather small) probability of indeed finding the mincut in this way. Of course a probability of 1 - 2 / n^2 of finding a wrong answer is not very satisfying. However, by repeating the experiment and finally outputting the smallest cut we were finding, the success probability can be made arbitrarily close to 1. The idea is that if we make the random choices in each attempt independently of any previous choices, the error probabilities are independent and thus, when F_j denotes the event of erring in experiment j, we have

Prob[and_{j = 1}^t F_j] = Prod_{j = 1}^t Prob[F_j] = (1 - 2 / n^2)^j <= e^{-2 * t / n^2}.
For t = n^2 * log n success is given with high probability. This is not yet very efficient, but it clearly shows how unexpectedly simple randomized algorithms can solve hard problems. It also shows the idea of boosting the success probability by repeating the experiment.

Types of Results

Binary Planar Partitions

The problem in this section is considered in order to show the power of the "linearity of expected values", and also provides a first example of the randomized method. The problem comes from the area of geometric algorithms.

A binary planar partition consists of a special kind of binary tree: each internal node corresponds with a line and a polygonal region of the plane. The region of the root is the whole plane. The regions of the children u_1 and u_2 of an internal node u are the parts of the region of u that lie at either side of the line of u. In the regions of the leafs we find at most a part of a single object for a set of n graphical objects in total. In our case we assume that the objects are line segments (any more complex shape can be approximated by a set of line segments connected in their endpoints).

Binary planar partition for 9 segments with 10 lines

The fun with having a binary planar partition is that it allows to efficiently render a scene. The task is to draw a scene from a given, but non-fixed, viewpoint. Objects that lie behind others should not be drawn. A correct method to do this is to first draw the objects that lie furthest away, and then repaint the parts of these that are overlapped by closer-by objects. The binary planar partition can help us to do this in an efficient way: starting at the root, we look which of the two regions determined by the line of the root is closer by. Then we first process the further region recursively, before recursively processing the closer region. The recursion terminates at the leafs: the single object can be drawn. This method requires time proportional to the size of the tree. The depth is irrelevant. The size may be larger or smaller depending on how many segments are cut: for every cut there is one more leaf and one more internal node.

Clearly the problem of finding optimum binary planar partitions is very hard, it is not even easy to determine how many possibilities one might have for drawing the first line. Nevertheless there is a simple randomized algorithm, with good expected performance. In the following we only consider auto partitions: partitions defined by lines running through the segment. The line itself is assigned to the region containing fewest segments (if both have the same number any assignment is ok).

The segments are indexed from 1 to n: u_1, ..., u_n. The line through segment u is denoted l_u. The algorithm is almost trivial:

  1. Construct a random permutation pi of {1, 2, ..., n}.
  2. While a region contains more than one segment, cut it 
     with l_{u_i}, where i is the smallest index in the 
     ordering of pi so that l_{u_i} cuts the region.
It is not obvious how to implement this method efficiently, but in any case it can be done in O(n^2), and implementation is not the main concern now. It is more interesting to study the size of the partition.

Theorem The expected size of the constructed autopartition is bounded by O(n * log n).

Proof: Let index(u, v) = number of segments the line l_u cuts before hitting segment v (segment v counted as well). If l_u does not cut segment v, then index(u, v) = infinity. u # v is the event that l_u cuts segment v in the constructed partition. This only happens if u comes before all the other i = index(u, v) segments v_1, ..., v_i in the permutation pi and before v. Thus, Prob[u # v] = 1 / (1 + index(u, v)). The number of internal nodes in the tree equals n + number_segments_cut. The expected value of number_segments_cut equals the sum of expected values of the C_{u, v}, where C_{u, v} is the indicator variable of the event u # v. An indicator variable of an event is a random variable that has value 1 if the event is true and value 0 otherwise. Exp[C_{u, v}] = Prob[u # v] = 1 / (1 + index(u, v)). Notice that for given u, in the multiset of values {index(u, v)| v != u} each value occurs at most twice. Thus, Exp[number_segments_cut] = sum_u sum_{v != u} Exp[C_{u, v}] = sum_u sum_{v != u} Prob[u # v] = sum_u sum_{v != u} 1 / (1 + index(u, v)) <= 2 * n * H_n. End.

Because this holds for all inputs, it shows that for every set of segments there must be at least one partition in which no more than 2 * n * H_n segments are cut: if a function f, based on k random choices has expected value Exp[f], then there is a set of choices x_1, ..., x_k, so that f(x_1, ..., x_k) <= Exp[f] (and also a set of choices y_1, ..., y_k, so that f(y_1, ..., y_k) >= Exp[f]).

A priori, it is not obvious at all that such a choice exists: if there is a whole bunch of segments, one might just as well believe that it is inevitable that one cuts many of them, maybe constructing a partition that cuts Omega(n^2) segments. Here we have used a fact about an expected value to derive an existence proof. This is what is called the randomized method. This proof is completely non-constructive. We can only hope to rapidly find a good example, but there is absolutely no guarantee: it may be so that most constructed partitions are slightly larger than the expected value while a few are much smaller.

Exercises

  1. Consider the experiment of throwing two dices. Let X be the random variable giving the maximum of the two outcomes. Each dice independently assumes any of the values 1, ..., 6 with probability 1 / 6.
    1. Give all values of X, the probability on each of these values, compute the expected value and the variance of X.
    2. Perform the same computations for the random variable X' giving the maximum of the outcomes on three dices.
    3. Now consider dices with n faces, having values 1, ..., n. Each face comes up with equal probability 1 / n. Compute the expected value of the maximum when throwing 2 and 3 dices, respectively. The values do not have to be exact, we are mainly interested in the leading term, including constant, for n going to infinity.

  2. For three or more random variables the independence notion can be generalized in several ways. Here we assume there are three random variables X, Y and Z. These are said to be independent if Prob[X = x and Y = y and Z = z] = Prob[X = x] * Prob[Y = y] * Prob[Z = z], for all x, y and z. They are said to be pairwise independent, if each pair of two of these random variables is independent.
    1. Prove that if X, Y and Z are independent, that they are also pairwise independent.
    2. Let X give the result, 0 or 1, of a fair coin toss and Y of another such coin toss. Let Z = 1 if X = 0 and Y = 0 or X = 1 and Y = 1. Otherwise Z = 0. Prove that these X, Y and Z are pairwise independent.
    3. Show that the above defined X, Y and Z are not independent.

  3. Prove that for n random variables X_i, 1 <= i <= n, Exp[sum_{i = 1}^n X_i] = sum_{i = 1}^n Exp[X_i], using that for two random variables X and Y, Exp[X + Y] = Exp[X] + Exp[Y].

  4. Prove that Exp[X * Y] = Exp[X] * Exp[Y] for any two independent random variables X and Y.

  5. Prove the following two results for two random variables X and Y. The results are most useful if X and Y are dependent. In that case they may be used to solve a general problem by analyzing special cases.
    1. Prob[X = x] = sum_{all possible values y} Prob[X = i| Y = y] * Prob[Y = y].
    2. Exp[X] = sum_{all possible values y} Exp[X| Y = y] * Prob[Y = y].

  6. Analyze the expected number of comparisons performed by the quicksort algorithm more carefully. The goal is to accurately estimate the constant factor c in the expression c * n * log_2 n.

  7. Implement the randomized algorithm for determining a minimum cut in a graph. The graph to study is generated randomly. Study the size of the minimum cut as a function of the number of nodes n and the number of edges m. The edges are created by selecting two node indices independently. Multiple edges and self loops do not have to be removed. Give a theoretical analysis of the expected size of the minimum cut as a function of n and m. For large n, this is simpler than you may expect.

  8. Consider autopartitions of a set of n line segments.
    1. Apply the algorithm to the example with 9 segments shown in the picture above. Construct the corresponding tree and use this tree to systematically compute the visible subsegments for an observer positioned at the last letter of the title of the picture.
    2. Prove an upper bound on the size of the corresponding tree.
    3. Suggest a simple algorithm and prove an upper bound on its time consumption.
    4. Present an improved algorithm which has better expected running time.
    5. Analyze the expected running time of this algorithm.





Game-Theoretic Techniques

Game Tree Evaluation

Problem and Algorithm

A game tree is a tree with alternating levels of maximization and minimization. We only consider binary trees with values 0 and 1. That is, they may be viewed as booleans with levels of "and" and "or" operations. Our cost measure is the number of leafs for which the value is tested. By an adversary argument one can easily see that any deterministic algorithm needs to evaluate all 2^{2 * k} = 4^k for certain trees of depth 2 * k: on an and-level, the first evaluated value always gives 1, on an or-level, the first evaluated value always gives 0: never a truncation can be made. This problem is an amazingly simple example of a problem that can be solved provably better by a randomized algorithm: we show that 3^k is an upper-bound for the expected number of leafs for which the value must be checked. The reason that it is possible at all that we can achieve such an improvement is that the adversary argument does not work any longer: knowing a deterministic algorithm, one can construct a bad input; but knowing a randomized algorithm still does not allow to construct an input with large expected running time. The algorithm is trivial: to evaluate a node, pick randomly and uniformly one of its two children and evaluate this one. If the node is an or-node and this child returns a 1, then return 1, else evaluate the other child as well and return the returned value. If the node is an and-node, then return 0 if the first child returns 0, else evaluate the second child and return its return value. The power lies in the picking the child to evaluate at random. How good is this strategy?

Game Tree Transformation

Basic Result

If an or-node has return value 1, then at least 1 of its children returns 1, and the probability that it is evaluated first is 50%. Thus, in this case the expected number of children for which the evaluation is performed is 3/2 or less (if both children have return value 1, the number of children to evaluate equals 1). If the return value of an or node is 1, then both children have to be evaluated. For an and-node the situation is reversed: if the return value is 0, then the expected number of children to evaluate is 3/2, if it is 1, both children must be evaluated. This does not yet give us the saving we are looking for. The trick is to look two levels deep: An or-node with return value 0 has two and-nodes returning 0 as children. Thus, the expected number of grand-children to evaluate is at most 3. An or-node with return value 1 has to evaluate 1 or 3/2 children. In the first case the expected number of grand children equals 2, in the second case at most 3 1/2. The expected number of grand children to evaluate is thus at most 11 / 4. For and-nodes we get a similar situation. Thus, any 2 levels of the tree contribute a factor 3. The formal proof of this goes by induction: assume that the expected number of nodes to evaluate for a tree of depth 2 * k, denoted T_k is at most 3^k (which has more or less been proven for the base case k = 1 and which is clear for the case k = 0 anyway), then T_k <= 3 * T_{k - 1} = 3^k, because we have argued that the expected number of grandchildren to evaluate is at most 3. Everywhere we are using linearity of expectation.

Better Result

If we set n = 4^k, then 3^k = n^0.793, which is indeed considerably less. This is a Las Vegas algorithm: always correct, the only uncertainty lies in the running time. An interesting question is whether it pays to look deeper than two levels. Maybe we only have to evaluate ast most an expected 8 of the 16 children four levels down. Is this possible? To study this problem, we define four recurrency relations: or0, or1, and0 and and1, giving the maximum expected number of leafs to evaluate when performing the randomized choice of the branches. The following rules apply:
or0[k] = 2.0 * and0[k - 1],
or1[k] = 0.5 * and0[k - 1] + 1.0 * and1[k - 1],
and0[k] = 1.0 * or0[k - 1] + 0.5 * or1[k - 1],
and1[k] = 2.0 * or1[k - 1].
Substitution gives
or0[k] = 2.00 * or0[k - 2] + 1.00 * or1[k - 2],
or1[k] = 0.50 * or0[k - 2] + 2.25 * or1[k - 2].
In addition we have or0[0] = or1[0] = 1. Some values: or0[2] = 3, or1[2] = 2.75, or0[4] = 8.75, or1[4] = 7.1875, or0[6] = 24.6825, or1[6] =

For normal single recurrence relations there is the method using the characteristic polynomial to find a solution. In our case we have two interdependent relations, which makes things harder. The easiest is to guess a general form of a solution and then to proof the correctness of this by induction. Here we guess that or0[k] ~= c0 * a^k, while we guess that or1[k] ~= c1 * a^k. This is motivated by the observation that or0 and or1 essentially must grow equally fast, because or0[k] > or1[k - 2] and or1[k] > 0.50 * or0[k - 2]. We also know that such recurrence relations generally grow exponentially (by induction it is easy to prove that both are at least 2^{k / 2}). Now we want to find the constant a. Substituting and dividing by a^{k - 2} gives

c0 * a^2 = 2.00 * c_0 + 1.00 * c_1
c1 * a^2 = 0.50 * c_0 + 2.25 * c_1.
We have three unknowns and only two relations. However, we mainly want to know a, we are not really interested in the constants c0 and c1. We will therefore determine their quotient b = c1 / c0. The relations become
a = 2.00 + 1.00 * b
a = 0.50 / b + 2.25.
Subtraction gives
4 * b^2 - b - 2 = 0.
This gives b = (1 + sqrt{33}) / 8 ~= 0.843. Substitution gives a = 2.843, instead of the estimate 3 we had before. The algorithm has not improved, but we know better how good it is: prove that
or0[2 * l] = c * (1.000 + o(1)) * 2.843^l,
or1[2 * l] = c * (0.843 + o(1)) * 2.843^l.
Here c is a constant which can be approximated by considering the value of or0[2 * l] for sufficiently large l. Except for lower-order terms this result is sharp, so the number of evaluations for any of the four cases for a tree of depth 2 * l is given by Theta(2.843^l).

Exact Analysis

The above method leads to a solution in an easy way but we have not formally proven that it is correct. By induction it is possible to prove relations of the form or0[2 * l] <= c_0 * 2.843^l - c_1 * 2^l, for suitable c_0 and c_1. However, there is a more fundamental way to proceed, leading to an exact expression.

Let x_l = or0[2 * l] and y_l = or1[2 * l], then we have the following intertwinned recurrence relation:

  x_l = 2   * x_{l - 1} + 1     * y_{l - 1}, for all l > 0,
  x_0 = 1,

  y_l = 1/2 * x_{l - 1} + 2 1/4 * y_{l - 1}, for all l > 0,
  y_0 = 1.

This recurrence relation can also be written in matrix form. Let A = (a_ij) be given by a_00 = 2, a_01 = 1, a_10 = 1/2 and a_11 = 2 1/4, and let v^(l) be the vector with entries x_l and y_l, then v^(l) = A * v^(l - 1) = A^l * v^(0). In general a 2 x 2 matrix has two eigenvalues c_0 and c_1 and two eigenvectors e_0 and e_1. If we are not too unlucky, e_0 and e_1 are independent and thus can they be used as a basis. That is, v^(0) can be written as a linear combination of e_0 and e_1: v^(0) = d_0 * e_0 + d_1 * e_1 for some constants d_0 and d_1. Now, using induction, it follows immediately that v^(l) = A * A^{l - 1} * v^(0) = A * (d_0 * c_0^{l - 1} * e_0 + d_1 * c_1^{l - 1} * e_1) = d_0 * c_0^l * e_0 + d_1 * c_1^l * e_1.

The sketched solution method can be used for any system of intertwinned recurrence relations as long as this system is linear. Of course, if there are n recurrence relations, the solution involves finding the eigenvalues of an n x n matrix which involves solving a polynomial of degree n. For n > 2 this is not an easy task. It also requires a lot of work.

In our concrete case, we look for solutions to the following system of equations:

  2 + y = c,
  1/2 + 2 1/4 * y = c * y
Substituting y = c - 2 in the second equation gives c^2 - 4 1/4 * c + 4 = 0. The solutions are c_0 = (17 + sqrt(33)) / 8 ~= 2.843 and c_1 = (17 - sqrt(33)) / 8 ~= 1.407.

In this very simple case, there is no need to compute the eigenvectors explicitly. Knowing the form of the solution, the coefficients can also be computed directly, using that if x_l = alpha * c_0^l + beta * c_1^l and y_l = gamma * c_0^l + delta * c_1^l, for all l >= 0, in particular we must have

  alpha + beta  = 1,  
  gamma + delta = 1, 
  alpha * c_0 + beta  * c_1 = 3,
  gamma * c_0 + delta * c_1 = 2.75.
From these it immediately follows that
  alpha = (3    -  c_1) / (c_0 - c1) = 1/2 + 7/66 * sqrt(33) ~=  1.109,
  beta  = (c_0  -    3) / (c_0 - c1) = 1/2 - 7/66 * sqrt(33) ~= -0.109,
  gamma = (2.75 -  c_1) / (c_0 - c1) = 1/2 + 5/66 * sqrt(33) ~=  0.935,
  delta = (c_0  - 2.75) / (c_0 - c1) = 1/2 - 5/66 * sqrt(33) ~=  0.065.
So, after substitution, we conclude that for all l,
    or0[2 * l]  = (1/2 + 7/66 * sqrt(33)) * ((17 + sqrt(33)) / 8)^l
                + (1/2 - 7/66 * sqrt(33)) * ((17 - sqrt(33)) / 8)^l,
               ~= 1.109 * 2.843^l - 0.109 * 1.407^l,
    or1[2 * l]  = (1/2 + 5/66 * sqrt(33)) * ((17 + sqrt(33)) / 8)^l
                + (1/2 - 5/66 * sqrt(33)) * ((17 - sqrt(33)) / 8)^l
               ~= 0.935 * 2.843^l + 0.065 * 1.407^l.
Just as for the Fibonacci numbers, we have an expression involving square roots which nevertheless for all integral values of l gives integral values.

The Minimax Principle

Yao's Minimax Principle

We consider a problem Pi, a finite set of inputs II and a finite set of algorithms AA. For input I in II and A in AA let C(I, A) denote the cost for algorithm A on input I. Let p give a probability distribution on II and q on AA. Let I_p be an input chosen according to p and A_q an algorithm chosen according to q, then
max_p min_q Exp[C(I_p, A_q)] = min_q max_p Exp[C(I_p, A_q)]
max_p min_{A in AA} Exp[C(I_p, A)] = min_q max_{I in II} Exp[C(I, A_q)]
These are reformulations of von Neumann's and Loomi's theorems which will be presented hereafter. There it will become clear that actually all four expressions give the same value. The equality min_q max_p Exp[C(I_p, A_q)] = min_q max_{I in II} Exp[C(I, A_q)] can be understood as "there is no need to randomize both algorithm and input".

Leaving away the maximum on the left-hand side and the minimum on the right-hand side, leads to the following inequality, which is known as Yao's minimax principle. For any distribution p on the inputs and q on the algorithms,

min_{A in AA} Exp[C(I_p, A)] <= max_{I in II} Exp[C(I, A_q)].
In words this states that for any distribution on the inputs the expected cost for the best algorithm from AA provides a lower bound on the worst-case expected cost when the algorithms are sampled according to q. Because randomly sampling algorithms from a class of deterministic algorithms corresponds to a randomized algorithm that develops according to the random choices that are made in the course of the program, this provides a lower-bound for randomized algorithms, because we can take AA to be the set of all algorithms. The issue of AA being finite is not really a problem: we can concentrate on algorithms that run for at most O(f(n)) time for an input of size n: there are only finitely many of them.

When applying this lower bound, the freedom lies in the choice of p: if we find a class of inputs for which there is no good deterministic algorithm, then we have found a strong lower bound on the performance of randomized algorithms. Of course, if we choose II and p in a non-clever way, then the algorithm A might be tuned to perform well on this class.

Lower Bound for Randomized Game-Tree Evaluation

In the case of the evaluation of game trees, we were considering before, the class is simple: all terminal values are chosen randomly and independently 0 or 1, only the probability with which they are chosen to become 1 is somewhat special.

To make the following description easier, we first reduce the tree with and and or nodes to a tree with nor nodes only. In this replacement, the values originally computed in the or levels are getting inverted value. The values originally computed at the and levels are the same. If the bottom level consists of ands, all values in the terminals must be inverted. By induction, using that a or b = !(a nor b); a and b = (!a) nor (!b), one can show that this replacement computes the same value if the root node was an and, and the opposite value if the root node was an or.

Game Tree Transformation

For a network of depth 2 * l, the input class II consists now of all 2^{4^l} possible inputs. Each value is 1 with probability x = (3 - sqrt{5}) / 2. x was just chosen so that if the two entries of a nor port are 1 with probability x independently of each other, that then the probability that the exit has value 1 equals (1 - x)^2 = x. So, the probability that any node has value 1 equals x. Now one can show that it is correct to concentrate on algorithms that evaluate the tree depth-first. This is correct in the sense that if W(T) denotes the minimum expected time to evaluate a tree T in which all leafs are set to 0 or 1 independently with fixed probability, that then there is also a depth-first algorithm achieving an expected value W(T).

For depth-first algorithms, the time is simply given by the formula we had before for evaluating the upper bound:

W(k) = W(k - 1) + (1 - x) * W(k - 1).
That is, W(k) = (2 - x) * W(k - 1) ~= 1.618 * W(k - 1). This gives W(2 * l) ~= 2.618^l. This does not match our computed upper bound! The reason is that our probability distribution over the class of inputs was not chosen optimally: it is a waste to have any nodes in which both inputs have value 1. So, the values at the terminals should be chosen 0 and 1, but not independently. This analysis shows that the algorithm is optimal and is therefore important, but it is even more important that our simple analysis already could show that one does not need to look for randomized algorithms evaluating only 2^l leafs or less.

Understanding Yao's Principle

We first look at two-person zero-sum games. The standard example is the scissors/paper/stone game, in which two players, R and C, each make a choice and then get points according to a table which has one -1, one 0 and one 1 in each row and column, giving the pay-off for player R. The table is called the pay-off matrix. If R gains x units, -1 <= x <= 1, C gains -x units. This game is called a zero-sum game, because the sum of all gains and losses over all rounds and all players is zero. R aims at maximizing its score. C aims at minimizing the score of R. The score that R certainly can achieve is V_R = max_i min_j M{i j}, where M_{i, j} gives the score when R chooses i and C chooses j. The score that C at most looses V_C = min_j max_i M_{i j}. In general V_R <= V_C. This holds for any matrix: the maximum of the row minima is not larger than the minimum of the column maxima.

Zero-Sum Game

When V_R = V_C, then it is said that the game has a solution. This solution holds in the sense that a player who deviates from this choice looses when the other player is consequent.

Game with Solution

Deterministic strategies are called pure. If the game has no solution, then knowledge about the strategy of the opponent allows to improve the payoff. Randomized strategies are called mixed. For deterministic strategies, there is in general no solution. However, if the choices are made randomly, each choice being made with a certain probability, then we get a vector p and a vector q for the other player, and the expected payoff equals p * M * q. Now

V_R = max_p min_q p * M * q,
V_C = min_q max_p p * M * q.

Theorem: V_R = V_C.

This is known as von Neumann's minimax theorem. It is a more general formulation of Loomi's theorem, which is equivalent:

Theorem: max_p min_j p * M * e_j = min_q max_i e_i * M * q.

Proof: Once p has been fixed, p * M * q is a linear function of q. A linear function with a vector of unit length assumes its maximum value along one of the coordinate axis. So, for all p, min_q p * M * q = min_j p * M * e_j. Analogously, for all q, max_p p * M * q = max_i e_i * M * q. Thus, max_p min_j p * M * e_j = max_p min_q p * M * q = V_R = V_C = min_q max_p p * M * q = in_q max_i e_i * M * q. End.

Corollary: min_j p * M * e_j <= max_i e_i * M * q, for all vectors p and q of unit length.

Now the column player is viewed as the algorithm designer and the row player as the adversary choosing the input. The pay-off is given by the cost for a specific algorithm on a specific input. The algorithms considered must be correct and terminating, to make everything finite. V_C, using the first definition, gives the best achievable worst-case running time of a deterministic algorithm. A mixed strategy for the column player C corresponds to an always correct randomized algorithm. Notice that hence the obtained lower-bounds do not apply to Monte-Carlo algorithms, they only concern Las Vegas algorithms!

Derandomization

Adleman's Theorem

A central question in the theory on randomized algorithms is whether the randomization is essential or not. Sometimes one can prove that randomization contributes something that cannot be matched by a deterministic algorithm: in the case of game tree evaluation, the expected time for a randomized algorithm was less than the worst-case running time of a deterministic algorithm. One might even show that the randomized algorithm is better with high probability. Of course, if a randomized algorithm has better worst-case time, then a new better deterministic algorithm is obtained by just taking any choice.

In the following we show that in a very weak sense most randomized algorithms can be made deterministic, but only in an existential way: there is no way to actually perform this derandomization. The argument, which is due to Adleman, is nevertheless very beautiful. We consider (randomized) Boolean circuits. A Boolean circuit can model the operations of an algorithm.

A boolean circuit is a directed acyclic graph in which the nodes have indegree 0, 1 or 2. The nodes are labeled INPUT, NOT, AND, OR and OUTPUT. The n input nodes have indegree 0. The boolean nodes have indegree 1 or 2 according to their type. The input nodes get assigned boolean values x_1, ..., x_n (in the following 0 will be used as a short notation for false and 1 for true). There is a single output node, which has outdegree 0. The value resulting at the output node is a function of the x_i. We say that the circuit computes this function. The size of the circuit is the number of nodes in it. In addition to all this, a randomized circuit has random inputs r_i, which get assigned a boolean value independently and equiprobably. A randomized circuit is said to compute a function f if the following two conditions are satisfied:

An m-tuple (r_1, ..., r_m), for which f_n(x_1, ..., x_n) = 1 is called a witness for the input (x_1, ..., x_n): it testifies the correct value of f_n(x_1, ..., x_n) when it is 1. By the above definition there are only positive witnesses. The second point implies that for inputs (x_1, ..., x_n) for which f_n(x_1, ..., x_n) = 1 at least half of all 2^m choices of the (r_1, ..., r_m) are witnesses.

Let f = f_1, f_2, ... be a (family of) function(s), with f_n: {0, 1}^n -> {0, 1}. A family of circuits C = C_1, C_2, ... is a circuit family for f if C_n is a circuit for f_n, for all i. This circuit family is said to be polynomially sized, if |C(n)| = O(n^k), for some constant k. C is a randomized circuit family for f, if the C_n are randomized circuits and compute the functions f_n in the above defined sense.

Theorem (Adleman's Theorem): If a family of boolean functions {f_1, f_2, ...}, has a randomized polynomial-sized circuit family {C_1, C_2, ...}, then it has a polynomial-sized circuit family.

Proof: For each n we will construct a deterministic circuit D_n which is larger than C_n by a factor which is polynomial in n. Form a matrix with 2^n rows and 2^m columns. The rows correspond to the choices of the inputs, the columns to the possible random choices. The entries give the values of f_n for the inputs x_1, ..., x_n, in combination with the values of r_1, ..., r_m. All rows for which f_n evaluates to 0 are deleted. The remaining rows have 1 in at least half the positions. So, at least half of all entries in the remaining matrix is 1. Thus, there must be a column in which at least half of all entries is 1. Construct a new circuit with these m choices fixed. Eliminate all rows for which this function computes 1. The remaining rows again have at least half of their entries equal to 1. Thus, again there must be column in which at least half of the remaining rows have a 1. Construct the corresponding circuit as well. Continue until all rows are eliminated. The final circuit D_n computes the or of all newly constructed circuits. Because each step reduces the number of rows by a factor 2, we need at most n circuits and therefore is the size of this circuit at most n times larger than that of C_n (plus a few gates for computing the or function of at most n values). End.

The problem with this method is that the constructed circuit works for a specific n, but it is not necessarily so that for n + 1 the circuit looks similar. More formally: the constructed deterministic algorithm is non-uniform in the following sense. a() is a function from the positive integers to the strings. An algorithm is said to use a() as advice, if for an input of size n it uses a(n) as advice. The important point is that it uses the same advice for all inputs. This advice is used for fine tuning the algorithm for this specific n. A uniform algorithm uses no advice. By a correspondence between polynomial-time algorithms and polynomial-sized circuit families, Adleman's theorem implies that the class of randomized polynomial-time algorithms is a subclass of the deterministic non-uniform polynomial-time algorithms. This is interesting but not practical.

Primality Testing

As an illustration of the above technique we consider primality testing. The treatment, however, goes beyond being merely an illustration: once we start considering this highly interesting and extremely important topic, we treat it completely.

One of the most famous theorems from number theory is Fermat's little theorem (formulated 1640):

a^{n - 1} mod n = 1, for any prime n and any number a, 1 <= a < n.
For example, 1^6 = 1 = 1 + 0 * 7, 2^6 = 64 = 1 + 9 * 7, 3^6 = 729 = 1 + 104 * 7, 4^6 = 4096 = 1 + 585 * 7, 5^6 = 15625 = 1 + 2232 * 7, 6^6 = 46656 = 1 + 6665 * 7. Furthermore, for any composite number n, there are numbers a, 1 <= a < n, so that a^(n - 1) mod n > 1. The smallest composite number n for which there is an a with a^{n - 1} mod n = 1, is 15: 4^14 mod 15 = (4^7 mod 15)^2 mod 15 = 4^2 mod 15 = 1. Also 11^14 mod 15 = (11^7 mod 15)^2 mod 15 = 11^2 mod 15 = 1. For all other a, 2 <= a <= 13, a^14 mod 15 > 1. More generally, for most composite numbers n there are many a so that a^(n - 1) mod n > 1. In any case this holds for all a with ggd(a, n) > 1, so for all a which have a common factor with n.

Fermat's theorem and the partial reversal of it lead to a prime test which tests for primality without factorization. A simple randomized algorithm tests for primality as follows:

  int expoMod(int a, int m, int n)
  {
    int c, z;
    for (c = a, z = 1; m != 0; m = m >> 1)
    {
      if (m & 1) /* m is odd */
      {
        z *= c;
        z %= n;
      }
      c *= c;
      c %= n;
    }
    return z;
  }
  
  boolean isWitness(int n, int a)
  {
    if ((n & 1) == 0) // n is even
      return n == 2;
    return expoMod(a, n - 1, n) == 1;
  }

  boolean isPrime(int n) 
  {
    final int ITS = 20;
    int its = 0; 
    while (its < ITS && isWitness(n, random.nextInt(2, n - 2)) 
      its++;
    return its == ITS;
  }
As long as all operations can be performed in constant time, this algorithm runs in O(ITS * log n). More generally, expoMod performs O(log m) products of numbers which are of the length of n. So, the cost is O(ITS * log m) * T_prod(log_b n), where b is the basis with respect to which the number n is written (on a system with 32-bit arithmetic, b = 2^16 is a handy choice). If n is a prime number, then isPrime(n) will return true. For most non-primes it will return false. Considering all primes and all possible choices of witnesses for them, it turns out that there is only a very small fraction of false-witnesses. So, for a uniformly chosen number n and uniformly chosen numbers a, the test is correct with high probability.

Unfortunately there are composite numbers n, for which a^(n - 1) mod n = 1 for most a, 1 <= a < n. 561 is such a number: of the 559 numbers a, 2 <= a < 561, there are 318 false-witnesses. Only 241, that is less than half of them, can be used as witness of the non-primality of 561. There are large composite numbers for which the fraction of witnesses is arbitrarily small. More precisely, for any eps > 0, there are infinitely many composite numbers for which the fraction of witnesses is smaller than eps. A Monte Carlo algorithm which is correct with probability p is said to be p-correct. If an algorithm is p-correct for some p > 0, then this p can mostly be increased by repeating independent trials. However, the above fact implies that applying Fermat's test to a specified number n is not p-correct for any p > 0, and therefore repeating it a constant number of times does not help.

However, a refined test does work in the sense that the fraction of false-witnesses is bounded away from zero by a constant. For an odd number n, let k and l be the unique numbers so that 2^k * l = n - 1 where l is odd. An extension of Fermat's theorem shows that

Lemma: For any prime number n one of the following conditions is satisfied for all numbers a, 2 <= a < n - 1:

The proof is omited, but we will show that this lemma is indeed an extension of Fermat's theorem. For even numbers n, k = 0 and l = n - 1. Therefore for even n the lemma is equivalent to Fermat's theorem. For odd n, k > 0, the lemma implies Fermat's theorem: If a^l = 1 mod n, then a^{n - 1} = a^{2^k * l} = (a^l)^{2^k} = 1^{2^k} mod n = 1 mod n. If a^{2^i * l} = n - 1 mod n, then we use that n - 1 = -1 mod n. So, in this case a^{n - 1} = a^{2^k * l} = (a^{2^i * l})^{k - i} = (-1)^{k - i} mod n = 1 mod n. Here we essentially need that i < k.

The suggested better primality test, known as Rabin's test, is to test whether for n the above condition is satisfied and to return true when it holds and false otherwise. That is, in the above code fragment, the subroutine isWitness() should be replaced by the following:

  boolean isWitnessRef(int n, int a) 
  {
    int e, l, x;
    if ((n & 1) == 0) // n is even
      return n == 2;
    e = l = (n - 1) >> 1;
    while ((l & 1) == 0) // l is even
      l >>= 1;
    x = expoMod(a, l, n);
    if (x == 1)
      return true;
    while (l != e && x != n - 1)
    {
      x  *= x;
      x  %= n;
      l <<= 1;
    }
    return x == n - 1; 
  }
False-witnesses with respect to Rabin's test will be called strong false-witnesses. These are much rarer than the earlier false-witnesses. As an example we consider again the values n = 15 and a = 4 or 11. For n = 15, l = 7 and k = 1. If k = 1, it is only tested whether a^l mod n = 1 or n - 1, which is not the case: 4^7 mod 15 = 4, 11^7 mod 15 = 11.

Theorem:

This shows that the test is p-correct for some p >= 3 / 4. So, all conditions for the application of Adleman's theorem are satisfied: for all n, for testing primality of all numbers up to n, the randomized algorithm can be turned into a deterministic algorithm with at most O(log n) slow down. This is done by determining O(log n) numbers a_i which together are a complete set of witnesses for all odd composite numbers up to n.

To obtain an error probability of isPrime() of less than eps, ITS must be chosen so that 1 / 4^ITS < eps, that is, we must take ITS = log_4 (1 / eps) = log_2 (1 / eps) / 2. The suggested value ITS = 20, gives an erring probability of at most 4^{-20} ~= 10^{-12}. This is a very pessimistic estimate: for most values of n there are much fewer strong false-witnesses: 72% of the odd composite numbers up to 1000 does not have a single false witness.

How about the time consumption? As before, a single test takes a time which is dominated by computing O(log n) products of numbers of size at most n. So, in total the primality testing costs O(ITS * log n) * T_prod(log_b n). If we want an algorithm which is correct with high probability, ITS must be chosen omega(1). The time for computing the products depends on the applied algorithm. Applying the simplest algorithm, T_prod(m) = O(m^2), with Karatsuba's method, T_prod(m) = O(m^{log_2 3}), and using an algorithm based on FFT this can even be reduced to T_prod(m) = O(m * log m). Whatever method is applied, we conclude that:

Theorem: There is a Monte Carlo algorithm for testing primality in polylogarithmic time, which is correct with high probability.

Using Adleman's construction for any n, all prime numbers up to n can be determined with a deterministic algorithm: it suffices to specify a list of at most log n witnesses which together cover the whole set of non-primes. In principle a good choice of witnesses could be hard to find, but it turns out that for moderate n one can simply take the smallest prime numbers. The following table illustrates this:

n Fermat Rabin
10^5 13 (6) 5 (3)
10^6 41 (13) 5 (3)
10^7 109 (29) 5 (3)
10^8 281 (60) 7 (4)
10^9 617 (113) 7 (4)
The columns give the value of n and the corresponding largest prime number with which must be tested to assure that any number up to n is a prime number. In brackets the rank of this prime number is indicated.

So, for testing all numbers up to 10^9 on primality, it suffices to perform at most 113 Fermat tests or at most 4 Rabin tests. The above values have been computed with a program, which can be downloaded here (for values of n larger than 2^16, this program should be run on a computer with 64-bit arithmetic). It has been shown that actually for all odd composite numbers up to 10^10 either 2, 3, 5 or 7 is a witness. Adding 61 to the set of witnesses we come even much further: for all odd composite numbers up to 10^13, either 2, 3, 5, 7 or 61 is a witness. So, for these larger numbers we find a non-trivial testing set.

Considering that for most non-primes 2 is a witness, the average number of tests for these numbers is only a little more than 1. For prime numbers some more tests must be performed, but because there are only about n / (log n - 1) prime numbers up to n, this has no big impact on the average number of tests to perform. When picking the values a at random, for any composite number, the expected number of Rabin tests to perform is bounded by 1 + 1/4 + ... < 4/3. When trying to find a large prime number with value between n and 2 * n, such as needed in cryptographic applications, at most one of the tested numbers (namely the last) is a prime number. Because a fraction Theta(1 / log n) of the numbers in this range is prime, and because the number of tests for the last number is bounded by O(log n), the expected total number of tests to perform is only O(log n).

Public-Key Cryptography

Cryptography offers a domain in which large prime numbers are of central importance. Here you can find an elementary text in German on this topic.

Exercises

  1. The randomized algorithm for the game tree evaluation was only considered for the case of 0-1 values. In general the values are positive integers, and the levels of the trees are alternatingly maximizing and minimizing. The most common technique to save unnecessary work is called alpha-beta pruning, this means that in any node a child is evaluated only if this still may have an impact on the computed value. With any deterministic strategy the values at the leafs may be so that all of them have to be evaluated. Choosing the child to evaluate independently at random, the expected number of evaluated leafs in a binary tree of depth 2 * k is strictly smaller than 4^k, independently of the assignment of the values to the leafs. Compute an upper bound on the expected number of leafs to evaluate and give a value distribution for which this bound is tight.

  2. Consider algorithms for game-tree evaluation for the case of a game tree with alternating levels of and and or gates, with an and gate in the root. Show how to create a randomized circuit computing the function defined by this game tree. For a tree of depth 2, which has four leafs, show in detail how Adleman's idea can be used to obtain a deterministic algorithm. Generalize for larger depths.

  3. Apply Fermat's and Rabin's test to all odd composite numbers up to some value n using this program. Let C(n) be the number of tested numbers and let S_F(n) and S_R(n) be the sum of the number of a-values for which calls to the two witness procedures are made. Compute C(n), S_F(n), S_R(n), S_F(n) / C(n) and S_R(n) / C(n) for increasing values of n, going as far as time and hardware allows. Do these results allow to conclude that in practice both tests are almost equally good?

  4. Fermat's test should not be applied for testing primality of a given n because there are composite numbers n for which the fraction of witnesses of non-primality is arbitrarily small. However, the number of such n is very small. In cryptographic applications a number n is picked uniformly at random from a large interval and then tested on primality. Assume that in the interval from which n is picked the probability to hit a number with more than n / 2 false witnesses is less than f(n), for some function f(n) = o(1), for example f(n) = 1 / sqrt(n). Assume a_i^(n - 1) = 1 mod n for k independently selected numbers a_i, 2 <= a_i < n, 0 <= i < k. How large is the probability that n is composite? Give an upper bound in terms of f(n) and k. What are the advantages and disadvantages of this test in comparison with Rabin's?





Moments and Deviations

Basic Lemmas and Definitions

The variance Var[X] = sigma^2(X) of a random variable X is defined as Var[X] = Exp[(X - mu)^2], where mu = Exp[X]. Two random variables X and Y are independent if Prob[X = x| Y = y] = Prob[X = x], for any pair of values x and y. A set of random variables X_1, ..., X_n, is called pairwise independent if for any 0 <= i, j < n, i != j, X_i and X_j are independent. AS remarked earlier, for independent random variables X and Y, Exp[X * Y] = Exp[X] * Exp[Y], and particularly this holds for pairwise independent random variables. A well known lemma (physics students get this in their first year) is that the variance of the sum of a set of (pairwise) independent random variables equals the sum of their variances. This can be shown by induction. We only consider the case of two independent random variables X and Y. Then Exp[(X + Y - mu_X - mu_Y)^2] = Exp[X^2 + Y^2 + mu_X^2 + mu_Y^2 + 2 * X * Y - 2 X * mu_X - 2 * X * mu_Y - 2 * Y * mu_X - 2 * Y * mu_y + 2 * mu_X * mu_Y]. Using linearity of expectation, using that mu_X and mu_Y are just numbers and using that Exp[X * Y] = Exp[X] * Exp[Y], we get that this equals Exp[X^2] - mu_X^2 + Exp[Y^2] - mu_Y^2 = Exp[(X - mu_X)^2] + Exp[(Y - mu_Y)^2] = Var[X] + Var[Y].

The following two results are known as Markov's and Chebyshev's inequality, respectively:

Theorem (Markov): For a random variable X which only assumes positive values and a positive number h, Prob[X >= h * Exp[X]] <= 1 / h.

Proof: Because X is positive for all x, Exp[X] >= x * Prob[X >= x]. For x = h * Exp[X], this shows that we must have the stated condition. End.

Theorem (Chebyshev): Prob[|X - mu_X| >= h * sqrt(Var[X])] <= 1 / h^2.

Proof:

Prob[|X - mu_X| >= h * sqrt(Var[X])] = Prob[(X - mu_X)^2 >= h^2 * Var[X]]
Now apply Markov's inequality to the positive random variable Y = (X - mu_X)^2 which has expectation Var[X]. End.

In the literature sqrt(Var[X]) is known as the standard deviation of X and often written sigma_X.

Randomized Selection

Selection is the problem of picking an element with specified rank from a set of elements that can be compared by a strong order "<". This problem can be solved by sorting and then picking the element in constant time. This takes O(n * log n) time in total for a set of n elements. Deterministically, there is the famous linear-time algorithm by Blum e.a. based on the idea of recursively selecting the median of the medians. Randomizedly linear time is much easier to achieve. These algorithm are really practical in the sense that they easily outperform the sorting-based selection algorithm.

The algorithm is simple and elegant. There are many variants of the same idea. The set of elements is denoted S, the element x to select has rank k.

  1. Pick uniformly and independently at random with replacement n^{3/4} elements. This multiset is called R.
  2. Sort R.
  3. Let l = k / n^{1/4}. Let a be the element in R with rank l - sqrt{n} and let b be the element in R with rank l + sqrt{n} (special cases apply when l < sqrt{n} or l > n - sqrt{n}). Let r_a, r_b be the rank of a, b in S.
  4. If r_a <= k <= r_b and r_b - r_a <= 4 * n^{3/4} then sort the elements between a and b and return the element with rank k - r_a from this set, else repeat from the beginning again.
Except for details the algorithm is correct. The question is how efficient it is. Clearly every pass can be implemented in O(n + n^{3/4} * log n) time, so we must show that the number of passes is likely to be constant.

In the proof we will bound the probability that more than k / n^{1/4} + sqrt{n} of the selected numbers are smaller than x and the probability that a < S_{k - 2 * n^{3/4}}, where S_i denotes the element with rank i. The other two probabilities can be bounded in the same way.

The probability that in any of the n^{3/4} independent selection operations a number smaller than x is selected equals k / n. Let X be the random variable giving the sum of these outcomes. It has expected value l and Var[X] <= n^{3/4} / 4. The probability on a large deviation of X from its expected value can be bounded with the Chebyshev inequality: Prob[|X - l| >= sqrt(n)] = Prob[|X - l| >= 2 * n^{1/8} * sigma(X)] = O(n^{-1/4}). The other bound can be derived similarly. As a result, we can estimate that the number of performed comparisons is bounded by 2 * n + O(n^{3/4} * log n).

Probability Amplification

We consider what we can do with the Chebyshev inequality to reduce the number of required random bits for obtaining a certain level of certainty in the context of RP algorithms. RP (which stands for randomized polynomial) is the class of randomized algorithms with one-sided errors and error probability less than 1/2. The class RP is formulated in terms of deciding whether a string belongs to language L. The random numbers are picked from the set 0, 1, ..., n - 1. Later we will see why it is nice when n is a prime. The algorithm A now computes for any input x and random number r a value A(x, r) with properties: A number r with A(x, r) = 1 is called a witness for x. Different x may have different sets of witnesses, but each of these sets is large. The conventional method of increasing our certainty that x not in L is picking many choices of r, and if we have tested log t of them, then the probability that x in L, while A(x, r_i) = 0, for all i, is at most 2^{- log t} = 1 / t. This is fine, but costs log t * log n random bits.

A large saving can be obtained using the theory above. However, we must first give an explicit example of sets of pairwise independent random variables. Let n be a prime number. Then F_n = Z/nZ is a field, the finite field with n elements in which the usual addition and multiplication operations are performed modulo n. Define random variable Y_i(a, b) = (a * i + b) mod n. Of course, for given i, a and b, this is just a single number, but as a function of the choices of a and b it is a real random variable. It is easy to show that Prob[Y_i = x] = 1 / n and that thus the Y_i are uniformly distributed: consider all possible n^2 choices of a and b. Make a matrix, with the choices of a vertically and the choices of b horizontally. Write 1 if Y_i(a, b) = x and 0 otherwise. Because F_n is a field, there is exactly 1 one in every row: in row a there is a one at position b = (x - a * i) mod n. Thus, in total there are n out of the n^2 choices of (a, b) which result in a one. Now, for i != j, consider Prob[Y_i = x | Y_j = y] = Prob[Y_i = x and Y_j = y] / Prob[Y_j = y]. The condition Y_i = x and Y_j = y gives two equations with two unknowns in a field:

a * i + b = x
a * j + b = y
The solution is a = (x - y) * 1 / (i - j) and b = x - a * i. So, for any two different values i and j, there is a unique a and b. That is, Prob[Y_i = x and Y_j = y] = 1 / n^2. As we already have seen that Prob[Y_j = y] = 1 / n, it follows that for all i != j, x and y, Prob[Y_i = x| Y_j = y] = 1 / n = Prob[Y_i = x]. Thus all these random variables are pairwise independent. Clearly any three of them are dependent: if Y_i = x and Y_j = y, then the value of any other Y_k is known.

Pairwise Independent Random Variables

Now, for enhancing the probability, we first pick the two numbers a and b independently at random and then take r_i = (a * i + b) mod n. We do the same as before: if A(x, r_i) = 0 for all i, then we guess that x not in L, otherwise we are sure that x in L. This procedure is called two-point sampling, because it is based on a sample of two independently selected values. We only consider the case that x in L and want to bound the probability that A(x, r_i) = 0, for all i. The A(x, r_i) are random variables depending on a and b. These are pairwise independent just as the numbers r_i. Let Z_t = sum_{i = 1}^t A(x, r_i). Exp[Z_t] = Exp[sum_{i = 1}^t A(x, r_i)] = sum_{i = 1}^t Exp[A(x, r_i)] = sum_{i = 1}^t Prob[A(x, r_i) = 1] >= t / 2. Here we used that A(x, r) = 1 for at least half of the r values and that the r_i are distributed uniformly. Because the random variables are pairwise independent, the variance of Var[Z_t] = Var[sum_{i = 1}^t A(x, r_i)] = sum_{i = 1}^t Var[A(x, r_i)] <= t / 4, because the A(x, r_i) are Bernoulli trials. So, sqrt(Var[A(x, r_i)]) <= sqrt(t) / 2. Now we apply Chebyshev's inequality:

  Prob[Z_t = 0] <= Prob[|Z_t - Exp[Z_t]| >= t / 2] 
                <= Prob[|Z_t - Exp[Z_t]| >= sqrt(t) * sqrt(Var[A(x, r_i)])
                <= 1 / t.

This procedure is sometimes called probability amplification. For any t <= n, with t experiments the probability that x in L, while A(x, r_i) = 0, for all i, is at most 1 / t. Picking independent numbers, this is achieved with log t numbers. So, we are saving a lot on the number of random bits needed, going from log t * log n to 2 * log n, For t = n, this gives a reduction from log^2 n to log n. At the same time it costs much extra time to perform t instead of log t tests. What is better in practice is hard to say: pseudo-random bits are rarely so expensive that it is worth to repeat an experiment many times just to save random bits. On the other hand, for many problems, taking the numbers r_i = a * i + b mod n will work just as well as selecting pseudo-random numbers again and again. Any way, the real importance of the above is that it shows that full independence is not always needed. Furthermore, using 4-way, 6-way, ... independence, stronger results can be obtained. A central question in the domain of randomized algorithms, addressed here, and already seen in relation with Adleman's result, is: how much independence do we need, can we derandomize the process.

Principle of Deferred Decisions

Stable Marriage Problem

The input is a complete bipartite graph of n "men" and n "women". The men will be denoted by small letters, the women by capitals. The edges are weighted on both sides. For each node, the weights of the incident edges constitute a permutation of the set {0, 1, ..., n - 1}. For a man x and woman X, w(x,X) denotes the weight given by x to the edge (x, X), while w(X, x) denotes the weight given by X. The task is to find a stable matching, a perfect matching, leaving no node unmatched, with the special property, that there is no pair of matched edges ((x, X), (y, Y)) with w(x, Y) > w(x, X) and w(Y, x) > w(Y, y). If the weights are interpreted as preferences, this means that there should not be two couples in which the man of the first couple likes the women of the second couple better than his current wife, while the women of the second couple prefers the men of the first couple over her current husband. A stable matching is neither optimal in some global sense, nor fair, but at least it does not offer possibilities for opportunistic partner changes which may cause an ungoing chain of further changes.

Stable Matching

At a first glance it is not clear that for any set of weightings there exists a stable matching. How to construct one? An obvious way of trying to achieve a stable matching is to create an initial matching, for example pairing man i with woman i, for all 0 <= i < n, and then to gradually improve it. Alternatively, we can start with a greedy matching, processing the men in order and matching man i with the unmarried woman that stand highest on his list. Unfortunately, this improvement phase does not need to terminate. The reason is, that what means an improvement for x and Y may mean an even bigger deterioration for y and X.

The following method works and thereby in particular shows that stable matchings always exist. First all nodes are free. Then any free man is selected. This man proposes to the most attractive women on his list who has not yet turned down his proposal before. A woman, free or matched, accepts a proposal if and only if the proposing man is more attractive to her than her current mate. If she was matched before, her current mate becomes free again. The algorithm terminates as soon as all men and women are matched (because they are both n, this happens at the same time). All women will get married this way, because each woman appears on all lists, and therefore eventually they will get a proposal and once a woman got matched she will always remain matched. Likewise, it will never happen that a man runs out of candidates: if a man is making his k-th proposal, then all women higher on his list must be matched. So, his n-th proposal (if none of the earlier ones) is accepted, because all other men are matched to the other women, and this last one therefore must still be free. The algorithm indeed leads to a stable marriage. This can be proven by contradiction. Assume that in the constructed matching we have a pair of matched edges ((x, X), (y, Y)), while x and Y prefer each other over their current partners. x has proposed to Y before it proposed to x. The fact that it is not matched with Y at the end means that either Y has turned x down at the time of the proposal or later on has rejected him in favor of another candidate. Because Y only gets better candidates over time, both imply that y must be more attractive than x, which gives a contradiction. The total number of proposals made is at most sum_{i = 1}^n i = 1/2 * n * (n + 1) and thus the running time is O(n^2).

Constructing a Stable Matching

All this is deterministic. The question that interests us is the expected running time of the algorithm in case the preferences of the men are selected randomly while the preferences of the women are arbitrary but fixed from the beginning. This analysis is not easy. There is a terrible lot of dependency and in this case not even the expected number of proposals that is made by each man is easy to analyze. The crucial idea is to apply the principle of deferred decisions. This means that we realize that it is not important that the men choose their complete lists of preferences at the beginning. The next preference on the list can also be chosen just in time, that is, when the man is going to make his next proposal.

So, women are picked randomly and uniformly from among all those that were not selected before. This is still exactly the same game. To make the game even easier to analyze, we study the approach in which the men just randomly propose to any of the n women, including those they were proposing to before. The women are more consequent and turn down any man they were turning down before. Denote the time for the original algorithm T_org and the time for the modified one T_mod. T_mod stochastically dominates T_org in the sense that Prob[T_mod > m] >= Prob[T_org > m], for all m.

Lemma: If, for random variables X and Y, X stochastically dominates Y, then Exp[X] >= Exp[Y].

Proof Of course it is not true that Prob[X = k] > Prob[Y = k] for all k, because sum_k Prob[X = k] = 1 = sum_k Prob[Y = k]. So, we can not argue this way. The trick is to reformulate the expression of the expected value. The following argument is given for positive random variables, but it can be generalized.

    Exp[X] =  sum_{k >= 0} k * Prob[X = k]
           =  sum_{k >= 0} 1 * Prob[X >= k]
           >= sum_{k >= 0} 1 * Prob[Y >= k]
           =  sum_{k >= 0} k * Prob[Y = k] = Exp[Y].
  
End.

Thus, in order to put a bound on Exp[T_org] it suffices to bound Exp[T_mod]. Notice that, because only free men are making proposals, the game is over as soon as all women got a proposal, because then all women and men are matched. Thus, the analysis of the game has now been reduced to studying how many randomly uniformly distributed balls have to be thrown before each of n holes has been hit. This problem is called the coupon collectors problem. It is easy to see that Theta(n * log n) balls are required: when throwing k balls, the probability that a hole has not been hit equals (1 - 1 / n)^k ~= e^{- k / n}. So, for k = c * n, this probability is about 1 / e^c. For c = x * log_e n, the probability is about 1 / n^x, and thus we can even conclude that with high probability none of the holes will remain empty (when computing the probability on a multiple or-clause, the individual probabilities can be added). The other direction is not so easy to estimate, but one will believe that if the probability that any single hole remains empty is considerably larger than 1 / n, that then it is quite likely that one of the n holes will remain empty. For example, for k = n * log_e n / 4, the expected number of empty holes is about n^{3/4}. The result is actually more precise than this. The literature gives the following:

Theorem: For any constant c in R and m = n * log_e n + c * n, lim_{n -> infinity} Prob[T_mod > m] = 1 - e^{-e^{-c}}.

This result states that for m = n * log_e n in the limit there is a probability of exactly 1 - 1 / e that we need more than m trials, whereas for m = n * log_e n + c * n, this probability has been reduced to about 1 / e^c. Asymptotically, this is a very sharp transition, one more or less knows exactly how many trials must be made. Such a result is called a sharp threshold result: the answer is known (with high probability) to within lower-order terms. This situation is actually rather common. For example, in the theory of random graphs there are plenty of sharp threshold results, most noticeably the result on how many edges a random graph must have to be connected.

Click here to download a program which simulates the coupon collector problem: for an array of length n, the value of n is to be supplied by the user, random numbers in the range 0, ..., n - 1 are generated until each number has been generated at least once. Then it is counted how many numbers were needed. In addition the maximum number of hits and the sum of the reciprocals of the number of hits are computed. Finally a statistic is printed, showing the complete frequence distribution.

Clock Solitaire

The principle of deferred decisions is used in an even more explicit way in the analysis of a game which is called clock solitaire. The game goes as follows: 52 playing cards are put on 13 piles with 4 cards each. The piles are indexed with the labels on the cards: 2, 3, ..., 10, J, D, K, A. We start by drawing a card from pile A. When a card is drawn with label x, the next card is drawn from pile x. The game is won if all 52 cards can be taken and otherwise it is lost. What is the probability of winning?

The first observation is that a loss always happens by an attempt to draw a card from pile A: at any time any other pile x holds at least as many cards as the number of remaining cards with label x. Thus, we are winning only if the last card is an A. One now might analyze the problem by considering all possible ways to hand out the cards and then study for which of these we finish with an A. This is hard. One might also let the set of random choices unfold while doing. This is the same: from the point of view of the probability there is no difference between an initially chosen distribution of the cards (which is still unknown) and a distribution that is (respecting the data already revealed before) chosen during the process. So, we may assume that the next card is always chosen independently and uniformly from the set of remaining cards. But now at once we see how likely it is that this process terminates with an A: 1 / 13.

Exercises

  1. Let X_i, 0 <= i < n, be a set of n independent random variables.
    1. Prove that for all j, 1 <= j < n, Y = sum_{i = 0}^{j - 1} X_i and X_j are independent.
    2. Prove that for any two disjoint subsets S_0 and S_1 of {0, 1, ..., n - 1}, Y_0 = sum_{i in S_0} X_i and Y_1 = sum_{i in S_1} X_i are independent.
    3. Prove that Var[sum_{i = 0}^{n - 1} X_i] = sum_{i = 0}^{n - 1} Var[X_i].
    4. Let Y = sum_{i = 0}^{n - 1} X_i. Express the standard deviation sigma_Y of Y in terms of the standard deviation sigma_X of X for the case that all X_i are distributed as X.
    5. So far the X_i were assumed to be independent. Check for each of the above proven results whether pairwise independence is also enough.

  2. In the randomized selection algorithm, n^{3/4} numbers where selected. In the light of Chebyshev this is a good choice. However, we can also apply Chernoff bounds. Give a slightly modified algorithm which has smaller lower-order terms and analyze its performance with the Chernoff bounds.

  3. The expected number of performed comparisons of the given selection algorithm is 2 * n + o(n). Suggest a minor modification which reduces this to n + min{k, n - k} + o(n). for selecting the number with rank k out off a set with n elements. The given function is maximized for k = n / 2, showing that in general the randomized algorithm can be used to perform selection with 3/2 * n + o(n) comparisons. It has been proven that any deterministic algorithm requires at least 2 * n + o(n) comparisons. Thus, selection is a problem for which a randomized algorithm is provably better than any deterministic algorithm.

  4. A set of n random variables X_i, 0 <= i < n, is said to be k-way independent, 2 <= k <= n, if for any j, 0 <= j < n, and any set S of indices not containing j consisting of m <= k - 1 elements i_l, 1 <= l <= m, and any set of values x_j and x_{i_l}, Prob[X_l = x_l| X_{i_1} = x_{i_1} and ... and X_{i_m} = x_{i_m}] = Prob[X_l = x_l]. Prove that this definition of k-way independent is equivalent with the property that for any set S' of different indices consisting of m <= k elements i_l, 1 <= l <= m, and any set of values x_{i_l}, Prob[X_{i_1} = x_{i_1} and ... and X_{i_m} = x_{i_m}] = Prob[X_{i_1} = x_{i_1}) * ... * Prob[X_{i_m} = x_{i_m}].

  5. So far, for a random variable X we have encountered the expected value mu_x = Exp[X] and its variance Var[X] = Exp[(X - mu_x)^2]. The variance is a special case of the more general notion of a moments of X. The k-th central moment mu_X^(k) is defined by mu_X^(k) = Exp[(X - mu_X)^k]. Notice that mu_X^(2) = Var[X].
    1. Prove that mu_X^(1) = 0 for any X.
    2. Let the random variable X be the outcome of a Bernoulli trial. For all even k, compute mu_X^(k). Give a general estimate of mu_X^(k) which is independent of p = Prob[X_i = 1].
    3. Formulate a general expression for the k-th moment of the sum of n independent random variables X_i, 0 <= i < n. Do you need full independence of the X_i, or is some l-wise independence sufficient?
    4. Let the random variables X_i, 0 <= i < n, be the outcome of independent Bernoulli trials. Let Y = sum_i X_i be their sum. For all even k, compute an estimate of mu_X^(k). This estimate should be independent of p = Prob[X_i = 1].
    5. Prove the following generalized Chebyshev inequality
      Prob[|X - mu_X| >= t * sigma_X^(k)] <= 1 / t^k.
      Here sigma_X^(k) denotes the k-th root of mu_X^(k). Notice that sigma_X^(2) = sigma_X.
    6. Consider again the probability amplification which was based on a two-point sampling. Describe how to extend the given idea to a four-point sampling. Prove that the constructed random variables are four-way independent.
    7. Use the above estimate on the moments of the sum of Bernoulli trials and the generalized Chebyshev inequality to bound the probability that when performing t tests with values selected according to a four-point sampling scheme all fail for an element x in L.
    8. Describe how to perform (2 * k)-point sampling and give the resulting bound on the error probability when performing t tests.

  6. Consider the problem of computing a stable matching. For some suitable n, give a labeling of the edges of the bipartite graph with n nodes on each side, an initial matching and an infinite sequence of "improvements".

  7. Consider the problem of computing a stable matching. We study a bipartite graph with n nodes on each side. The "men" are indicated by small letters, the "women" by capitals. The edges are weighted on both sides, the set of weights constituting a permutation of {0, 1, ..., n - 1}, for each node. The question is how much it is worth to take initiative. That is, for the presented algorithm, we want to estimate the values of men_sum = sum_{0 <= x < n} w(x, sel(x)) and women_sum = sum_{0 <= X < n} w(X, sel(X)). Here sel(x) and sel(X) denote the indices of the nodes so that (x, sel(x)) and (X, sel(X)) are matched edges.
    1. Prove that the expected value of men_sum lies close to n^2, and that thus the men can be satisfied on average.
    2. Give estimates for women_sum. Hint: The number of proposals can be bounded to some value close to n * ln n. In order to get some feeling for the practical behavior, you can try the program simulating the coupon-collector problem.
    3. The conclusion of the previous question is that even the women can be quite happy, especially for large n, but not nearly as happy as the men. Suggest an alternative algorithm which is guaranteed to find a stable matching, which has the same expected running time and which is fairer to both parties. The goal is to minimize the expected value of 2 * n^2 - men_sum - women_sum.





Tail Inequalities

In this chapter it is considered how the Chernoff and similar bounds can be used to derive strong bounds on the performance of randomized algorithms and even can be used to base algorithms upon.

Fast Rounding

Rounding problems are common. Here we consider simple variants, but the presented ideas can be adapted and extended.

Two-Coloring a Family of Vectors

The first problem is that of rounding the values in a matrix. The input is a matrix with real values and the task is to replace these by integer values so that the sum of the rows and columns does not change much, ideally the sums should be the rounded values of their original values.

This problem is simplified. We consider an n x n matrix A with zeroes and ones, and want to find a vector b with -1 and +1 that minimizes ||A . b||, where, for a vector c, ||c|| equals the absolute value of the largest entry in c. This problem is known as set-balancing or two-coloring a family of vectors.

To find the optimum choice for b is NP-hard. Good choices for b can be found by heuristics. But, even a trivial randomized algorithm is already quite good. The idea is incredible: just pick a random vector in which each position is -1 or +1 with probability 1/2. The algorithm completely ignores the values in A.

How good is this algorithm? Consider the product of any given row of A with b. The positions in which A has a zero certainly contribute 0. The others contribute -1 or +1 with equal probability. Let n' be the number of ones in the row of A. The zeroes certainly contribute 0, so the situation is as if we are selecting n' balls each of which has value -1 or +1 with probability 1/2. We would like to use Chernoff bounds, but here we do not have basic Bernoulli trials. Fortunately, any bivalued random variable X assuming values c_1 and c_2, c_1 < c_2, can be turned into a Bernoulli trial Y = (X - c_1) / (c_2 - c_1). Let mu_X = Exp[X] and mu_Y = Exp[Y]. mu_Y = (mu_X - c_1) / (c_2 - c_1). Prob[X - mu_x >= h * (c_2 - c_1)] = Prob[Y - mu_Y >= h]. Analogous relations hold for Prob[X - mu_x <= -h] and Prob[|X - mu_x| >= h]. Let Z_X = sum_{0 <= i < n} X_i and Z_Y = sum_{0 <= i < n} Y_i. Prob[Z_X - n * mu_X >= h * (c_2 - c_1)] = Prob[Z_Y - n * mu_Y >= h]. This latter probability can be bounded using Chernoff bounds. If there are actually n' < n experiments, then we can nevertheless use the same bounding: Prob[Z_Y - n' * mu_Y >= h] <= e^{-h^2 / (3 * n' * mu_Y)} <= e^{-h^2 / (3 * n * mu_Y)}. So, a h value which gives sufficiently small probability for n experiments can also be used for n' experiments.

In our case c_1 = -1 and c_2 = +1. mu_X = 0 and mu_Y = 0.5. So, c_2 - c_1 = 2.

Prob((A . b)_i > 2 * h) < e^{-h^2 / 3 * n * 0.5} <= e^{-2 * h^2 / 3 * n}.
Taking h = sqrt(3 * n * log n) gives a probability of less than 1 / n^2. This was for any particular row. The values in the other rows are not independent. As before we handle this by reduction to a set of or clauses: Let I_i be the indicator variable giving 1 when a bad event occurs for row i. A bad event is when (A.b)_i > sqrt(12 * n * log n). Then Prob(I_i = 1) < 1 / n^2 for all i, and thus Prob(Or_i I_i = 1) <= sum_i Prob(I_i = 1) = n * Prob(I_i = 1) <= 1 / n. Deviations in the other direction can be bounded analogously. Thus, ||A . b|| = O(sqrt(n * log n)), with high probability.

Fast Parallel Perfect Rounding

Now we consider a similar problem. We have n reals a_i and want to round these, up or down, to integers b_i so that |sum_i a_i - sum b_i| < 1. With a single processor this problem is easy to solve in linear time: we maintain the current sums of the a_i and the b_i and the next rounding is performed so that the difference is kept as small as possible. In this way it can be assured that at for all j, |sum_{0 <= i < j} a_i - sum_{0 <= i < j} b_i| < 1. This takes O(n) time.

Suppose now that we have a parallel computer with n processors. How fast can the rounding be performed? As before we simplify the problem. So, assume we have an array b with values -1, 0 and +1 (0 gives the values that lie close to an integer form the beginning, +1 those that are about integer + 1/2 and are rounded up, and -1 those that are rounded down). The task is to change the -1 and +1 values so that the sum over all values is -1, 0 or +1. A sequential algorithm goes through the array and performs this operation in O(n) time by keeping track of the current value of the sum, assuring that the required property holds for all prefix sums. We wonder how good one can do this in parallel.

The problem can be solved deterministically in O(log n) time with n / log n PUs by singling out all -1 and +1 indices and then flipping the required number. This is ok for the given problem, but exploits in an essential way that the original problem was reduced to a problem with only finitely many different values. We would prefer an algorithm that works more generally.

A simple parallel algorithm first sets all b_i which are not 0 to an arbitrary (not necessarily random) value. Then we do kind of a merging algorithm: inductively we assume that in round k all groups of 2^k adjacent elements have a sum of -1, 0 or +1. Then two adjacent groups are combined by looking at their sums: if both have +1 or both -1, then the choices in one group are flipped and the sum becomes 0. In this way with O(n * log n) work the rounding can be performed. If n is even, then the final sum will be 0, otherwise it will be -1 or +1. The time for this operation is only O(log n) which is ok, but the work, the sum of the number of operations performed in each round is O(n * log n), which is more than for the sequential algorithm.

Now we consider a randomized algorithm which appears to work in full generality. All those values which are not 0 are chosen randomly -1 or +1 with probability 1/2. This takes O(1) parallel time. Then we can compute the sum h_0, |h_0| = O(sqrt{n * log n}), in O(log n) time and finally, depending on the sign of h_0 we can correct h_0 / 2 choices sequentially. This takes O(sqrt{n * log n}) parallel time. The work is O(n).

Of course we are not satisfied with this. Assume that after the first round the sum is h_0, a positive value. Then we can look at the first n_1 = h_0 positions and turn all +1 values into -1. The hope is that this brings us close to 0. However the fact that the sum was +x means that there are (n + h_0) /2 with +1 and (n - h_0) / 2 with -1. Even though h_0 is mostly small, it is better to compensate for this. We should take n_1 = h_0 * n / (n + h_0). Then, the expected value h_1 of the deviation after the first correction round is 0, independently of h_0. What bound can we put on h_1? We assume that a good case has occurred. Namely, that h_0 <= sqrt{c * n * log n}, for some small constant c. Then, we are asking about the deviation from the expected number of +1 values from the expected. These can be viewed as being chosen independently and uniformly with probability p_1 = (n + h_0) / (2 * n). So, Chernoff bounds apply. This gives h_1 <= sqrt{c * h_0 * log n} <= c * n^{1/4} * log n. Going on this way, in the third round we take the following "unseen" rounded values, we get that after round k the deviation is still h_k <= c * n^{1 / 2^k} * log n. After loglog n rounds this becomes O(log n). Then a single processor can finish the job.

The good point is that the total amount of work is only O(n). This means that we can reduce the number of processors by a factor log n without substantial increase of the time. Thus we obtain an "optimal" algorithm: the processor-time product is O(n), while the time is low. Whether this is optimal or not cannot be seen here, but much simpler problems as parity of the sum of n bits have a proven lower bound of Omega(log n / loglog n).

Routing on Interconnection Networks

Permutation Routing on Mesh Networks

A two-dimensional n x n mesh is a computer network consisting of n^2 processing units (PUs) connected in a two-dimensional n x n grid. A PU can only communicate with its (at most) four neighbors. In one step one packet of unit size can be send to each of the neighbors. In addition a PU can receive a packet from each of the neighbors.

Mesh

The problem considered in this section is permutation routing: each PU has a single packet that must be routed to a destination processor; the global pattern of all sources and destinations constitutes a permutation of the n^2 PU indices. This problem has been studied intensively. The task is to minimize the number of steps, T, while also achieving a small upper-bound on the number of packets that may have to be stored in a single PU during any given step of the algorithm. The maximum number of packets that may have to be stored in a PU is called the maximum queue size, Q.

The trivial greedy algorithm, which simply routes all packets first along the x-axis, then along the y-axis. Whenever more than one packet wants to use the same connection at the same time, priority is given to the packet that still has to travel furthest. This is called the farthest-first priority rule. The greedy algorithm achieves the first goal but not the second: during the first phase packets are not delayed at all, because they move in so-called lock-step mode: all packets traveling in one direction move like the wagons in a train, keeping their distances constant. During the second phase, a packet which has to travel n - 1 - d steps is delayed at worst by the at most d packets that have to travel further, so all packets finish even the second phase in n - 1 steps. Thus, T = 2 * n - 2, which is just the diameter of the network and therefore optimal.

The problem with the greedy algorithm is that if all packets in a single row have destination in the same column, then all these packets are wiped together in the first phase, and n packets stand in a PU at the same time. This gives Q = n, which is bad. An improvement is achieved when the two routing phases are coallesced. This means that the two routing phases are not separated, but that the second phase immediately starts for packets that have finished their first phase. Unfortunately, this does not really solve the problem: if all n packets with destination in column n / 3 - 1 initially reside in position 0, ..., n / 3 - 1 of row 0 and position 0, 2 / 3 * n - 1 of row 1, then three of them stream into the PU at position n / 3 - 1 of row 1 for almost n / 3 steps, while only one packet leaves this PU in each step. So, even when coallescing we get Q = 2 / 3 * n - O(1).

A simple randomized algorithm is much better. We show that one can perform permutation routing with T = 2 * n + o(n) and Q = O(log n). This can be further improved, but one of the main ideas is present already in this basic algorithm: reducing bad routing situations by (partially) randomizing the initial packet distribution.

The idea is to perform a three-phase algorithm. For some number k, which is supposed to divide n, the network is (only conceptually) divided into n / k row bundles of k rows each. Then the following steps are performed:

In the first routing phase, each PU expects to receive 1 packet. At most a PU receives O(log n) packets with high probability. In phase 1 there are no delays. In phase 2 there may be delays, however these are small for the packets moving far: a rightwards moving packet can only be delayed by the packets moving further. For a more precise analysis we cite a lemma. First, for a routing problem on a one-dimensional processor array with n PUs we define r(i, j) to be the number of packets originally standing in the PUs 0, ..., i - 1 having destination in the PUs j, ..., n - 1. Then

Lemma: On a one-dimensional array the rightward routing is finished after exactly max_{i <= j} r(i, j) + j - i.

The values of r(i, j) can easily be bounded: packets may have almost arbitrary columns as destinations, but we know that the expected number of packets in any i PUs equals i, and is bounded by i + O(log n + sqrt{i * log n}). So, it immediately follows that r(i, j) + j - i <= n + O(sqrt{n * log n}). Thus, the horizontal routing phase takes at most this much time. The vertical routing phase takes at most n - 1 steps just as before.

How about Q? Clearly, with this very simple approach which can easily be improved, we cannot bound Q to less than O(log n). If we are unlucky, and all packets in a row bundle of k rows have to move to a column bundle of k columns, then at the end of phase 2 the expected queue sizes are n / k. Thus, with high probability, Q < n / k + O(log n). A good choice for k is k = n / log n, achieving Q = O(log n).

Theorem The randomized three-phase algorithm routes permutation on a two-dimensional n x n mesh in 2 * n + O(sqrt(n * log n)) steps, with maximum queue size O(log n), with high probability.

Permutation Routing on a Hypercube Network

An n-dimensional hypercube is a processor network consisting of N = 2^n PUs. The PU with index (b_{n - 1}, ..., b_0), where all b_i in {0, 1}, is connected to all n PUs who have indices that differ in exactly 1 position.

Hypercube

We consider the permutation routing problem on hypercubes. A very natural algorithm, which does not impose any extra conditions on the routing hardware, is to perform the routing in n phases: in phase i, 0 <= i < n, bit i is corrected. These bit-correcting algorithms are simple, but bad. For the "transpose" (the permutation under which a packet from (a, b) has destination in (b, a), where a and b are bitvectors of length n / 2 each) permutation, all packets which originally reside in (0, a) are routed through the PU with index (0, 0). These are sqrt(N) packets. As there are at most n connections, we immediately get a lower bound of sqrt{N} / n. Looking more carefully the bound can be sharpened further. For example, one can consider the number of packets moving over the connection between (0, ..., 0, 1, 0, ..., 0) and (0, ..., 0, 0, 0, ..., 0). All sqrt{N} / 2 packets starting in (a, 1, 0), where a is a bit vector of length n / 2 - 1 and 0 is a vector of n / 2 zeroes are moving over this connection. Thus, the trivial algorithm takes ate least sqrt{N} / 2 steps.

A very simple and versatile idea remedies the problem. The algorithm goes as follows:

  1. For each packet p standing in processor s(p) and going to processor d(p), choose independently and uniformly at random an intermediate destination i(p).
  2. Route p applying the bit-correction algorithm and the FIFO conflict-resolution strategy from s(p) to i(p).
  3. Route p applying the bit-correction algorithm and the FIFO conflict-resolution strategy from i(p) to d(p).
The algorithm tries to remedy the bottle-neck problem by first completely randomizing the input and then routing as usual. As a result the distinction between hard and easy permutations disappears: routing a transpose is just as hard or easy as routing the identity permutation in which packets do not have to move at all. The great point is that we now have a performance guarantee: with high probability any permutation is routed in O(n) steps. In order to prove this, it satisfies to analyze the first phase: the time for the second phase is distributed in the same way: in phase 1 packets are routed from well-distributed positions to random positions, in the second phase we have the opposite. The congestion remains the same. Congestion is indeed the only issue because all paths have length at most log n in each phase. A packet is delayed iff two or more packets want to use the same connection out of a PU at the same time.

Lemma: Viewing routes in phase 1 as directed paths from the source to the intermediate destination, it is true that once two routes separate they do not rejoin.

Proof: Paths go out of each other if they differ in a bit. But then one of them will have this bit 0 and the other will have a 1 during the rest of the routing. End.

Of course this does not mean that packets can meet only once: they may follow the same track for a while and a packet may be delayed several times by the same packet (if this delaying packet is delayed by a third packet). The following lemma gives a (rather pessimistic) upper bound on the delay a packet may incur.

Lemma: Let rho be the path of a packet v. Let S be the set of all packets whose routes pass through at least one of the edges on rho. Then the delay incurred by v is at most |S|.

Proof: Let rho = e_1, e_2, ..., e_k. A packet leaves rho in that time step in which it traverses an edge of rho for the last time. If a packet traverses edge e_j of rho at time t, then its lag equals t - j. The lag is defined only for packets moving along rho. The delay of v is the lag when it traverses its last edge e_k. The increases of the lag of v will be attributed to different elements from S. If the lag of v increases from l to l + 1 then there must be at least one packet in S that wants to traverse the same edge at the same time. Thus, S contains at least one packet whose lag reaches l. We prove that there is even a packet in S which has lag l when leaving rho. The increase of the lag of v will be attributed to this packet. Let t' be the last time step in which any packet in S has lag l. Thus, some packet w is moving over edge e_j' for j' = t' - l. This packet leaves rho in this step, because otherwise w or another packet in S would move over edge e_{j' + 1} in step t' + 1, having lag l in step t' + 1. End.

So far everything was deterministic. Let rho_i, rho_j be the paths of packets v_i and v_j, and let H_{ij} be the indicator random variable for the event that rho_i and rho_j share an edge. The lemma gives that the delay of v_i is bounded by sum_j H_{ij}. All routes are chosen independently, thus the H_{ij} are Bernoulli trials for i != j. Thus, the deviation from the expected value can be bounded using Chernoff bounds. The expected value will be bounded first.
sum_{0 <= j < N} H_{ij} <= sum_{1 <= l <= k} T(e_l),
where T(e) is the number of routes passing through an edge e, and the e_l, 1 <= l <= k, are the edges of rho_i. The right-hand side may be larger because the same route may be counted several times. Applying expected values to both sides respects the inequality. Because the hypercube is isotropic, the expected number of paths through any edge is the same. Because every PU has n outgoing edges and because each packet traverses on average n / 2 edges, Exp[T(e)] = 1 / 2 for all e. Thus,
Exp[sum_{j = 0}^{N - 1} H_{ij}] <= Exp[sum_{1 <= l <= k} T(e_l)] = sum_{1 <= l <= k} Exp[T(e_l)] = k / 2 <= n / 2.
Even though we do not know the value of Exp[sum_{j = 0}^{N - 1} H_{ij}] but only an upper bound, we can nevertheless estimate the deviation: if an expected value of a sum of Bernoulli trials has value X <= Y, then we know that with high probability the true value will be not larger than X + Chernoff(X) <= Y + Chernoff(Y), even if Y is not a random variable for which the Chernoff conditions are satisfied. Here this gives a bound of O(n).

Paths in Hypercube

The disadvantage of this algorithm is that we have doubled the length of the paths: it is easy to estimate that with high probability there are packets that have to move almost distance n: the expected number of packets for which there are n - k bits different equals (n over k). For small k, (n over k) ~= (e * n / k)^k. Thus we will almost certainly find a packet traveling distance n - O(1) in each of the phases. One might think this does not matter as we have O(n) anyway. However, it turns out that in practice delays are mostly small (it does not harm to apply farthest-first here as well!), and thus the time is mainly given by the longest path.

If we are coallescing the routing phases (starting phase 2 of the routing for packets that have finished phase 1 without waiting for all packets to finish phase 1, giving priority to packets still performing phase 1) then the longest total path length is relevant. Even though for most packets the total path length lies around n, mostly there are even some packets with path length 2 * n - o(n). So, even coallescing does not solve the problem.

A small twist overcomes the problem. For every packet p moving from s(p) to d(p) we choose an intermediate destination i(p) as before and define i'(p) = bit_complement(i(p)). Let l_1(p) and l_2(p) bet the lengths of the path from s(p) to i(p) and the path from i(p) to d(p), respectively. l(p) = l_1(p) + l_2(p). l'_1(p) and l'_2(p) give the corresponding lengths of the paths through i'(p). l'(p) = l'_1(p) + l'_2(p) = (n - l_1(p)) + (n - l_2(p)) = 2 * n - l(p). So, min{l(p), l'(p)} <= n. This suggests the following approach:

Select i(p). Compute l(p). If l(p) <= n, use i(p) as intermediate destination, else compute i'(p) and use it as intermediate destination.

This idea bounds the total path length of any packet to at most n. Therefore we may hope that an algorithm with coallesced routing phases runs in n + o(n) routing steps. Vöcking has shown that this indeed happens with high probability. The analysis is quite hard, because of the limited randomization and because of the coallescing which makes that two streams of packets are blending.

k-k Routing on Circular Arrays

A circular array of length n is a processor network consisting of n PUs. The PU with index i, 0 <= i < n, is connected to the two PUs with indices (i - 1) mod n and (i + 1) mod n, the whole structure constituting a ring.

Circular Array

We consider k-k routing on circular arrays. In this problem, each PU is the source and destination of exactly k >= 1 packets. The communication model is as before and the task is to minimize the number of routing steps. The diameter of the network is n / 2. Consider any distribution in which all packets in the left half of the network (the PUs with index i, 0 <= i < n / 2) have their destinations in the right half (the PUs with index i, n / 2 <= i < n). These k * n / 2 packets have to leave their half through two connections, so routing this distribution takes at least k * n / 4 steps.

Lemma: max{n / 2, k * n / 4} gives a lower bound on the for k-k routing on a circular array of length n.

The question is whether this bound can be matched. For k = 1, this is easy: all packets can walk without any delay along the shortest paths to their destinations. For k = 2, it can easily be shown that the routing time is lower bounded by 2 / 3 * n. It is not clear whether this bound can be achieved. For k = 3, the situation is similar. In the following we therefore concentrate on the case k >= 4.

The easiest idea is to route all packets along the shortest paths. This does not work well: if all packets in PU_i, the PU with index i, have destination in PU_{(i + n / 2 - 1) mod n}, then all packets will move rightwards. Thus, k * (n / 2 - 1) packets will have to move over any rightwards directed connection, and thus the routing takes at least k * n / 2 - k steps, almost twice as much as given by the lower bound.

Using randomization, it is easy to match the lower bound up to lower-order terms. The idea is the same as before:

  1. For each packet p standing in processor s(p) and going to processor d(p), choose independently and uniformly at random an intermediate destination i(p).
  2. Route p along the shortest path to i(p).
  3. Route p along the shortest path to d(p).
It is very surprising that this algorithm may give an improvement! For the hypercubes we did not care about the constant factor, but here the goal is to gain a factor two in comparison with a deterministic algorithm. Can this possibly be achieved by doubling the number of routing phases? The answer is yes, and the analysis is easy, when assuming the correctness of a variant of the above given lemma giving a tight bound on the time for routing a given source-destination distribution on circular arrays.

Let r(i, j) be defined as before. Because the situation is the same everywhere, we can fix j = n / 2, and consider all i, 1 <= i <= n / 2. For a packet p in PU_l, 0 <= l < n / 2, there are l out of n destinations (namely the PUs with indices n / 2, n / 2 + 1, n / 2 + l - 1) for which the shortest path from PU_l leads over the connection e from PU_{n / 2 - 1} to PU_{n / 2}. Thus, the probability that p crosses e is l / n. Let S be the random variable giving the number of packets crossing e. Exp[S] = k * sum_{0 <= l < n / 2} l / n = k / n * n / 2 * (n / 2 - 1) / 2 ~= k * n / 8.

Even in this case Chernoff bounds can be used, to obtain a high probability bounding that holds for all connections at the same time. The Chernoff bounds were formulated for sums of equally distributed Bernoulli trials, but this is not necessary. For a sum S = sum_{0 <= i < n} X_i, where the X_i are Bernoulli trials with Prob[X_i = 1] = p_i, in the formulation of the Chernoff bounds p * n can be replaced by sum_{0 <= i < n} p_i. In our case this means that the deviation from the expected value can be bounded to O(sqrt{k * n * log n}).

The above analysis holds for both routing phases, because these phases are essentially the same, because even the second routing phase is a randomization: if the original permutation is given by Pi and the chosen randomization by Rho, then the first phase routes Rho, and the second phase Pi * Rho^{-1}. Even under Pi * Rho^{-1}, for any packet in PU_i the probability to move to PU_j equals 1 / n, for all 0 <= i, j < n.

Theorem The randomized two-phase algorithm performs k-k routing on a circular array of length n in k * n / 4 + O(sqrt{k * n * log n}) steps with maximum queue size k + O(log n + sqrt{k * log n}), with high probability.

Of course for many source-destination distributions this routing strategy means a doubling of the routing time, but for some bad instances it gives a reduction. In other words: the given algorithm is optimal for the problem as a whole, but not for every single instance of the problem.

Exercises

  1. Above we have seen how the parallel perfect rounding can be solved in O(n) work and O(O(log n * loglog n) time. Redesign the algorithm so that it runs in O(log n) time. Hint: notice that the time for the rounds is determined by the time to perform the prefix computation.

  2. Consider the randomized two-phase algorithm for permutation routing on hypercubes. For a packet p, let l_1(p) and l_2(p) denote the length of the path of p during phase 1 and phase 2, respectively. Let l(p) = l_1(p) + l_2(p). Estimate for which value x, with high probability there is a packet p with l(p) >= 2 * n - x. Distinguish three kinds of permutations: the identity permutation; the bit complement permutation, routing from PU_i to PU_j with j = bit_complement(i); a uniformly selected random permutation.

  3. The analysis of the randomized algorithm for routing on a hypercube the estimate sum_{0 <= j < N} H_{ij} <= sum_{1 <= l <= k} T(e_l), was used to derive a bound on Exp[S_i] for S_i = sum_{0 <= j < N} H_{ij}. Compute Exp[S_i] directly, and use the computed value to estimate the average number of edges that two intersecting paths have in common.

  4. Consider again the two variants of the randomized algorithm for routing permutations on n-dimensional hypercubes. The first simply chooses the intermediate i(p) destination of a packet p uniformly the second takes either i(p) or i'(p) = bit_complement(p), whichever gives the shorter total path length for p. The routing phases are coallesced. Write a program simulating these two approaches and measure the resulting routing times for n = 4, 5, ... , going as far as you can within the limits of the available memory and time. Test with the identity, transpose and random permutations. You may use a pure FIFO priority strategy, treating the packets performing phase 1 and 2 in the same way.

  5. The given algorithm for k-k routing on a circular array (and likewise the algorithm for routing permutations on a hypercube) is not entirely satisfactory, because even for very simple distributions (such as shifting rightwards all packets one position) it needs k * n / 4 routing steps. Consider a modified algorithm: for each packet the routing direction is selected randomly, but not uniformly: a packet which has to travel a distance d along rightwards connections to reach its destination has a probability of (n - d) / n of being routed rightwards. After selecting the directions, all packets move without detour along the selected direction towards their destinations, applying the farthest-first conflict-resolution strategy.
    1. Give high-probability bounds on the maximum required number of steps.
    2. Give high-probability bounds on the competitive ratio. The competitive ratio equals the ratio of the actually used number of steps, and the minimum number of steps needed. If it is convenient to assume that k is large, you may do so.




The Probabilistic Method

Introduction

Two trivial observations allow to conclude on the existence of combinatorial objects: Some examples: Think what this means. The first two results do not appear so impressive because one might lack a natural expectation of what is reachable. However, the last two results are not obvious at all. It would not be easy at all to construct a vector b such as in the third example. For constructing a large cut, there is no obvious greedy strategy that immediately gives the result. All four results are applications of the first observation, but in this chapter we will also encounter examples of applications of the second.

The randomized method consists of two parts:

There is a difference between constructing randomized algorithms and applying the probabilistic method. An algorithm should have a reasonable probability of being successful and it should be able to perform it efficiently. For the randomized method, we only require that the success probability is positive.

Clearly, once we know that an object / solution exists we would also like to construct / find it. However, for some problems the proof of existence does not give a key for the effective computation. For example, applying Fermat's or Rabin's test, we can prove that a number is composite and thus has non-trivial factors, but factorization is an unsolved problem.

Approximation and Linear Programming

As a preparation for the next section, in which we consider approximation algorithms for the MAXSAT problem, we briefly discuss here some of the most important topics from the theory of approximation algorithms.

An approximation algorithm for an optimization problem is a polynomial-time algorithm for a hard, mostly NP-hard, problem, which finds a feasible solution for which it can be proven that the value of the objective function lies within a factor from the optimum value for all inputs. In this latter sense approximation algorithms are more than just heuristics, which may come without any guarantee. For an approximation algorithm A, let m_A(I) denote the value achieved on an instance I, while f_*(I) denotes the optimum value. The performance ratio of A is defined to be the minimum over all I of f_A(I) / f_*(I). For a randomized algorithm A, f_A(I) is replaced by the expected achieved value, Exp[f_A(I)]. An algorithm with performance ratio c is also said to be a c-approximation.

There is a general method for obtaining approximation algorithms for an optimization problem P. It works as follows:

Many famous optimization problems can be tackled this way, even though it is easy to invent optimization problems which cannot be formulated as an ILP. By estimating the error that is made when rounding the variables in S' the deviation from the optimum may be bounded. The rounding can be performed in several ways, which may be problem specific.

We consider a simple example, which can be interpreted geometrically. Consider the following system of linear inequalities:

  x >= 0,
  y >= 0,
  x + 4 * y <= 16,
  6 * x + 4 * y <= 30,
  2 * x - 5 * y <= 6.
The objective function is f(x, y) = 6 * x + 5 * y. The goal is to find the feasible solution (x'_0, y'_0) for which f is maximal.

The five inequalities define a closed two-dimensional subset I of R^2. Because it is the intersection of convex subsets, it is convex itself. Moving along any line segment within I, the value is either constant, or increases in one direction and decreases in the other. This implies that the maximum of f over I is not only assumed in the interior of I. Arguing on, it follows that there must even be a vertex of the polytope on which f assumes its maximum value. For the given example, this reduces the problem to checking the value of f for the five vertices which is not much work: (x'_0, y'_0) = (2.8, 3.3), with f(x'_0, y'_0) = 33.3.

If in addition there are conditions that x and y should be integral, then any of the high-lighted points is a candidate for giving the maximum value of f. These points have the special property that starting from the optimum (2.8, 3.3) of the LP they are non-dominated in the sense that there are no points in I, which lie closer to (2.8, 3.3) with both x- and y-value. Actually, we do not even have to consider the point (3, 2), because f can only assume its maximum value in this point if it here has the same value as in (2, 3) and (4, 1). These non-dominated points in I are also called Pareto optima. The notion of Pareto optimality is named after the Italian economist Vilfredo Pareto (1848-1923). It is widely used in game theory and economics. In the latter context, an allocation of resources is Pareto optimal if there is no way that some individual could be made better off without making any other individual worse off. Checking the Pareto optima, we find that f is maximal for (x_0, y_0) = (4, 1). f(4, 1) = 29, 4.3 less that the value achieved by the LP. Notice that (4, 1) does not lie close to (2.8, 3.3) in a geometric sense. The most natural rounding idea in this case would have been to take the closest integral solution in I, which is (2, 3), giving f(2, 3) = 27, 2 less than the optimum.

Linear Programming

In its extreme simplicity, the given example expresses many important ideas: for any LP, the maximum, when finite, is (also) assumed on a vertex of the polytope bordering the set of feasible solutions. The algorithms for efficiently solving LP consist of methods to rapidly converge to this vertex by performing a coordinated walk from vertex to vertex. The classical simplex method rapidly finds a solution for most problem instances, but one may construct pathological inputs for which the it visits an exponential number of vertices before reaching the optimum. The more recent ellipsoid method has polynomial running time on all inputs, but is more elaborate and not necessarily better in practice.

The solution of the LP limits the search for the optimum of the ILP to the Pareto optima. For Pareto optima which lie on a line segment only the endpoints of the line segment need to be tested. For some problems this approach allows to find optimum solutions in polynomial time:

However, in general there remain exponentially many points to check. Therefore, the search is limited to a polynomial subset, for example by simply picking the feasible solution which lies closest to the optimum solution of the LP in some sense. If x = (x_0, ..., x_{n - 1}) is the optimum of the LP and x' = (x'_0, ..., x'_{n - 1}) is any feasible solution with |x_i - x'_i| < 1 for all i, 0 <= i < n, then |f(x) - f(x')| < sum_{0 <= i < n} |a_i|, where a_i is the coefficient of the i-th coordinate in the objective function. So, the absolute difference from f(x) can be bounded easily, but the relative difference may be arbitrarily large.

MAXSAT

Consider a system of m clauses C_i, 0 <= i < m, given in conjunctive normal form (CNF), over n variables, for example, the following system of clauses:
  C_0 =  x_1,
  C_1 =  x_1       x_3,
  C_2 =           !x_3,
  C_3 =       x_2,
  C_4 = !x_1 !x_2  x_3
  C_5 =      !x_2 !x_3,
  C_6 = !x_1       x_3,
  C_7 =       x_2 !x_3.
Here m = 8 and n = 3. The conjunctive operator "or" is understood between any pair of juxtaposed variables. The symbol "!" indicates that the literal following it is negated. The question is how many of the m clauses can be satisfied. That is, we want to compute the maximum over all 2^n possible truth assignments to the literals of the number of satisfied clauses. This problem is called MAXSAT. For the given example it can be verified (testing all 8 assignments or using a case distinction) that at most 6 clauses can be satisfied and that this is achievable.

SAT is the related decision problem dealing with the question whether all clauses can be satisfied or not. k-SAT is the variant of SAT in which any clause consists of at most k literals. 1-SAT is trivial, 2-SAT can be solved in polynomial time and 3-SAT is one of the most famous NP-complete problems. Clearly MAXSAT is more general than SAT because the answer to SAT is yes iff the answer to MAXSAT is m. Therefore MAXSAT is NP-hard if the length of the clauses is not bounded to 1 or 2.

In this section we consider approximation algorithms for MAXSAT. We first present a 1/2- and a (1 - 1/e)-approximation, which then will be combined to a 3/4-approximation. Notice that a c-approximation for MAXSAT does not need to satisfy at least c * m clauses for any system of m clauses: the performance ratio is given with respect to the best achievable result, which can be less than m.

The simplest randomized algorithm one can imagine is to set any literal true or false independently with probability 1/2. This algorithm, which will be referred to as A_1, is less bad than it appears:

Lemma: Applying A_1, the expected number of satisfied clauses equals sum_{0 <= i < m} (1 - 1 / 2^{l_i}) >= m / 2, where l_i denotes the number of literals in clause i, 0 <= i < m.

Proof: For a clause of length l, the probability that it is satisfied is 1 - 1 / 2^l. Let Z_i, 0 <= i < m, be the indicator random variable with value 1 when clause i is satisfied and 0 otherwise. Let S = sum_{0 <= i < m} Z_i. Exp[S] = Exp[sum_{0 <= i < m} Z_i] sum_{0 <= i < m} Exp[Z_i] = sum_{0 <= i < m} Prob[Z_i = 1] = sum_{0 <= i < m} (1 - 1 / 2^{l_i}). The fact that the Z_i are dependent is of no importance because of the linearity of expectation, which even holds for dependent random variables. Because we may assume that there are no empty clauses, l_i >= 1, for all i, and thus sum_{0 <= i < m} (1 - 1 / 2^{l_i}) >= sum_{0 <= i < m} 1 / 2 = m / 2. End.

Applying the randomized method the above results can be used in an existential way:

Corollary: For any set of m clauses there is a truth assignment satisfying at least round_up(sum_{0 <= i < m} (1 - 1 / 2^{l_i})) >= round_up(m / 2) clauses.

Proof: The only point which remains to be verified is that the expected value may be rounded up. This is so, because if the expected value Exp[X] of a random variable X which only assumes integral values is not integral itself, X must assume at least one value which is larger than Exp[X]. The smallest such value is round_up(Exp[X]). End.

This easily obtained result leads to far-reaching conclusions. For the above example with (l_0, ..., l_7) = (1, 2, 1, 1, 2, 2, 2, 2), it follows that there must be an assignment satisfying at least round_up(3 * 1/2 + 4 * 3/4 + 1 * 7/8) = round_up(5 3/8) = 6 clauses. This we already knew, but here the result is obtained by considering only the lengths of the clauses. More generally, for any system of 7 clauses with each 3 literals (without repetitions of the same literal within a single clause), the corollary gives that at least round_up(7 * 7/8) = round_up(6 1/8) = 7 literals can be satisfied. That is, any system with at most 7 clauses with 3 literals each is satisfiable.

Corollary: A_1 gives a 1/2-approximation.

Proof: Because the optimum value f_* of MAXSAT satisfies f_*(I) <= m / 2 for any input I with m clauses, it follows that f_{A_1}(I) / f_*(I) >= m / 2 / m = 1 / 2. End.

A_1 is good for long clauses and even for clauses of length 2 it works satisfactory. Only if there are many singletons, the performance is less good than we would hope to achieve. Therefore, we now present an alternative approximation algorithm A_2 which performs best on short clauses. We will show that the approximation algorithm AA which consists of trying both A_1 and A_2 and taking the best result gives a 3/4-approximation.

MAXSAT can be written as an ILP using variables z_j, 0 <= j < m, and y_i, 0 <= i < n. z_j indicates the degree of "satisfiedness" of clause C_j, while y_i indicates the degree of "trueness" of the literal x_i. Denoting for each clause C_j, by C_j^+ the subset of literals appearing in non-negated form, while C_j^- gives the negated literals, we get

maximize s = sum_j z_j under the following conditions
z_j <= sum_{i in C_j^+} y_i + sum_{i in C_j^-} (1 - y_i), for all j, 0 <= j < m,
y_i in {0, 1}, for all i, 0 <= i < n,
z_j in {0, 1}, for all j, 0 <= j < m.
The maximum value of s is denoted s_max. The corresponding LP, formulated in terms of variables y'_i and z'_j, obtained by dropping the integrality conditions, is given by
maximize s' = sum_j z'_j under the following conditions
z'_j <= sum_{i in C_j^+} y'_i + sum_{i in C_j^-} (1 - y'_i), for all j, 0 <= j < m,
0 <= y'_i <= 1, for all i, 0 <= i < n,
0 <= z'_j <= 1, for all j, 0 <= j < m.
The maximum value of s' is denoted s'_max. Because s'_max is the maximum of the same function computed with less restrictions, s'_max >= s_max. One may hope that s_max lies close to s'_max, but this does not need to be so. Having the clauses C_1 = (x_1 or x_2), C_2 = (x_1 or !x_2), C_3 = (!x_1 or x_2), C_4 = (!x_1 or !x_2), s_max = 3 while s'_max = 4. This immediately shows that for MAXSAT the integrality gap, the maximum ratio of the value of an optimum solution of an LP and that of an ILP, is at least 4 / 3.

The most common way to perform randomized rounding, is to replace a variable z = x + y, with x an integer and 0 <= y < 1, by x + 1 with probability y and by z with probability 1 - y, independently of any other rounding. Define b_k = 1 - (1 - 1 / k)^k. b_1 = 1, b_2 = 3/4 and, for all k, b_k >= 1 - 1 / e.

Lemma: Let C_j be a clause with k literals. The probability that it is satisfied is at least b_k * z'_j.

Proof: Assume wlog that the clause is x_1 or x_2 or ... or x_k. We know that y'_1 + y'_2 + ... + y'_k is x_1 or x_2 or ... or x_k. We know that y'_1 + y'_2 + ... + y'_k >= z'_j. All literals are rounded independently, so the probability that none of the literals is true in the rounded clause is Prod_{i = 1}^k (1 - y'_i). We show that 1 - Prod_{i = 1}^k (1 - y'_i) >= b_k * z'_j. Using Lagrange and concave function properties + the inequality this follows. Prod_{i = 1}^k (1 - y'_i) is maximized when all factors are equal, namely y'_i = z'_j / k. Thus, Prod_{i = 1}^k (1 - y'_i) <= (1 - z'_j / k)^k. We want to show that for all x, 0 <= x <= 1, the function f(x) = 1 - (1 - x / k)^k >= b_k * x. Because f(x) is concave, it suffices to test this for the endpoints of the interval. f(0) = 0 >= b_k * 0. f(1) = b_k >= b_k * 1. End.

This way of computing an approximation for MAXSAT will be referred to as A_2.

Corollary: A_2 gives a (1 - 1/e)-approximation.

Proof: Exp[sum_{0 <= j < m} z_j] = sum_{0 <= j < m} Exp[z_j] >= sum_{0 <= j < m} b_{k_j} * z'_j >= (1 - 1 / e) * sum_{0 <= j < m} z'_j >= (1 - 1 / e) * s_max. End.

The suggested algorithm AA is to take the maximum of the results given by A_1 and A_2.

Theorem: AA gives a 3/4-approximation.

Proof: For some set of m clauses, let n_1 and n_2 denote the expected number of clauses satisfied by A_1 and A_2, respectively. Let S_k be the set of clauses with k literals. n_1 + n_2 >= sum_k sum_{C_j in S_k} (1 - 1 / 2^k) + sum_k sum_{C_j in S_k} b_k * z'_j >= sum_k sum_{C_j in S_k} (1 - 1 / 2^k + b_k) * z'_j >= 3/2 * sum_j z'_j. The last estimate is correct because for all k, 1 - 1 / 2^k + b_k = 2 - 1 / 2^k - (1 - 1 / k)^k >= 3/2. For k = 1, 2 and 3, this can be tested explicitly, for larger k, we estimate 1 / 2^k + (1 - 1 / k)^k <= 1 / 16 + 1 / e <= 0.44. End.

Lovasz Local Lemma

An event E_i is mutually independent of a set of other events S = {E_j}, if for any subset T of S, Prob[E_i| and_{j in T} E_j] = Prob[E_i]. The dependency graph of a set of events is a directed graph, with a node for each event. The edges are so that E_i is mutually independent of the set of events S_i = {j| there is no edge (i, j)}.

Maybe one might believe that if the events are pairwise independent, that then the graph has no edges. However, this is not correct. If we have random variables X and Y chosen uniformly and independently from the set {0, 1}, and a random variable Z = (X + Y) mod 2, then all three are pairwise independent, but clearly any single one is specified by the other two. Let E_X be the event that E_X = 0. Define E_Y and E_Z analogously. Prob[E_X| E_Y] = 1/2 = Prob[E_X], but Prob[E_X| E_Y and E_Z] = 1 != 1/2 = Prob[E_X]. Thus, there must be an edge from E_X to at least one of the two other events. The same holds true for E_Y and E_Z. As a result, the dependency graph for these three events has at least two different edges. We see that the dependency graph does not even has to be unique!

The famous Lovasz local lemma tells us that, provided a condition holds, Prob[and_i (! E_i)] > 0, even when these events are weakly dependent on each other. Of course we know that generally

Prob[and_i (! E_i)] = Prob[! (or_i E_i)] = 1 - Prob[or_i E_i] >= 1 - sum_i Prob[E_i].
This we have used many times before, but this works only if we can put a sufficiently strong bound on Prob[E_i]. Otherwise this whole condition becomes void. The other extreme case is that all the E_i (and thus the ! E_i) are independent. Then we have
Prob[and_i (! E_i)] = Prod_i Prob[! E_i] = Prod_i (1 - Prob[E_i]).
As long as Prob[E_i] < 1, for all i, this gives Prob[and_i (! E_i)] > 0, even though it might be very small. In the context of the randomized method it would nevertheless allow to prove the existence of some object.

Lovasz Local Lemma: Let G = (V, E) be the dependency graph of a set of events E_i, 0 <= i < n. Let x_i, 0 <= x_i <= 1, for all 0 <= i < n, be so that

Prob[E_i] <= x_i * Prod_{j| (i, j) in E} (1 - x_j),
then
Prob[and_{i = 0}^{n - 1} (! E_i)] >= Prod_{i = 0}^{n - 1} (1 - x_i).

The proof is relatively short but omited here. The conclusion is that if we can define x_i satisfying the condition, that then we can forget about the dependency when working with the x_i instead of the probabilities.

As an example we consider a situation in which we have eight clauses: x_0 or x_1 or x_2, x_2 or x_3 or x_4, x_4 or x_5 or x_6, ..., x_14 or x_15 or x_0. Event E_i corresponds to clause_i has value false. Prob[E_i] = 1/8, for all i. Clearly if the clauses would be independent, Prob[and_i (! E_i)] = (1 - 1/8)^8 = (7/8)^8 ~= 0.344. If we would use the correct formula using 1 - or_i E_i, we would find Prob[and_i (! E_i)] >= 0, which is not very interesting. The Lovasz local lemma helps in this case: E_i is dependent only on E_{i - 1} and E_{i + 1} (indices computed cyclically). Taking x_i = 1/5 for all i, we get Prob[E_i] = 1/8 <= 1/5 * (1 - 1/5)^2, and thus we conclude that Prob[and_i (! E_i)] >= (1 - 1/5)^8 ~= 0.168. Trying all possible assignments using the program which can be downloaded here shows that 25889 of the 65536 assignments are satisfying, a fraction of 0.395.

The above example, which by its simplicity can certainly be solved analytically as well, is a special case of the situation described by the following corollary. It is much less general than the lemma, but more easy to use:

Corollary: Let E_i, 0 <= i < n, be events with Prob[E_1] <= p, for all i. If each event is mutually independent of all but d of the other events and e * p * (d + 1) <= 1, then Prob[and_{i = 0}^{n - 1} (! E_i)] > (1 - 1 / (d + 1))^n > 0.

Proof: Take x_i = x = 1 / (d + 1), for all i, 0 <= i < n. This gives Prob[E_i] <= p <= 1 / ((d + 1) * e) <= 1 / (d + 1) * (1 - 1 / (d + 1))^d = x * (1 - x)^d <= x * Prod_{j| (i, j) in E} (1 - x) = x_i * Prod_{j| (i, j) in E} (1 - x_i). Here we used that (1 - 1 / (d + 1))^d > 1 / e, for all d > 0. End.

This lemma can be used to derive results on restricted versions of SAT. We consider the version of k-SAT in which all clauses contain exactly k literals without repetitions. A further restriction is that we assume that each of the n literals occurs (positively or negatively) in at most 2^{(1 - eps) * k} of the in total m clauses. The argument is simple, once one knows the above corollary and thinks of the randomized method:

Theorem: Any instance of k-SAT with n literals and m clauses for which each literal occurs in at 2^{(1 - eps) * k}, eps > 0, clauses is satisfiable for all k >= k_min, where k_min only depends on eps.

Proof: First we consider the special case eps = 1/2. All variables are set to true or false with probability 1 / 2, independently of each other. E_i, 0 <= i < m, denotes the event that clause i does not get satisfied by this assignment. Clearly Prob[E_i] = 2^{-k}, for all i. The event E_i is independent of E_j if clause i and clause j do not share at least one variable. However, there are at most k * 2^{k / 2} clauses having a literal in common with clause i (namely, at most 2^{k / 2} for each of the k literals). Because e * (k * 2^{k / 2} + 1) * 2^{-k} <= 1 for all k >= 10, we can conclude that for these k, Prob[and_{i = 0}^{m - 1} (! E_i)] > (1 - 1 / (k * 2^{k / 2} + 1))^m > 0 for all m. This means that just by throwing randomly, we have a strictly positive probability of finding a truth assignment satisfying all clauses, which in particular implies that such a truth assignment exists. Assuming that each literal occurs in at most 2^{(1 - eps) * k} clauses, for some eps > 0, the condition becomes (k * 2^{(1 - eps) * k} + 1) * 2^{-k} <= 1. End.

Looking into the details, we find k_min(0.5) = 10. For eps = 0.5 and k = 10, any literal can occur at most in 3 clauses. k_min(0.1) = 59. For eps = 0.1 and k = 59, any literal can occur in at most 39 clauses. Notice that the condition on the number of occurrencies imposes an upper bound on m: because we must have n * 2^{k / 2} >= k * m we get m <= 2^{k / 2} / k * n. But, this is not that serious.

The main weakness of the above is that the probability that a random assignment to the literals satisfies all clauses may be very small. Thus, simply repeating random assignments until a satisfying assignment is hit, may take too long. To obtain an algorithm that finds a solution in expected polynomial time we sharpen the conditions. Assume any literal occurs in at most 2^{k / 50} clauses. k is considered to be a constant. If k is constant, even 2^k is constant. This is important because the expected running time of the presented solution is polynomial in m but exponential in k. The algorithm consists of two phases. In the first phase part of the literals are fixed uniformly and independently to true or false. The remaining literals are left unspecified until the second phase where they are fixed in a more elaborate way.

The first phase goes as follows:

  for (int i = 0; i < n; i++)
    dangerous[i] = false;
  for (int i = 0; i < n; i++)
    if (!dangerous[i])
    {
      x[i] = random.nextBoolean();
      for (int j = i + 1; j < n; j++)
        testDangerous(j, x, dangerous, clauses);
    }
Here x[i] denotes the value of literal i. A literal becomes dangerous, meaning that it should not be fixed in a careless way, when it occurs in a dangerous clause. A clause with k literals is dangerous, if k / 2 of its literals have been fixed and it has not been satisfied by these. These dangerous literals are skipped during phase 1.

A clause which is not satisfied during phase 1 is said to have survived. A clause may survive because it became dangerous itself, or because its remaining unspecified literals became dangerous because other clauses in which these occur became dangerous. Any surviving clause still has at least k / 2 unspecified literals. It is important to notice that the partial solution after phase 1 is extendible in the sense that there is an assignment to the dangerous literals so that all surviving clauses are satisfied (proving this is left as an exercise). Thus, there is no need for backtracking. A clause C becomes dangerous when k / 2 of its literals have been fixed in an unlucky way. The probability that this happens is at most 2^{- k / 2}. Thus, the probability that a clause C survives is at most (d + 1) * 2^{- k / 2}, because either C or at least one of the at most d = k * 2^{k / 50} other clauses which has a literal in common with C must have become dangerous.

For the considered problem, the dependency graph G has one vertex for each event E_j, 0 <= j < m, the event that clause C_j is not satisfied. E_j depends on E_{j'} for each j' so that C_j and C_{j'} share a literal. Because there are at most d such clauses, the dependency graph has degree at most d. In the following we will identify clauses and vertices. The possibility to fix the dangerous literals in polynomial time depends on the structure of the subgraph G' of G induced by the the dangerous clauses. Further down we prove that with high probability all connected components of G' are small. Any dangerous literal occurs only in clauses of a single connected component of G'. Thus, the remaining problem can be solved by separately treating the subsets of literals belonging to the clauses occurring in the different connected components of G'. Consider a connected component of size s. These clauses contain k * s literals. These are not all different, and part of them have been fixed in phase 1, so the number of literals to fix for this component is less than k * s. It follows that there are less than 2^{k * s} = 2^k * 2^s assignments to test for this component. Each assignment can be tested in O(k * s) time. That is, an assignment satisfying all clauses in the component can be found in O(k * 2^k * s * 2^s) time. Let p be the number of components, and let s_l, 0 <= l < p, denote the size of component l, the the total time is given by

sum_{0 <= l < p} c * k * 2^k * s_l * 2^{s_l} <=
sum_{0 <= l < p} c * k * 2^k * s_l * 2^{s_max} <=
c * k * 2^k * m * 2^{s_max}.
Here c is a constant and s_max = max{s_l| 0 <= l < p}. The inequality shows that the running time is polynomial in m if s_max = O(log m).

Lemma: With probability 1 - o(1), the size of the largest connected components of G' is bounded by O(log m).

Proof: The probability that any clause survives is at most (d + 1) * 2^{- k / 2}. If we consider clauses C_1 and C_2 which are at distance 4 or more in G, then the event E_1 that C_1 survives is independent of the event E_2 that C_2 survives, because in G no neighbor of C_1 is adjacent to any neighbor of C_2. Thus, for any subset {C_0, ..., C_{r - 1}} of clauses of size r for which the distance of any pair of clauses is at least 4, Prob[and_{0 <= i < r} E_i] = prod_{0 <= i < r} E_i <= ((d + 1) * 2^{- k / 2})^r.

An s-tree of a graph G is a subset T of the nodes satisfying the following two properties:

The name s-tree is not a very lucky choice, because an s-tree together with the edges between the nodes at distance r, does not need to be a tree. For example, when G is an n x n grid, an s-tree T is given by the set of points {(r * i, r * j)| 0 <= i, j < n / r}, where (x, y) gives the point at position (x, y) of the grid. Thus, in this case T has a grid structure. Any connected graph of degree at most d and size d^s * r contains an s-tree of size r. This is namely the size that is certainly achieved by the greedy algorithm for constructing maximal s-trees:
     while (there are still nodes left)
       Pick an arbitrary node u which has distance s to any of the 
       earlier selected nodes (the first node can be picked freely).
       Remove all nodes with distance less than s from u. 
  
Because for each selected node less than 1 + d + d^2 + ... + d^{s - 1} = (d^s - 1) / (d - 1) < d^s are removed, this procedure will select at least n / d^s nodes in a graph with n nodes. In the following we will consider 4-trees of the dependency matrix G. We also consider the graph G_4 derived from G. This graph has the same set of m nodes as G. There is an edge (u, v) in G_4 if and only if u and v have distance equal to 4 in G.

The definitions imply that the number of 4-trees of size r is bounded by the number of connected subgraphs of G_4 of size r. Because the degree of G_4 is bounded by d^4, this number of subgraphs is bounded by m * d^{8 * r}. This result follows from the general fact that a graph with n nodes and degree d has at most n * d^{2 * r} connected subgraphs of size r. Thus, the probability that there is a 4-tree in G of size r for which all clauses survive is bounded by m * d^{8 * r} * ((d + 1) * 2^{- k / 2})^r = m * (d^8 * (d + 1) * 2^{- k / 2})^r. For r = x * log m, for some suitable constant x, this is o(1). Because, due to the argument given above, any connected component of G of size d^4 * r contains a 4-tree of size r, we conclude that the probability that all clauses of a connected subset of size more than x * d^4 * log m survive is o(n). Because d is considered to be constant, this gives the claimed result. End.

The complete algorithm now proceeds as follows:

  repeat
  {
    perform phase 1;
    construct G';
    compute the connected components of G';
    let s_max be the size of the largest component;
  }
  until (s_max <= x * log m);
  let p be the number of connected components;
  for (int l = 0; l < p; l++)
    find a satisfying assignment for the clauses in component l;

Theorem: The algorithm finds a satisfying assignments to the literals of any instance of k-SAT in which each literal occurs in at most 2^{k / 50} clauses. The expected running time is polynomial in m.

Proof: The expected number of iterations through the loop is less than 2 because the probability to succeed in each pass is larger than 1/2. Therefore, the total time spent in the loop is polynomial in m. Above it was shown that the second phase can be performed in O(k * 2^k * m * 2^{s_max}) time. If the condition is satisfied, we have O(k * 2^k * m * 2^{s_max}) = O(k * 2^k * m^{x + 1}) = O(m^{x + 1}), because k is considered to be a constant. End.

Exercises

  1. Give polynomial-time solutions for 1-MAXSAT and 2-MAXSAT, the variants of MAXSAT in which the length of all clauses is bounded to 1 and 2, respectively. Specify the running time of your algorithm in terms of the number of clauses m.

  2. Above it was shown that any system of 7 clauses with each 3 literals is satisfiable. Show that this bound is sharp. That is, give a system of 8 clauses with each 3 literals which is not satisfiable. How many clauses can certainly be satisfied if all clauses have length l?

  3. For arbitrary k, give a set of k-wise independent events for which the dependency graph is not empty. Construct a dependency graph for this set of events which has a minimal number of edges.

  4. The minimum value k_min of k for which k-SAT is satisfiable if each literal occurs in at most 2^{(1 - eps) * k} literals depends only on eps. Investigate how k_min develops as a function of eps when eps goes to 0.

  5. Prove that after phase 1 of the k-SAT algorithm there is an assignment to the dangerous literals so that all surviving clauses are satisfied.

  6. In the k-SAT algorithm, the number of occurrencies of any literal was bounded to 2^{k / 50}. Analyze how much larger this number of occurrencies can be taken without compromising the polynomial running time of the algorithm.


This page was created by Jop Sibeyn.
Last update Monday, 02 February 04 - 18:31.
For any comments: send an email.