As html is not suited for mathematical formulas, some additional notation is used (as used in the typographical package Latex). a_i denotes a with subscript i. a^i denotes a to the power i. <= and >= are used as in most computer languages. Curly brackets, "{" and "}", are use to group things. sum_{i = 0}^j denotes the sum for i running from 0 to j. The same may also be written sum_{0 <= i <= j}. ~= stands for "approximately equal" and ~ for "proportional to". Greek letters are written out. For logical expressions we either use the notation from C or the operators are written out in text. So, "a && !b" is the same as "a and not b". sqrt stands for the square root function and log without specifying the base number for the logarithm to basis 2. There are a few more notations, but they should be understood easily. New notions are printed bold there where they are defined. Inside the chapters particularly important ideas and short key notes are highlighted without introductory text. There are many pictures. These are intended to be self-explanatory. Typically they are placed just after the text fragment to which they belong, generally there is no direct reference to them in the text.
There are very few references to the literature. Clearly most of the presented material is not new. A part is even common knowledge. More directly this text is based on material found in the following book:
Notice: the following text may be overcomplete. At the examination it is expected that the students only know all that has been presented during the lectures.
Exp[X] = sum_i (i * Prob[X = i]).This is also called the first moment of X. The second moment is obtained by replacing i * Prob[X = i] by i^2 * Prob[X = i], this equals the expected value of X^2. Higher moments are defined analogously. The variance Var[X] of X is defined as
Var[X] = Exp[X^2] - (Exp[X])^2The variance does what one expects: it tells something about the way the values are spreading around the expected value.
An example of a probability space are all possible outcomes of throwing two indistinguishable dices in one stroke. In this case the probability space is the set {(1, 1), ..., (1, 6), (2, 2), ..., (2, 6), ..., (5, 5), (5, 6), (6, 6)}. The cardinality of this set is 21. Probabilities may be (but must no be) assigned so as to correspond to real probabilities: Prob[(i, j)] = 1 / 36 if i = j and Prob[(i, j)] = 1 / 18 if i != j. Many random variables can be defined, for example, the random variable X giving the sum of both values. This X assumes values 2, 3, ..., 12. With the defined probabilities Prob[X = 2] = Prob[X = 12] = 1 / 36, Prob[X = 3] = Prob[X = 11] = 2 / 36, Prob[X = 4] = Prob[X = 10] = 3 / 36, ..., Prob[X = 7] = 6 / 36. The expected value of X, which is 7, can be computed easily. The variance is given by 1974 / 36 - 7 * 7 = 5 5/6.
A basic but useful relation is Markov's inequality which applies to random variables with positive values, that is Prob[X < 0] = 0. For them we have
Prob[X >= k * Exp[X]] <= 1 / k.The proof can be given by contradiction: if this probability would be larger, then the expected values would be larger. A stronger estimate, based on the variance is called Chebycheff's inequality, but we will not need it.
Two random variables X and Y are called independent if
Prob[X = x and Y = y] = Prob[X = x] * Prob[Y = y], for all x and y.An example of independent random variables: throwing two coins, one after the other, X and Y being the number of heads (0 or 1) appearing up on the first and second coin respectively. If the coins are glued together, then X and Y are clearly dependent. Either X = Y, or X = 1 - Y depending on how the coins are glued. So, in that case Prob[X = 0 and Y = 0] is either 0 or 1, different from Prob[X = 0] * Prob[Y = 0] = 1/2 * 1/2 = 1/4.
The conditional probability of an event X = x under condition Y = y is denoted as Prob[X = x| Y = y]. It is defined as
Prob[X = x| Y = y] = Prob[X = x and Y = y] / Prob[Y = y].For example, if X gives the outcome of throwing a dice, and Y the outcome of throwing another dice and Z = X + Y, then Prob[Z >= 10] = 6 / 36 = 1 / 6, but Prob[Z = 10| X >= 5] = Prob[Z >= 10 and X >= 5] / Prob[X >= 5] = (4 / 36) / (2 / 6) = 1 / 3. Of course we can also write
Prob[X = x and Y = y] = Prob[X = x] * Prob[Y = y| X = x] ,expressing the cumulative effect of the operations: first we check the value of X in full independence, then we add Y, where the choice may be impacted by the earlier choice of X. Of course this can be extended to
Prob[and_{i = 1}^k X_i = x_i] = Prob[X_1 = x_1] * Prob[X_2 = x_2| X_1 = x_1] * Prob[X_3 = x_3| X_1 = x_1 and X_2 = x_2] * ... * Prob[X_k = x_k| and_{i = 1}^{k - 1} X_i = x_i].The proof can easily be given by induction. Notice that if X and Y are independent, Prob[X = x| Y = y] = Prob[X = x and Y = y] / Prob[Y = y] = Prob[X = x] * Prob[Y = y] / Prob[Y = y] = Prob[X = x]. Thus, independence and conditional probability behave in a natural way.
Prob[X = x or Y = y] = Prob[X + x] + Prob[Y = y] - Prob[X = x and Y = y] <= Prob[X = x] + Prob[Y = y]. Using induction this can be used to prove that for a set of n random variables X_i, 1 <= i <= n, Prob[or_{i = 1}^n X_i = x_i] <= sum_{i = 1}^n Prob[X_i = x_i]. The most important special case is that all X_i are equally distributed and that x_i = x for all i. In that case Prob[or_{i = 1}^n X_i = x] <= n * Prob[X_1 = x]. This also implies that Prob[or_{i = 1}^n X_i > x] <= n * Prob[X_1 > x], a relation which is frequently used in estimates.
Expected values are linear:
Lemma: For two random variables X and Y, Exp[X + Y] = Exp[X] + Exp[Y].
Proof: The most important step is an alternative way of writing the expected value of Exp[X]. Exp[X] = sum_x x * Prob[X = x] = sum_x x * sum_y Prob[X = x and Y = y] = sum_x sum_y x * Prob[X = x and Y = y]. Here we use that Prob[X = x] = sum_y Prob[X = x and Y = y], which follows from the interpretation of Prob[X = x and Y = y] as being an entry in a two-dimensional table of probabilities. Using this relation for Exp[X] and the analogous one for Exp[Y], we get
Exp[X] + Exp[Y] = sum_x sum_y x * Prob[X = x and Y = y] + sum_x sum_y y * Prob[X = x and Y = y] = sum_x sum_y (x + y) * Prob[X = x and Y = y] = Exp[X + Y].End. The essential point is that it is not required that X and Y are independent. Using induction this lemma can be used to prove that this even holds for any countable number of random variables:
Corollary: For n random variables X_i, 1 <= i <= n, Exp[sum_{i = 1}^n X_i] = sum_{i = 1}^n Exp[X_i].
An important special case is that the X_i are equally distributed. In that case Exp[sum_{i = 1}^n X_i] = n * Exp[X_1]. That this is something special becomes clear when one considers Exp[A * B]. In general this cannot be written as Exp[A] * Exp[B]. For example, if Prob[A = 0] = Prob[A = 1] = 1/2 and B = 1 - A, then Exp[A] = Exp[B] = 1/2, so Exp[A] * Exp[B] = 1 / 4. On the other hand, Exp[A * B] = 0 * Prob[A * B = 0] + 1 * Prob[A * B = 1] = 0 * 1 + 1 * 0 = 0. However, if A and B are independent, then Exp[A * B] = Exp[A] * Exp[B]. This can be shown analogously to the proof of the linearity of the expected value.Exp[sum_{i = 1}^n sum_{j = i + 1}^n X_{i, j}] = sum_{i = 1}^n sum_{j = i + 1}^n Exp[X_{i, j}].
Now we have reduced the problem of analyzing quicksort to that of analyzing the X_{i, j}. If p_{i, j} is the probability that X_{i, j} = 1, then Exp[X_{i, j}] = 0 * (1 - p_{i, j}) + 1 * p_{i, j} = p_{i, j}. This is a trivial but general fact: the expected value of a 0-1 random variable X equals Prob[X = 1]. The execution is viewed as a binary tree, in each node of the tree we write the splitter element y, all elements in S_< are arranged to the left and all elements in S_> to the right. a_i and a_j are compared iff one is an ancestor of the other. Now define a permutation pi of the elements by traversing the tree in level-order, that is, level by level, starting with the root. There are two further observations that allow us to compute p_{i, j}:
sum_{i = 1}^n sum_{j = i + 1}^n p_{i, j} =
sum_{i = 1}^n sum_{j = i + 1}^n 2 / (j - i + 1) =
sum_{i = 1}^n sum_{k = 2}^{n - i + 1} 2 / k <=
2 * sum_{i = 1}^n sum_{k = 1}^n 1 / k =
2 * sum_{k = 1}^n H_k <= 2 * n * H_n.
where H_k gives the k-th harmonic number. Because H_n = sum_{i
= 1}^n 1 / i ~= integral_1^n 1 / x dx = ln n, this shows that the
expected number of comparisons for this variant of quicksort is bounded
by O(n * log n). This estimate is rather rough. More careful analysis
allows to put bounds on the constant. Running a small program gives the
exact values for small n:
n = 2, sum = 1, sum / n = 0.500, sum / (n * log n) = 0.500 n = 4, sum = 4, sum / n = 1.208, sum / (n * log n) = 0.604 n = 8, sum = 16, sum / n = 2.115, sum / (n * log n) = 0.705 n = 16, sum = 50, sum / n = 3.184, sum / (n * log n) = 0.796 n = 32, sum = 139, sum / n = 4.371, sum / (n * log n) = 0.874 n = 64, sum = 360, sum / n = 5.636, sum / (n * log n) = 0.939 n = 128, sum = 889, sum / n = 6.951, sum / (n * log n) = 0.993 n = 256, sum = 2123, sum / n = 8.297, sum / (n * log n) = 1.037 n = 512, sum = 4945, sum / n = 9.660, sum / (n * log n) = 1.073 n = 1024, sum = 11297, sum / n = 11.033, sum / (n * log n) = 1.103 n = 2048, sum = 25420, sum / n = 12.412, sum / (n * log n) = 1.128 n = 4096, sum = 56502, sum / n = 13.795, sum / (n * log n) = 1.150 n = 8192, sum = 124344, sum / n = 15.179, sum / (n * log n) = 1.168 n = 16384, sum = 271382, sum / n = 16.564, sum / (n * log n) = 1.183 n = 32768, sum = 588170, sum / n = 17.950, sum / (n * log n) = 1.197 n = 65536, sum =1267172, sum / n = 19.336, sum / (n * log n) = 1.208 n =131072, sum =2716025, sum / n = 20.722, sum / (n * log n) = 1.219The values in the last column appear to converge to some value around 1.3. This is also suggested by experiments though in actual applications the deviations from the expected value are so larger that experiments must be repeated many times in order to get a somewhat reliable estimate of the expected number of comparisons performed. An exercise asks to compute the limiting value in an analytical way. Here it is not so much the (known) result as the derivation which is interesting.
Let us assume that the element s is chosen uniformly at random and that all elements are different. How long does the algorithm take? Here we have the problem of analyzing a randomized algorithm. There are many proofs that this time is O(n * log n) in some sense. The following statement also contains some nice ideas itself. We are using the notion of with high probability, which means that the probability that the claim is not satisfied is less than n^{- eps}, for some eps > 0. We might use the so-called Chernoff bounds, which are the most important tool in estimating the tail probabilities of the distribution of the sum of independent Bernoulli trials (a Bernoulli trial is a 0-1 experiment with a given probability p to get 1 and probability 1 - p to get 0: throwing a biased coin).
Lemma (Chernoff Bound):
let X be the random variable corresponding to throwing a biased
coin: with probability p the result is 1, with probability 1 - p
the result is 0. Let S be the random variable giving the number of
1 outcomes in n independent experiments of the above type. Then
Prob[S > p * n + h] <= e^{- h^2 / (3 * p * n)}
Prob[S < p * n - h] <= e^{- h^2 / (3 * p * n)}
Lemma:
let X be the random variable corresponding to throwing an unbiased
coin: with probability 1/2 the result is 1, with probability 1/2
the result is 0. Let S be the random variable giving the number of
1 outcomes in n independent experiments of the above type. Then
Prob[S < n / 4] <= n / 4 * 0.91^n.
Proof: The probability P_k that among the n trials there are exactly k successes is given by
P_k = (n over k) * (1/ 2)^n.Away from the expected value n / 2 these probabilities are monotonically decreasing. Thus, sum_{h < k} P_h <= k * P_k. Using Stirling's formula, we find that for all n and k:
(n / k)^k <= (n over k) <= (n * e / k)^k.So, P_k <= 2^{-n} * (n * e / k)^k. This gives
Prob[S < n / 4] <= n / 4 * ((4 * e)^{1/4} / 2)^n.End. In words this lemma states: "when expecting n / 2 successes, we may be almost sure that there are actually at least n / 4 successes".
Theorem: Quicksort sorts n numbers in O(n * log n) time, with high probability.
Proof: We are considering the depth of the recursion tree. Because at every level at most O(n) work is performed, bounding this depth to O(log n) would give the result.
Consider the depth of the branch of the recursion tree in which a specific element x lies. At every level the size of the subset to which x belongs is reduced, and the recursion ends when the size of this subset has been reduced to 1, consisting of x itself only.
The central idea of this proof (and of variants of it) is that of a successful split: a split is successful if the size of the subset is of x reduced to 3/4 of its original size or less. This happens with probability at least 1/2.
Claim: the probability on a successful split is at least 1/2. The set in which x comes is not larger than the largest of S_< and S_>. In general, Prob[max{X, Y} >= c] = Prob[X >= c or Y >= c] <= Prob[X >= c] + Prob[Y >= c]. Using this, the claim follows from the fact that Prob[|S_<| >= 3/4 * |S|] = Prob[|S_> >= 3/4 * S] < 1/4.
A sequence of at most log_{4/3} n successful splits (intermixed with any number of non-successful splits) results in a set of size one. The chance that a split is successful is independent of any of the previous actions. Thus, we can use the Chernoff bounds to state that we need only marginally more than 2 * log_{4/3} n = O(log n) splits to guarantee that with high probability log_{4/3} n of them are successful. Applying our weaker lemma we conclude that from among 4 * log_{4/3} n splits at least log_{4/3} n are successful with high probability.
This argument is correct but not complete. One still should address the following point: we have now shown that the branch with element x as a leaf does not lie deeper than O(log n) with high probability. However, the recursion tree has n leafs and we must prove that all of them lie at depth bounded by O(log n). Thus, we have n experiments (which are not independent). Here we use that
Prob[a or b] = Prob[a] + Prob[b] - Prob[a and b] <= Prob[a] + Prob[b].This holds even when a and b are dependent! Using this observation, we obtain for a set of n equally likely events a_1, ..., a_n:
Prob[a_1 and .. and a_n] = 1 - Prob[not (a_1 and ... and a_n)] = 1 - Prob[(not a_1) or ... or (not a_n)] <= 1 - n * Prob[not a_1].Thus, if we can bound the non-success probability of a single experiment to n^{- (1 + eps)}, for some eps > 0, then we can guarantee that with high probability even n of these experiments end successfully. This last condition is satisfied in our case (end hardly ever a problem, it should just be remarked) because 0.91^{4 * log_{4/3} n} ~= 0.91 * {9.63 * log_2 n} ~= 0.40^{log_2 n} ~= n^{-1.31}. End
/* for node i, succ[i] denotes the next element in the list */
For (each node i)
randomly and uniformly pick x[i] from {0, 1};
For (each node i)
if (x[i] == 0 and x[succ[i]] == 1)
y[i] = 1;
else
y[i] = 0;
We claim that the set S of nodes with y[i] == 1 is an independent set.
The correctness of this claim is obvious: if y[i] == 1, then
x[i] == 0, and thus, for the node i' with succ[i'] = i, x[succ[i']]
= 0, which implies that y[i'] == 0. Also, if y[i] == 1, then for the
node i'' = succ[i], we must have x[i''] == 1, which implies that y[i'']
= 0. So, if i in S, then neither its predecessor nor its successor
belong to S.
How large is the selected independent set S? The expected value is trivial to compute: for any element, the probability of being selected depends only on the choice of x[i] and x[succ[i]]. In one out of four cases these values are such that y[i] == 1. Each case is equally likely, so Prob[y[i] == 1] = 1/4. Using linearity of expected values, we get:
Exp[sum_i y[i]] = sum_i Exp[y_i] = n * (0 * 3/4 + 1 * 1/4) = n / 4.In an implementation we should treat the final node in a special way: the final node i sets y[i] = 1 iff x[i] = 0. So, for this node Prob[y[i] = 1] = 1/2. In the above and following analysis we ignore this small difference.
What can we prove with high probability? This is not so easy. The problem is that the x[i] are dependent (no two consecutive ...). Thus, we cannot immediately use the Chernoff bounds as we would like to do. There are two solutions.
The simplest, but nice, idea, is to concentrate on a smaller set of results that are independent. Let Z_j be the random variable giving the value x[i], where i is the index of the j-th element in the list. Z_j depends on Z_{j + 1}, but not on Z_{j + 2}. More generally, the set of values Z_{2 * k}, 0 <= z < n / 2, are independent. This is true, because Z_{2 * k} only depends on the choices x[i] and x[i'] where i is the node with rank 2 * k and i' is the node with rank 2 * k + 1. Thus, now Chernoff bounds immediately give that sum_j Z_j > n / 8 - O(sqrt{n * log n}), with high probability.
This is quite nice, but we have lost a factor 2 here. Instead of n / 4 we are proving that it is actually not much less than n / 8! Not very satisfactory, because experiments show that the actual value lies close to n / 4. To prove this, we need a more powerful canon:
Lemma (Azuma Inequalities):
Let X_1, ..., X_n be independent random variables. For each i,
X_i takes values in a set A_i. Let f: prod_i A_i -> real_numbers
be a measurable function satisfying |f(x) - f(y)| <= c,
when x and y differ only in a single coordinate. Let Z be
the random variable f(X_1, ..., X_n). Then for any t > 0,
Prob[|Z - Exp[Z]| > t] <= 2 * e^{-2 * t^2 / (c^2 * n)}.
In the following we assume that the edge to contract along is picked uniformly at random, that is, if the graph currently has m' edges, then any of them is picked with probability 1 / m'. Let k be the size of the mincut. Then we must have m >= k * n / 2, because all nodes must have degree at least k in this case. The mincut is denoted by C. Let E_i designate the event of not picking any of the edges in step i, 1 <= i <= n - 2. Prob[E_1] >= 1 - k / (k * n / 2) = 1 - 2 / n. Assuming E_1, we have n - 1 nodes with degree at least k when making the second choice. This gives Prob[E_2| E_1] >= 1 - k / (k * (n - 1) / 2) 1 - 2 / (n - 1). More generally, Prob[E_i|and_{j = 1}^{i - 1} E_j] = 1 - k / (k * (n - i + 1) / 2) = 1 - 2 / (n - i + 1). Using the relation on rewriting a multiple-and as a product of conditional probabilities, we get
Prob[and_{i = 1}^{n - 2} E_i] =
prod_{i = 1}^{n - 2} Prob[E_i|and_{j = 1}^{i - 1} E_j] =
prod_{i = 1}^{n - 2} (1 - 2 / (n - i + 1)) =
2 / (n * (n - 1)) > 2 / n^2.
So, we have some (unfortunately rather small) probability of indeed finding the mincut in this way. Of course a probability of 1 - 2 / n^2 of finding a wrong answer is not very satisfying. However, by repeating the experiment and finally outputting the smallest cut we were finding, the success probability can be made arbitrarily close to 1. The idea is that if we make the random choices in each attempt independently of any previous choices, the error probabilities are independent and thus, when F_j denotes the event of erring in experiment j, we have
Prob[and_{j = 1}^t F_j] = Prod_{j = 1}^t Prob[F_j] = (1 - 2 / n^2)^j <= e^{-2 * t / n^2}.For t = n^2 * log n success is given with high probability. This is not yet very efficient, but it clearly shows how unexpectedly simple randomized algorithms can solve hard problems. It also shows the idea of boosting the success probability by repeating the experiment.
A binary planar partition consists of a special kind of binary tree: each internal node corresponds with a line and a polygonal region of the plane. The region of the root is the whole plane. The regions of the children u_1 and u_2 of an internal node u are the parts of the region of u that lie at either side of the line of u. In the regions of the leafs we find at most a part of a single object for a set of n graphical objects in total. In our case we assume that the objects are line segments (any more complex shape can be approximated by a set of line segments connected in their endpoints).
Clearly the problem of finding optimum binary planar partitions is very hard, it is not even easy to determine how many possibilities one might have for drawing the first line. Nevertheless there is a simple randomized algorithm, with good expected performance. In the following we only consider auto partitions: partitions defined by lines running through the segment. The line itself is assigned to the region containing fewest segments (if both have the same number any assignment is ok).
The segments are indexed from 1 to n: u_1, ..., u_n. The line through segment u is denoted l_u. The algorithm is almost trivial:
1. Construct a random permutation pi of {1, 2, ..., n}.
2. While a region contains more than one segment, cut it
with l_{u_i}, where i is the smallest index in the
ordering of pi so that l_{u_i} cuts the region.
It is not obvious how to implement this method efficiently, but
in any case it can be done in O(n^2), and implementation is not the
main concern now. It is more interesting to study the size of the
partition.
Theorem The expected size of the constructed autopartition is bounded by O(n * log n).
Proof: Let index(u, v) = number of segments the line l_u cuts before hitting segment v (segment v counted as well). If l_u does not cut segment v, then index(u, v) = infinity. u # v is the event that l_u cuts segment v in the constructed partition. This only happens if u comes before all the other i = index(u, v) segments v_1, ..., v_i in the permutation pi and before v. Thus, Prob[u # v] = 1 / (1 + index(u, v)). The number of internal nodes in the tree equals n + number_segments_cut. The expected value of number_segments_cut equals the sum of expected values of the C_{u, v}, where C_{u, v} is the indicator variable of the event u # v. An indicator variable of an event is a random variable that has value 1 if the event is true and value 0 otherwise. Exp[C_{u, v}] = Prob[u # v] = 1 / (1 + index(u, v)). Notice that for given u, in the multiset of values {index(u, v)| v != u} each value occurs at most twice. Thus, Exp[number_segments_cut] = sum_u sum_{v != u} Exp[C_{u, v}] = sum_u sum_{v != u} Prob[u # v] = sum_u sum_{v != u} 1 / (1 + index(u, v)) <= 2 * n * H_n. End. Because this holds for all inputs, it shows that for every set of segments there must be at least one partition in which no more than 2 * n * H_n segments are cut: if a function f, based on k random choices has expected value Exp[f], then there is a set of choices x_1, ..., x_k, so that f(x_1, ..., x_k) <= Exp[f] (and also a set of choices y_1, ..., y_k, so that f(y_1, ..., y_k) >= Exp[f]).A priori, it is not obvious at all that such a choice exists: if there is a whole bunch of segments, one might just as well believe that it is inevitable that one cuts many of them, maybe constructing a partition that cuts Omega(n^2) segments. Here we have used a fact about an expected value to derive an existence proof. This is what is called the randomized method. This proof is completely non-constructive. We can only hope to rapidly find a good example, but there is absolutely no guarantee: it may be so that most constructed partitions are slightly larger than the expected value while a few are much smaller.
or0[k] = 2.0 * and0[k - 1],Substitution gives
or1[k] = 0.5 * and0[k - 1] + 1.0 * and1[k - 1],
and0[k] = 1.0 * or0[k - 1] + 0.5 * or1[k - 1],
and1[k] = 2.0 * or1[k - 1].
or0[k] = 2.00 * or0[k - 2] + 1.00 * or1[k - 2],In addition we have or0[0] = or1[0] = 1. Some values: or0[2] = 3, or1[2] = 2.75, or0[4] = 8.75, or1[4] = 7.1875, or0[6] = 24.6825, or1[6] =
or1[k] = 0.50 * or0[k - 2] + 2.25 * or1[k - 2].
For normal single recurrence relations there is the method using the characteristic polynomial to find a solution. In our case we have two interdependent relations, which makes things harder. The easiest is to guess a general form of a solution and then to proof the correctness of this by induction. Here we guess that or0[k] ~= c0 * a^k, while we guess that or1[k] ~= c1 * a^k. This is motivated by the observation that or0 and or1 essentially must grow equally fast, because or0[k] > or1[k - 2] and or1[k] > 0.50 * or0[k - 2]. We also know that such recurrence relations generally grow exponentially (by induction it is easy to prove that both are at least 2^{k / 2}). Now we want to find the constant a. Substituting and dividing by a^{k - 2} gives
c0 * a^2 = 2.00 * c_0 + 1.00 * c_1We have three unknowns and only two relations. However, we mainly want to know a, we are not really interested in the constants c0 and c1. We will therefore determine their quotient b = c1 / c0. The relations become
c1 * a^2 = 0.50 * c_0 + 2.25 * c_1.
a = 2.00 + 1.00 * bSubtraction gives
a = 0.50 / b + 2.25.
4 * b^2 - b - 2 = 0.This gives b = (1 + sqrt{33}) / 8 ~= 0.843. Substitution gives a = 2.843, instead of the estimate 3 we had before. The algorithm has not improved, but we know better how good it is: prove that
or0[2 * l] = c * (1.000 + o(1)) * 2.843^l,Here c is a constant which can be approximated by considering the value of or0[2 * l] for sufficiently large l. Except for lower-order terms this result is sharp, so the number of evaluations for any of the four cases for a tree of depth 2 * l is given by Theta(2.843^l).
or1[2 * l] = c * (0.843 + o(1)) * 2.843^l.
Let x_l = or0[2 * l] and y_l = or1[2 * l], then we have the following intertwinned recurrence relation:
x_l = 2 * x_{l - 1} + 1 * y_{l - 1}, for all l > 0,
x_0 = 1,
y_l = 1/2 * x_{l - 1} + 2 1/4 * y_{l - 1}, for all l > 0,
y_0 = 1.
This recurrence relation can also be written in matrix form. Let A = (a_ij) be given by a_00 = 2, a_01 = 1, a_10 = 1/2 and a_11 = 2 1/4, and let v^(l) be the vector with entries x_l and y_l, then v^(l) = A * v^(l - 1) = A^l * v^(0). In general a 2 x 2 matrix has two eigenvalues c_0 and c_1 and two eigenvectors e_0 and e_1. If we are not too unlucky, e_0 and e_1 are independent and thus can they be used as a basis. That is, v^(0) can be written as a linear combination of e_0 and e_1: v^(0) = d_0 * e_0 + d_1 * e_1 for some constants d_0 and d_1. Now, using induction, it follows immediately that v^(l) = A * A^{l - 1} * v^(0) = A * (d_0 * c_0^{l - 1} * e_0 + d_1 * c_1^{l - 1} * e_1) = d_0 * c_0^l * e_0 + d_1 * c_1^l * e_1.
The sketched solution method can be used for any system of intertwinned recurrence relations as long as this system is linear. Of course, if there are n recurrence relations, the solution involves finding the eigenvalues of an n x n matrix which involves solving a polynomial of degree n. For n > 2 this is not an easy task. It also requires a lot of work.
In our concrete case, we look for solutions to the following system of equations:
2 + y = c, 1/2 + 2 1/4 * y = c * ySubstituting y = c - 2 in the second equation gives c^2 - 4 1/4 * c + 4 = 0. The solutions are c_0 = (17 + sqrt(33)) / 8 ~= 2.843 and c_1 = (17 - sqrt(33)) / 8 ~= 1.407.
In this very simple case, there is no need to compute the eigenvectors explicitly. Knowing the form of the solution, the coefficients can also be computed directly, using that if x_l = alpha * c_0^l + beta * c_1^l and y_l = gamma * c_0^l + delta * c_1^l, for all l >= 0, in particular we must have
alpha + beta = 1, gamma + delta = 1, alpha * c_0 + beta * c_1 = 3, gamma * c_0 + delta * c_1 = 2.75.From these it immediately follows that
alpha = (3 - c_1) / (c_0 - c1) = 1/2 + 7/66 * sqrt(33) ~= 1.109, beta = (c_0 - 3) / (c_0 - c1) = 1/2 - 7/66 * sqrt(33) ~= -0.109, gamma = (2.75 - c_1) / (c_0 - c1) = 1/2 + 5/66 * sqrt(33) ~= 0.935, delta = (c_0 - 2.75) / (c_0 - c1) = 1/2 - 5/66 * sqrt(33) ~= 0.065.So, after substitution, we conclude that for all l,
or0[2 * l] = (1/2 + 7/66 * sqrt(33)) * ((17 + sqrt(33)) / 8)^l
+ (1/2 - 7/66 * sqrt(33)) * ((17 - sqrt(33)) / 8)^l,
~= 1.109 * 2.843^l - 0.109 * 1.407^l,
or1[2 * l] = (1/2 + 5/66 * sqrt(33)) * ((17 + sqrt(33)) / 8)^l
+ (1/2 - 5/66 * sqrt(33)) * ((17 - sqrt(33)) / 8)^l
~= 0.935 * 2.843^l + 0.065 * 1.407^l.
Just as for the Fibonacci numbers, we have an expression involving
square roots which nevertheless for all integral values of l gives
integral values.
max_p min_q Exp[C(I_p, A_q)] = min_q max_p Exp[C(I_p, A_q)]These are reformulations of von Neumann's and Loomi's theorems which will be presented hereafter. There it will become clear that actually all four expressions give the same value. The equality min_q max_p Exp[C(I_p, A_q)] = min_q max_{I in II} Exp[C(I, A_q)] can be understood as "there is no need to randomize both algorithm and input".
max_p min_{A in AA} Exp[C(I_p, A)] = min_q max_{I in II} Exp[C(I, A_q)]
Leaving away the maximum on the left-hand side and the minimum on the right-hand side, leads to the following inequality, which is known as Yao's minimax principle. For any distribution p on the inputs and q on the algorithms,
min_{A in AA} Exp[C(I_p, A)] <= max_{I in II} Exp[C(I, A_q)].In words this states that for any distribution on the inputs the expected cost for the best algorithm from AA provides a lower bound on the worst-case expected cost when the algorithms are sampled according to q. Because randomly sampling algorithms from a class of deterministic algorithms corresponds to a randomized algorithm that develops according to the random choices that are made in the course of the program, this provides a lower-bound for randomized algorithms, because we can take AA to be the set of all algorithms. The issue of AA being finite is not really a problem: we can concentrate on algorithms that run for at most O(f(n)) time for an input of size n: there are only finitely many of them.
When applying this lower bound, the freedom lies in the choice of p: if we find a class of inputs for which there is no good deterministic algorithm, then we have found a strong lower bound on the performance of randomized algorithms. Of course, if we choose II and p in a non-clever way, then the algorithm A might be tuned to perform well on this class.
To make the following description easier, we first reduce the tree with and and or nodes to a tree with nor nodes only. In this replacement, the values originally computed in the or levels are getting inverted value. The values originally computed at the and levels are the same. If the bottom level consists of ands, all values in the terminals must be inverted. By induction, using that a or b = !(a nor b); a and b = (!a) nor (!b), one can show that this replacement computes the same value if the root node was an and, and the opposite value if the root node was an or.
For a network of depth 2 * l, the input class II consists now of all 2^{4^l} possible inputs. Each value is 1 with probability x = (3 - sqrt{5}) / 2. x was just chosen so that if the two entries of a nor port are 1 with probability x independently of each other, that then the probability that the exit has value 1 equals (1 - x)^2 = x. So, the probability that any node has value 1 equals x. Now one can show that it is correct to concentrate on algorithms that evaluate the tree depth-first. This is correct in the sense that if W(T) denotes the minimum expected time to evaluate a tree T in which all leafs are set to 0 or 1 independently with fixed probability, that then there is also a depth-first algorithm achieving an expected value W(T).
For depth-first algorithms, the time is simply given by the formula we had before for evaluating the upper bound:
W(k) = W(k - 1) + (1 - x) * W(k - 1).That is, W(k) = (2 - x) * W(k - 1) ~= 1.618 * W(k - 1). This gives W(2 * l) ~= 2.618^l. This does not match our computed upper bound! The reason is that our probability distribution over the class of inputs was not chosen optimally: it is a waste to have any nodes in which both inputs have value 1. So, the values at the terminals should be chosen 0 and 1, but not independently. This analysis shows that the algorithm is optimal and is therefore important, but it is even more important that our simple analysis already could show that one does not need to look for randomized algorithms evaluating only 2^l leafs or less.
Deterministic strategies are called pure. If the game has no solution, then knowledge about the strategy of the opponent allows to improve the payoff. Randomized strategies are called mixed. For deterministic strategies, there is in general no solution. However, if the choices are made randomly, each choice being made with a certain probability, then we get a vector p and a vector q for the other player, and the expected payoff equals p * M * q. Now
V_R = max_p min_q p * M * q,
V_C = min_q max_p p * M * q.
Theorem: V_R = V_C.
This is known as von Neumann's minimax theorem. It is a more general formulation of Loomi's theorem, which is equivalent:Theorem: max_p min_j p * M * e_j = min_q max_i e_i * M * q.
Proof: Once p has been fixed, p * M * q is a linear function of q. A linear function with a vector of unit length assumes its maximum value along one of the coordinate axis. So, for all p, min_q p * M * q = min_j p * M * e_j. Analogously, for all q, max_p p * M * q = max_i e_i * M * q. Thus, max_p min_j p * M * e_j = max_p min_q p * M * q = V_R = V_C = min_q max_p p * M * q = in_q max_i e_i * M * q. End.
Corollary: min_j p * M * e_j <= max_i e_i * M * q, for all vectors p and q of unit length.
Now the column player is viewed as the algorithm designer and the row player as the adversary choosing the input. The pay-off is given by the cost for a specific algorithm on a specific input. The algorithms considered must be correct and terminating, to make everything finite. V_C, using the first definition, gives the best achievable worst-case running time of a deterministic algorithm. A mixed strategy for the column player C corresponds to an always correct randomized algorithm. Notice that hence the obtained lower-bounds do not apply to Monte-Carlo algorithms, they only concern Las Vegas algorithms!In the following we show that in a very weak sense most randomized algorithms can be made deterministic, but only in an existential way: there is no way to actually perform this derandomization. The argument, which is due to Adleman, is nevertheless very beautiful. We consider (randomized) Boolean circuits. A Boolean circuit can model the operations of an algorithm.
A boolean circuit is a directed acyclic graph in which the nodes have indegree 0, 1 or 2. The nodes are labeled INPUT, NOT, AND, OR and OUTPUT. The n input nodes have indegree 0. The boolean nodes have indegree 1 or 2 according to their type. The input nodes get assigned boolean values x_1, ..., x_n (in the following 0 will be used as a short notation for false and 1 for true). There is a single output node, which has outdegree 0. The value resulting at the output node is a function of the x_i. We say that the circuit computes this function. The size of the circuit is the number of nodes in it. In addition to all this, a randomized circuit has random inputs r_i, which get assigned a boolean value independently and equiprobably. A randomized circuit is said to compute a function f if the following two conditions are satisfied:
Let f = f_1, f_2, ... be a (family of) function(s), with f_n: {0, 1}^n -> {0, 1}. A family of circuits C = C_1, C_2, ... is a circuit family for f if C_n is a circuit for f_n, for all i. This circuit family is said to be polynomially sized, if |C(n)| = O(n^k), for some constant k. C is a randomized circuit family for f, if the C_n are randomized circuits and compute the functions f_n in the above defined sense.
Theorem (Adleman's Theorem): If a family of boolean functions {f_1, f_2, ...}, has a randomized polynomial-sized circuit family {C_1, C_2, ...}, then it has a polynomial-sized circuit family.
Proof: For each n we will construct a deterministic circuit D_n which is larger than C_n by a factor which is polynomial in n. Form a matrix with 2^n rows and 2^m columns. The rows correspond to the choices of the inputs, the columns to the possible random choices. The entries give the values of f_n for the inputs x_1, ..., x_n, in combination with the values of r_1, ..., r_m. All rows for which f_n evaluates to 0 are deleted. The remaining rows have 1 in at least half the positions. So, at least half of all entries in the remaining matrix is 1. Thus, there must be a column in which at least half of all entries is 1. Construct a new circuit with these m choices fixed. Eliminate all rows for which this function computes 1. The remaining rows again have at least half of their entries equal to 1. Thus, again there must be column in which at least half of the remaining rows have a 1. Construct the corresponding circuit as well. Continue until all rows are eliminated. The final circuit D_n computes the or of all newly constructed circuits. Because each step reduces the number of rows by a factor 2, we need at most n circuits and therefore is the size of this circuit at most n times larger than that of C_n (plus a few gates for computing the or function of at most n values). End.
The problem with this method is that the constructed circuit works for a specific n, but it is not necessarily so that for n + 1 the circuit looks similar. More formally: the constructed deterministic algorithm is non-uniform in the following sense. a() is a function from the positive integers to the strings. An algorithm is said to use a() as advice, if for an input of size n it uses a(n) as advice. The important point is that it uses the same advice for all inputs. This advice is used for fine tuning the algorithm for this specific n. A uniform algorithm uses no advice. By a correspondence between polynomial-time algorithms and polynomial-sized circuit families, Adleman's theorem implies that the class of randomized polynomial-time algorithms is a subclass of the deterministic non-uniform polynomial-time algorithms. This is interesting but not practical.One of the most famous theorems from number theory is Fermat's little theorem (formulated 1640):
a^{n - 1} mod n = 1, for any prime n and any number a, 1 <= a < n.For example, 1^6 = 1 = 1 + 0 * 7, 2^6 = 64 = 1 + 9 * 7, 3^6 = 729 = 1 + 104 * 7, 4^6 = 4096 = 1 + 585 * 7, 5^6 = 15625 = 1 + 2232 * 7, 6^6 = 46656 = 1 + 6665 * 7. Furthermore, for any composite number n, there are numbers a, 1 <= a < n, so that a^(n - 1) mod n > 1. The smallest composite number n for which there is an a with a^{n - 1} mod n = 1, is 15: 4^14 mod 15 = (4^7 mod 15)^2 mod 15 = 4^2 mod 15 = 1. Also 11^14 mod 15 = (11^7 mod 15)^2 mod 15 = 11^2 mod 15 = 1. For all other a, 2 <= a <= 13, a^14 mod 15 > 1. More generally, for most composite numbers n there are many a so that a^(n - 1) mod n > 1. In any case this holds for all a with ggd(a, n) > 1, so for all a which have a common factor with n.
Fermat's theorem and the partial reversal of it lead to a prime test which tests for primality without factorization. A simple randomized algorithm tests for primality as follows:
int expoMod(int a, int m, int n)
{
int c, z;
for (c = a, z = 1; m != 0; m = m >> 1)
{
if (m & 1) /* m is odd */
{
z *= c;
z %= n;
}
c *= c;
c %= n;
}
return z;
}
boolean isWitness(int n, int a)
{
if ((n & 1) == 0) // n is even
return n == 2;
return expoMod(a, n - 1, n) == 1;
}
boolean isPrime(int n)
{
final int ITS = 20;
int its = 0;
while (its < ITS && isWitness(n, random.nextInt(2, n - 2))
its++;
return its == ITS;
}
As long as all operations can be performed in constant time, this
algorithm runs in O(ITS * log n). More generally, expoMod performs
O(log m) products of numbers which are of the length of n. So, the
cost is O(ITS * log m) * T_prod(log_b n), where b is the basis with
respect to which the number n is written (on a system with 32-bit
arithmetic, b = 2^16 is a handy choice). If n is a prime number, then
isPrime(n) will return true. For most non-primes it will return false.
Considering all primes and all possible choices of witnesses for them,
it turns out that there is only a very small fraction of
false-witnesses. So, for a uniformly chosen number n and uniformly
chosen numbers a, the test is correct with high probability.
Unfortunately there are composite numbers n, for which a^(n - 1) mod n = 1 for most a, 1 <= a < n. 561 is such a number: of the 559 numbers a, 2 <= a < 561, there are 318 false-witnesses. Only 241, that is less than half of them, can be used as witness of the non-primality of 561. There are large composite numbers for which the fraction of witnesses is arbitrarily small. More precisely, for any eps > 0, there are infinitely many composite numbers for which the fraction of witnesses is smaller than eps. A Monte Carlo algorithm which is correct with probability p is said to be p-correct. If an algorithm is p-correct for some p > 0, then this p can mostly be increased by repeating independent trials. However, the above fact implies that applying Fermat's test to a specified number n is not p-correct for any p > 0, and therefore repeating it a constant number of times does not help.
However, a refined test does work in the sense that the fraction of false-witnesses is bounded away from zero by a constant. For an odd number n, let k and l be the unique numbers so that 2^k * l = n - 1 where l is odd. An extension of Fermat's theorem shows that
Lemma:
For any prime number n one of the following conditions is
satisfied for all numbers a, 2 <= a < n - 1:
The suggested better primality test, known as Rabin's test, is to test whether for n the above condition is satisfied and to return true when it holds and false otherwise. That is, in the above code fragment, the subroutine isWitness() should be replaced by the following:
boolean isWitnessRef(int n, int a)
{
int e, l, x;
if ((n & 1) == 0) // n is even
return n == 2;
e = l = (n - 1) >> 1;
while ((l & 1) == 0) // l is even
l >>= 1;
x = expoMod(a, l, n);
if (x == 1)
return true;
while (l != e && x != n - 1)
{
x *= x;
x %= n;
l <<= 1;
}
return x == n - 1;
}
False-witnesses with respect to Rabin's test will be called
strong false-witnesses. These are much rarer than the earlier
false-witnesses. As an example we consider again the values n = 15 and
a = 4 or 11. For n = 15, l = 7 and k = 1. If k = 1, it is only tested
whether a^l mod n = 1 or n - 1, which is not the case: 4^7 mod 15 = 4,
11^7 mod 15 = 11.
Theorem:
To obtain an error probability of isPrime() of less than eps, ITS must be chosen so that 1 / 4^ITS < eps, that is, we must take ITS = log_4 (1 / eps) = log_2 (1 / eps) / 2. The suggested value ITS = 20, gives an erring probability of at most 4^{-20} ~= 10^{-12}. This is a very pessimistic estimate: for most values of n there are much fewer strong false-witnesses: 72% of the odd composite numbers up to 1000 does not have a single false witness.
How about the time consumption? As before, a single test takes a time which is dominated by computing O(log n) products of numbers of size at most n. So, in total the primality testing costs O(ITS * log n) * T_prod(log_b n). If we want an algorithm which is correct with high probability, ITS must be chosen omega(1). The time for computing the products depends on the applied algorithm. Applying the simplest algorithm, T_prod(m) = O(m^2), with Karatsuba's method, T_prod(m) = O(m^{log_2 3}), and using an algorithm based on FFT this can even be reduced to T_prod(m) = O(m * log m). Whatever method is applied, we conclude that:
Theorem: There is a Monte Carlo algorithm for testing primality in polylogarithmic time, which is correct with high probability.
Using Adleman's construction for any n, all prime numbers up to n can be determined with a deterministic algorithm: it suffices to specify a list of at most log n witnesses which together cover the whole set of non-primes. In principle a good choice of witnesses could be hard to find, but it turns out that for moderate n one can simply take the smallest prime numbers. The following table illustrates this:
The columns give the value of n and the corresponding largest prime number with which must be tested to assure that any number up to n is a prime number. In brackets the rank of this prime number is indicated.
n Fermat Rabin 10^5 13 (6) 5 (3) 10^6 41 (13) 5 (3) 10^7 109 (29) 5 (3) 10^8 281 (60) 7 (4) 10^9 617 (113) 7 (4)
So, for testing all numbers up to 10^9 on primality, it suffices to perform at most 113 Fermat tests or at most 4 Rabin tests. The above values have been computed with a program, which can be downloaded here (for values of n larger than 2^16, this program should be run on a computer with 64-bit arithmetic). It has been shown that actually for all odd composite numbers up to 10^10 either 2, 3, 5 or 7 is a witness. Adding 61 to the set of witnesses we come even much further: for all odd composite numbers up to 10^13, either 2, 3, 5, 7 or 61 is a witness. So, for these larger numbers we find a non-trivial testing set.
Considering that for most non-primes 2 is a witness, the average number of tests for these numbers is only a little more than 1. For prime numbers some more tests must be performed, but because there are only about n / (log n - 1) prime numbers up to n, this has no big impact on the average number of tests to perform. When picking the values a at random, for any composite number, the expected number of Rabin tests to perform is bounded by 1 + 1/4 + ... < 4/3. When trying to find a large prime number with value between n and 2 * n, such as needed in cryptographic applications, at most one of the tested numbers (namely the last) is a prime number. Because a fraction Theta(1 / log n) of the numbers in this range is prime, and because the number of tests for the last number is bounded by O(log n), the expected total number of tests to perform is only O(log n).
The following two results are known as Markov's and Chebyshev's inequality, respectively:
Theorem (Markov): For a random variable X which only assumes positive values and a positive number h, Prob[X >= h * Exp[X]] <= 1 / h.
Proof: Because X is positive for all x, Exp[X] >= x * Prob[X >= x]. For x = h * Exp[X], this shows that we must have the stated condition. End.
Theorem (Chebyshev): Prob[|X - mu_X| >= h * sqrt(Var[X])] <= 1 / h^2.
Proof:
Prob[|X - mu_X| >= h * sqrt(Var[X])] = Prob[(X - mu_X)^2 >= h^2 * Var[X]]Now apply Markov's inequality to the positive random variable Y = (X - mu_X)^2 which has expectation Var[X]. End. In the literature sqrt(Var[X]) is known as the standard deviation of X and often written sigma_X.
The algorithm is simple and elegant. There are many variants of the same idea. The set of elements is denoted S, the element x to select has rank k.
In the proof we will bound the probability that more than k / n^{1/4} + sqrt{n} of the selected numbers are smaller than x and the probability that a < S_{k - 2 * n^{3/4}}, where S_i denotes the element with rank i. The other two probabilities can be bounded in the same way.
The probability that in any of the n^{3/4} independent selection operations a number smaller than x is selected equals k / n. Let X be the random variable giving the sum of these outcomes. It has expected value l and Var[X] <= n^{3/4} / 4. The probability on a large deviation of X from its expected value can be bounded with the Chebyshev inequality: Prob[|X - l| >= sqrt(n)] = Prob[|X - l| >= 2 * n^{1/8} * sigma(X)] = O(n^{-1/4}). The other bound can be derived similarly. As a result, we can estimate that the number of performed comparisons is bounded by 2 * n + O(n^{3/4} * log n).
A large saving can be obtained using the theory above. However, we must first give an explicit example of sets of pairwise independent random variables. Let n be a prime number. Then F_n = Z/nZ is a field, the finite field with n elements in which the usual addition and multiplication operations are performed modulo n. Define random variable Y_i(a, b) = (a * i + b) mod n. Of course, for given i, a and b, this is just a single number, but as a function of the choices of a and b it is a real random variable. It is easy to show that Prob[Y_i = x] = 1 / n and that thus the Y_i are uniformly distributed: consider all possible n^2 choices of a and b. Make a matrix, with the choices of a vertically and the choices of b horizontally. Write 1 if Y_i(a, b) = x and 0 otherwise. Because F_n is a field, there is exactly 1 one in every row: in row a there is a one at position b = (x - a * i) mod n. Thus, in total there are n out of the n^2 choices of (a, b) which result in a one. Now, for i != j, consider Prob[Y_i = x | Y_j = y] = Prob[Y_i = x and Y_j = y] / Prob[Y_j = y]. The condition Y_i = x and Y_j = y gives two equations with two unknowns in a field:
a * i + b = xThe solution is a = (x - y) * 1 / (i - j) and b = x - a * i. So, for any two different values i and j, there is a unique a and b. That is, Prob[Y_i = x and Y_j = y] = 1 / n^2. As we already have seen that Prob[Y_j = y] = 1 / n, it follows that for all i != j, x and y, Prob[Y_i = x| Y_j = y] = 1 / n = Prob[Y_i = x]. Thus all these random variables are pairwise independent. Clearly any three of them are dependent: if Y_i = x and Y_j = y, then the value of any other Y_k is known.
a * j + b = y
Now, for enhancing the probability, we first pick the two numbers a and b independently at random and then take r_i = (a * i + b) mod n. We do the same as before: if A(x, r_i) = 0 for all i, then we guess that x not in L, otherwise we are sure that x in L. This procedure is called two-point sampling, because it is based on a sample of two independently selected values. We only consider the case that x in L and want to bound the probability that A(x, r_i) = 0, for all i. The A(x, r_i) are random variables depending on a and b. These are pairwise independent just as the numbers r_i. Let Z_t = sum_{i = 1}^t A(x, r_i). Exp[Z_t] = Exp[sum_{i = 1}^t A(x, r_i)] = sum_{i = 1}^t Exp[A(x, r_i)] = sum_{i = 1}^t Prob[A(x, r_i) = 1] >= t / 2. Here we used that A(x, r) = 1 for at least half of the r values and that the r_i are distributed uniformly. Because the random variables are pairwise independent, the variance of Var[Z_t] = Var[sum_{i = 1}^t A(x, r_i)] = sum_{i = 1}^t Var[A(x, r_i)] <= t / 4, because the A(x, r_i) are Bernoulli trials. So, sqrt(Var[A(x, r_i)]) <= sqrt(t) / 2. Now we apply Chebyshev's inequality:
Prob[Z_t = 0] <= Prob[|Z_t - Exp[Z_t]| >= t / 2]
<= Prob[|Z_t - Exp[Z_t]| >= sqrt(t) * sqrt(Var[A(x, r_i)])
<= 1 / t.
This procedure is sometimes called probability amplification. For any t <= n, with t experiments the probability that x in L, while A(x, r_i) = 0, for all i, is at most 1 / t. Picking independent numbers, this is achieved with log t numbers. So, we are saving a lot on the number of random bits needed, going from log t * log n to 2 * log n, For t = n, this gives a reduction from log^2 n to log n. At the same time it costs much extra time to perform t instead of log t tests. What is better in practice is hard to say: pseudo-random bits are rarely so expensive that it is worth to repeat an experiment many times just to save random bits. On the other hand, for many problems, taking the numbers r_i = a * i + b mod n will work just as well as selecting pseudo-random numbers again and again. Any way, the real importance of the above is that it shows that full independence is not always needed. Furthermore, using 4-way, 6-way, ... independence, stronger results can be obtained. A central question in the domain of randomized algorithms, addressed here, and already seen in relation with Adleman's result, is: how much independence do we need, can we derandomize the process.
At a first glance it is not clear that for any set of weightings there exists a stable matching. How to construct one? An obvious way of trying to achieve a stable matching is to create an initial matching, for example pairing man i with woman i, for all 0 <= i < n, and then to gradually improve it. Alternatively, we can start with a greedy matching, processing the men in order and matching man i with the unmarried woman that stand highest on his list. Unfortunately, this improvement phase does not need to terminate. The reason is, that what means an improvement for x and Y may mean an even bigger deterioration for y and X.
The following method works and thereby in particular shows that stable matchings always exist. First all nodes are free. Then any free man is selected. This man proposes to the most attractive women on his list who has not yet turned down his proposal before. A woman, free or matched, accepts a proposal if and only if the proposing man is more attractive to her than her current mate. If she was matched before, her current mate becomes free again. The algorithm terminates as soon as all men and women are matched (because they are both n, this happens at the same time). All women will get married this way, because each woman appears on all lists, and therefore eventually they will get a proposal and once a woman got matched she will always remain matched. Likewise, it will never happen that a man runs out of candidates: if a man is making his k-th proposal, then all women higher on his list must be matched. So, his n-th proposal (if none of the earlier ones) is accepted, because all other men are matched to the other women, and this last one therefore must still be free. The algorithm indeed leads to a stable marriage. This can be proven by contradiction. Assume that in the constructed matching we have a pair of matched edges ((x, X), (y, Y)), while x and Y prefer each other over their current partners. x has proposed to Y before it proposed to x. The fact that it is not matched with Y at the end means that either Y has turned x down at the time of the proposal or later on has rejected him in favor of another candidate. Because Y only gets better candidates over time, both imply that y must be more attractive than x, which gives a contradiction. The total number of proposals made is at most sum_{i = 1}^n i = 1/2 * n * (n + 1) and thus the running time is O(n^2).
All this is deterministic. The question that interests us is the expected running time of the algorithm in case the preferences of the men are selected randomly while the preferences of the women are arbitrary but fixed from the beginning. This analysis is not easy. There is a terrible lot of dependency and in this case not even the expected number of proposals that is made by each man is easy to analyze. The crucial idea is to apply the principle of deferred decisions. This means that we realize that it is not important that the men choose their complete lists of preferences at the beginning. The next preference on the list can also be chosen just in time, that is, when the man is going to make his next proposal.
So, women are picked randomly and uniformly from among all those that were not selected before. This is still exactly the same game. To make the game even easier to analyze, we study the approach in which the men just randomly propose to any of the n women, including those they were proposing to before. The women are more consequent and turn down any man they were turning down before. Denote the time for the original algorithm T_org and the time for the modified one T_mod. T_mod stochastically dominates T_org in the sense that Prob[T_mod > m] >= Prob[T_org > m], for all m.
Lemma: If, for random variables X and Y, X stochastically dominates Y, then Exp[X] >= Exp[Y].
Proof Of course it is not true that Prob[X = k] > Prob[Y = k] for all k, because sum_k Prob[X = k] = 1 = sum_k Prob[Y = k]. So, we can not argue this way. The trick is to reformulate the expression of the expected value. The following argument is given for positive random variables, but it can be generalized.
Exp[X] = sum_{k >= 0} k * Prob[X = k]
= sum_{k >= 0} 1 * Prob[X >= k]
>= sum_{k >= 0} 1 * Prob[Y >= k]
= sum_{k >= 0} k * Prob[Y = k] = Exp[Y].
End.
Thus, in order to put a bound on Exp[T_org] it suffices to bound
Exp[T_mod]. Notice that, because only free men are making proposals,
the game is over as soon as all women got a proposal, because then all
women and men are matched. Thus, the analysis of the game has now been
reduced to studying how many randomly uniformly distributed balls have
to be thrown before each of n holes has been hit. This problem is
called the coupon collectors problem. It is easy to see that
Theta(n * log n) balls are required: when throwing k balls, the
probability that a hole has not been hit equals (1 - 1 / n)^k ~= e^{- k
/ n}. So, for k = c * n, this probability is about 1 / e^c. For c = x
* log_e n, the probability is about 1 / n^x, and thus we can even
conclude that with high probability none of the holes will remain empty
(when computing the probability on a multiple or-clause, the individual
probabilities can be added). The other direction is not so easy to
estimate, but one will believe that if the probability that any single
hole remains empty is considerably larger than 1 / n, that then it is
quite likely that one of the n holes will remain empty. For example,
for k = n * log_e n / 4, the expected number of empty holes is about
n^{3/4}. The result is actually more precise than this. The literature
gives the following:
Theorem: For any constant c in R and m = n * log_e n + c * n, lim_{n -> infinity} Prob[T_mod > m] = 1 - e^{-e^{-c}}.
This result states that for m = n * log_e n in the limit there is a probability of exactly 1 - 1 / e that we need more than m trials, whereas for m = n * log_e n + c * n, this probability has been reduced to about 1 / e^c. Asymptotically, this is a very sharp transition, one more or less knows exactly how many trials must be made. Such a result is called a sharp threshold result: the answer is known (with high probability) to within lower-order terms. This situation is actually rather common. For example, in the theory of random graphs there are plenty of sharp threshold results, most noticeably the result on how many edges a random graph must have to be connected.Click here to download a program which simulates the coupon collector problem: for an array of length n, the value of n is to be supplied by the user, random numbers in the range 0, ..., n - 1 are generated until each number has been generated at least once. Then it is counted how many numbers were needed. In addition the maximum number of hits and the sum of the reciprocals of the number of hits are computed. Finally a statistic is printed, showing the complete frequence distribution.
The first observation is that a loss always happens by an attempt to draw a card from pile A: at any time any other pile x holds at least as many cards as the number of remaining cards with label x. Thus, we are winning only if the last card is an A. One now might analyze the problem by considering all possible ways to hand out the cards and then study for which of these we finish with an A. This is hard. One might also let the set of random choices unfold while doing. This is the same: from the point of view of the probability there is no difference between an initially chosen distribution of the cards (which is still unknown) and a distribution that is (respecting the data already revealed before) chosen during the process. So, we may assume that the next card is always chosen independently and uniformly from the set of remaining cards. But now at once we see how likely it is that this process terminates with an A: 1 / 13.
Prob[|X - mu_X| >= t * sigma_X^(k)] <= 1 / t^k.Here sigma_X^(k) denotes the k-th root of mu_X^(k). Notice that sigma_X^(2) = sigma_X.
This problem is simplified. We consider an n x n matrix A with zeroes and ones, and want to find a vector b with -1 and +1 that minimizes ||A . b||, where, for a vector c, ||c|| equals the absolute value of the largest entry in c. This problem is known as set-balancing or two-coloring a family of vectors.
To find the optimum choice for b is NP-hard. Good choices for b can be found by heuristics. But, even a trivial randomized algorithm is already quite good. The idea is incredible: just pick a random vector in which each position is -1 or +1 with probability 1/2. The algorithm completely ignores the values in A.
How good is this algorithm? Consider the product of any given row of A with b. The positions in which A has a zero certainly contribute 0. The others contribute -1 or +1 with equal probability. Let n' be the number of ones in the row of A. The zeroes certainly contribute 0, so the situation is as if we are selecting n' balls each of which has value -1 or +1 with probability 1/2. We would like to use Chernoff bounds, but here we do not have basic Bernoulli trials. Fortunately, any bivalued random variable X assuming values c_1 and c_2, c_1 < c_2, can be turned into a Bernoulli trial Y = (X - c_1) / (c_2 - c_1). Let mu_X = Exp[X] and mu_Y = Exp[Y]. mu_Y = (mu_X - c_1) / (c_2 - c_1). Prob[X - mu_x >= h * (c_2 - c_1)] = Prob[Y - mu_Y >= h]. Analogous relations hold for Prob[X - mu_x <= -h] and Prob[|X - mu_x| >= h]. Let Z_X = sum_{0 <= i < n} X_i and Z_Y = sum_{0 <= i < n} Y_i. Prob[Z_X - n * mu_X >= h * (c_2 - c_1)] = Prob[Z_Y - n * mu_Y >= h]. This latter probability can be bounded using Chernoff bounds. If there are actually n' < n experiments, then we can nevertheless use the same bounding: Prob[Z_Y - n' * mu_Y >= h] <= e^{-h^2 / (3 * n' * mu_Y)} <= e^{-h^2 / (3 * n * mu_Y)}. So, a h value which gives sufficiently small probability for n experiments can also be used for n' experiments.
In our case c_1 = -1 and c_2 = +1. mu_X = 0 and mu_Y = 0.5. So, c_2 - c_1 = 2.
Prob((A . b)_i > 2 * h) < e^{-h^2 / 3 * n * 0.5} <= e^{-2 * h^2 / 3 * n}.Taking h = sqrt(3 * n * log n) gives a probability of less than 1 / n^2. This was for any particular row. The values in the other rows are not independent. As before we handle this by reduction to a set of or clauses: Let I_i be the indicator variable giving 1 when a bad event occurs for row i. A bad event is when (A.b)_i > sqrt(12 * n * log n). Then Prob(I_i = 1) < 1 / n^2 for all i, and thus Prob(Or_i I_i = 1) <= sum_i Prob(I_i = 1) = n * Prob(I_i = 1) <= 1 / n. Deviations in the other direction can be bounded analogously. Thus, ||A . b|| = O(sqrt(n * log n)), with high probability.
Suppose now that we have a parallel computer with n processors. How fast can the rounding be performed? As before we simplify the problem. So, assume we have an array b with values -1, 0 and +1 (0 gives the values that lie close to an integer form the beginning, +1 those that are about integer + 1/2 and are rounded up, and -1 those that are rounded down). The task is to change the -1 and +1 values so that the sum over all values is -1, 0 or +1. A sequential algorithm goes through the array and performs this operation in O(n) time by keeping track of the current value of the sum, assuring that the required property holds for all prefix sums. We wonder how good one can do this in parallel.
The problem can be solved deterministically in O(log n) time with n / log n PUs by singling out all -1 and +1 indices and then flipping the required number. This is ok for the given problem, but exploits in an essential way that the original problem was reduced to a problem with only finitely many different values. We would prefer an algorithm that works more generally.
A simple parallel algorithm first sets all b_i which are not 0 to an arbitrary (not necessarily random) value. Then we do kind of a merging algorithm: inductively we assume that in round k all groups of 2^k adjacent elements have a sum of -1, 0 or +1. Then two adjacent groups are combined by looking at their sums: if both have +1 or both -1, then the choices in one group are flipped and the sum becomes 0. In this way with O(n * log n) work the rounding can be performed. If n is even, then the final sum will be 0, otherwise it will be -1 or +1. The time for this operation is only O(log n) which is ok, but the work, the sum of the number of operations performed in each round is O(n * log n), which is more than for the sequential algorithm.
Now we consider a randomized algorithm which appears to work in full generality. All those values which are not 0 are chosen randomly -1 or +1 with probability 1/2. This takes O(1) parallel time. Then we can compute the sum h_0, |h_0| = O(sqrt{n * log n}), in O(log n) time and finally, depending on the sign of h_0 we can correct h_0 / 2 choices sequentially. This takes O(sqrt{n * log n}) parallel time. The work is O(n).
Of course we are not satisfied with this. Assume that after the first round the sum is h_0, a positive value. Then we can look at the first n_1 = h_0 positions and turn all +1 values into -1. The hope is that this brings us close to 0. However the fact that the sum was +x means that there are (n + h_0) /2 with +1 and (n - h_0) / 2 with -1. Even though h_0 is mostly small, it is better to compensate for this. We should take n_1 = h_0 * n / (n + h_0). Then, the expected value h_1 of the deviation after the first correction round is 0, independently of h_0. What bound can we put on h_1? We assume that a good case has occurred. Namely, that h_0 <= sqrt{c * n * log n}, for some small constant c. Then, we are asking about the deviation from the expected number of +1 values from the expected. These can be viewed as being chosen independently and uniformly with probability p_1 = (n + h_0) / (2 * n). So, Chernoff bounds apply. This gives h_1 <= sqrt{c * h_0 * log n} <= c * n^{1/4} * log n. Going on this way, in the third round we take the following "unseen" rounded values, we get that after round k the deviation is still h_k <= c * n^{1 / 2^k} * log n. After loglog n rounds this becomes O(log n). Then a single processor can finish the job.
The good point is that the total amount of work is only O(n). This means that we can reduce the number of processors by a factor log n without substantial increase of the time. Thus we obtain an "optimal" algorithm: the processor-time product is O(n), while the time is low. Whether this is optimal or not cannot be seen here, but much simpler problems as parity of the sum of n bits have a proven lower bound of Omega(log n / loglog n).
The problem considered in this section is permutation routing: each PU has a single packet that must be routed to a destination processor; the global pattern of all sources and destinations constitutes a permutation of the n^2 PU indices. This problem has been studied intensively. The task is to minimize the number of steps, T, while also achieving a small upper-bound on the number of packets that may have to be stored in a single PU during any given step of the algorithm. The maximum number of packets that may have to be stored in a PU is called the maximum queue size, Q.
The trivial greedy algorithm, which simply routes all packets first along the x-axis, then along the y-axis. Whenever more than one packet wants to use the same connection at the same time, priority is given to the packet that still has to travel furthest. This is called the farthest-first priority rule. The greedy algorithm achieves the first goal but not the second: during the first phase packets are not delayed at all, because they move in so-called lock-step mode: all packets traveling in one direction move like the wagons in a train, keeping their distances constant. During the second phase, a packet which has to travel n - 1 - d steps is delayed at worst by the at most d packets that have to travel further, so all packets finish even the second phase in n - 1 steps. Thus, T = 2 * n - 2, which is just the diameter of the network and therefore optimal.
The problem with the greedy algorithm is that if all packets in a single row have destination in the same column, then all these packets are wiped together in the first phase, and n packets stand in a PU at the same time. This gives Q = n, which is bad. An improvement is achieved when the two routing phases are coallesced. This means that the two routing phases are not separated, but that the second phase immediately starts for packets that have finished their first phase. Unfortunately, this does not really solve the problem: if all n packets with destination in column n / 3 - 1 initially reside in position 0, ..., n / 3 - 1 of row 0 and position 0, 2 / 3 * n - 1 of row 1, then three of them stream into the PU at position n / 3 - 1 of row 1 for almost n / 3 steps, while only one packet leaves this PU in each step. So, even when coallescing we get Q = 2 / 3 * n - O(1).
A simple randomized algorithm is much better. We show that one can perform permutation routing with T = 2 * n + o(n) and Q = O(log n). This can be further improved, but one of the main ideas is present already in this basic algorithm: reducing bad routing situations by (partially) randomizing the initial packet distribution.
The idea is to perform a three-phase algorithm. For some number k, which is supposed to divide n, the network is (only conceptually) divided into n / k row bundles of k rows each. Then the following steps are performed:
In the first routing phase, each PU expects to receive 1 packet. At most a PU receives O(log n) packets with high probability. In phase 1 there are no delays. In phase 2 there may be delays, however these are small for the packets moving far: a rightwards moving packet can only be delayed by the packets moving further. For a more precise analysis we cite a lemma. First, for a routing problem on a one-dimensional processor array with n PUs we define r(i, j) to be the number of packets originally standing in the PUs 0, ..., i - 1 having destination in the PUs j, ..., n - 1. Then
Lemma: On a one-dimensional array the rightward routing is finished after exactly max_{i <= j} r(i, j) + j - i.
The values of r(i, j) can easily be bounded: packets may have almost arbitrary columns as destinations, but we know that the expected number of packets in any i PUs equals i, and is bounded by i + O(log n + sqrt{i * log n}). So, it immediately follows that r(i, j) + j - i <= n + O(sqrt{n * log n}). Thus, the horizontal routing phase takes at most this much time. The vertical routing phase takes at most n - 1 steps just as before.How about Q? Clearly, with this very simple approach which can easily be improved, we cannot bound Q to less than O(log n). If we are unlucky, and all packets in a row bundle of k rows have to move to a column bundle of k columns, then at the end of phase 2 the expected queue sizes are n / k. Thus, with high probability, Q < n / k + O(log n). A good choice for k is k = n / log n, achieving Q = O(log n).
Theorem The randomized three-phase algorithm routes permutation on a two-dimensional n x n mesh in 2 * n + O(sqrt(n * log n)) steps, with maximum queue size O(log n), with high probability.
We consider the permutation routing problem on hypercubes. A very natural algorithm, which does not impose any extra conditions on the routing hardware, is to perform the routing in n phases: in phase i, 0 <= i < n, bit i is corrected. These bit-correcting algorithms are simple, but bad. For the "transpose" (the permutation under which a packet from (a, b) has destination in (b, a), where a and b are bitvectors of length n / 2 each) permutation, all packets which originally reside in (0, a) are routed through the PU with index (0, 0). These are sqrt(N) packets. As there are at most n connections, we immediately get a lower bound of sqrt{N} / n. Looking more carefully the bound can be sharpened further. For example, one can consider the number of packets moving over the connection between (0, ..., 0, 1, 0, ..., 0) and (0, ..., 0, 0, 0, ..., 0). All sqrt{N} / 2 packets starting in (a, 1, 0), where a is a bit vector of length n / 2 - 1 and 0 is a vector of n / 2 zeroes are moving over this connection. Thus, the trivial algorithm takes ate least sqrt{N} / 2 steps.
A very simple and versatile idea remedies the problem. The algorithm goes as follows:
Lemma: Viewing routes in phase 1 as directed paths from the source to the intermediate destination, it is true that once two routes separate they do not rejoin.
Proof: Paths go out of each other if they differ in a bit. But then one of them will have this bit 0 and the other will have a 1 during the rest of the routing. End. Of course this does not mean that packets can meet only once: they may follow the same track for a while and a packet may be delayed several times by the same packet (if this delaying packet is delayed by a third packet). The following lemma gives a (rather pessimistic) upper bound on the delay a packet may incur.Lemma: Let rho be the path of a packet v. Let S be the set of all packets whose routes pass through at least one of the edges on rho. Then the delay incurred by v is at most |S|.
Proof: Let rho = e_1, e_2, ..., e_k. A packet leaves rho in that time step in which it traverses an edge of rho for the last time. If a packet traverses edge e_j of rho at time t, then its lag equals t - j. The lag is defined only for packets moving along rho. The delay of v is the lag when it traverses its last edge e_k. The increases of the lag of v will be attributed to different elements from S. If the lag of v increases from l to l + 1 then there must be at least one packet in S that wants to traverse the same edge at the same time. Thus, S contains at least one packet whose lag reaches l. We prove that there is even a packet in S which has lag l when leaving rho. The increase of the lag of v will be attributed to this packet. Let t' be the last time step in which any packet in S has lag l. Thus, some packet w is moving over edge e_j' for j' = t' - l. This packet leaves rho in this step, because otherwise w or another packet in S would move over edge e_{j' + 1} in step t' + 1, having lag l in step t' + 1. End.
So far everything was deterministic. Let rho_i, rho_j be the paths of packets v_i and v_j, and let H_{ij} be the indicator random variable for the event that rho_i and rho_j share an edge. The lemma gives that the delay of v_i is bounded by sum_j H_{ij}. All routes are chosen independently, thus the H_{ij} are Bernoulli trials for i != j. Thus, the deviation from the expected value can be bounded using Chernoff bounds. The expected value will be bounded first.sum_{0 <= j < N} H_{ij} <= sum_{1 <= l <= k} T(e_l),where T(e) is the number of routes passing through an edge e, and the e_l, 1 <= l <= k, are the edges of rho_i. The right-hand side may be larger because the same route may be counted several times. Applying expected values to both sides respects the inequality. Because the hypercube is isotropic, the expected number of paths through any edge is the same. Because every PU has n outgoing edges and because each packet traverses on average n / 2 edges, Exp[T(e)] = 1 / 2 for all e. Thus,
Exp[sum_{j = 0}^{N - 1} H_{ij}] <= Exp[sum_{1 <= l <= k} T(e_l)] = sum_{1 <= l <= k} Exp[T(e_l)] = k / 2 <= n / 2.Even though we do not know the value of Exp[sum_{j = 0}^{N - 1} H_{ij}] but only an upper bound, we can nevertheless estimate the deviation: if an expected value of a sum of Bernoulli trials has value X <= Y, then we know that with high probability the true value will be not larger than X + Chernoff(X) <= Y + Chernoff(Y), even if Y is not a random variable for which the Chernoff conditions are satisfied. Here this gives a bound of O(n).
The disadvantage of this algorithm is that we have doubled the length of the paths: it is easy to estimate that with high probability there are packets that have to move almost distance n: the expected number of packets for which there are n - k bits different equals (n over k). For small k, (n over k) ~= (e * n / k)^k. Thus we will almost certainly find a packet traveling distance n - O(1) in each of the phases. One might think this does not matter as we have O(n) anyway. However, it turns out that in practice delays are mostly small (it does not harm to apply farthest-first here as well!), and thus the time is mainly given by the longest path.
If we are coallescing the routing phases (starting phase 2 of the routing for packets that have finished phase 1 without waiting for all packets to finish phase 1, giving priority to packets still performing phase 1) then the longest total path length is relevant. Even though for most packets the total path length lies around n, mostly there are even some packets with path length 2 * n - o(n). So, even coallescing does not solve the problem.
A small twist overcomes the problem. For every packet p moving from s(p) to d(p) we choose an intermediate destination i(p) as before and define i'(p) = bit_complement(i(p)). Let l_1(p) and l_2(p) bet the lengths of the path from s(p) to i(p) and the path from i(p) to d(p), respectively. l(p) = l_1(p) + l_2(p). l'_1(p) and l'_2(p) give the corresponding lengths of the paths through i'(p). l'(p) = l'_1(p) + l'_2(p) = (n - l_1(p)) + (n - l_2(p)) = 2 * n - l(p). So, min{l(p), l'(p)} <= n. This suggests the following approach:
Select i(p). Compute l(p). If l(p) <= n, use i(p) as intermediate destination, else compute i'(p) and use it as intermediate destination.
This idea bounds the total path length of any packet to at most n. Therefore we may hope that an algorithm with coallesced routing phases runs in n + o(n) routing steps. Vöcking has shown that this indeed happens with high probability. The analysis is quite hard, because of the limited randomization and because of the coallescing which makes that two streams of packets are blending.
Lemma: max{n / 2, k * n / 4} gives a lower bound on the for k-k routing on a circular array of length n.
The question is whether this bound can be matched. For k = 1, this is easy: all packets can walk without any delay along the shortest paths to their destinations. For k = 2, it can easily be shown that the routing time is lower bounded by 2 / 3 * n. It is not clear whether this bound can be achieved. For k = 3, the situation is similar. In the following we therefore concentrate on the case k >= 4.The easiest idea is to route all packets along the shortest paths. This does not work well: if all packets in PU_i, the PU with index i, have destination in PU_{(i + n / 2 - 1) mod n}, then all packets will move rightwards. Thus, k * (n / 2 - 1) packets will have to move over any rightwards directed connection, and thus the routing takes at least k * n / 2 - k steps, almost twice as much as given by the lower bound.
Using randomization, it is easy to match the lower bound up to lower-order terms. The idea is the same as before:
Let r(i, j) be defined as before. Because the situation is the same everywhere, we can fix j = n / 2, and consider all i, 1 <= i <= n / 2. For a packet p in PU_l, 0 <= l < n / 2, there are l out of n destinations (namely the PUs with indices n / 2, n / 2 + 1, n / 2 + l - 1) for which the shortest path from PU_l leads over the connection e from PU_{n / 2 - 1} to PU_{n / 2}. Thus, the probability that p crosses e is l / n. Let S be the random variable giving the number of packets crossing e. Exp[S] = k * sum_{0 <= l < n / 2} l / n = k / n * n / 2 * (n / 2 - 1) / 2 ~= k * n / 8.
Even in this case Chernoff bounds can be used, to obtain a high probability bounding that holds for all connections at the same time. The Chernoff bounds were formulated for sums of equally distributed Bernoulli trials, but this is not necessary. For a sum S = sum_{0 <= i < n} X_i, where the X_i are Bernoulli trials with Prob[X_i = 1] = p_i, in the formulation of the Chernoff bounds p * n can be replaced by sum_{0 <= i < n} p_i. In our case this means that the deviation from the expected value can be bounded to O(sqrt{k * n * log n}).
The above analysis holds for both routing phases, because these phases are essentially the same, because even the second routing phase is a randomization: if the original permutation is given by Pi and the chosen randomization by Rho, then the first phase routes Rho, and the second phase Pi * Rho^{-1}. Even under Pi * Rho^{-1}, for any packet in PU_i the probability to move to PU_j equals 1 / n, for all 0 <= i, j < n.
Theorem The randomized two-phase algorithm performs k-k routing on a circular array of length n in k * n / 4 + O(sqrt{k * n * log n}) steps with maximum queue size k + O(log n + sqrt{k * log n}), with high probability.
Of course for many source-destination distributions this routing strategy means a doubling of the routing time, but for some bad instances it gives a reduction. In other words: the given algorithm is optimal for the problem as a whole, but not for every single instance of the problem.
The randomized method consists of two parts:
Clearly, once we know that an object / solution exists we would also like to construct / find it. However, for some problems the proof of existence does not give a key for the effective computation. For example, applying Fermat's or Rabin's test, we can prove that a number is composite and thus has non-trivial factors, but factorization is an unsolved problem.
An approximation algorithm for an optimization problem is a polynomial-time algorithm for a hard, mostly NP-hard, problem, which finds a feasible solution for which it can be proven that the value of the objective function lies within a factor from the optimum value for all inputs. In this latter sense approximation algorithms are more than just heuristics, which may come without any guarantee. For an approximation algorithm A, let m_A(I) denote the value achieved on an instance I, while f_*(I) denotes the optimum value. The performance ratio of A is defined to be the minimum over all I of f_A(I) / f_*(I). For a randomized algorithm A, f_A(I) is replaced by the expected achieved value, Exp[f_A(I)]. An algorithm with performance ratio c is also said to be a c-approximation.
There is a general method for obtaining approximation algorithms for an optimization problem P. It works as follows:
We consider a simple example, which can be interpreted geometrically. Consider the following system of linear inequalities:
x >= 0, y >= 0, x + 4 * y <= 16, 6 * x + 4 * y <= 30, 2 * x - 5 * y <= 6.The objective function is f(x, y) = 6 * x + 5 * y. The goal is to find the feasible solution (x'_0, y'_0) for which f is maximal.
The five inequalities define a closed two-dimensional subset I of R^2. Because it is the intersection of convex subsets, it is convex itself. Moving along any line segment within I, the value is either constant, or increases in one direction and decreases in the other. This implies that the maximum of f over I is not only assumed in the interior of I. Arguing on, it follows that there must even be a vertex of the polytope on which f assumes its maximum value. For the given example, this reduces the problem to checking the value of f for the five vertices which is not much work: (x'_0, y'_0) = (2.8, 3.3), with f(x'_0, y'_0) = 33.3.
If in addition there are conditions that x and y should be integral, then any of the high-lighted points is a candidate for giving the maximum value of f. These points have the special property that starting from the optimum (2.8, 3.3) of the LP they are non-dominated in the sense that there are no points in I, which lie closer to (2.8, 3.3) with both x- and y-value. Actually, we do not even have to consider the point (3, 2), because f can only assume its maximum value in this point if it here has the same value as in (2, 3) and (4, 1). These non-dominated points in I are also called Pareto optima. The notion of Pareto optimality is named after the Italian economist Vilfredo Pareto (1848-1923). It is widely used in game theory and economics. In the latter context, an allocation of resources is Pareto optimal if there is no way that some individual could be made better off without making any other individual worse off. Checking the Pareto optima, we find that f is maximal for (x_0, y_0) = (4, 1). f(4, 1) = 29, 4.3 less that the value achieved by the LP. Notice that (4, 1) does not lie close to (2.8, 3.3) in a geometric sense. The most natural rounding idea in this case would have been to take the closest integral solution in I, which is (2, 3), giving f(2, 3) = 27, 2 less than the optimum.
In its extreme simplicity, the given example expresses many important ideas: for any LP, the maximum, when finite, is (also) assumed on a vertex of the polytope bordering the set of feasible solutions. The algorithms for efficiently solving LP consist of methods to rapidly converge to this vertex by performing a coordinated walk from vertex to vertex. The classical simplex method rapidly finds a solution for most problem instances, but one may construct pathological inputs for which the it visits an exponential number of vertices before reaching the optimum. The more recent ellipsoid method has polynomial running time on all inputs, but is more elaborate and not necessarily better in practice.
The solution of the LP limits the search for the optimum of the ILP to the Pareto optima. For Pareto optima which lie on a line segment only the endpoints of the line segment need to be tested. For some problems this approach allows to find optimum solutions in polynomial time:
However, in general there remain exponentially many points to check. Therefore, the search is limited to a polynomial subset, for example by simply picking the feasible solution which lies closest to the optimum solution of the LP in some sense. If x = (x_0, ..., x_{n - 1}) is the optimum of the LP and x' = (x'_0, ..., x'_{n - 1}) is any feasible solution with |x_i - x'_i| < 1 for all i, 0 <= i < n, then |f(x) - f(x')| < sum_{0 <= i < n} |a_i|, where a_i is the coefficient of the i-th coordinate in the objective function. So, the absolute difference from f(x) can be bounded easily, but the relative difference may be arbitrarily large.
C_0 = x_1, C_1 = x_1 x_3, C_2 = !x_3, C_3 = x_2, C_4 = !x_1 !x_2 x_3 C_5 = !x_2 !x_3, C_6 = !x_1 x_3, C_7 = x_2 !x_3.Here m = 8 and n = 3. The conjunctive operator "or" is understood between any pair of juxtaposed variables. The symbol "!" indicates that the literal following it is negated. The question is how many of the m clauses can be satisfied. That is, we want to compute the maximum over all 2^n possible truth assignments to the literals of the number of satisfied clauses. This problem is called MAXSAT. For the given example it can be verified (testing all 8 assignments or using a case distinction) that at most 6 clauses can be satisfied and that this is achievable.
SAT is the related decision problem dealing with the question whether all clauses can be satisfied or not. k-SAT is the variant of SAT in which any clause consists of at most k literals. 1-SAT is trivial, 2-SAT can be solved in polynomial time and 3-SAT is one of the most famous NP-complete problems. Clearly MAXSAT is more general than SAT because the answer to SAT is yes iff the answer to MAXSAT is m. Therefore MAXSAT is NP-hard if the length of the clauses is not bounded to 1 or 2.
In this section we consider approximation algorithms for MAXSAT. We first present a 1/2- and a (1 - 1/e)-approximation, which then will be combined to a 3/4-approximation. Notice that a c-approximation for MAXSAT does not need to satisfy at least c * m clauses for any system of m clauses: the performance ratio is given with respect to the best achievable result, which can be less than m.
The simplest randomized algorithm one can imagine is to set any literal true or false independently with probability 1/2. This algorithm, which will be referred to as A_1, is less bad than it appears:
Lemma: Applying A_1, the expected number of satisfied clauses equals sum_{0 <= i < m} (1 - 1 / 2^{l_i}) >= m / 2, where l_i denotes the number of literals in clause i, 0 <= i < m.
Proof: For a clause of length l, the probability that it is satisfied is 1 - 1 / 2^l. Let Z_i, 0 <= i < m, be the indicator random variable with value 1 when clause i is satisfied and 0 otherwise. Let S = sum_{0 <= i < m} Z_i. Exp[S] = Exp[sum_{0 <= i < m} Z_i] sum_{0 <= i < m} Exp[Z_i] = sum_{0 <= i < m} Prob[Z_i = 1] = sum_{0 <= i < m} (1 - 1 / 2^{l_i}). The fact that the Z_i are dependent is of no importance because of the linearity of expectation, which even holds for dependent random variables. Because we may assume that there are no empty clauses, l_i >= 1, for all i, and thus sum_{0 <= i < m} (1 - 1 / 2^{l_i}) >= sum_{0 <= i < m} 1 / 2 = m / 2. End.
Applying the randomized method the above results can be used in an existential way:Corollary: For any set of m clauses there is a truth assignment satisfying at least round_up(sum_{0 <= i < m} (1 - 1 / 2^{l_i})) >= round_up(m / 2) clauses.
Proof: The only point which remains to be verified is that the expected value may be rounded up. This is so, because if the expected value Exp[X] of a random variable X which only assumes integral values is not integral itself, X must assume at least one value which is larger than Exp[X]. The smallest such value is round_up(Exp[X]). End.
This easily obtained result leads to far-reaching conclusions. For the above example with (l_0, ..., l_7) = (1, 2, 1, 1, 2, 2, 2, 2), it follows that there must be an assignment satisfying at least round_up(3 * 1/2 + 4 * 3/4 + 1 * 7/8) = round_up(5 3/8) = 6 clauses. This we already knew, but here the result is obtained by considering only the lengths of the clauses. More generally, for any system of 7 clauses with each 3 literals (without repetitions of the same literal within a single clause), the corollary gives that at least round_up(7 * 7/8) = round_up(6 1/8) = 7 literals can be satisfied. That is, any system with at most 7 clauses with 3 literals each is satisfiable.Corollary: A_1 gives a 1/2-approximation.
Proof: Because the optimum value f_* of MAXSAT satisfies f_*(I) <= m / 2 for any input I with m clauses, it follows that f_{A_1}(I) / f_*(I) >= m / 2 / m = 1 / 2. End.
A_1 is good for long clauses and even for clauses of length 2 it works satisfactory. Only if there are many singletons, the performance is less good than we would hope to achieve. Therefore, we now present an alternative approximation algorithm A_2 which performs best on short clauses. We will show that the approximation algorithm AA which consists of trying both A_1 and A_2 and taking the best result gives a 3/4-approximation.
MAXSAT can be written as an ILP using variables z_j, 0 <= j < m, and y_i, 0 <= i < n. z_j indicates the degree of "satisfiedness" of clause C_j, while y_i indicates the degree of "trueness" of the literal x_i. Denoting for each clause C_j, by C_j^+ the subset of literals appearing in non-negated form, while C_j^- gives the negated literals, we get
maximize s = sum_j z_j under the following conditionsThe maximum value of s is denoted s_max. The corresponding LP, formulated in terms of variables y'_i and z'_j, obtained by dropping the integrality conditions, is given by
z_j <= sum_{i in C_j^+} y_i + sum_{i in C_j^-} (1 - y_i), for all j, 0 <= j < m,
y_i in {0, 1}, for all i, 0 <= i < n,
z_j in {0, 1}, for all j, 0 <= j < m.
maximize s' = sum_j z'_j under the following conditionsThe maximum value of s' is denoted s'_max. Because s'_max is the maximum of the same function computed with less restrictions, s'_max >= s_max. One may hope that s_max lies close to s'_max, but this does not need to be so. Having the clauses C_1 = (x_1 or x_2), C_2 = (x_1 or !x_2), C_3 = (!x_1 or x_2), C_4 = (!x_1 or !x_2), s_max = 3 while s'_max = 4. This immediately shows that for MAXSAT the integrality gap, the maximum ratio of the value of an optimum solution of an LP and that of an ILP, is at least 4 / 3.
z'_j <= sum_{i in C_j^+} y'_i + sum_{i in C_j^-} (1 - y'_i), for all j, 0 <= j < m,
0 <= y'_i <= 1, for all i, 0 <= i < n,
0 <= z'_j <= 1, for all j, 0 <= j < m.
The most common way to perform randomized rounding, is to replace a variable z = x + y, with x an integer and 0 <= y < 1, by x + 1 with probability y and by z with probability 1 - y, independently of any other rounding. Define b_k = 1 - (1 - 1 / k)^k. b_1 = 1, b_2 = 3/4 and, for all k, b_k >= 1 - 1 / e.
Lemma: Let C_j be a clause with k literals. The probability that it is satisfied is at least b_k * z'_j.
Proof: Assume wlog that the clause is x_1 or x_2 or ... or x_k. We know that y'_1 + y'_2 + ... + y'_k is x_1 or x_2 or ... or x_k. We know that y'_1 + y'_2 + ... + y'_k >= z'_j. All literals are rounded independently, so the probability that none of the literals is true in the rounded clause is Prod_{i = 1}^k (1 - y'_i). We show that 1 - Prod_{i = 1}^k (1 - y'_i) >= b_k * z'_j. Using Lagrange and concave function properties + the inequality this follows. Prod_{i = 1}^k (1 - y'_i) is maximized when all factors are equal, namely y'_i = z'_j / k. Thus, Prod_{i = 1}^k (1 - y'_i) <= (1 - z'_j / k)^k. We want to show that for all x, 0 <= x <= 1, the function f(x) = 1 - (1 - x / k)^k >= b_k * x. Because f(x) is concave, it suffices to test this for the endpoints of the interval. f(0) = 0 >= b_k * 0. f(1) = b_k >= b_k * 1. End.
This way of computing an approximation for MAXSAT will be referred to as A_2.Corollary: A_2 gives a (1 - 1/e)-approximation.
Proof: Exp[sum_{0 <= j < m} z_j] = sum_{0 <= j < m} Exp[z_j] >= sum_{0 <= j < m} b_{k_j} * z'_j >= (1 - 1 / e) * sum_{0 <= j < m} z'_j >= (1 - 1 / e) * s_max. End.
The suggested algorithm AA is to take the maximum of the results given by A_1 and A_2.
Theorem: AA gives a 3/4-approximation.
Proof: For some set of m clauses, let n_1 and n_2 denote the expected number of clauses satisfied by A_1 and A_2, respectively. Let S_k be the set of clauses with k literals. n_1 + n_2 >= sum_k sum_{C_j in S_k} (1 - 1 / 2^k) + sum_k sum_{C_j in S_k} b_k * z'_j >= sum_k sum_{C_j in S_k} (1 - 1 / 2^k + b_k) * z'_j >= 3/2 * sum_j z'_j. The last estimate is correct because for all k, 1 - 1 / 2^k + b_k = 2 - 1 / 2^k - (1 - 1 / k)^k >= 3/2. For k = 1, 2 and 3, this can be tested explicitly, for larger k, we estimate 1 / 2^k + (1 - 1 / k)^k <= 1 / 16 + 1 / e <= 0.44. End.
Maybe one might believe that if the events are pairwise independent, that then the graph has no edges. However, this is not correct. If we have random variables X and Y chosen uniformly and independently from the set {0, 1}, and a random variable Z = (X + Y) mod 2, then all three are pairwise independent, but clearly any single one is specified by the other two. Let E_X be the event that E_X = 0. Define E_Y and E_Z analogously. Prob[E_X| E_Y] = 1/2 = Prob[E_X], but Prob[E_X| E_Y and E_Z] = 1 != 1/2 = Prob[E_X]. Thus, there must be an edge from E_X to at least one of the two other events. The same holds true for E_Y and E_Z. As a result, the dependency graph for these three events has at least two different edges. We see that the dependency graph does not even has to be unique!
The famous Lovasz local lemma tells us that, provided a condition holds, Prob[and_i (! E_i)] > 0, even when these events are weakly dependent on each other. Of course we know that generally
Prob[and_i (! E_i)] = Prob[! (or_i E_i)] = 1 - Prob[or_i E_i] >= 1 - sum_i Prob[E_i].This we have used many times before, but this works only if we can put a sufficiently strong bound on Prob[E_i]. Otherwise this whole condition becomes void. The other extreme case is that all the E_i (and thus the ! E_i) are independent. Then we have
Prob[and_i (! E_i)] = Prod_i Prob[! E_i] = Prod_i (1 - Prob[E_i]).As long as Prob[E_i] < 1, for all i, this gives Prob[and_i (! E_i)] > 0, even though it might be very small. In the context of the randomized method it would nevertheless allow to prove the existence of some object.
Lovasz Local Lemma:
Let G = (V, E) be the dependency graph of a set of events E_i,
0 <= i < n. Let x_i, 0 <= x_i <= 1, for all 0 <= i < n, be so that
Prob[E_i] <= x_i * Prod_{j| (i, j) in E} (1 - x_j),
then
Prob[and_{i = 0}^{n - 1} (! E_i)] >= Prod_{i = 0}^{n - 1} (1 - x_i).
The proof is relatively short but omited here. The conclusion is that if we can define x_i satisfying the condition, that then we can forget about the dependency when working with the x_i instead of the probabilities.
As an example we consider a situation in which we have eight clauses: x_0 or x_1 or x_2, x_2 or x_3 or x_4, x_4 or x_5 or x_6, ..., x_14 or x_15 or x_0. Event E_i corresponds to clause_i has value false. Prob[E_i] = 1/8, for all i. Clearly if the clauses would be independent, Prob[and_i (! E_i)] = (1 - 1/8)^8 = (7/8)^8 ~= 0.344. If we would use the correct formula using 1 - or_i E_i, we would find Prob[and_i (! E_i)] >= 0, which is not very interesting. The Lovasz local lemma helps in this case: E_i is dependent only on E_{i - 1} and E_{i + 1} (indices computed cyclically). Taking x_i = 1/5 for all i, we get Prob[E_i] = 1/8 <= 1/5 * (1 - 1/5)^2, and thus we conclude that Prob[and_i (! E_i)] >= (1 - 1/5)^8 ~= 0.168. Trying all possible assignments using the program which can be downloaded here shows that 25889 of the 65536 assignments are satisfying, a fraction of 0.395.
The above example, which by its simplicity can certainly be solved analytically as well, is a special case of the situation described by the following corollary. It is much less general than the lemma, but more easy to use:
Corollary: Let E_i, 0 <= i < n, be events with Prob[E_1] <= p, for all i. If each event is mutually independent of all but d of the other events and e * p * (d + 1) <= 1, then Prob[and_{i = 0}^{n - 1} (! E_i)] > (1 - 1 / (d + 1))^n > 0.
Proof: Take x_i = x = 1 / (d + 1), for all i, 0 <= i < n. This gives Prob[E_i] <= p <= 1 / ((d + 1) * e) <= 1 / (d + 1) * (1 - 1 / (d + 1))^d = x * (1 - x)^d <= x * Prod_{j| (i, j) in E} (1 - x) = x_i * Prod_{j| (i, j) in E} (1 - x_i). Here we used that (1 - 1 / (d + 1))^d > 1 / e, for all d > 0. End.
This lemma can be used to derive results on restricted versions of SAT. We consider the version of k-SAT in which all clauses contain exactly k literals without repetitions. A further restriction is that we assume that each of the n literals occurs (positively or negatively) in at most 2^{(1 - eps) * k} of the in total m clauses. The argument is simple, once one knows the above corollary and thinks of the randomized method:Theorem: Any instance of k-SAT with n literals and m clauses for which each literal occurs in at 2^{(1 - eps) * k}, eps > 0, clauses is satisfiable for all k >= k_min, where k_min only depends on eps.
Proof: First we consider the special case eps = 1/2. All variables are set to true or false with probability 1 / 2, independently of each other. E_i, 0 <= i < m, denotes the event that clause i does not get satisfied by this assignment. Clearly Prob[E_i] = 2^{-k}, for all i. The event E_i is independent of E_j if clause i and clause j do not share at least one variable. However, there are at most k * 2^{k / 2} clauses having a literal in common with clause i (namely, at most 2^{k / 2} for each of the k literals). Because e * (k * 2^{k / 2} + 1) * 2^{-k} <= 1 for all k >= 10, we can conclude that for these k, Prob[and_{i = 0}^{m - 1} (! E_i)] > (1 - 1 / (k * 2^{k / 2} + 1))^m > 0 for all m. This means that just by throwing randomly, we have a strictly positive probability of finding a truth assignment satisfying all clauses, which in particular implies that such a truth assignment exists. Assuming that each literal occurs in at most 2^{(1 - eps) * k} clauses, for some eps > 0, the condition becomes (k * 2^{(1 - eps) * k} + 1) * 2^{-k} <= 1. End.
Looking into the details, we find k_min(0.5) = 10. For eps = 0.5 and k = 10, any literal can occur at most in 3 clauses. k_min(0.1) = 59. For eps = 0.1 and k = 59, any literal can occur in at most 39 clauses. Notice that the condition on the number of occurrencies imposes an upper bound on m: because we must have n * 2^{k / 2} >= k * m we get m <= 2^{k / 2} / k * n. But, this is not that serious.The main weakness of the above is that the probability that a random assignment to the literals satisfies all clauses may be very small. Thus, simply repeating random assignments until a satisfying assignment is hit, may take too long. To obtain an algorithm that finds a solution in expected polynomial time we sharpen the conditions. Assume any literal occurs in at most 2^{k / 50} clauses. k is considered to be a constant. If k is constant, even 2^k is constant. This is important because the expected running time of the presented solution is polynomial in m but exponential in k. The algorithm consists of two phases. In the first phase part of the literals are fixed uniformly and independently to true or false. The remaining literals are left unspecified until the second phase where they are fixed in a more elaborate way.
The first phase goes as follows:
for (int i = 0; i < n; i++)
dangerous[i] = false;
for (int i = 0; i < n; i++)
if (!dangerous[i])
{
x[i] = random.nextBoolean();
for (int j = i + 1; j < n; j++)
testDangerous(j, x, dangerous, clauses);
}
Here x[i] denotes the value of literal i. A literal becomes
dangerous, meaning that it should not be fixed in a careless
way, when it occurs in a dangerous clause. A clause with k literals
is dangerous, if k / 2 of its literals have been fixed and it has not
been satisfied by these. These dangerous literals are skipped during
phase 1.
A clause which is not satisfied during phase 1 is said to have survived. A clause may survive because it became dangerous itself, or because its remaining unspecified literals became dangerous because other clauses in which these occur became dangerous. Any surviving clause still has at least k / 2 unspecified literals. It is important to notice that the partial solution after phase 1 is extendible in the sense that there is an assignment to the dangerous literals so that all surviving clauses are satisfied (proving this is left as an exercise). Thus, there is no need for backtracking. A clause C becomes dangerous when k / 2 of its literals have been fixed in an unlucky way. The probability that this happens is at most 2^{- k / 2}. Thus, the probability that a clause C survives is at most (d + 1) * 2^{- k / 2}, because either C or at least one of the at most d = k * 2^{k / 50} other clauses which has a literal in common with C must have become dangerous.
For the considered problem, the dependency graph G has one vertex for each event E_j, 0 <= j < m, the event that clause C_j is not satisfied. E_j depends on E_{j'} for each j' so that C_j and C_{j'} share a literal. Because there are at most d such clauses, the dependency graph has degree at most d. In the following we will identify clauses and vertices. The possibility to fix the dangerous literals in polynomial time depends on the structure of the subgraph G' of G induced by the the dangerous clauses. Further down we prove that with high probability all connected components of G' are small. Any dangerous literal occurs only in clauses of a single connected component of G'. Thus, the remaining problem can be solved by separately treating the subsets of literals belonging to the clauses occurring in the different connected components of G'. Consider a connected component of size s. These clauses contain k * s literals. These are not all different, and part of them have been fixed in phase 1, so the number of literals to fix for this component is less than k * s. It follows that there are less than 2^{k * s} = 2^k * 2^s assignments to test for this component. Each assignment can be tested in O(k * s) time. That is, an assignment satisfying all clauses in the component can be found in O(k * 2^k * s * 2^s) time. Let p be the number of components, and let s_l, 0 <= l < p, denote the size of component l, the the total time is given by
sum_{0 <= l < p} c * k * 2^k * s_l * 2^{s_l} <=Here c is a constant and s_max = max{s_l| 0 <= l < p}. The inequality shows that the running time is polynomial in m if s_max = O(log m).
sum_{0 <= l < p} c * k * 2^k * s_l * 2^{s_max} <=
c * k * 2^k * m * 2^{s_max}.
Lemma: With probability 1 - o(1), the size of the largest connected components of G' is bounded by O(log m).
Proof: The probability that any clause survives is at most (d + 1) * 2^{- k / 2}. If we consider clauses C_1 and C_2 which are at distance 4 or more in G, then the event E_1 that C_1 survives is independent of the event E_2 that C_2 survives, because in G no neighbor of C_1 is adjacent to any neighbor of C_2. Thus, for any subset {C_0, ..., C_{r - 1}} of clauses of size r for which the distance of any pair of clauses is at least 4, Prob[and_{0 <= i < r} E_i] = prod_{0 <= i < r} E_i <= ((d + 1) * 2^{- k / 2})^r.
An s-tree of a graph G is a subset T of the nodes satisfying the following two properties:
while (there are still nodes left)
Pick an arbitrary node u which has distance s to any of the
earlier selected nodes (the first node can be picked freely).
Remove all nodes with distance less than s from u.
Because for each selected node less than 1 + d + d^2 + ... +
d^{s - 1} = (d^s - 1) / (d - 1) < d^s are removed, this procedure
will select at least n / d^s nodes in a graph with n nodes. In the
following we will consider 4-trees of the dependency matrix G. We
also consider the graph G_4 derived from G. This graph has the same
set of m nodes as G. There is an edge (u, v) in G_4 if and only if u
and v have distance equal to 4 in G.
The definitions imply that the number of 4-trees of size r is bounded by the number of connected subgraphs of G_4 of size r. Because the degree of G_4 is bounded by d^4, this number of subgraphs is bounded by m * d^{8 * r}. This result follows from the general fact that a graph with n nodes and degree d has at most n * d^{2 * r} connected subgraphs of size r. Thus, the probability that there is a 4-tree in G of size r for which all clauses survive is bounded by m * d^{8 * r} * ((d + 1) * 2^{- k / 2})^r = m * (d^8 * (d + 1) * 2^{- k / 2})^r. For r = x * log m, for some suitable constant x, this is o(1). Because, due to the argument given above, any connected component of G of size d^4 * r contains a 4-tree of size r, we conclude that the probability that all clauses of a connected subset of size more than x * d^4 * log m survive is o(n). Because d is considered to be constant, this gives the claimed result. End.
The complete algorithm now proceeds as follows:
repeat
{
perform phase 1;
construct G';
compute the connected components of G';
let s_max be the size of the largest component;
}
until (s_max <= x * log m);
let p be the number of connected components;
for (int l = 0; l < p; l++)
find a satisfying assignment for the clauses in component l;
Theorem: The algorithm finds a satisfying assignments to the literals of any instance of k-SAT in which each literal occurs in at most 2^{k / 50} clauses. The expected running time is polynomial in m.
Proof: The expected number of iterations through the loop is less than 2 because the probability to succeed in each pass is larger than 1/2. Therefore, the total time spent in the loop is polynomial in m. Above it was shown that the second phase can be performed in O(k * 2^k * m * 2^{s_max}) time. If the condition is satisfied, we have O(k * 2^k * m * 2^{s_max}) = O(k * 2^k * m^{x + 1}) = O(m^{x + 1}), because k is considered to be a constant. End.