Trees

Euler-Tour Technique

In this section we present a simple technique with a whole bunch of basic but crucial applications solving problems on trees such as rooting the tree and computing the depth of all nodes.

Construction

The idea of the Euler-tour technique is to reduce a problem on trees to a problem on lists. This is done by first replacing each undirected tree edge (u, v) by two directed edges (u, v) and (v, u). Each node u numbers its neighbors in arbitrary order. So, these can be denoted v_i, 0 <= i < deg_u, where deg_u denotes the degree of node u. This ordering gives rise to a successor relation between the directed edges: the successor suc(v_i, u) of edge (v_i, u) is defined by
suc(v_i, u) = (v, u_{(i + 1) mod deg_u}).

If the edges are numbered with different numbers, which can be performed by computing the prefix sums of the numbers of edges each PU is managing, then the defined successor relation simply gives a linked circular list. This can be turned into a non-circular list, by cutting it at a single place. For example, node 0 can cut the list between its first and last neighbor by setting suc(v_{deg_0 - 1}, 0) = null.

For a tree with n nodes and n - 1 edges, on a PRAM this construction takes O(log n) time with O(n) work for the prefix computation and O(1) time with O(n) work for the linking. On a DMM with P PUs, the prefix takes O(n / P) * c_i + log P * (c_v + c_a). If we assume that the information concerning edge (v_i, u) is stored in the PU responsible for node v_i, then the linking requires that the PU which is responsible for node u sends a packet to this other PU. Thus on a DMM the cost is dominated by that of an all-to-all routing in which each PU sends and receives one information for each edge of a node it is managing. If there are no extreme node degrees and the nodes are distributed evenly, this amounts to a k-k routing with k = 2 * n / P. More generally, the edges, after numbering them, should be distributed evenly over the PUs.

Euler-Tour Construction

Basic Applications

The Euler-tour technique can be used for several problems by weighting the edges of the list appropriately and then solving a weighted variant of the list-ranking problem. In a normal list-ranking problem the task is to compute for each node u the number of links on the path from the initial node or on the path to the final node, whatever is easier. A weighted list-ranking problem is very similar: the links have weights which are not necessarily 1 for all edges, and the task is to compute the sum of all weights of the links on the path from the initial node or on the path to the final node. Said otherwise, this is the problem of computing a prefix or suffix sum on a list. None of the presented list-ranking algorithms essentially depends on the links being unweighted, though in an optimized version some of them must send slightly smaller information units for the unweighted version. For example, in pointer jumping the distance regularly doubles until reaching the final node. This observation makes it unnecessary to send distance information all the time.

Tree Rooting

In many applications, for example as a result of a parallel algorithm for constructing a spanning tree, the tree is just an acyclic connected graph without further structure: there is no root and the edges are not directed. Choosing a root is not hard, the node with the smallest index is a good candidate. But in a forest this presupposes that each node knows to which tree it belongs and for further operations it is also handy if for each node u the node v = parent[u] is known, being the neighbor of u lying on the path from u to the root of its tree.

Preorder and Postorder

The preorder and postorder numbers are important for determining the structure of the tree. For example, a DFS spanning tree of a graph is characterized by the fact that there are no "forward cross edges", these are edges running from a node u to a node v, with preorder(v) > preorder(u) and postorder(v) > postorder(u). These numberings can also be used as just a numbering.

Assume we have a rooted tree as input. Then to compute the preorder numbers, all edges pointing away from the root are given weight +1 the other edges are given weight 0. After running a weighted list-ranking algorithm, a node u can obtain its number from the rank of the edge coming from its parent. Switching the weights, that is giving weight +1 to the edges leading towards the root and 0 to the others computes the postorder numbers.

Depth

Computing the depth of all nodes provides important information about the structure of the tree. It also may help to compute distances in the tree. The distances are very easy to compute. Just weight the edges of the Euler tour appropriately: all edges pointing away from the root get weight +1, all other edges get weight -1. After solving the weighted list-ranking problem, each node u can obtain its depth from the rank of the edge coming from its parent.

Size of Subtree

The size of the subtree of a node can be determined as a by-product of the computation of the preorder numbering: if the edges pointing away from the root are given weight +1 and all other edges weight 0, then for any node u the difference of the rank of the edge leading from parent[u] to u and that of the edge leading from u to parent[u] gives the number of downward edges in the subtree of u. Each such an edge reaches a new node, so this equals the size of the subtree of u without counting u.

Inorder

For a binary tree without nodes of degree 1 it is also handy to have an inorder numbering. This is slightly harder than the above. In a binary tree there are four types of consecutive edges in the list of edge. The default is that an edge has weight 0. Some edges get weight 1 according to the following rules:

Performing a weighted list ranking on the thus weighted list, a node u can find its inorder number from the rank of the first edge of the pair of edges that changed direction in u.

Coloring Regular Bipartite Graphs

Sequential Algorithm

An edge coloring of a graph is an assignment of numbers to the edges so that no two edges incident upon a node have the same number. In general it is very hard to compute a coloring using a minimum number of colors, but for bipartite graphs this is much easier. Particularly it is generally true that it is possible to construct a coloring with d colors when d is the maximum degree of any node. This is a consequence of Hall's (also called König's) theorem which states that on bipartite graphs there is a matching which matches all nodes of maximum degree. This provides the step in an inductive proof: a coloring can certainly be found by repeatedly constructing such a maximum matching. Surprisingly coloring bipartite graphs can be performed much more efficiently than this. A single matching in a bipartite graph with n nodes and m edges takes O(sqrt(n) * m) time. For a regular bipartite graph of degree g, m = g * n, so repeatedly matching takes O(sqrt(n) * g * m) time. A coloring can be constructed in just O(log g * m) time.

Consider a bipartite graph with node sets V_1 and V_2, each with n nodes. Assume that the graph is regular of degree g. A first idea for constructing a coloring is to start allocating the colors to the first node of V_1, then the second and so on. When assigning the colors of node i, we should respect the conditions imposed by earlier assigned colors. If one is lucky this may work, but in general this approach will get stuck when we must assign a color c to an edge (i, j) while node j has an edge (i', j) for some i' < i, which was already assigned color c while coloring node i'. Another "greedy" strategy may also work, but not always. The idea is that one tries to assign the colors one by one. The algorithm gets stuck when, while assigning color c, one comes to a node i for which all the uncolored edges lead to nodes which already have an edge that was given color c before.

A working and efficient strategy is based on Euler splittings. For a g-regular graph G with g even, the graph can easily be split in two edge-disjoint subgraphs G_0 and G_1 each of degree g / 2. This is done by constructing an Euler tour of the graph, numbering the edges on the tour alternatingly 0 and 1. All edges which have been numbered 0 belong to G_0, all other edges to G_1. Because sequentially an Euler tour of a graph with n nodes and m edges can be constructed in O(n + m) time, this splitting costs time O(g * n).

In general, for a graph of degree g = 2^k, the algorithm consists of k rounds. In round i, 0 <= i < k, the algorithm is working on 2^i subgraphs in which all nodes have degree 2^{k - i}. Each of the 2^i operations in round i takes O(2^{k - i} * n) time, so the total amount of work in any of the rounds is O(2^k * n) = O(g * n) = O(m). Thus, in total over all rounds the algorithm takes O(k * m) = O(log g * m) time. For g which are not powers of 2, the construction of such colorings is considerably harder. After much research it has been established (by Cole, Ost and Schirra in Combinatorica, Vol. 21, 2001) that even the general case can be solved in O(log g * m) time.

Coloring Regular Bipartite Graphs

Parallel Algorithm

How can we turn this idea into a parallel algorithm? In the following we assume that the degree g of the graph is a power of two, the general case being much harder and unsolved as far as we know.

The basic idea is to turn the problem of finding an Euler tour into a list-ranking problem. For every edge there is a node. These nodes have no predecessor and successor links, but for every edge e, there is a data element left[e] and rght[e]. left[e] gives the edge e' to which e is connected in its left endpoint i in V_1. This decision is made locally, without global coordination in node i. If left[e] = e', then left[e'] = e. rght[e] gives the edge e'' to which e is linked in its right endpoint j in V_2. A node of the list (corresponding to an edge of the graph) which is entered by a left[] link is left through a rght[] link and vice versa.

The list-ranking algorithms can, with minor modifications also be applied to figure out the minimum on a circular list and the distance thereof. This is now done for both lists (not knowing the global structure, it must be done for both). Of course we do not have a conventional list structure: each node has two "successors": one labeled "left" and one labeled "right". So, what is the appropriate successor? The answer is simple: the path starting over a "left" link must at even distances from the original node traverse "left" links and at odd distances "right" links. Further, notice that, because any cycle on a bipartite graph has even length, for any node e, if the distance to the node with the smallest index is even along one direction then it is also even along the other direction. This observation makes that we now can split the set of nodes (corresponding to the edges of the graph) in two: all those at even distances go into one subset, all those at odd distances into another.

Summarizing: first each node of the graph arbitrarily links up the edges connected to it in connected pairs; then a set of bidirectionally linked circular lists of a slightly modified nature with g * n nodes is constructed; then two list ranking problems are constructed. Clearly this whole problem can be solved in O(T_listrank(g * n)). Hereafter we have to solve two instances with each half as many edges. In a straightforward implementation this gives

T_color(P, n, g) = O(T_listrank(P, g * n) + 2 * T_color(P, n, g/2).
The solution depends on the list-ranking time. For a DMM, T_listrank(P, g * n) = O(g * n / P * (c_i + c_v) + c_a * P). Substituting and solving gives
T_color(P, n, g) = O(log g * g * n / P * (c_i + c_v) + c_a * g * P).

This is a good result except for the number of startups. Organizing things slightly differently the number of startups can be minimized: after each splitting of the problem, each of the two subproblems should be concentrated in half of the available PUs. This gives a different recurrence relation:

T_color(P, n, g) = O(T_listrank(P, g * n) + T_color(P/2, n, g/2).
Solving this recurrence gives
T_color(P, n, g) = O(log g * g * n / P * (c_i + c_v) + c_a * P).
So, we saved a factor g in the number of startups. This improved algorithm will work well if log g * g * n >> P^2, the first version required log g * n >> P^2.
If there are two independent problems which must be solved on a DMM or an interconnection network with P Pus, then it is mostly better to solve these problems in parallel each on P / 2 PUs than to solve them after each other each on P PUs.

On a two-dimensional m x m mesh the situation is even much better. Assuming that g * n > m^2, T_listrank(m^2, g * n) = O(T_route(m^2, g * n)) = O(g * n / m). So, we get

T_color(m^2, n, g) = O(g * n / m) + T_color(m^2 / 2, n, g / 2).
Because (g / 2) * n / sqrt(m^2 / 2) = (g * n / m) / sqrt(2), after each splitting the subproblems can be solved faster by a constant factor (more accurately: after splitting twice there are four subproblems which can be solved twice as fast on m / 2 x m / 2 meshes). Thus, the sum of the times of all splitting phases is proportional to the time for the first one:
T_color(m^2, n, g) = O(g * n / m).

Of course the speed-up of this mesh algorithm is only m = sqrt(P), but this is the normal situation on meshes. There are very few problems like matrix multiplication and all-pairs shortest paths for which full speed-up can be achieved. All but very trivial problems require that at least all data are communicated once over the bisection of the network and therefore full speed-up can be achieved only if the sequential complexity of the problem is sufficiently high. A result like the obtained one is optimal up to constant factors in a two-dimensional world. The basic assumption of the DMM model that locality plays no role is definitely not valid without limitation on P.

Unbalanced All-to-All Routing

Balanced all-to-all routing is mostly simple. We have seen how to perform it on a DMM. If every PU is sending and receiving the same number of packets in total, but not the same number to each other PU, the problem is harder. This is not an uncommon situation. For example in sorting algorithms each PU holds initially and finally the same number of packets, but it is not necessarily true that initially each PU holds a uniform sample of the set of numbers. If on a DMM with P PUs the largest numbers all happen to stand in PU_0 all numbers from PU_0 must be routed to PU_{P - 1}.

We have seen how to solve the unbalanced routing problem using randomization: sending all packets first to a random intermediate destination (or by shuffling the packets in a deterministic way), any routing pattern may be replaced by two almost balanced routing patterns. The main disadvantage of this approach is that it doubles the routing volume. A coloring approach offers an alternative solution which are mostly satisfied in practical contexts.

We consider a DMM with P PUs. Each PU has to send and receive in total k packets. Let a_{i, j} denote the number of packets PU_i has to send to PU_j. The conditions are

0 <= a_{i, j} <= k,
sum_{i = 0}^{P - 1} a_{i, j} = sum_{j = 0}^{P - 1} a_{i, j} = k.
The entire routing pattern constitutes a bipartite k-regular graph with P nodes on either side. Constructing a k coloring decomposes the routing into k permutation routing problems, which can be routed in (c_v + c_a) * k time. Even when ignoring the excessive time for the coloring, this is not good, because in general this will be much more than c_v * 2 * k + c_a * 2 * P, the time for the randomized coloring.

Some form of sparsification is the key to the solution. Possibly we might group PUs together, but this inevitably leads to routing packets several times, and then we could just as well apply the simpler randomized routing. So, bundling the packets into superpackets is the only cost reduction we can try to apply. Let s be the size of the superpackets. That is, in PU_i the a_{i, j} packets which have to be routed to PU_j are packet into b_{i, j} = round_up(a_{i, j} / s) superpackets of size s, creating one partially filled superpacket. In total PU_i creates sum_{j = 0}^{P - 1} round_up(a_{i, j} / s) <= sum_{j = 0}^{P - 1} (a_{i, j} / s + 1) = k / s + P. So, this gives a bipartite graph with maximum degree k / s + P. The nodes do not need to have the same degree, but after computing prefix sums dummy packets can be added to make the graph regular. Because within certain limits s can be chosen freely, we may assume that k / s + P is a power of two.

On this reduced (k / s + P)-regular bipartite graph with P nodes on either side the presented Euler-tour-based coloring algorithm can be applied. This takes O(log(k / s + P) * (k / s + P) * (c_i + c_v) + c_a * P) time. This is a loss term. The second loss term comes from the fact that we have more than k / s superpackets: the subsequent routing takes (k / s + P) * (c_v * s + c_a). We want to determine a good choice for s and a condition on k and P so that all loss terms can be made negligible in comparison with the useful time c_v * k. Clearly P should be negligible in comparison with k / s. This allows to simplify the time for the coloring. Equating the loss terms gives the following equation:

log k * k / s = P * s.
Solving gives a close-to-optimal choice for s, namely s = sqrt(k * log k / P).

Theorem: On a DMM with P PUs any k-k routing can be performed in (1 + o(1)) * k * c_v time if k = omega(P * log P).

Proof: Taking s = sqrt(k * log k / P), the loss terms are bounded by O(sqrt(k * log k * P) * c_v). If k = x * P * log P, sqrt(k * log k * P) = k * sqrt((1 + log x / log P + loglog P / log P) / x). Clearly for x = omega(1), this is o(k). End.

So, for unbalanced k-k routing optimality can be achieved for values of k which are only marginally larger than the k = omega(P) required for balanced k-k routing. Practically the only disadvantage is the rather complicated construction.

Tree Contraction

Problem Context

The Euler-tour technique is powerful, but cannot be used to solve all problems on trees. Such a problem is the expression-evaluation problem. In this problem it is assumed that a number is associated with each leaf of the tree, while some operators are associated with any of the internal nodes. The task is to compute for any node u the value of the subexpression given by the subtree rooted at u. Expression evaluation can be solved by performing a tree contraction, the process of gradually reducing the tree until only the root node with the correct value remains.

Evaluating arithmetic expressions in parallel does not appear to make sense because it is hard to imagine in which context sufficiently large expressions occur, but the expression-evaluation problem models other tree problems which are important and for which even have to be solved for large trees. An example of such a problem is that of computing the height of each node u in a tree, that is, the distance to the deepest lying leaf in the subtree of u. For this problem the Euler-tour technique does not appear to be suitable. As an expression-evaluation problem it can be solved by putting 0 in all leafs and using as operator max + 1, taking the maximum value of the children and adding 1. Another example is that of computing for each node u of a tree the minimum key value in the subtree of u. This problem can be solved as an expression-evaluation problem in which each node computes the minimum of its own value and that of its children.

For expression trees of moderate depth the evaluation problem can be solved efficiently in parallel by allocating a processor to each node connected to two leafs. This initial processor allocation can be performed using the Euler-tour technique. For each such a node its value can be computed in constant time. When as a result a new node connected to two leafs is generated, the allocated PU is taking over this job, otherwise it becomes passive. In each step the depth of the tree is reduced by one. Thus, for a tree of depth d with n nodes, the time is O(d + log n) and the work is O(n). If the tree is deep and skinny, for example a path, this way of evaluating gives very limited parallelization.

Raking

Consider a rooted binary tree with root r. A tree of higher degree should first be replaced by a binary tree. This reconstruction can easily be performed in parallel by allocating to each node computational resources which are proportional to its degree. For each node u != r, parent[u] gives the parent of u and sibling[u] gives the sibling of u. Possibly sibling[u] = null. The rake operation is the basic operation we will use to perform tree contraction. It can be applied to any leaf u of the tree as far as u != r and parent[u] != r. It removes u and parent[u] from the tree and hooks sibling[u] to parent[parent[u]]. Of course it also performs some arithmetic to assure that the value expressed by this tree does not change as a result of the raking. Repeating rake operations, we ultimately get a tree consisting of r and at most two leafs for which the value can be computed in constant time.

Many rakes can be performed in parallel, but it should be avoided to rake two leafs with the same grandparent. This is not hard to realize. The algorithm starts by consecutively numbering the leafs using the Euler-tour technique. Then the rake operation is first performed on all leafs with numbers 4 * k + 0, for some k, and then on all leafs with numbers 4 * k + 2, for some k. Hereafter each leaf with an odd numbers i is renumbered as (i - 1) / 2, and the process is repeated until there are only two leafs left. Because the raked leafs are always four numbers apart, they can never have the same grandparent. At this point we essentially use that the tree is binary. So, provided we know how to perform a rake operation, this correctly contracts the tree.

The above tree contraction can be performed very efficiently in parallel: the Euler-tour technique was analyzed before. On an EREW PRAM it can be performed in O(log n) time with O(n) work. Assuming that raking a leaf takes constant time, each contraction phase, consisting of two parallel rakings and a trivial renumbering, takes O(1) time. In each contraction phase the number of leafs is halved, so the whole contraction can be performed in O(log n) time. For some constant c, the work during the contraction is given by sum_t c * n / 2^t = O(n). In the time estimate it is assumed that there are no internal nodes of degree one, otherwise for a tree which has very few leafs only a few nodes disappear in every contraction phase until there is one leaf left. The easiest is to add a dummy leaf to any internal node with degree one. This leaf should get a neutral value so that the evaluated value remains the same for all real tree nodes.

Expression Evaluation

Now that we know how to perform raking in parallel expression evaluation can be presented as a special case. Initially any node u gets a pair of parameters (a_u, b_u). If u is a leaf, a_u = 0 and b_u = c_u, where c_u is the constant of this leaf. If u is an internal node a_u = 1 and b_u = 0. Let val(u) denote the value to be computed for node u. For any internal node u, the operator of u, + or * in our case, is denoted by op(u).

Invariant: At all times for any leaf u, a_u = 0 and b_u = val(u). For any internal node u, with children v and w, val(u) = (a_v * val(v) + b_v) op_u (a_w * val(w) + b_w).

Initially the invariant condition is satisfied, while raking the values of the parameters must be modified appropriately. Assume we are raking a node v with parent u and sibling w. The nodes v and u are eliminated, so w should take over the values of these three. v is a leaf, thus, according to the invariant, val(u) = b_v op_u (a_w * val(w) + b_w). We distinguish two cases: Hereafter in order to establish the invariant, w plays the role of u.

The way the algorithm is specified, the correct value will be computed in the root of the tree, but not for all other nodes. In the following examples it will be discussed how to correct this. The idea is analogous to that for computing a prefix sum: the whole algorithm consists of two phases, one running upwards, one running downwards. The difference is now that operations may take place at different levels of the tree in the same parallel operation.

Applications

Size of Subtree

Above we have seen how the problem of computing for each node u of a tree the size of the subtree at u can be solved quite easily using the Euler-tour technique. To illustrate the method we here consider how this problem can also be solved by tree contraction. Practically this does not make sense because the Euler-tour technique is used as a subroutine in the tree contraction.

This is an easy case. All nodes are given initial weight 1. In addition there are dummy leafs added to all internal nodes of degree one, these get weight 0. The only operation is the addition. It adds the value of both children to the weight. More precisely, when raking u with parent v and grandparent w which have weights a_u, a_v and a_w, respectively, then u and v are removed and a_w is replaced by a_w = a_w + a_v + a_u.

Notice that the above construction does not compute the size of the subtree for the node v which is skipped over. This problem can be solved easily. After the tree contraction there is a tree expansion, reversing the operations. So, all nodes which are excluded in phase i, i = 1, 2, ..., t, are reinserted in phase i = t, t - 1, ..., 1. After reinsertion phase i we may assume that all nodes which have been reinserted already know their correct values. So, by the time v is reinserted, u and its sibling u' at the time of the exclusion of v have been reinserted and obtained their final values a_u and a_{u'}. So, v can compute its value by setting a_v = a_v + a_u + a_{u'}.

Height

The heights can be computed analogously to the size of the subtrees. The differences lie in the details. Initially all nodes are given weight 0. When raking u with parent v and grandparent w, we set a_w = max{a_w, a_u + 2}. When reinserting we should set a_v = max{a_v, a_u + 1, a_{u'} + 1}.

Lowest Common Ancestor

Definition and Applications

Consider a rooted tree T with root r. For any node u, an ancestor is a node w on the path from u to r, u and r included. For a pair of nodes u and v a common ancestor is a node w which is both an ancestor of u and of v. The lowest common ancestor w of u and v is the common ancestor of u and v which lies furthest from r. Thus, this is the node w in which the path from r to u and v forks. w is called the LCA of u and v, also denoted LCA(u, v). In many contexts it is required to compute lowest common ancestors. In the following we first give some examples.

If the tree represents some kind of evolutionistic development, then the LCA of u and v gives the historical event where two "species" went their own independent way. If the tree T is a spanning tree of some undirected graph and we want to make it more like a DFS tree, then when discovering a cross edge (u, v) it may be clever to replace one of the edges from LCA(u, v) leading to u or v by the edge (u, v).

The distance from a node u to a node v in a tree is given by the sum of the distance from u to LCA(u, v) and that from v to LCA(u, v). The distance of u to LCA(u, v) is given by the difference of their depths in the tree. So, after computing the depths of all nodes, which can be performed in linear time, the distances between any pair of nodes can be computed in constant time plus the time for for computing the LCA. Replacing the depths by the sum of all weights on the path from the root to a node, this idea can even be used for computing the distances in weighted trees.

An analogous problem is when for a fixed weighted tree we frequently have to determine the heaviest edge on the path from a node u to another node v in the tree. So, instead of using the sum operator we use the maximum operator here. The difference is that the maximum operator has no inverse, so we cannot simply subtract the contribution from the path between the root and LCA(u, v). In this case we first have to construct an associated tree T'. We describe how this can be done sequentially. We need a union-find data structure in which all nodes are inserted. If T has n nodes, then T' has n leafs in one-one correspondence with the nodes. The leafs all get key value 0. The edges of T are sorted according to their weight and processed in order of increasing weight. For an edge (u, v) of weight weight(u, v), a new node w of T' with two children is created. w has key value weight(u, v). w is connected to the node corresponding to the component of u and the component of v. Hereafter these components are unified and w becomes the node representing this component. Hereafter, for any pair of node u and v the weight of the heaviest edge on the path from u to v in T is given by the key of the LCA of u and v in T'.

Finding Heaviest Edge on Path in Tree

The above examples show that for several applications it is desirable to be able to answer the question "what is the LCA of u and v" for two nodes u and v of a tree. For any single such query the problem can be solved easily by walking towards the root from u and v marking the nodes on the path as visited. The first node that is reached for the second time is the LCA. If this walking is done by alternatingly making a step on the path from u and the path from v, the total number of visited nodes is proportional to the length of the path from u to v. The marks can be set back in the same time. This is clearly optimal and if there are several such queries, then these can be performed in parallel without problem (though concurrent reads may occur).

However, if there are very many such queries, it is better to preprocess the tree so that the queries can be faster than by walking along the tree. The problem of preprocessing the tree so that subsequent queries can be performed in constant time is called the LCA problem. So, the topic of this section is not how to parallelize the queries, which is trivial, but to show how this preprocessing, which sequentially can be performed in O(n) time for a tree with n nodes, can be performed in parallel.

Special Cases

There are two special cases for which the LCA is easy. If the tree is a path, then it suffices to compute for each node its depth. LCA(u, v) = u if depth(u) <= depth(v), otherwise LCA(u, v) = v. This is not a very interesting case.

More important is the case that the tree T is a perfect binary tree with n = 2^k - 1 nodes for some k > 0. The key idea is to perform an infix numbering of this tree, giving number 1 to the leftmost leaf. So, the node numbers are 1, ..., 2^k - 1. Let w = LCA(u, v) be the LCA of two nodes u and v. Let (u_{k - 1}, ..., u_0) and (v_{k - 1}, ..., v_0), be the binary expansions of the numbers of u and v respectively. Let x(u, v) be the index of the most-significant bit of u and v which are different. Using this notation, the binary expansion of the number of w can be expressed easily: w_i = u_i for all i > x(u, v); w_{x(u, v)} = 1; w_i = 0 for all i < x(u, v). Of course it is not necessary that the number of nodes is exactly a power of two minus one.

LCA on Perfect Binary Tree

Theorem: For a perfect binary tree T with n = 2^k - 1 nodes a numbering of the nodes can be computed in O(log n) parallel time and O(n) work so that based on this numbering for any pair of nodes u and v of T a query LCA(u, v) can be solved in O(1) sequential time.

The idea for the perfect binary trees can be extended to all trees which are sufficiently like perfect binary trees. The sequential linear time algorithm is based on such an extension. We do not further consider this construction but assume the following as a fact:

Theorem: For any tree T with n nodes a tree T' of size O(n) can be constructed in O(n) time so that based on the information contained in T' for any pair of nodes u and v of T a query LCA(u, v) can be solved in O(1) time.

Reduction to Range-Minima Problem

In this section the general LCA problem will be reduced to the problem of computing the minimum in subranges of an array. Interestingly, in the next section we will show how this latter problem can be solved by the special case of the LCA applied to perfect binary trees.

The construction is again based on the Euler-tour technique. First an Euler tour is constructed running along all edges in the tree, starting and ending in the root node r. This tour is used to construct arrays index[] and depth[]. For a tree with n nodes each has length 2 * n - 1. index[i] gives the index of the node that is visited after traversing i edges of the tour. index[0] = index[2 * n - 2] = r. depth[i] gives the depth of node index[i]. It can be computed by weighting all list edges leading away from the root with +1 and all other edges with -1. Running a list-ranking algorithm, each of these arrays can be computed in O(log n) time and O(n) work.

From index[] and depth[] we compute the arrays first[] and last[]. These arrays have length n. For all u, 0 <= u < n, first[u] = min_{0 <= i < 2 * n - 1} {index[i] = u}, that is, it gives the first occurrence of u on the tour. Analogously, last[u] gives the last occurrence of j on the tour. Because the tree is connected all these values are well defined. For a leaf u, first[u] = last[u]. For an internal node u first[u] < last[u]. On an EREW PRAM these arrays can be constructed in O(1) time and O(n) work: For all i, 0 <= i < 2 * n - 1, it is checked whether depth[i - 1] < depth[i] or not. If this is the case, we know that the tour came to node u = index[i] from above. So, this must be the first visit to node u. In that case the testing PU sets first[u] = i. If depth[i + 1] > depth[i], this is the last visit to node u = index[i] and the testing PU sets last[u] = i.

For any pair of nodes u and v, there are five mutually exclusive possibilities:

  1. u = v;
  2. first[u] < first[v], last[v] < last[u]: u is an ancestor of v, LCA(u, v) = u;
  3. first[v] < first[u], last[u] < last[v]: v is an ancestor of u, LCA(u, v) = v;
  4. last[u] < first[v], LCA(u, v) is neither u nor v.
  5. last[v] < first[u], LCA(u, v) is neither u nor v.
Only the last two cases need to be considered further. The last is analogous to the first, so it suffices to consider case 4: < first[u].

Lemma: If last[u] < first[v], LCA(u, v) = index[j], where j = min_{last[u] < i < first[v]}{depth[i] = d}, where d = min_{last[u] < i < first[v]}{depth[i]}.

Proof:

Let d and j be as defined and let w = index[j]. All nodes on the tour between last[u] and first[v] have depth larger than w. So, the tour has not left the subtree rooted at w. Thus, all these nodes, particularly u and v are descendants of w, that is, w is a common ancestor of u and v. Now consider another node w' which also lies on the subtour lying between last[u] and first[v]. Suppose w' is an ancestor of u. The tour is constructed so that once it leaves a node to a higher lying node it does not return. This holds in particular for w'. So, only after having visited all nodes in the subtree of w' the tour reaches w. These nodes in the subtree of w' will not be visited again. We know that hereafter the tour reaches v at least one more time because first[v] > j. Thus, v does not lie in the subtree of w' and thus w' is not a common ancestor. We conclude that on this subtour w is the only common ancestor of u and v. End.

Reduction of LCA to Range-Minima

The lemma reduces the problem of computing LCA(u, v) to testing several simple cases, and if necessary computing the minimum value in a specified interval of an array of length O(n). Finding the minimum in a subrange of an array can be solved easily in parallel: for a subrange of length n' it can be performed in O(log n') time and O(n') work. This shows that any single LCA query can be performed in O(log n) time and O(n) work. This is not a trivial result but in the above presented context we are striving for something else: we want to preprocess the tree so that LCA queries can be performed in O(1) time. So, after the reduction this means that we should preprocess an array so that queries for the minimum in a subrange of the array can be performed sequentially in O(1) time. The question how to achieve this is called the range-minima problem.

Range-Minima Problem

For an array of length n there are (n over 2) = Omega(n^2) subranges. That sounds like a too big number. A clever and rather general idea is to reduce any kind of range searches to prefix and suffix searches. In this case the range-minima problem will be solved by constructing a set of arrays containing prefix and suffix minima from which the desired value can be easily recovered.

Suboptimal Algorithm

For an array a[] of length n, the array p[] of prefix minima is defined by p[i] = min_{0 <= j <= i}{a[i]}, for all 0 <= i < n. The array s[] of suffix minima is defined by s[i] = min_{j <= i < n}{a[i]}. Suppose we have already computed the prefix minima p_1[] for the first half of an array a[] and the prefix minima p_2[] for the second half of a[]. Then the array p[] of prefix minima of the whole array can be computed in O(1) time and O(n) work setting
p[i] = p_1[i], for all 0 <= i < n / 2;
p[i] = min{p_1[n / 2 - 1], p_2[i - n / 2]}, for all n / 2 <= i < n.
In an analogous way the array of suffix minima s[] can be constructed from arrays s_1[] and s_2[] of suffix minima for each half of a[].

Assume that n = 2^k for some k >= 0. For other n, we can if necessary round up the length of the array to the next power of two. Let p_{l, m}[], 0 <= l <= k, 0 <= m < n / 2^l, be the array of prefix minima for the subarray of a[] of length 2^l starting at position m * 2^l. Define s_{l, m}[] to be the analogous array of suffix minima. By the above recursive construction all arrays p_{l, m} and s_{l, m} can be constructed in O(log n) time and O(n * log n) work:

T(n) = T(n / 2) + O(1);
W(n) = 2 * W(n / 2) + O(n).
Once all these arrays have been computed, a range-minimum query can be performed in O(1) sequential time.

We start with an example. Suppose we want to determine min_{393 <= i <= 438}{a[i]}. Then we first write these numbers u and v in binary: u = 393 = 00110001001, v = 434 = 00110110010. We have added a few leading zeroes to show the generality of the idea. Then we determine the most-significant different bit. In our example it occurs at position 5. The number w which is obtained by taking v and setting the 5 least-significant bits to zero is w = 0011010000 = 416 = 13 * 32. The number we are looking for is min{s_{5, 12}[9], p_{5, 13}[18]}. Here 23 = 393 - 12 * 32 and 18 = 434 - 13 * 32.

More generally, for numbers u and v, u <= v, the most-significant different bit x = x(u, v) is determined. Then the number w is constructed by taking v and setting the least-significant x bits to zero. Let d = 2^x and l = w / d. The definition of w and d is so that (l - 1) * d = w - d <= u and (l + 1) * d = w + d > v. The number we are looking for is min{s_{x, l - 1}[u - (l - 1) * d], p_{x, l}[v - l * d]}. This is so, because s_{x, l - 1}[u - (l - 1) * d] = min_{u <= j < w}{a[j]} and p_{x, l}[v - l * d] = min_{w <= j <= v}{a[i]}.

It is interesting to notice that the number w is just the index of the LCA of nodes u and v in a perfect binary tree with inorder indexing with smallest index 1, the special case of the LCA problem treated above. This shows the close relation between the LCA problem and the range-minima problem. In an implementation it is not necessary to make this tree structure explicit, the set of prefix and suffix arrays can be managed by a double indexed array of arrays of integers. The given computation shows which two values in this construct must be accessed to compute the range minimum from. On a RAM this takes constant time.

Lemma: For an array a[] of length n, in O(log n) time and O(n * log n) work a data structure requiring O(n * log n) storage can be created which allows it to solve range-minima queries on a[] in constant time.

Optimal Algorithm

The above construction is simple and once it is ready it allows to work with efficiently, but the work and the storage are too large. Using accelerated cascading the work and memory consumption can be made linear, which is optimal. Actually for the range-minima problem there is an even faster parallel algorithm, running in O(loglog n). This faster algorithm is not considered here because it does not make the complete algorithm for the LCA problem faster.

To reduce the work and memory consumption, the array a[] of length n is divided in n / log n subarrays of length log n each. For each of these subarrays an optimal sequential algorithm is applied. For each of the subarrays this takes O(log n) time and work. So, the parallel time is O(log n) and the work is O(n). The created structure allows to perform range-minima queries inside any of the subarrays in constant time. Along with this construction we can also construct an array b[] of length n / log n consisting of the minima in each of the subarrays. The above given suboptimal algorithm is now applied to b[] rather than a[]. This takes O(log n) time and O((n / log n) * log(n / log n)) = O(n) work. So, the whole construction takes O(log n) time and O(n) work.

It remains to specify how this two-stage structure can be used to perform range-minima queries in constant time. For numbers u and v, u <= v, belonging to subarray i_u and i_v, respectively, we distinguish two cases (with a subcase in the second):

Now the construction is perfect except that we have not told how to sequentially preprocess an array in linear time so that range-minima queries can be performed in constant time.

Theorem: For an array a[] of length n, in O(log n) time and O(n) work a data structure requiring O(n) storage can be created which allows it to solve range-minima queries on a[] in constant time.

Corollary: For a tree with n nodes, in O(log n) time and O(n) work a data structure requiring O(n) storage can be created which allows it to solve LCA queries for any pair of nodes in constant time.

Computing Most-Significant Different Bit

The above results are based on the assumption that the most significant different bit of two numbers u and v can be computed in constant time. Computing z = u ^ v, the bitwise exor of u and v, this problem is reduced to determining the leading non-zero bit of a number, said otherwise the problem of computing intlog(z) = round_down(log_2 z). Unfortunately, this integer logarithm is not a standard operation, even though it would be easy to realize in hardware and even though it is quite important.

For k-bit numbers, any single computation of intlog() can trivially be performed in O(k) time. Using a binary-search strategy, based on comparisons and shifts, this can easily be reduced to O(log k) time. Using bit parallelism further improvements are possible, but we do not obtain a constant-time operation this way. If we have to perform many computations of intlog(), it may be better to first compute an array intlog[] of length n = 2^k containing all values. A later call intlog(i) can then be answered in constant time by returning intlog[i].

Sequentially all n values of intlog[] can easily be computed in O(n) time:

  void computeIntlog(int[] intlog, int n)
  {
    intlog[0] = -1; // Artificial value
    intlog[1] =  0; 
    for (int i = 2; i < n; i++)
      if ((i & (1 << (intlog[i - 1] + 1) == 0)
        intlog[i] = intlog[i - 1];
      else
        intlog[i] = intlog[i - 1] + 1;
  }

The given subroutine cannot easily be parallelized. There are several alternative constructions. If optimal work would not be a major concern, then we could apply any of the sequential algorithms. For a sequential time T(n), this gives work T(n) * n. We know that T(n) = O(loglog n) or less. It is very simple to make the algorithm optimal: first compute intlog[i] for all values 0 <= i < sqrt(n). This takes T(n) time and O(sqrt(n) * T(n)) = O(n) work. Then for any number i, sqrt(n) <= i < n, set intlog[i] = intlog[i >> (k / 2)] + k / 2. This takes constant time and O(n) work.

Theorem: If T(n) = O(loglog n) denotes the time for sequentially computing the value of intlog(i) for any number 0 <= i < n, then the array intlog[] with intlog[i] = intlog(i) for all i, 0 <= i < n, can be computed in parallel in O(T(n)) time and O(n) work.

Exercises

  1. For a tree T the task is to number all leafs consecutively, that is, the numbering should be so that for any node u the set of all numbers of the leafs equals {l, l + 1, ..., h} for some l and h. Describe how the Euler-tour technique can be used to compute such a numbering.

  2. Sequentially, coloring a bipartite graph can also be solved by repeatedly determining a maximum matching. Matching is more expensive than coloring, but different from the Euler-tour-based splitting technique it can even be applied for graphs of odd degree.

  3. In the raking procedure it was used that the tree is binary. Trees of other degrees can be first made binary, but if the degree of any node is bounded by some constant d this is not necessary. Describe how to generalize the algorithm. Express the time consumption for a tree with n nodes in n and d.

  4. Suppose we have a tree in which all nodes have degree at most two. Each node holds a key and for each node u we would like to compute the minimum value of the keys in the subtree rooted at u, u included. In the text it was suggested to use the tree-contraction method for this. Work out the details, particularly describe how to handle the internal nodes of degree one. For a tree with n nodes the algorithm should run in O(log n) time and O(n) work on an EREW PRAM independently of the tree structure.

  5. Assume we have a black-box subroutine which allows to compute the minimum value m in a specified interval [l, h] of an array a[] of length n. Describe how this routine can be used to compute an index j, l <= j <= h so that a[j] = m. We assume that the range-minima subroutine only works on integer arrays, so that we cannot simply add the index as secondary key to the values of a[]. Distinguish the case that it is given that all values of a[] are different and the case that values may be the same.

  6. This question deals with computing the least-significant different bit. All considered numbers have at most k bits for some even k > 0. n = 2^k. It is supposed that there is an array intlog[] of length n containing for all i, 0 <= i < n, the value of round_down(log_2 i).

This page was created by Jop Sibeyn.
Last update Monday, 06 September 04 - 09:19.
For any comments: send an email.