suc(v_i, u) = (v, u_{(i + 1) mod deg_u}).
If the edges are numbered with different numbers, which can be performed by computing the prefix sums of the numbers of edges each PU is managing, then the defined successor relation simply gives a linked circular list. This can be turned into a non-circular list, by cutting it at a single place. For example, node 0 can cut the list between its first and last neighbor by setting suc(v_{deg_0 - 1}, 0) = null.
For a tree with n nodes and n - 1 edges, on a PRAM this construction takes O(log n) time with O(n) work for the prefix computation and O(1) time with O(n) work for the linking. On a DMM with P PUs, the prefix takes O(n / P) * c_i + log P * (c_v + c_a). If we assume that the information concerning edge (v_i, u) is stored in the PU responsible for node v_i, then the linking requires that the PU which is responsible for node u sends a packet to this other PU. Thus on a DMM the cost is dominated by that of an all-to-all routing in which each PU sends and receives one information for each edge of a node it is managing. If there are no extreme node degrees and the nodes are distributed evenly, this amounts to a k-k routing with k = 2 * n / P. More generally, the edges, after numbering them, should be distributed evenly over the PUs.
Assume we have a rooted tree as input. Then to compute the preorder numbers, all edges pointing away from the root are given weight +1 the other edges are given weight 0. After running a weighted list-ranking algorithm, a node u can obtain its number from the rank of the edge coming from its parent. Switching the weights, that is giving weight +1 to the edges leading towards the root and 0 to the others computes the postorder numbers.
Performing a weighted list ranking on the thus weighted list, a node u can find its inorder number from the rank of the first edge of the pair of edges that changed direction in u.
Consider a bipartite graph with node sets V_1 and V_2, each with n nodes. Assume that the graph is regular of degree g. A first idea for constructing a coloring is to start allocating the colors to the first node of V_1, then the second and so on. When assigning the colors of node i, we should respect the conditions imposed by earlier assigned colors. If one is lucky this may work, but in general this approach will get stuck when we must assign a color c to an edge (i, j) while node j has an edge (i', j) for some i' < i, which was already assigned color c while coloring node i'. Another "greedy" strategy may also work, but not always. The idea is that one tries to assign the colors one by one. The algorithm gets stuck when, while assigning color c, one comes to a node i for which all the uncolored edges lead to nodes which already have an edge that was given color c before.
A working and efficient strategy is based on Euler splittings. For a g-regular graph G with g even, the graph can easily be split in two edge-disjoint subgraphs G_0 and G_1 each of degree g / 2. This is done by constructing an Euler tour of the graph, numbering the edges on the tour alternatingly 0 and 1. All edges which have been numbered 0 belong to G_0, all other edges to G_1. Because sequentially an Euler tour of a graph with n nodes and m edges can be constructed in O(n + m) time, this splitting costs time O(g * n).
In general, for a graph of degree g = 2^k, the algorithm consists of k rounds. In round i, 0 <= i < k, the algorithm is working on 2^i subgraphs in which all nodes have degree 2^{k - i}. Each of the 2^i operations in round i takes O(2^{k - i} * n) time, so the total amount of work in any of the rounds is O(2^k * n) = O(g * n) = O(m). Thus, in total over all rounds the algorithm takes O(k * m) = O(log g * m) time. For g which are not powers of 2, the construction of such colorings is considerably harder. After much research it has been established (by Cole, Ost and Schirra in Combinatorica, Vol. 21, 2001) that even the general case can be solved in O(log g * m) time.
The basic idea is to turn the problem of finding an Euler tour into a list-ranking problem. For every edge there is a node. These nodes have no predecessor and successor links, but for every edge e, there is a data element left[e] and rght[e]. left[e] gives the edge e' to which e is connected in its left endpoint i in V_1. This decision is made locally, without global coordination in node i. If left[e] = e', then left[e'] = e. rght[e] gives the edge e'' to which e is linked in its right endpoint j in V_2. A node of the list (corresponding to an edge of the graph) which is entered by a left[] link is left through a rght[] link and vice versa.
The list-ranking algorithms can, with minor modifications also be applied to figure out the minimum on a circular list and the distance thereof. This is now done for both lists (not knowing the global structure, it must be done for both). Of course we do not have a conventional list structure: each node has two "successors": one labeled "left" and one labeled "right". So, what is the appropriate successor? The answer is simple: the path starting over a "left" link must at even distances from the original node traverse "left" links and at odd distances "right" links. Further, notice that, because any cycle on a bipartite graph has even length, for any node e, if the distance to the node with the smallest index is even along one direction then it is also even along the other direction. This observation makes that we now can split the set of nodes (corresponding to the edges of the graph) in two: all those at even distances go into one subset, all those at odd distances into another.
Summarizing: first each node of the graph arbitrarily links up the edges connected to it in connected pairs; then a set of bidirectionally linked circular lists of a slightly modified nature with g * n nodes is constructed; then two list ranking problems are constructed. Clearly this whole problem can be solved in O(T_listrank(g * n)). Hereafter we have to solve two instances with each half as many edges. In a straightforward implementation this gives
T_color(P, n, g) = O(T_listrank(P, g * n) + 2 * T_color(P, n, g/2).The solution depends on the list-ranking time. For a DMM, T_listrank(P, g * n) = O(g * n / P * (c_i + c_v) + c_a * P). Substituting and solving gives
T_color(P, n, g) = O(log g * g * n / P * (c_i + c_v) + c_a * g * P).
This is a good result except for the number of startups. Organizing things slightly differently the number of startups can be minimized: after each splitting of the problem, each of the two subproblems should be concentrated in half of the available PUs. This gives a different recurrence relation:
T_color(P, n, g) = O(T_listrank(P, g * n) + T_color(P/2, n, g/2).Solving this recurrence gives
T_color(P, n, g) = O(log g * g * n / P * (c_i + c_v) + c_a * P).So, we saved a factor g in the number of startups. This improved algorithm will work well if log g * g * n >> P^2, the first version required log g * n >> P^2.
If there are two independent problems which must be solved on a DMM or an interconnection network with P Pus, then it is mostly better to solve these problems in parallel each on P / 2 PUs than to solve them after each other each on P PUs.
On a two-dimensional m x m mesh the situation is even much better. Assuming that g * n > m^2, T_listrank(m^2, g * n) = O(T_route(m^2, g * n)) = O(g * n / m). So, we get
T_color(m^2, n, g) = O(g * n / m) + T_color(m^2 / 2, n, g / 2).Because (g / 2) * n / sqrt(m^2 / 2) = (g * n / m) / sqrt(2), after each splitting the subproblems can be solved faster by a constant factor (more accurately: after splitting twice there are four subproblems which can be solved twice as fast on m / 2 x m / 2 meshes). Thus, the sum of the times of all splitting phases is proportional to the time for the first one:
T_color(m^2, n, g) = O(g * n / m).
Of course the speed-up of this mesh algorithm is only m = sqrt(P), but this is the normal situation on meshes. There are very few problems like matrix multiplication and all-pairs shortest paths for which full speed-up can be achieved. All but very trivial problems require that at least all data are communicated once over the bisection of the network and therefore full speed-up can be achieved only if the sequential complexity of the problem is sufficiently high. A result like the obtained one is optimal up to constant factors in a two-dimensional world. The basic assumption of the DMM model that locality plays no role is definitely not valid without limitation on P.
We have seen how to solve the unbalanced routing problem using randomization: sending all packets first to a random intermediate destination (or by shuffling the packets in a deterministic way), any routing pattern may be replaced by two almost balanced routing patterns. The main disadvantage of this approach is that it doubles the routing volume. A coloring approach offers an alternative solution which are mostly satisfied in practical contexts.
We consider a DMM with P PUs. Each PU has to send and receive in total k packets. Let a_{i, j} denote the number of packets PU_i has to send to PU_j. The conditions are
0 <= a_{i, j} <= k,The entire routing pattern constitutes a bipartite k-regular graph with P nodes on either side. Constructing a k coloring decomposes the routing into k permutation routing problems, which can be routed in (c_v + c_a) * k time. Even when ignoring the excessive time for the coloring, this is not good, because in general this will be much more than c_v * 2 * k + c_a * 2 * P, the time for the randomized coloring.
sum_{i = 0}^{P - 1} a_{i, j} = sum_{j = 0}^{P - 1} a_{i, j} = k.
Some form of sparsification is the key to the solution. Possibly we might group PUs together, but this inevitably leads to routing packets several times, and then we could just as well apply the simpler randomized routing. So, bundling the packets into superpackets is the only cost reduction we can try to apply. Let s be the size of the superpackets. That is, in PU_i the a_{i, j} packets which have to be routed to PU_j are packet into b_{i, j} = round_up(a_{i, j} / s) superpackets of size s, creating one partially filled superpacket. In total PU_i creates sum_{j = 0}^{P - 1} round_up(a_{i, j} / s) <= sum_{j = 0}^{P - 1} (a_{i, j} / s + 1) = k / s + P. So, this gives a bipartite graph with maximum degree k / s + P. The nodes do not need to have the same degree, but after computing prefix sums dummy packets can be added to make the graph regular. Because within certain limits s can be chosen freely, we may assume that k / s + P is a power of two.
On this reduced (k / s + P)-regular bipartite graph with P nodes on either side the presented Euler-tour-based coloring algorithm can be applied. This takes O(log(k / s + P) * (k / s + P) * (c_i + c_v) + c_a * P) time. This is a loss term. The second loss term comes from the fact that we have more than k / s superpackets: the subsequent routing takes (k / s + P) * (c_v * s + c_a). We want to determine a good choice for s and a condition on k and P so that all loss terms can be made negligible in comparison with the useful time c_v * k. Clearly P should be negligible in comparison with k / s. This allows to simplify the time for the coloring. Equating the loss terms gives the following equation:
log k * k / s = P * s.Solving gives a close-to-optimal choice for s, namely s = sqrt(k * log k / P).
Theorem: On a DMM with P PUs any k-k routing can be performed in (1 + o(1)) * k * c_v time if k = omega(P * log P).
Proof: Taking s = sqrt(k * log k / P), the loss terms are bounded by O(sqrt(k * log k * P) * c_v). If k = x * P * log P, sqrt(k * log k * P) = k * sqrt((1 + log x / log P + loglog P / log P) / x). Clearly for x = omega(1), this is o(k). End.
So, for unbalanced k-k routing optimality can be achieved for values of k which are only marginally larger than the k = omega(P) required for balanced k-k routing. Practically the only disadvantage is the rather complicated construction.Evaluating arithmetic expressions in parallel does not appear to make sense because it is hard to imagine in which context sufficiently large expressions occur, but the expression-evaluation problem models other tree problems which are important and for which even have to be solved for large trees. An example of such a problem is that of computing the height of each node u in a tree, that is, the distance to the deepest lying leaf in the subtree of u. For this problem the Euler-tour technique does not appear to be suitable. As an expression-evaluation problem it can be solved by putting 0 in all leafs and using as operator max + 1, taking the maximum value of the children and adding 1. Another example is that of computing for each node u of a tree the minimum key value in the subtree of u. This problem can be solved as an expression-evaluation problem in which each node computes the minimum of its own value and that of its children.
For expression trees of moderate depth the evaluation problem can be solved efficiently in parallel by allocating a processor to each node connected to two leafs. This initial processor allocation can be performed using the Euler-tour technique. For each such a node its value can be computed in constant time. When as a result a new node connected to two leafs is generated, the allocated PU is taking over this job, otherwise it becomes passive. In each step the depth of the tree is reduced by one. Thus, for a tree of depth d with n nodes, the time is O(d + log n) and the work is O(n). If the tree is deep and skinny, for example a path, this way of evaluating gives very limited parallelization.
Many rakes can be performed in parallel, but it should be avoided to rake two leafs with the same grandparent. This is not hard to realize. The algorithm starts by consecutively numbering the leafs using the Euler-tour technique. Then the rake operation is first performed on all leafs with numbers 4 * k + 0, for some k, and then on all leafs with numbers 4 * k + 2, for some k. Hereafter each leaf with an odd numbers i is renumbered as (i - 1) / 2, and the process is repeated until there are only two leafs left. Because the raked leafs are always four numbers apart, they can never have the same grandparent. At this point we essentially use that the tree is binary. So, provided we know how to perform a rake operation, this correctly contracts the tree.
The above tree contraction can be performed very efficiently in parallel: the Euler-tour technique was analyzed before. On an EREW PRAM it can be performed in O(log n) time with O(n) work. Assuming that raking a leaf takes constant time, each contraction phase, consisting of two parallel rakings and a trivial renumbering, takes O(1) time. In each contraction phase the number of leafs is halved, so the whole contraction can be performed in O(log n) time. For some constant c, the work during the contraction is given by sum_t c * n / 2^t = O(n). In the time estimate it is assumed that there are no internal nodes of degree one, otherwise for a tree which has very few leafs only a few nodes disappear in every contraction phase until there is one leaf left. The easiest is to add a dummy leaf to any internal node with degree one. This leaf should get a neutral value so that the evaluated value remains the same for all real tree nodes.
Invariant: At all times for any leaf u, a_u = 0 and b_u = val(u). For any internal node u, with children v and w, val(u) = (a_v * val(v) + b_v) op_u (a_w * val(w) + b_w).
Initially the invariant condition is satisfied, while raking the values of the parameters must be modified appropriately. Assume we are raking a node v with parent u and sibling w. The nodes v and u are eliminated, so w should take over the values of these three. v is a leaf, thus, according to the invariant, val(u) = b_v op_u (a_w * val(w) + b_w). We distinguish two cases:The way the algorithm is specified, the correct value will be computed in the root of the tree, but not for all other nodes. In the following examples it will be discussed how to correct this. The idea is analogous to that for computing a prefix sum: the whole algorithm consists of two phases, one running upwards, one running downwards. The difference is now that operations may take place at different levels of the tree in the same parallel operation.
This is an easy case. All nodes are given initial weight 1. In addition there are dummy leafs added to all internal nodes of degree one, these get weight 0. The only operation is the addition. It adds the value of both children to the weight. More precisely, when raking u with parent v and grandparent w which have weights a_u, a_v and a_w, respectively, then u and v are removed and a_w is replaced by a_w = a_w + a_v + a_u.
Notice that the above construction does not compute the size of the subtree for the node v which is skipped over. This problem can be solved easily. After the tree contraction there is a tree expansion, reversing the operations. So, all nodes which are excluded in phase i, i = 1, 2, ..., t, are reinserted in phase i = t, t - 1, ..., 1. After reinsertion phase i we may assume that all nodes which have been reinserted already know their correct values. So, by the time v is reinserted, u and its sibling u' at the time of the exclusion of v have been reinserted and obtained their final values a_u and a_{u'}. So, v can compute its value by setting a_v = a_v + a_u + a_{u'}.
If the tree represents some kind of evolutionistic development, then the LCA of u and v gives the historical event where two "species" went their own independent way. If the tree T is a spanning tree of some undirected graph and we want to make it more like a DFS tree, then when discovering a cross edge (u, v) it may be clever to replace one of the edges from LCA(u, v) leading to u or v by the edge (u, v).
The distance from a node u to a node v in a tree is given by the sum of the distance from u to LCA(u, v) and that from v to LCA(u, v). The distance of u to LCA(u, v) is given by the difference of their depths in the tree. So, after computing the depths of all nodes, which can be performed in linear time, the distances between any pair of nodes can be computed in constant time plus the time for for computing the LCA. Replacing the depths by the sum of all weights on the path from the root to a node, this idea can even be used for computing the distances in weighted trees.
An analogous problem is when for a fixed weighted tree we frequently have to determine the heaviest edge on the path from a node u to another node v in the tree. So, instead of using the sum operator we use the maximum operator here. The difference is that the maximum operator has no inverse, so we cannot simply subtract the contribution from the path between the root and LCA(u, v). In this case we first have to construct an associated tree T'. We describe how this can be done sequentially. We need a union-find data structure in which all nodes are inserted. If T has n nodes, then T' has n leafs in one-one correspondence with the nodes. The leafs all get key value 0. The edges of T are sorted according to their weight and processed in order of increasing weight. For an edge (u, v) of weight weight(u, v), a new node w of T' with two children is created. w has key value weight(u, v). w is connected to the node corresponding to the component of u and the component of v. Hereafter these components are unified and w becomes the node representing this component. Hereafter, for any pair of node u and v the weight of the heaviest edge on the path from u to v in T is given by the key of the LCA of u and v in T'.
The above examples show that for several applications it is desirable to be able to answer the question "what is the LCA of u and v" for two nodes u and v of a tree. For any single such query the problem can be solved easily by walking towards the root from u and v marking the nodes on the path as visited. The first node that is reached for the second time is the LCA. If this walking is done by alternatingly making a step on the path from u and the path from v, the total number of visited nodes is proportional to the length of the path from u to v. The marks can be set back in the same time. This is clearly optimal and if there are several such queries, then these can be performed in parallel without problem (though concurrent reads may occur).
However, if there are very many such queries, it is better to preprocess the tree so that the queries can be faster than by walking along the tree. The problem of preprocessing the tree so that subsequent queries can be performed in constant time is called the LCA problem. So, the topic of this section is not how to parallelize the queries, which is trivial, but to show how this preprocessing, which sequentially can be performed in O(n) time for a tree with n nodes, can be performed in parallel.
More important is the case that the tree T is a perfect binary tree with n = 2^k - 1 nodes for some k > 0. The key idea is to perform an infix numbering of this tree, giving number 1 to the leftmost leaf. So, the node numbers are 1, ..., 2^k - 1. Let w = LCA(u, v) be the LCA of two nodes u and v. Let (u_{k - 1}, ..., u_0) and (v_{k - 1}, ..., v_0), be the binary expansions of the numbers of u and v respectively. Let x(u, v) be the index of the most-significant bit of u and v which are different. Using this notation, the binary expansion of the number of w can be expressed easily: w_i = u_i for all i > x(u, v); w_{x(u, v)} = 1; w_i = 0 for all i < x(u, v). Of course it is not necessary that the number of nodes is exactly a power of two minus one.
Theorem: For a perfect binary tree T with n = 2^k - 1 nodes a numbering of the nodes can be computed in O(log n) parallel time and O(n) work so that based on this numbering for any pair of nodes u and v of T a query LCA(u, v) can be solved in O(1) sequential time.
The idea for the perfect binary trees can be extended to all trees which are sufficiently like perfect binary trees. The sequential linear time algorithm is based on such an extension. We do not further consider this construction but assume the following as a fact:
Theorem: For any tree T with n nodes a tree T' of size O(n) can be constructed in O(n) time so that based on the information contained in T' for any pair of nodes u and v of T a query LCA(u, v) can be solved in O(1) time.
The construction is again based on the Euler-tour technique. First an Euler tour is constructed running along all edges in the tree, starting and ending in the root node r. This tour is used to construct arrays index[] and depth[]. For a tree with n nodes each has length 2 * n - 1. index[i] gives the index of the node that is visited after traversing i edges of the tour. index[0] = index[2 * n - 2] = r. depth[i] gives the depth of node index[i]. It can be computed by weighting all list edges leading away from the root with +1 and all other edges with -1. Running a list-ranking algorithm, each of these arrays can be computed in O(log n) time and O(n) work.
From index[] and depth[] we compute the arrays first[] and last[]. These arrays have length n. For all u, 0 <= u < n, first[u] = min_{0 <= i < 2 * n - 1} {index[i] = u}, that is, it gives the first occurrence of u on the tour. Analogously, last[u] gives the last occurrence of j on the tour. Because the tree is connected all these values are well defined. For a leaf u, first[u] = last[u]. For an internal node u first[u] < last[u]. On an EREW PRAM these arrays can be constructed in O(1) time and O(n) work: For all i, 0 <= i < 2 * n - 1, it is checked whether depth[i - 1] < depth[i] or not. If this is the case, we know that the tour came to node u = index[i] from above. So, this must be the first visit to node u. In that case the testing PU sets first[u] = i. If depth[i + 1] > depth[i], this is the last visit to node u = index[i] and the testing PU sets last[u] = i.
For any pair of nodes u and v, there are five mutually exclusive possibilities:
Lemma: If last[u] < first[v], LCA(u, v) = index[j], where j = min_{last[u] < i < first[v]}{depth[i] = d}, where d = min_{last[u] < i < first[v]}{depth[i]}.
Proof:
Let d and j be as defined and let w = index[j]. All nodes on the tour between last[u] and first[v] have depth larger than w. So, the tour has not left the subtree rooted at w. Thus, all these nodes, particularly u and v are descendants of w, that is, w is a common ancestor of u and v. Now consider another node w' which also lies on the subtour lying between last[u] and first[v]. Suppose w' is an ancestor of u. The tour is constructed so that once it leaves a node to a higher lying node it does not return. This holds in particular for w'. So, only after having visited all nodes in the subtree of w' the tour reaches w. These nodes in the subtree of w' will not be visited again. We know that hereafter the tour reaches v at least one more time because first[v] > j. Thus, v does not lie in the subtree of w' and thus w' is not a common ancestor. We conclude that on this subtour w is the only common ancestor of u and v. End.
p[i] = p_1[i], for all 0 <= i < n / 2;In an analogous way the array of suffix minima s[] can be constructed from arrays s_1[] and s_2[] of suffix minima for each half of a[].
p[i] = min{p_1[n / 2 - 1], p_2[i - n / 2]}, for all n / 2 <= i < n.
Assume that n = 2^k for some k >= 0. For other n, we can if necessary round up the length of the array to the next power of two. Let p_{l, m}[], 0 <= l <= k, 0 <= m < n / 2^l, be the array of prefix minima for the subarray of a[] of length 2^l starting at position m * 2^l. Define s_{l, m}[] to be the analogous array of suffix minima. By the above recursive construction all arrays p_{l, m} and s_{l, m} can be constructed in O(log n) time and O(n * log n) work:
T(n) = T(n / 2) + O(1);Once all these arrays have been computed, a range-minimum query can be performed in O(1) sequential time.
W(n) = 2 * W(n / 2) + O(n).
We start with an example. Suppose we want to determine min_{393 <= i <= 438}{a[i]}. Then we first write these numbers u and v in binary: u = 393 = 00110001001, v = 434 = 00110110010. We have added a few leading zeroes to show the generality of the idea. Then we determine the most-significant different bit. In our example it occurs at position 5. The number w which is obtained by taking v and setting the 5 least-significant bits to zero is w = 0011010000 = 416 = 13 * 32. The number we are looking for is min{s_{5, 12}[9], p_{5, 13}[18]}. Here 23 = 393 - 12 * 32 and 18 = 434 - 13 * 32.
More generally, for numbers u and v, u <= v, the most-significant different bit x = x(u, v) is determined. Then the number w is constructed by taking v and setting the least-significant x bits to zero. Let d = 2^x and l = w / d. The definition of w and d is so that (l - 1) * d = w - d <= u and (l + 1) * d = w + d > v. The number we are looking for is min{s_{x, l - 1}[u - (l - 1) * d], p_{x, l}[v - l * d]}. This is so, because s_{x, l - 1}[u - (l - 1) * d] = min_{u <= j < w}{a[j]} and p_{x, l}[v - l * d] = min_{w <= j <= v}{a[i]}.
It is interesting to notice that the number w is just the index of the LCA of nodes u and v in a perfect binary tree with inorder indexing with smallest index 1, the special case of the LCA problem treated above. This shows the close relation between the LCA problem and the range-minima problem. In an implementation it is not necessary to make this tree structure explicit, the set of prefix and suffix arrays can be managed by a double indexed array of arrays of integers. The given computation shows which two values in this construct must be accessed to compute the range minimum from. On a RAM this takes constant time.
Lemma: For an array a[] of length n, in O(log n) time and O(n * log n) work a data structure requiring O(n * log n) storage can be created which allows it to solve range-minima queries on a[] in constant time.
To reduce the work and memory consumption, the array a[] of length n is divided in n / log n subarrays of length log n each. For each of these subarrays an optimal sequential algorithm is applied. For each of the subarrays this takes O(log n) time and work. So, the parallel time is O(log n) and the work is O(n). The created structure allows to perform range-minima queries inside any of the subarrays in constant time. Along with this construction we can also construct an array b[] of length n / log n consisting of the minima in each of the subarrays. The above given suboptimal algorithm is now applied to b[] rather than a[]. This takes O(log n) time and O((n / log n) * log(n / log n)) = O(n) work. So, the whole construction takes O(log n) time and O(n) work.
It remains to specify how this two-stage structure can be used to perform range-minima queries in constant time. For numbers u and v, u <= v, belonging to subarray i_u and i_v, respectively, we distinguish two cases (with a subcase in the second):
Now the construction is perfect except that we have not told how to sequentially preprocess an array in linear time so that range-minima queries can be performed in constant time.
Theorem: For an array a[] of length n, in O(log n) time and O(n) work a data structure requiring O(n) storage can be created which allows it to solve range-minima queries on a[] in constant time.
Corollary: For a tree with n nodes, in O(log n) time and O(n) work a data structure requiring O(n) storage can be created which allows it to solve LCA queries for any pair of nodes in constant time.
For k-bit numbers, any single computation of intlog() can trivially be performed in O(k) time. Using a binary-search strategy, based on comparisons and shifts, this can easily be reduced to O(log k) time. Using bit parallelism further improvements are possible, but we do not obtain a constant-time operation this way. If we have to perform many computations of intlog(), it may be better to first compute an array intlog[] of length n = 2^k containing all values. A later call intlog(i) can then be answered in constant time by returning intlog[i].
Sequentially all n values of intlog[] can easily be computed in O(n) time:
void computeIntlog(int[] intlog, int n)
{
intlog[0] = -1; // Artificial value
intlog[1] = 0;
for (int i = 2; i < n; i++)
if ((i & (1 << (intlog[i - 1] + 1) == 0)
intlog[i] = intlog[i - 1];
else
intlog[i] = intlog[i - 1] + 1;
}
The given subroutine cannot easily be parallelized. There are several alternative constructions. If optimal work would not be a major concern, then we could apply any of the sequential algorithms. For a sequential time T(n), this gives work T(n) * n. We know that T(n) = O(loglog n) or less. It is very simple to make the algorithm optimal: first compute intlog[i] for all values 0 <= i < sqrt(n). This takes T(n) time and O(sqrt(n) * T(n)) = O(n) work. Then for any number i, sqrt(n) <= i < n, set intlog[i] = intlog[i >> (k / 2)] + k / 2. This takes constant time and O(n) work.
Theorem: If T(n) = O(loglog n) denotes the time for sequentially computing the value of intlog(i) for any number 0 <= i < n, then the array intlog[] with intlog[i] = intlog(i) for all i, 0 <= i < n, can be computed in parallel in O(T(n)) time and O(n) work.