Graph Algorithms

Definitions

Graphs are a mathematical concept, that can be used to model many concepts from the real world. A graph consists of two sets. One set are the nodes (also called vertices) the other are the edges (sometimes also called arcs). The edges connect the nodes. A road map (without the background coloring) is an example of a graph: the cities and villages are nodes, the roads are the edges. This is an example of an undirected graph: typically a road on a map can be used in both directions. But, we can also make a sociogram: for each considered person there is a node, and there is an edge from person A to person B if A likes B. These edges are directed: liking someone is not always reflexive. More mathematical, if there are n nodes, then we will number them from 0 to n - 1. These numbers are called indices. An edge can be given as a pair of nodes (u, v) indicates the edge from node u to node v. With any graph it should be specified whether it is directed or not. In the latter case an edge (u, v) then can be used both for going from u to v and for going from v to u. Any undirected graph can be replaced by a directed graph by replacing each edge by a pair of edges.

In connection with graphs there are many notions. Some of them are important already now. A neighbor of a node u is any node v for which there is an edge (u, v). In this case one also says that v is adjacent to u. The nodes u and v are called the endpoints of the edge. A path from u to w is a sequence of edges (u, v_1), (v_1, v_2), ..., (v_{k - 1}, w) connecting u with w. A path is called simple if all nodes on the path are different. A cycle is a simple path which has the same begin- and endpoint. In the example this means that u == w. A graph without cycles is called acyclic. A tree is an acyclic and connected graph. By extension are directed graphs called trees when it is a tree considered as an undirected graph, and when all edges are directed towards or away from a specific node called root. A forest is a set of trees. The length of a path is the number of used edges, in the example the path has length k. The distance from u to v is the length of the shortest path between u and v. A graph is called connected if for any pair of nodes u, v there is a path running from u to v. For directed graphs one mostly speaks of strongly connected if we take the direction of the edges into account for these paths, otherwise one speaks of weakly connected. A connected component is a maximum subset of the nodes that is connected. For directed graphs one speaks of strongly (weakly) connected components, often these are also called strong (weak) components. A spanning tree is a tree that reaches all nodes of a connected graph. A spanning forest is a set of trees, one for each connected component of a graph. The degree of a node u is the number of edges incident on u. For directed graphs it is customary to distinguish indegree, the number of edges leading to u, and outdegree, the number of edges starting in u. The degree of the graph is the maximum of the degrees of all nodes. If an edge (u, v) occurs more than once (that is, the "set" of edges is actually a multiset), then we will say that there are multiple edges. A self-loop is an edge (u, u). A graph without multiple edges and self-loops is said to be simple. It is common to assume that graphs are simple. The number of nodes of a graph is often denoted with n, the number of edges with m. Simple graphs have at most n * (n - 1) edges if they are directed and at most n * (n - 1) / 2 edges if they are undirected. If m = O(n) (as in road graphs), then the graph is said to be sparse. If m is much larger than linear the graph is called dense. This notion is not precisely defined, sometimes it means m = omega(n), sometimes it means m = Omega(n^2).

Directed Graph with 12 Nodes and 16 Edges

Representations

The simplest representation of a graph is as a list of edges. In that case the input has size 2 * m. This representation is unsuitable for solving most common graph problems, because there is no efficient way to access all edges incident on a node. But, one of the presented algorithms for computing the connected components of a graph can work with a list of edges. Furthermore, this may be the initial way the graph is specified. If this is the case and for the problem to solve this format is unsuitable, a more suitable representation must be constructed first.

A common way to represent graphs in the computer is by creating for each node a set of all its neighbors. This is a particularly good idea for sparse graphs. In general one might use linear lists for these sets. This is called the adjacency list representation of the graph. Such an implementation requires O(n + m) memory. The strong feature of using adjacency lists, is that it becomes very easy to determine all neighbors of a node. Using linked lists, it is also trivial to add or delete a particular edge. If all edges have about the same degree with a maximum g, then it is much more convenient to represent the graph as an array of arrays of length g, storing for each node its degree as well. Even for lists of variable length an array may be used, marking for each node where in the array its neighbors are beginning. This implementation requires n + m + O(1) memory. The major disadvantage of adjacency lists is that it takes time proportional to the number of neighbors of node u to test whether an edge (u, v) exists or not.

For dense graphs the most appropriate representation is with an adjacency matrix: an n x n boolean matrix, where a 1 at position (u, v) indicates that there is an edge (u, v). If the graph is undirected, the adjacency matrix is symmetric. This representation requires n^2 bits of memory, independently of the number of edges m. Thus, for really large graphs this representation cannot be used. Furthermore, any operation that requires the access of all neighbors of a node takes O(n) time. On the other hand, this is the representation to use if frequently the existence of single edges must be tested. Furthermore, for rather dense graphs of moderate sizes storing n^2 bits may require less memory than storing n + m ints.

Undirected Graph with 13 Nodes and 20 Edges

Graph Traversal

An elementary but important problem on graphs, directed or undirected, occurring as a subproblem when solving the connected-components or spanning-tree problem, is graph traversal. This means, some systematic way to visit all n nodes of the graph. In this section, while doing this the nodes are numbered, but later this is used to perform other operations. Sometimes a specific processing order is required (when computing shortest paths or when determining a topological sorting), but quite often it is not.

Basic Graph Traversal

We have encountered several ADTs. An important ADT which was not discussed explicitly before is the bag. A bag supports three operations: Stack and queues are bags. Leaving things unspecified allows to use potentially even more efficient implementations (though it is unlikely that there exists anything better and simpler than a stack).

In case no specific processing order of the nodes of a graph is required, the following non-recursive procedure might be the simplest and most efficient:

  void traversal(int[] number) 
  {
    for (v = 0; v < n; v++)
      number[v] = -1; // Flag value
    counter = -1;
    Bag b = new Bag(n);
    for (r = 0; r < n; r++)
      if (number[r] == -1) // r is the root of a new component
      {
        counter++;
        number[r] = counter;
        b.add(r);
        while (b.notEmpty()) 
        {
          v = b.remove();
          for (each neighbor w of v)
            if (number[w] == -1) // w has not been visited yet
            {
              counter++;
              number[w] = counter;
              b.add(w);
            } 
        } 
      } 
  }

Clearly every node is numbered only once, because only nodes that were not visited before are assigned a value. Because counter is increased just before numbering a node and never decreased, all numbers are different. All nodes are pushed exactly once, and upon popping all their outgoing edges are considered. This means that the algorithm has running time O(n + m).

Traversal Numbers

Spanning Forests

The highlighted edges in the above picture constitute a spanning tree. This is not a lucky coincidence: the red edges are the ones along which the traversal algorithm was discovering the new nodes to add to the bag. Because all nodes are reached this is a spanning tree. Thus, with a tiny modification the above algorithm can be used to compute a spanning forest of an undirected graph.

The produced tree will be directed towards the root: for each node the next node on the path to the root will be computed. The roots are characterized by the fact that they are pointing to themselves. Because for each node only one value has to be stored, independently of the degree of the tree, the whole tree can simply be maintained in an array:

  void spanningForest(int[] parent) 
  {
    for (v = 0; v < n; v++)
      parent[v] = -1; // Flag value
    Bag b = new Bag(n);
    for (r = 0; r < n; r++)
      if (parent[r] == -1) // r is the root of a new component
      {
        parent[r] = r; // r points to itself
        b.add(r);
        while (b.notEmpty()) 
        {
          v = b.remove();
          for (each neighbor w of v)
            if (parent[w] == -1) // w has not been visited yet
            {
              parent[w] = v; // w is reached from v
              b.add(w);
            } 
        } 
      } 
  }

The time consumption is clearly the same as before: O(n + m). However, for a formal correctness proof, the above argument should be refined. It should be shown that all nodes in a connected component belong to the same tree. So, assume there is a node w belonging to the same component as r, which is not reached during the traversal starting from r. Consider a path from r to w. Such a path exists because r and w belong to the same component. Let u be the last node on the path for which the parent-value has been set during the traversal starting from r, and let v be the next node on the path. u has been added to the bag upon setting its parent-value. Because the traversal only stops once the bag is empty, u also has been removed from it. Upon that occasion all neighbors of u, including v, are inspected and when they were not yet reached at that point of time, their parent-value is set. This contradicts the assumption that there is a node on the path from r to w which has not been reached during the traversal starting from r.

Spanning Forest

Breadth-First Search

Sometimes a processing order is required in which the nodes are processed in the order in which they were discovered. The corresponding graph traversal is called breadth-first search and the produced numbering is called a BFS numbering of the graph. This type of graph traversal has many important applications of which we will encounter a few.
  void bfsTraversal(int[] number) 
  {
    for (v = 0; v < n; v++)
      number[v] = -1; // Flag value
    counter = -1;
    Queue q = new Queue(n);
    for (r = 0; r < n; r++)
      if (number[r] == -1) // r is the root of a new component
      {
        counter++;
        number[r] = counter;
        q.enqueue(r);
        while (q.notEmpty()) 
        {
          v = q.dequeue();
          for (each neighbor w of v)
            if (number[w] == -1) // w has not been visited yet
            {
              counter++;
              number[w] = counter;
              q.enqueue(w);
            } 
        } 
      } 
  }

The algorithm is identical with the previous one, except that we use a queue instead of a bag. Depending on how the bag is implemented, this may have a strong impact on the produced numbering, but not on the complexity: all nodes are enqueued and dequeued exactly once and upon dequeuing all their outgoing edges are considered. Thus, the complexity is O(n + m).

BFS Numbers

Depth-First Search

There is an alternative way of traversing a graph called depth-first search, the produced numbering is called a DFS numbering. Even this method has many important applications. DFS uses (implicitly or explicitly) a stack instead of an unspecified bag or a queue. DFS can be solved recursively and non-recursively. As usual the recursive algorithm is shorter, though most likely the non-recursive variant will be more efficient.

Non-Recursive Algorithm

First we consider a non-recursive variant of the DFS algorithm.
  void dfsTraversal(int[] number) 
  {
    for (v = 0; v < n; v++)
      number[v] = -1; // Flag value
    counter = -1;
    Stack s = new Stack(m);
    for (r = 0; r < n; r++)
      if (number[r] == -1) // r is the root of a new component
      {
        s.push(r);
        while (s.notEmpty()) 
        {
          v = s.pop();
          if (number[v] == -1) // v has not been visited yet
          {
            counter++;
            number[v] = counter;
            for (each neighbor w of v)
              s.push(w); 
          } 
        }
      }
  }

As before this algorithm numbers each node only once with different numbers from 0 to n - 1. Here the nodes are marked as visited only after they are popped from the stack. This implies that nodes may be pushed on the stack many times, and that the size of the stack may become as large as m (even 2 * m for undirected graphs). This is not such a large disadvantage: the graph itself also requires so much storage. If one nevertheless wants to prevent this, one should either apply the recursive algorithm, where the command "s.push(w)" is replaced by a conditional recursive call to the DFS procedure, or one should push instead of just w also v, and the index of w within the adjacency list of v. When popping this entry w, the next neighbor of v should be written instead of it.

Recursive Algorithm

DFS can more easily be formulated with a recursive algorithm.
  void dfs(int v, int* preCounter, int* postCounter, 
         int[] preNumber, int[] postNumber) 
  {
    (*preCounter)++;
    preNumber[v] = *preCounter; // preorder number
    for (each neighbor w of v)
      if (preNumber[w] == -1)
        dfs(w, &preCounter, &postCounter, preNumber, postNumber);
    (*postCounter)++;
    postNumber[v] = *postCounter; // postorder number 
  }

  void dfsTraversal(int* preNumber, int* postNumber) 
  {
    for (v = 0; v < n; v++)
      preNumber[v] = -1; // Flag value
    preCounter = postCounter = -1;
    for (r = 0; r < n; r++)
      if (preNumber[r] == -1)
        dfs(r, &preCounter, &postCounter, preNumber, postNumber); 
  }

Here we computed two types of numbers: preorder DFS numbers and postorder DFS numbers depending on whether they were assigned before or after the recursion. The preorder numbers are the same that were simply called "dfs numbers" before. The importance of computing both types of numbers will become clear soon.

The last code fragment is written in a C-like style. In Java where integer parameters cannot be passed by reference, one should either make the counters global (ugly but efficient), or pass it in some kind of object, for example as an object from the class "Integer". Click here to see all presented traversal algorithms integrated in a working Java program.

DFS Numbers

Edge Classification

Classification

The pre- and postorder numbers are not just any arbitrary numbers. They are useful for classifying the edges. Generally, with respect to a spanning tree of a graph, the edges of the graph may be classified as On undirected graphs, there is no distinction between forward and backward edges. If the spanning tree is a DFS tree of the graph, then for an undirected graph, there are no cross edges because of the way the DFS search is performed. For directed graphs, there are no forward cross edges with respect to a DFS tree, but there may be backward cross edges.

Classification of Edges of Directed Graph

With respect to a given spanning tree, for which we have computed pre- and postorder numbers, edges can be classified in constant time per edge as follows:

This classification does not distinguish tree edges from forward tree edges. If this matters, they can be distinguished testing whether u == parent[v] or not, where parent[v] gives the node from which v was reached. The classification allows to test in O(n + m) time whether a given spanning tree of a graph G is a DFS tree of G: compute the pre- and postorder numbers and test whether there are forbidden cross edges.

DFS Tree

Finding Cycles

Cycles are often problematic. For example, if the edges indicate some kind of order in which tasks are to be performed (precedence relations) then a cycle implies a deadlock situation. Fortunately it is easy to test for the existence of cycles: a graph is acyclic if and only if there are no backward tree edges. In principle this solves the problem, but there are more efficient special-purpose algorithms which will be presented in the following.

For undirected graphs this gives a particularly easy test for cycles:

  boolean isAcyclic() 
  {
    int[] parent = new int[n];
    for (v = 0; v < n; v++)
      parent[v] = -1; // Flag value
    Bag b = new Bag(n);
    for (r = 0; r < n; r++)
      if (parent[r] == -1) // r is the root of a new component
      {
        counter++;
        parent[r] = r;
        b.add(r);
        while (b.notEmpty()) 
        {
          v = b.remove();
          for (each neighbor w of v)
            if (parent[w] == -1) // w has not been visited yet
            {
              parent[w] = v;
              b.add(w);
            } 
          else
            if (w != parent[v])
              return false;
        } 
      } 
    return true;
  }

The above algorithm is a minor modification of the spanning-forest algorithm and has running time O(n + m). On undirected graphs one must be careful not to detect "cycles of length two": an edge (u, v) followed by the same edge in the other direction (v, u). In the algorithm these cases are singled out by the test "w != parent[v]". The applied method is simple and efficient, but one may nevertheless wonder whether it is necessary to use an additional array of n ints. One of the exercises deals with this, and the conclusion is that an array of n booleans is sufficient as well.

For directed graphs, the above algorithm might detect false cycles: reaching an earlier reached node does not need to imply that a directed cycle has been closed. On undirected graphs any graph-traversal algorithm can be used, but on directed graphs it appears to be essential to process the nodes in DFS order. Doing this, cycles are created only by backward tree edges, edges leading from a descendant v to an ancestor w. The ancestors of a node w are easy to recognize: these are exactly these nodes which already have been reached by the traversal, but for which the search is not yet completed. This can be characterized with two booleans per node. More formally, during the search any node may be in three possible states:

0.
unreached
1.
reached but not finished
2.
finished
So, the idea can also be implemented conveniently with a three-valued array:
  boolean hasCycle(int v, boolean[] status)
  {
    status[v] = 1;
    for (each neighbor w of v)
      if (status[w] == 0) // Unvisited node
      {
        if (hasCycle(w, reached, finished)) // Cycle in subtree
          return true;
      }
      else 
        if (status[w] == 1) // w is an ancestor of v
          return true;
    status[v] = 2;
    return false; // No cycles in any of the subtrees
  }

  boolean isAcyclic() 
  {
    byte[] status = new byte[n];
    for (v = 0; v < n; v++)
      status[v] = 0;
    for (r = 0; r < n; r++)
      if (status[r] == 0) // Unvisited node
        if (hasCycle(r, reached, finished)) // Cycle in subtree
          return false;
    return true; // No cycles in any of the subtrees
  }

Connected Components

Undirected Graphs

One of the simplest application of graph traversal algorithms is to determine the connected components of an undirected graph. This requires only a minimal modification of the traversal algorithm. The algorithm is very similar to the spanning-forest algorithm, which should come as no surprise: the trees of the spanning forest are in one-to-one correspondence with the connected components.
  void connectedComponents(int[] component) 
  {
    for (v = 0; v < n; v++)
      component[v] = -1; // Flag value
    Bag b = new Bag(n);
    for (r = 0; r < n; r++)
      if (component[r] == -1) // r is the root of a new component
      {
        component[r] = r;
        b.add(r);
        while (b.notEmpty()) 
        {
          v = b.remove();
          for (each neighbor w of v)
            if (component[w] == -1) // w has not been visited yet
            {
              component[w] = r;
              b.add(w);
            } 
        } 
      } 
  }

Hereafter, for all nodes v, component[v] gives the index of the "root" of the component. In this case, this index equals the smallest index of the nodes belonging to the component. The graph is connected if and only if component[v] == 0 for all v. Using a counter which is increased only when finding the root of a new component, the components can be numbered consecutively, which saves memory if the sizes of the connected components are to be stored in an int array size[]. Click here to see this algorithm integrated in a working Java program.

Directed Graphs

Determining the weak components of a directed graph is trivial: replace each directed edge by an undirected edge and run the above algorithm for undirected graphs.

Much more interesting and relevant is to determine the strong components. This problem is non-trivial: just running a graph traversal does not bring anything, reaching a node v from a node r does not mean that there is also a way back from v to r. At first it may even appear that finding the strong components is of an essentially harder nature than the problems discussed so far. This is not true: O(n + m) time is enough.

The algorithm consists of two rounds of DFS. For a graph G it proceeds as follows:

  1. Perform DFS on G computing the postorder numbers;
  2. Construct the inverted graph G', with an edge (v, u) if and only G has an edge (u, v);
  3. Perform DFS on G', addressing the nodes in the order given by their postorder numbers: u is inspected before v if postNumber[u] > postNumber[v].

Lemma: The nodes reached by a search starting from a node r during the second DFS performed on G' precisely give the strong component to which r belongs.

Proof: Denote the strong component to which r belongs by S_r. By definition of strong components, for any node u in S_r there is a path from u to r. Thus, there is also a path from r to u in G'. This implies that all of S_r is reached during the search starting from r. Now consider a node v which is reached during the search starting from r. Apparently there is a path in G' from r to v, and thus there is a path from v to r in G. It remains to prove that there is a path from r to v in G. r is considered before v. This implies that r has larger postorder number than v. If we assume that during the first DFS v was reached before r, then we get a contradiction, because we know that there is a path from v to r, and thus the postorder number of v would be larger than that of r. So we may assume that during the first DFS r was reached before v. In that case r only gets a larger postorder number when v is a descendant from r in the DFS tree. That is, there must be a path from r to v.

Topological Sorting

Suppose we have to perform a certain number of tasks and that we can perform one task at a time. There are certain dependencies between the tasks: it may be the case that some task T_1 should definitely be performed before another task T_2. The order of other tasks may be irrelevant. The whole set of dependencies can be drawn as a graph: a node for every task and an edge leading from T_1 to task T_2 if T_2 cannot be performed before T_1. We may hope the graph is acyclic, otherwise there is no feasible schedule. If the graph is acyclic we have a directed acyclic graph, abbreviated DAG. The result of this section will be that for any DAG G it is possible to compute a numbering so that for any edge (u, v) of G the number of u is smaller than the number of G. The process of computing this numbering, and even the numbering itself, is called topological sorting. A topological sorting of G provides a correct execution order for the tasks. Thus, a directed graph can be topologically sorted if and only if it is acyclic.

DFS-Based Algorithm

There is an alternative somewhat different method for computing a topological sorting. The idea is as follows:
  1. Construct the graph with inverted edges G'
  2. Perform a DFS traversal of G' computing the postorder numbers

Lemma: For a graph G, the postorder numbers of the graph with inverted edges G' constitute a topological sorting of G.

Proof: Assume there is an edge (u, v) in G. This implies that in G' there is an edge (v, u). So, if during the DFS on G', v is reached before u, then u will be reached as a descendant, either directly or indirectly, from v and therefore u will get a smaller postorder number than v, as it should be. If u is reached before v, then u gets a smaller postorder number than v unless v is reached as a descendant from u. However, this would mean that there is a path in G' from u to v. Together with the edge (v, u), this means that u and v lie on a common cycle. That is, G', and hence G, is not acyclic, a contradiction.

This algorithm immediately leads to an O(n + m) algorithm even though inverting the graph is slightly technical if we are working with an array-based implementation of the adjacency lists. Furthermore, this doubles the memory requirement.

Indegree-0 Algorithm

Which nodes can be savely numbered? A node with indegree equal to zero can be given number 0 without any risk. Furthermore, in a finite acyclic graph there is always at least one node with indegree zero. This can be seen as follows: consider the graph with inverted edges G'. Start at any node r and walk until reaching a node u without outgoing edges. Ultimately we will reach such a node, because due to the acyclicity we cannot walk in circles, and therefore after each step we reach a hitherto unreached node. Because the graph is finite, we cannot continue to reach unvisited nodes for ever. This u is a node with outdegree zero in G', which means that u has indegree zero in G.

The above idea immediately leads to an algorithm:

  1. Find a node u with indegree zero, give u the next number.
  2. Remove u and all its outgoing edges from the graph and continue with the reduced graph until there are no nodes left.
The correctness of this algorithm follows from the above observation and the fact that removing a node and its edges does not create cycles, and that thus the reduced graph is acyclic when the original graph was so.

It remains to find an efficient implementation of the above. The idea is to maintain for each node its current indegree in a separate array. The original values in this array can be computed in O(m). Then in one pass through the array all nodes with indegree zero are detected and entered in a bag. The nodes are taken out of the bag in arbitrary order. When removing an edge (u, v) the indegree of v is reduced by 1. In this way the whole algorithm has running time O(n + m). A nice feature of the algorithm is that it is not necessary to test beforehand whether the graph is acyclic or not: if the bag is empty before all nodes are numbered, then we know there is a cycle. If this does not happen, we know that apparently the graph must have been acyclic.

  boolean topologicalSort(int[] number) 
  {
    int[] degree = new int[n];
    for (v = 0; v < n; v++)
      degree[v] = 0;
    for (v = 0; v < n; v++)
      for (each neighbor w of v)
        degree[w]++;
    Bag b = new Bag(n);
    for (v = 0; v < n; v++)
      if (degree[v] == 0)
        b.add(v);
    counter = 0;
    while (b.notEmpty()) 
    {
      v = b.remove();
      number[v] = counter;
      counter++;
      for (each neighbor w of v)
      {
        degree[w]--;
        if (degree[w] == 0)
          b.add(w);
      }
    } 
    return counter == n; // Graph is acyclic
  }

Click here to see this algorithm integrated in a working Java program. The algorithm is efficient both considering time and memory consumption:

Shortest Paths

One of the most important problems on graphs is computing distances. Distances are not only distances in meters, but may also be time, cost, ... . The problem has many variants, the most important being Surprisingly, in the worst-case the first problem is only marginally easier than the second though on many graphs the problem can be solved much faster. For the third problem the best algorithms do not perform substantially better than simply solving the second problem for all s (this is not true if the edges may have negative weights). Thus, we can focus on the single-source-shortest-path, SSSP, problem.

BFS-Based Algorithm

For unweighted graphs the SSSP problem is easy, it can be solved by BFS: the distance from s of a newly reached node is one larger than the distance of the current node from s. In code this looks as follows:
  void unweightedSSSP(int s, int[] dist) 
  {
    for (v = 0; v < n; v++)
      dist[v] = INFINITY; 
    Queue q = new Queue(n);
    dist[s] = 0;
    q.enqueue(s);
    while (q.notEmpty()) 
    {
      v = q.dequeue();
      for (each neighbor w of v)
        if (dist[w] == INFINITY) // w has not been visited yet
        {
          dist[w] = dist[v] + 1;
          q.enqueue(w);
        } 
    } 
  } 

The algorithm is so that nodes v that are unreachable from s have distance[v] = infinity, which appears to be reasonable. Click here to see it integrated in a working Java program.

BFS Tree

Here and in the following dist[u] denotes the distance values in the algorithm, while distance(u, v) denotes the real distance in the graph from node u to v. Because the length of the concatenation of two paths is the sum of the lengths of each path, the triangle inequality holds for all triples of nodes u, v and w:

distance(u, w) <= distance(u, v) + distance(v, w).

Using induction over the number of performed enqueue operations, it is easy to prove that at all times for all nodes dist[v] >= distance(s, v). Initially this is true. So, assume it is true after t enqueuing operations. If in step t + 1 we set dist[w] = dist[v] + 1, then we may assume that dist[v] >= distance(s, v), and thus, using the triangle inequality, dist[w] = dist[v] + 1 >= distance(s, v) + 1 = distance(s, v) + distance(v, w) >= distance(s, w). Thus, the computed values are not too small. It remains to prove that they are not too large. This is not as easy as one might think.

Lemma: The values dist[v] of the enqueued (dequeued) nodes is monotonically increasing.

Proof: We use induction over the number of performed enqueue (dequeue) operations. Initially the claim holds because any sequence of length at most 1 is monotonic. So, assume the claim holds for the first t operations. Then in operation t + 1 we are enqueuing a node w with value dist[w] = dist[v] + 1. Because v is the latest dequeued node, we may assume that dist[v] >= dist[u] for any earlier dequeued node u. But then dist[w] = dist[v] + 1 >= dist[u] + 1, the value of an arbitrary node on the queue. Here we essentially use that in a queue nodes that were enqueued earlier are also dequeued earlier (the FIFO order).

Theorem: At the end of the algorithm dist[v] = distance(s, v) for all v.

Proof: It remained to show that the values are not too large. The proof goes by contradiction. Assume that dist[w] > distance(s, w) for some w, and assume that w is the node lying closest to s which gets assigned too large a value dist[w]. Let u be the last node before w on a shortest path from s to w. So, distance(s, w) = distance(s, u) + 1. Because u lies closer to s than w, we may assume that u gets assigned the correct value: dist[u] = distance(s, u). Let v be the node which was responsible for enqueuing w. This implies dist[w] = dist[v] + 1. So, dist[v] = dist[w] - 1 > distance(s, w) - 1 = distance(s, u) = dist[u]. Thus, according to the previous lemma, node u will be dequeued before node v. Thus, w should have been enqueued by u instead of v, and we should have dist[w] = dist[u] + 1. This is a contradiction, which can be traced back to our assumption that there is a node w with dist[w] > distance(s, w).

In this kind of proofs it is very common to argue by contradiction, focusing on a supposed first occasion for which the algorithm makes a mistake. If there is no first mistake, then there is no mistake at all!

Dijkstra's Algorithm

If there are weights, then the simple queue-order processing of the elements is no longer correct. Particularly, it is not necessarily true that the first time one discovers a node this is along a shortest path: a path with 10 short edges may be shorter than a path with 5 long edges. However, a simple modification works. The algorithm is known under the name Dijkstra's algorithm. It is assumed that all edge weights are positive, or at least non-negative! If there are negative weights, the algorithm will run in the same time, but in that case the computed values only give an upper-bound on the distances.
  void weightedSSSP(int s, int[] dist) 
  {
    for (v = 0; v < n; v++)
      dist[v] = INFINITY; 
    PriorityQueue pq = new PriorityQueue(n, INFINITY);
    dist[s] = 0;
    pq.decreaseKey(s, dist[s]);
    while (pq.notEmpty()) 
    {
      v = pq.deleteMin();
      for (each neighbor w of v)
        if (dist[w] > dist[v] + weight[v, w]) // shorter path
        {
          dist[w] = dist[v] + weight[v, w];
          pq.decreaseKey(w, dist[w]);
        } 
    } 
  } 

In order to get a formulation without case distinctions, initially all nodes are inserted into the priority queue with infinite key value. This works correct even if some nodes may not be reachable from s: for these nodes the key value remains unchanged throughout the algorithm. As soon as all reachable elements are deleted from the priority queue, these are deleted and they are processed just as reachable nodes, but certainly this does not lead to improved distance values. So, the existence of unreachable nodes only causes some useless work.

At most n elements are enqueued and dequeued. At most m decreasekey operations are performed. It depends on the priority queue used how much time this takes. Using a binary heap, the construction can be performed in O(n) time and each decreasekey and deletemin in O(log n) time. So, with an ordinary heap Dijkstra's algorithm has running time O((n + m) * log n). Better priority queues (Fibonacci heaps) allow to perform the decreasekey operations in O(1) amortized time, reducing the running time to O(m + n * log n). Thus, for all m = Omega(n * log n), this is O(m), which is clearly optimal. Refined algorithms perform better for sparser graphs.

Click here to see the algorithm integrated in a working Java program. In this implementation the priority queue is realized in a primitive way using an array. This minimizes the memory consumption and allows to perform a decrease-key operation in constant time, but makes deletemins expensive: they take O(n) time. So, the running time of the whole algorithm is O(n^2 + m), which for all simple graphs is O(n^2). For dense graphs this is ok, but for all other graphs it is much better to use a priority queue with faster deletemins. In general, implementing a decrease-key operation may require an additional search tree to find the nodes, but because in this particular case all node indices lie between 0 and n - 1, it is sufficient to maintain an extra array pos[] of length n, pos[u] giving the position of node u in the priority queue (one more application of direct addressing).

The proof of correctness of the algorithm is similar to the proof of correctness for the unweighted case. First one shows that the nodes that are dequeued have increasing distances.

Lemma: The values dist[v] of the dequeued nodes is monotonically increasing.

Proof: We use induction over the number of performed dequeue operations. Initially the claim holds because any sequence of length at most 1 is monotonic. So, assume the claim holds for the first t operations. Let w be the node we are dequeuing in step t + 1. Let u be any node dequeued before. At the time u was enqueued using deletemin we must have had dist[u] <= dist[w]. So, consider possible updates to dist[w] after u was dequeued. Let v be the node which caused the latest update of dist[w]. In that case dist[w] = dist[v] + weight[v, w]. From our induction assumption it follows that dist[v] >= dist[u]. But then, dist[w] = dist[u] + weight[v, w] >= dist[u]. It is at this point that we essentially use that weight[v, w] >= 0. Otherwise this lemma cannot be proven!

Theorem: At the end of the algorithm dist[v] = distance(s, v) for all v.

Proof: As before it can be shown that the values of distance[] at all times give an upper bound on the distances: as long as they are infinity, this is clear, once they have a finite value, the value corresponds to the length of a path: there may be shorter paths, but this path can be used for sure. So, we may assume that dist[v] >= distance(s, v) for all nodes at all times.

Consider the node w lying closest to s, having smallest value of distance(s, w), which upon dequeuing has dist[w] > distance(s, w). If weight[v, w] = 0, and the shortest path from s to w runs through v, then we will nevertheless say that v lies closer to s than w. Let v be the last node before w on a shortest path from s to w. Thus, distance(s, w) = distance(s, v) + weight[v, w]. Because weight[v, w] >= 0, we have dist(s, v) <= dist(s, w), and thus v gets correct value, that is dist[v] = dist(s, v). So, dist[w] > distance(s, w) = distance(s, v) + weight[v, w] = dist[v] + weight[v, w] >= dist[v], and therefore, because of the previous lemma, v will be dequeued before w. But at that occasion the algorithm would have set dist[w] = dist[v] + weight[v, w] = distance(s, v) + weight[v, w] = distance(s, w), in contradiction with the assumption that distance[w] > dist[s, v].

In the following picture the action of Dijkstra's algorithm is illustrated. Edge weights are indicated along the edges, the current values of dist[] are indicated in the nodes. 99 stands for infinity. Nodes that have been removed from pq have final distance value. The (connected) subgraph with these nodes is marked.

Dijkstra's Algorithm

Finding Edges on Paths

One can easily keep track of the edges lying on the shortest paths during the algorithm, but it is just as easy to determine them afterwards as follows
  void findShortestPathEdges(int[] dist, int[] parent) 
  {
    for (all nodes w)
      parent[w] = -1;
    for (all edges (v, w))
      if (dist[w] == dist[v] + weight[v, w])
        parent[w] = v; 
  }

The routine is given for directed graphs. For undirected graphs (depending on the input format) it may be necessary to consider an edge (v, w) both for w and for v. In the current version parent[w] may be set several times if there are several paths of the same length from s to w. This may be prevented by performing the assignment only when parent[w] == w, but this does not make the routine faster.

This routine takes O(n + m) time independently of the type of graph, so these costs are negligible except for unweighted graphs. For weighted graphs it will certainly be more efficient to determine the edges by this separate procedure. But, even for unweighted graphs it may be profitable to perform a second pass over all edges: this reduces the amount of data that are handled in a single subroutine, and may therefore allow to hold a larger fraction of the graph in cache.

Afterwards every node has a unique predecessor, and the graph defined by parent[] is acyclic provided that there are no zero-cost cycles. So, the whole graph defined by parent[] constitutes a tree directed towards s, spanning all nodes reachable from s. In particular: independently of m, the set of all shortest paths has size n. Once parent[] has been computed, queries of the form "give the shortest path from v" can be solved using O(n) time and memory: start at v, push all edges on the path from v to s on a stack and print them while popping them.

It is not nice that the above algorithm does not handle zero-cost cycles correctly. This problem can most easily be overcome by an alternative algorithm which is only marginally slower than the given one: run a spanning-tree algorithm on the graph G' = (V, E'), where E' = {(v, w) in E| dist[w] == dist[v] + weight[v, w]}.

Exercises

  1. Describe how to convert a list of undirected edges into an adjacency-list representation.

  2. Write a class AdjacencyMatrix. It should have a constructor AdjacencyMatrix(int n), and posses methods void addEdge(int u, int v), void deleteEdge(int u, int v), int numberOfEdges() and boolean isEdge(int u, int v), which can be used to add an edge, delete an edge, aks for the number of edges and test whether an edge exists, respectively. All operations should run in O(1) time, independently of the number of nodes or edges. For a graph with n nodes, the whole data structure should require n^2 + O(n) bits. It is not correct to assume that booleans are realized as bits.

  3. Draw a directed graph with 14 nodes and 20 edges. Give an adjacency matrix and an adjacency list of the graph. Select an arbitrary start node and compute preorder and postorder numbers for all nodes. The edges of a node are considered in the order in which they appear in the adjacency list.

  4. A common operation on graphs is edge contraction. This means that two adjacent nodes are fused. The edge between them is eliminated and all edges that formerly where running towards / away from either of them is now running towards / away from the new node. The new node can be given the index of one of the two previous nodes. The other node may be left without any edges. More precisely, if the edge to contract is (u, v), then the tasks are to
    1. Find all edges (v, w) and to replace them by edges (u, w);
    2. Find all edges (w, v) and to replace them by edges (w, u);
    3. To not create self-loops;
    4. To not create multiple edges.

    Give pseudocode realizing this operation. Distinguish four cases: an adjacency-matrix representation and a representation with adjacency lists, both for directed and undirected graphs. Indicate the time consumption for each of them. The time bounds should be given in terms of n, the number of nodes; m, the number of edges; d_u, the degree of u; d_v, the degree of v; and d_G, the current degree of the graph G.

    For directed graphs it is hard to find the edges (w, v). But, choosing an appropriate graph representation requiring O(n + m) memory, a slightly more complicated algorithm can perform edge contraction efficiently: Point 1, 2 and 3 can be realized in O(d_v) time. Describe the necessary graph representation and the algorithm in full detail.

    Point 4 is somewhat harder. Give a trivial realization requiring O(d_v * d_G) time. Organizing the adjacency structures in a suitable way, this can be improved to O(d_v * log(d_G)). Describe how this can be achieved. Instead of changing the organization of the adjacency structures, we can also use some additional data structure, so that we reasonably can expect to perform this operation in O(d_u + d_v) time. Describe how this can be achieved.

  5. For unweighted graphs the connected-components and spanning-tree problem can be solved in O(n + m) time by modifications of the graph-traversal algorithm. In a theoretical sense these algorithms are optimal, but practically this may not be true: if we assume that n = 100,000 and m = 10 * n, then the graph does not fit into the cache and therefore a traversal will produce on the order of n cache faults. On the other hand, we can store one integer per node in the cache. Therefore it is attractive to perform in cases like this a simplified version of Kruskal's algorithm: using a union-find data structure, all edges are considered in the order they are stored and two sets are unified whenever they are connected by an edge. Work this idea out to an algorithm in pseudocode. What is the complexity of this algorithm using the best implementation of the union-find ADT? How do you estimate the performance in practice will be in comparison with a traversal-based algorithm?

  6. The topic of this exercise are the properties of acyclic undirected graphs and an algorithm to test acyclicity.
    1. What kind of graph structure does an acyclic connected undirected graph have? For a graph with n nodes, how many edges does such a graph have?
    2. What structure does an arbitrary acyclic undirected graph have? Give a concise and general expression for the number of edges m in terms of the number of nodes n and a third parameter. Prove the general correctness of your expression.
    3. Develop an alternative algorithm for testing whether an undirected graph is acyclic. The algorithm should use, in addition to the memory required to store the adjacency lists and a bag for n nodes, memory for at most n booleans and a constant number of ints. The algorithm should have O(n) running time, independently of the number of edges. Hints: use that a graph is acyclic if and only if each connected component is acyclic, and the above results.

  7. Write a non-recursive version of the DFS algorithm requiring a stack whose size is bounded by O(n). Compare the performance with that of the presented non-recursive algorithm which can be downloaded here. Test graphs with n = 100,000 and m = k * n, for k = 1, 2, 4, 8, 16.

  8. For a directed graph G, several of the presented algorithms required the construction of the graph with inverted edges G'. Assume that for G we use an array-based implementation of the adjacency lists: there is one array f[] of length n + 1 and an array a[] of length m so that the neighbors of node u stand in a[] between (inclusively) f[u] and (exclusively) f[u + 1]. Construct from this the corresponding arrays b[] and g[] giving the adjacency lists of G'. The algorithm should run in O(n + m) time and it should only use O(1) memory beyond the four mentioned arrays. Now try to improve the algorithm: rewrite it so that it becomes semi-insitu. Here an algorithm solving a graph problem is called semi-insitu if it uses m + O(n) storage.

  9. Consider a directed acyclic graph of a special form: there is only one node s, called source, without ingoing edges and only one node t, called sink, without outgoing edges. For an edge (u, v), let N(u, v) denote the number of different paths from s to t containing (u, v). The task is to present an efficient algorithm for computing N(u, v). Notice that N(u, v) = N^-(u) * N^+(v), where N^-(u) denotes the number of different paths from s to u and N^+(v) denotes the number of different paths from v to t. Define Pred(u) = {x | (x, u) is an edge} and Suc(v) ={x | (v, x) is an edge}.
    1. Give recursive expressions of N^-(u) and N^+(v) in terms of Pred(u) and Suc(v). Do not forget to treat the special cases u == s and v ==t.
    2. Using these expressions, give a recursive algorithm for computing N(u, v).
    3. Analyze the time consumption of your algorithm.
    4. Processing the nodes in the right order, the value of N^-(u) can also be computed by a non-recursive algorithm. In which order should the nodes be processed?
    5. Outline a complete non-recursive algorithm for computing the values N(u, v) for all edges (u, v) in linear time.

  10. Consider the single-source-shortest-path problem on weighted graphs. The standard algorithm uses a priority queue, which implies some extra overhead (unstructured memory access, extra information for finding the nodes). One might consider alternatives. How about the following (Kruskal's minimum-spanning-tree algorithm):
    1. Sort all edges according to edge weight;
    2. Consider the edges in the order of increasing edge weight, an edge is added to the set of edges if it does not create a cycle.
    Is this algorithm correct? That is, does the shortest path from a node s to a node t always run over the edges of the constructed tree?

  11. Consider an arbitrary tree T with n nodes. Assume that there is a specified root node r and that for any node u in the tree the roots of the subtrees of u can somehow be accessed. Outline an algorithm for finding a separator of the tree. By separator we mean a node s so that removing s from T decomposes it into 2 or more components, all of them with size at most n / 2. Your algorithm must have O(n) running time.

  12. Consider the program implementing Dijkstra's algorithm. Design a method testDistances(int[] dist) to be added to the class Graph, which for a given graph tests in O(m) time that the computed distances are correct. In general for any non-trivial programming task it is a very good idea to add independent test routines whenever possible. In this case the task consists of two parts: testing that the computed distances are not too large and testing that they are not too small. Hint: for the second test it is handy to first determine the tree consisting of all edges lying on a shortest path.

  13. Test the program implementing Dijkstra's algorithm. Run experiments for n = 2^k for k = 12, 13, 14, 15 and m = l * n, for l = 10, 100, 1000. Fit a function involving four parameters through the measured values so that all time measurements are reasonably predicted. Do the experiments conform with the theoretical analysis?

    Now take the given heap implementation and complement it with a method decreaseKey(int i, int x) based on percolating up along the lines sketched in the text above. Replace the priority queue in the program testing Dijkstra's algorithm by this heap implementation.

    Repeat the experiments for the same pairs of n and m values. Compare the times. Fit a suitable function through your results. Do the experiments conform with the theoretical analysis?

  14. It was suggested how to solve the problem of finding the tree T giving all shortest paths in a weighted graph G = (V, E) by running a spanning tree algorithm on a reduced unweighted graph G' = (V, E'), where E' = {(v, w) in E| dist[w] == dist[v] + weight[v, w]}. Prove the correctness of this algorithm. That is, prove that the unique path in T from s to a node v indeed gives a shortest path from s to v.

    Actually the edges of T can be computed without explicitly constructing G'. Give an algorithm which, in addition to the memory required for storing G, essentially only requires memory for the array parent[].


This page was created by Jop Sibeyn.
Last update Monday, 02 June 03 - 08:25.
For any comments: send an email.