October 14, 2024
Path expressions are a compact way of representing the set of paths between two nodes in a graph. They are regular expressions over an alphabet of edges. Unlike a regular expression that you might use for string matching, which determines the possible sequence of characters that match a given pattern, a path expression determines the possible sequence of edges that create a path from one node to another.
There are a variety of applications of path expressions, including algebraic program analysis, a method of program analysis that serves as an alternative to the usual iterative methods that you might be familiar with. The idea is to compute path expressions for nodes in a control flow graph and then interpret these expressions in some algebra that can represent useful information about the program.
I’ll cover algebraic program analysis in another post, but in this one I will focus on how to actually compute path expressions. In particular, I will focus on computing path expressions for paths between the entry node and other nodes of a control flow graph.
Let’s take the following CFG as an example, which represents a program with a while loop.
Say we are interested in computing for the path expression for node 4. The set of paths from node 1, the entry node, to node 4 is infinite: one path traverses edges \(a, d\); another traverses edges \(a, b, c, d\); another traverses edges \(a, b, c, b, c, d\); and so on. We can’t actually enumerate the set of course, but the path expression for node 4 succinctly represents the whole set: \(a (bc)^{*} d\). The path expressions for the other nodes are what you might expect:
How might we compute these path expressions? Recall that path expressions are regular expressions, and, thanks to Kleene, we know that regular expressions are equivalent to finite automata. In turn, a finite automaton at its essence is a set of states with transitions between them, with distinguished starting states and accepting states. Sounds awfully close to a control flow graph! For a given node for which we want to compute a path expression, we can view a CFG as a finite automaton where:
The path expression for the node will then recognize the same regular language that this automaton recognizes.
So to compute path expressions, we just need an algorithm to convert finite automata into regular expressions. Such an algorithm already exists: Kleene’s algorithm. Not only does it do the job, it is also pretty simple. However, Kleene’s algorithm is inefficient. There are two main reasons:
The algorithm does too much: it computes a regular expression that represents all possible paths between all pairs of the control flow graph. We only care about the path from the entry block to other blocks in the program. It would be akin to using the Floyd-Warshall algorithm to compute the shortest path from one point to all other points, when you could have just used Dijkstra’s algorithm.
The algorithm has cubic asymptotic complexity. To put it in perspective, to compute path expressions for a control flow graph with 100 nodes (not unrealistic for large functions), Kleene’s algorithm must compute a million regular expressions!
Robert Tarjan developed a more efficient algorithm for computing path expressions in his paper “Fast Algorithms for Solving Path Problems”. It’s not a direct algorithm for computing path expressions, but rather an algorithm that can speed up existing algorithms for computing path expressions by breaking the problem down into smaller pieces. Unfortunately, I found the paper hard to understand so it took a while to see how it works. Here’s my attempt at giving an easier explanation.
A heads up if you do try to read the paper: I don’t try that hard to stick close to Tarjan’s exposition, and I take liberties with using some new terminology that was not in the paper whenever I think it clarifies the presentation. Also, Tarjan actually gives several algorithms for computing path expressions in the paper. This post is covers the last and most general algorithm, “decompose and sequence,” which works for an arbitrary control flow graphs. The other algorithms in the paper are basically variations of this one.
The key idea in Tarjan’s algorithm is to decompose path expressions using dominators. If you’ve taken a compilers class, you have probably encountered the concept—specifically, the standard algorithm for transforming into minimal SSA form uses the iterated dominance frontier—but we’ll do a quick recap here.
Node \(N_1\) dominates node \(N_2\) if every path from the entry of the CFG to \(N_2\) contains \(N_1\). Note that the domination relation is reflexive: every node dominates itself. A node \(N_1\) strictly dominates (or properly dominates) node \(N_2\) if \(N_1\) dominates \(N_2\) and \(N_1 \neq N_2\). We say that \(N_1\) is the immediate dominator of \(N_2\) if \(N_1\) strictly dominates \(N_2\) and does not strictly dominate a strict dominator of \(N_2\). The immediate dominator for any node is unique, so we can define a function \(\textup{idom}(N)\) that returns the immediate dominator for a node \(N\). Using the immediate domination relation, we can define a dominator tree for a control flow graph, where a node’s parent is its immediate dominator. It should be obvious that the root of the dominator tree is always the entry node.
Given a path to some node \(N\), the following lemma gives an important property about the path in relation to the proper dominators of \(N\).
Lemma 1: Let \(N_1, ..., N_m\) be the sequence of nodes of a path \(P\) that ends in node \(N_m\). If \(D\) is some proper dominator of \(N_m\), we know \(D\) must occur along \(P\); let \(N_k\) be the last occurrence of \(D\) for some \(1 \leq k < m\). Then for all \(N_i\) where \(k < i \leq m\), \(N_i\) is properly dominated by \(D\).
Proof. Assume this is false. Then there is some \(N_i\) where \(k < i < m\) that is not properly dominated by \(D\). (We know that \(i \neq m\) since \(N_k = \textup{idom}(N_m)\)). Thus there must be some path \(P^\prime\) from the entry to \(N_i\) that does not go through \(D\). But then we can extend \(P^\prime\) to \(N_{i+1}\) since we know in the path \(P\) there is an edge from \(N_i\) to \(N_{i+1}\). Now there is a path from the entry to \(N_{i+1}\) that does not go through \(D\), and thus we can conclude \(D\) does not properly dominate \(N_{i+1}\). We can use similar reasoning to conclude that \(D\) does not properly dominate \(N_{i+2}\) either, and so on until we conclude that \(D\) does not properly dominate \(N_m\), which is impossible. □
Lemma 1 says that for any path to some node \(N\), every node in the path after the last occurrence of some proper dominator \(D\) of \(N\) must be properly dominated by \(D\) as well. Intuitively, this means that after the last occurrence of \(D\) in the path, the rest of the nodes in the path must reside in the dominator subtree rooted at \(D\). Later in the path, after the last occurrence of another proper dominator \(D^\prime\) of \(N\), the rest of the nodes in the path must reside in the dominator subtree rooted at \(D^\prime\). Since \(D^\prime\) must be properly dominated by \(D\), this subtree must be contained in the subtree rooted at \(D\). So any path to \(N\) goes through transitions where it moves down some subtree of the dominator tree (i.e., a subtree that contains \(N\)) and never visits nodes outside of that subtree again.
We can use this observation that any path to \(N\) “goes down” the dominator tree towards \(N\) to give a decomposition of the path expression for \(N\) in terms of its proper dominators. Recall that we are interested in computing a map \(\textup{path}(N)\), which returns a path expression that represents the set of all paths from the entry node to node N. Given Lemma 1, we know that for any path from the entry node to \(N\), we can break the path up into two pieces: a path from the entry node to \(\textup{idom}(N)\), and a path from \(\textup{idom}(N)\) to \(N\) that only visits nodes that are properly dominated by \(\textup{idom}(N)\). From this we can conclude that \(\textup{path}(N)\) can be decomposed into \(\textup{path}(idom(N)) \times \textup{dpath}(N)\), where \(\textup{dpath}(N)\) maps a node \(N\) to a path expression that represents the set of all paths from \(\textup{idom}(N\)) to \(N\) that only goes through nodes that are all properly dominated by \(\textup{idom}(N)\). By similar reasoning, we can decompose \(\textup{path}(\textup{idom}(N))\) into \(\textup{path}(\textup{idom}(\textup{idom}(N))) \times \textup{dpath}(\textup{idom}(N))\).
You can see where this is going. Let \(D_1, ..., D_k\) be a path in the dominator tree, where \(D_1\) is the entry node and \(D_k\) is \(N\). We know that for any \(1 \leq i < k\), \(\textup{idom}(D_{i+1}) = D_i\). Then we know \[ \textup{path}(N) = \textup{path}(D_1) \times \textup{dpath}(D_2) \times \ldots \times \textup{dpath}(D_k). \] So we can compute \(\textup{path}\) in terms of \(\textup{dpath}\). Note, however, that we cannot decompose \(\textup{path}(D_1)\) in terms of \(\textup{dpath}(D_1)\), because the entry node has no immediate dominator. It is a special case, and we will leave discussion of it later.
Tarjan’s algorithm computes values of \(\textup{dpath}\) by performing a post-order traversal of the dominator tree. When a node \(N\) is processed, the algorithm computes \(\textup{dpath}(C)\) for all of the children \(C\) of \(N\). The traversal is post-order because computing \(\textup{dpath}\) for a node requires the values of \(\textup{dpath}\) for the nodes that it properly dominates.
To compute \(\textup{dpath}(C)\), the algorithm splits the incoming edges of \(C\) into two sets: \(\textup{tree}(C)\), which are incoming edges from \(N\) (i.e., the immediate dominator of \(C\)), and \(\textup{nontree}(C)\), which are the rest of the incoming edges of \(C\) that are not in \(\textup{tree}(C)\). We first compute a path expression from \(\textup{tree}(C)\), which we call \(\textup{dpathTree}(C)\), and then we process \(\textup{nontree}(C)\) to compute \(\textup{dpath}\) in terms of \(\textup{dpathTree}\).
Computing \(\textup{dpathTree}(C)\) is the easy “base” case. Each tree edge is already a path from \(N\) to \(C\), so we just need to add all of them up. Thus, given \(\textup{tree}(C) = \{ e_1, \ldots, e_n \}\), we know \(\textup{dpathTree}(C) = e_1 + \ldots + e_n\).
Processing edges in \(\textup{nontree}(C)\) is not as straightforward. While a non-tree edge \(e\) does not form a direct path from \(N\) to \(C\), we can use \(e\) to form a path expression from some other child of \(N\) to \(C\), which we can easily extend to a path expression from \(N\) to \(C\). To see how this can be the case, we need the following lemma:
Lemma 2: The node \(\textup{source}(e)\) must be dominated by node \(\textup{idom}(\textup{target}(e))\).
Proof. Assume the statement is false. Then since \(\textup{source}(e)\) is not dominated by \(\textup{idom}(\textup{target}(e))\), there is a path from the entry to \(\textup{source}(e)\) that does not pass through \(\textup{idom}(\textup{target}(e))\). But adding the \(e\) to this path creates a path from the entry to \(\textup{target}(e)\) without passing through \(\textup{idom}(\textup{target}(e))\), which is impossible. □
Applied to this case, Lemma 2 lets us conclude that the source node of \(e\) is dominated by \(N\). But since we already know that the source node of \(e\) is not \(N\), it must be the case that \(\textup{source}(e)\) must be dominated by one of the children of \(N\) in the dominator tree; let this node be \(C^\prime\). Then we know that there is a set of direct paths from \(N\) to \(C^\prime\), as given by \(\textup{dpathTree}(C^\prime)\), and there is a set of paths from \(C^\prime\) to \(\textup{source}(e)\), and then finally a path from \(\textup{source}(e)\) to \(C\) by way of \(e\). We can compose all of these together to form a set of paths from \(N\) to \(C\).
However, this is not the full story, because paths between the children of \(N\) can compose together. For example, say there is a set of paths from child node \(C_1\) to child node \(C_2\) ending in edge \(e_1\), and another set of path from \(C_2\) to a child node \(C_3\) ending in edge \(e_2\). Then we can compose the paths from the two sets to create new paths from \(C_1\) to \(C_3\). We can then compose paths from \(\textup{dpathTree}(C_1)\) with these new paths to create paths from \(N\) to \(C_3\).
So from this we can see that to calculate \(\textup{dpath}\), we need to consider not just the incoming edges of a child node, but the also the incoming edges of its siblings, because paths between child nodes can be composed together. How do we compute all of these possible compositions?
Tarjan defines a derived graph to help reason about these possible path compositions in a compact way. This derived graph has the same nodes as the CFG; we compute its edges as follows. For each edge \(e\) that is in \(\textup{nontree}(C)\) for some node \(C\) in the dominator tree, find the sibling node \(C^\prime\) of \(C\) that dominates \(\textup{source}(e)\)—the sibling has to exist by Lemma 2—and then add a new edge \(e^\prime\) in the derived graph where \(\textup{source}(e^\prime) = S\) and \(\textup{target}(e^\prime) = C\). Note that \(C^\prime\) can possibly be \(C\) itself.1
Intuitively, an edge in the derived graph represents the set of paths from \(C^\prime\) to \(C\) in the control flow graph that end in edge \(e\). We can compute a path expression—let’s call it the image of \(e^\prime\)—that represents these paths: given a path \(D_1, ..., D_k, \textup{source}(e)\) in the dominator tree, where \(C^\prime = D_1\), then the set of paths is given by \[ \textup{dpath}(D_2) \times \ldots \times \textup{dpath}(D_k) \times \textup{dpath}(\textup{source}(e)) \times e. \] If an edge in the derived graph represents a set of paths in the CFG, a path in the derived graph, then, represents one possible composition of a set of paths between siblings. Thus to compute all the different ways paths between siblings can be composed, we can just compute the set of paths in the derived graph from one sibling to another. Happily, we already have a way of computing a set of paths from one node to another: all we need to do is to compute path expressions between siblings in the derived graph. The algorithm dispatches the computation of these path expressions to some other path expression algorithm; in the paper, Tarjan uses an “eliminate” algorithm that is similar in spirit to Kleene’s algorithm.
Once we have computed path expressions in the derived graph between sibling nodes, we can then map these path expressions into path expressions in the control flow graph by replacing instances of edges in the derived graph with their images, at which point we will have path expressions that represent sets of paths from one sibling node to another.
We are now in a position to compute \(\textup{dpath}\). Let \(\textup{derivedPath}(C_1,C_2)\) be a function that returns a path expression representing paths from node \(C_1\) to \(C_2\) in the derived graph, and let \(\textup{img}(e)\) be a function that maps a path expression in the derived graph to a path expression in the control flow graph by replacing edges in the derived graph with their images. Additionally, let \(\textup{siblings}(C)\) define the set of siblings of node \(C\) in the dominator tree. Then we know
\[ \textup{dpath}(C) = \sum_{C^\prime \in \textup{siblings}(C)} \textup{dpathTree}(C^\prime) \times \textup{img}(\textup{derivedPath}(C^\prime, C)). \] You might be worried that the images of path expressions in \(\textup{derivedPath}\) are not computable: the images of edges in the derived graph are given using values of \(\textup{dpath}\), and we’re computing \(\textup{dpath}\) partly in terms of the images of edges in the derived graph, so it might be possible that the definitions are circular. But consider an edge \(e^\prime\) from sibling nodes \(C^\prime\) to \(C\) in the derived graph. We compute its image above using the \(\textup{dpath}\) values of nodes in the path from \(C^\prime\) to \(\textup{source}(e)\) in the dominator tree, excluding \(C^\prime\) itself. All of these nodes are contained in the dominator subtree rooted at \(C^\prime\), and since the algorithm processes nodes in post-traversal order according to the dominator tree, these \(\textup{dpath}\) values have already been computed by the time we compute \(\textup{dpath}(C)\). So even though \(\textup{dpath}\) and the images of edges in the derived graph are defined in terms of each other, the order in which their values are computed makes their definitions non-circular.
Finally, now that we know how to compute \(\textup{dpath}\) there is one last order of business: computing the value of \(\textup{path}\) for the entry node. We know that \(\textup{tree}(\textup{entry})\) is empty since the entry node has no immediate dominator, so we only need to worry about \(\textup{nontree}(\textup{entry})\). For every incoming edge \(e\) in \(\textup{nontree}(\textup{entry})\), we know that the entry node dominates \(\textup{source}(e)\). Thus there is a path \(D_1, ..., D_k, \textup{source}(e)\) in the dominator tree where \(D_1 = \textup{entry}\), and the following path expression defines the set of paths from the entry node to itself that (1) end in \(e\) and (2) don’t have the entry as an intermediate node: \[ \textup{dpath}(D_2) \times \ldots \times \textup{dpath}(D_k) \times e. \] We can compute such a path expression for every edge \(e\) in \(\textup{nontree}(\textup{entry})\); call it \(P_e\). Because we are computing paths from the entry to itself, we can stay at the entry and not traverse any edges, or go from the entry to itself once by way of a path described by one of these \(P_e\) expressions, and then repeat this process over and over. The expression for \(\textup{path}(\textup{entry})\) captures this: \[ \textup{path}(\textup{entry}) = \left ( \sum_{e \in \textup{nontree}(\textup{entry})} P_e \right )^{*}. \]
We can now put it all these pieces together to arrive at an implementation of Tarjan’s algorithm. The code assumes the following functions are implemented:
computeDerivedGraph(cfg, domtree)
- returns a tuple
containing: (1) edges of the derived graph and (2) a map from edges in
cfg
to edges in derived graphcomputePathExpressions(nodes, edges)
- returns path
expressions in terms of edges
between pairs in
nodes
edgePaths(edge, start, dpath, domtree)
- compute path
expression using dpath
that represents paths from
start
to edge.dst
that end in
edge
pathImg(imgMap, pathExpr)
- map path expression
pathExpr
in derived graph to path expression in CFG using
imgMap
, which maps edges in derived graph to their
imagesNote that RegexZero
and RegexOne
represent
regular expressions 0 and 1 respectively. You can interpret 0 as the
empty set, and 1 as the singleton set containing the empty sequence.
def tarjan(cfg, domtree):
# derivedEdges are the edges of the derived graph
# edgeMap maps edges in CFG to edges in derived graph
= computeDerivedGraph(cfg, domtree)
derivedEdges, edgeMap
= dict()
path = dict()
dpath = dict()
dpathTree
# compute dpath in post-order traversal of domtree
= traversePostorder(domtree)
postorderNodes for node in postorderNodes:
= domtree.children(node)
children = dict()
imageMap
# compute dpathTree and image of edges in derived graph
for child in children:
= RegexZero()
dpathTree[child] for edge in tree(child):
= dpathTree[child] + edge
dpathTree[child]
for edge in nontree(child):
= edgeMap[edge].src
siblingDom = edgePaths(edge, siblingDom, dpath, domtree)
imageMap[edgeMap[edge]]
# compute path expressions in derived graph between sibling nodes
= computePathExpressions(children, derivedEdges)
derivedPath
# compute dpath
for child in children:
= RegexZero()
dpath[child] for sibling in children:
# compute image of path expr in derived graph
= pathImg(imageMap, derivedPath[(sibling,child)])
pathImage = dpathTree[sibling] * pathImage
siblingPaths = dpath[child] + siblingPaths
dpath[child]
# compute path in reverse post-order traversal of domtree
for node in reversed(postorderNodes):
# special case for computing path[entry]
if node == cfg.entry:
= RegexZero()
path[cfg.entry] for edge in nontree(cfg.entry):
= edgePaths(edge, cfg.entry, dpath, domtree)
edgePaths = path[cfg.entry] + edgePaths
path[cfg.entry]
= RegexStar(path[cfg.entry])
path[cfg.entry]
# for non-entry nodes, compute path by appending dpath[node] to path[idom]
else:
= domtree.immediateDominator(node)
idom = path[idom] * dpath[node]
path[node]
return path
You can see a complete working implementation of the algorithm in Python here.
Let’s go back to the control flow graph at the beginning of this post, and let’s use Tarjan’s algorithm to compute the path expressions for its nodes. The derived graph for the CFG is below; it contains only an edge \(c^\prime\) from node 2 to itself, which represents the back edge \(c\) from node 3 to node 2 in the CFG.
The post-order traversal of the dominator tree starts with nodes 3 and 4. Neither of these nodes have children, so we don’t compute anything.
Next, we process node 2, which has nodes 3 and 4 as children in the dominator tree. Neither child node have non-tree edges, so we only have to compute \(\textup{dpathTree}\) for both. We arrive at \(\textup{dpath}(3) = b\) and \(\textup{dpath}(4) = d\) .
Next, we process node 1, which has node 2 as a child in the dominator tree. We know \(\textup{dpathTree}(2) = a\). For \(\textup{dpath}(2)\), we compute \(\left (c^\prime \right)^*\) as a path expression from node 2 to itself in the derived graph. The image of edge \(c^\prime\) is given by \(\textup{dpath}(3) \times c\), so the image of the path expression is \(\left (bc \right )^*\). Thus, \(\textup{dpath}(2) = a \left (bc \right)^*\).
Finally, we compute \(\textup{path}(1)\). There are no non-tree edges to node 1, so \(\textup{path}(1) = 1\).
Thus the path expressions for the CFG are:
which is exactly what we expected.
In the paper, Tarjan defines the derived graph so that it also includes edges in \(\textup{tree}(C)\). However, the algorithm only really uses the derived graph to process edges from \(\textup{nontree}(C)\), so we don’t bother to include edges from \(\textup{tree}(C)\) here.↩︎