Find the path to better answers

Downloads |
---|

46117.zip |

*Note: *The solutions presented in this article require SQL Server 2005 Beta 1 or higher. They were developed and tested on a pre-release version of SQL Server 2005 Beta 2 by Lubor Kollar and Itzik Ben-Gan.

Imagine that you need to know whether one person is an ancestor of another person. A data structure called transitive closure can help. For example, Fosco Baggins is the father of Drogo, and Drogo is the father of Frodo; therefore, Fosco is an ancestor of Frodo by transitivity. Simply put, transitive closure is a data structure, constructed in this case from family trees, that can help you quickly answer such questions. More formally, transitive closure is an extension or superset of a *binary relation *such that whenever (a,b) and (b,c) are in the extension, (a,c) is also in the extension. In essence, this data structure lets you efficiently answer "reachability" questions such as, Does a path from point a to point b exist? The input to the process (which in this case is T-SQL code) that generates the transitive closure is a *directed graph *(pairs of nodes with direction that can be followed in only one way).

Many databases store such graphs—for example, a bill of materials (BOM) with assembly/item pairs, Web links from source page to target page, and employee hierarchies. The output is pairs of nodes that have a path between them. For example, the transitive closure of a BOM input is all pairs of containing/contained items, be they contained directly—e.g., if a cake contains cream, return (cake, cream)—or indirectly—e.g., if a cake contains cream and cream contains sugar, a cake transitively contains sugar; return (cake, cream), (cream, sugar), (cake, sugar).

The transitive closure of a Web links graph is all pairs of source/target pages—again, with direct or indirect links between them. Once the transitive closure of the input graph is established, you can simply determine whether a certain pair exists to answer questions such as, Does item a contain item c? and, Can I get from page x to page y by following Web links? You can find the formal definition of transitive closure and related terms and algorithms at the National Institute of Standards and Technology (NIST) Web site (http://www.nist.gov/dads/HTML/transitiveClosure.html).

You might want to generate transitive closure in a database when your application needs to answer, as quickly as possible, frequent reachability requests against a graph stored in your database. For example, suppose you get many queries against the BOM in your database asking whether one item, such as a chocolate cake, contains another item, sugar. You can use T-SQL to implement an iterative or recursive algorithm that, for each request, traverses the graph starting with the containing item and looks for the contained item.

However, if you're familiar with such iterative solutions, you know that they're usually slow and costly in overhead. Alternatively, you can write T-SQL code that generates a transitive closure of your BOM and materializes it as a table. The table will contain the transitive closure of the BOM—namely, all pairs of items that have a path between them. As soon as your code materializes the transitive closure of your BOM, you can use it to answer all queries about whether one item contains another with a simple query that performs one seek operation within the index you created on the pair of attributes.

A related problem is finding the shortest path between two nodes in a graph. For example, if you have Web links information stored in your SQL Server database, your application might need to answer frequent requests for the minimum number of Web links that would take you from one Web page to another. Robert Floyd's and Stephen Warshall's transitive closure algorithms are commonly used in other computing environments. (For details, search the keywords Floyd and Warshall at http://www.nist.gov/dads.) However, although these algorithms might offer the most convenient programming solutions for the aforementioned problems, they don't work for relational databases because they can't be expressed in SQL queries.

With the introduction of recursive common table expressions (CTEs) in SQL Server 2005, you can provide pure T-SQL solutions to the transitive-closure and shortest-path problems. Let's look at some of these now. We first present solutions for acyclic graphs, in which no path exists from a node to itself; these are fairly simple. Next, we examine the more complex solutions for cyclic graphs, in which a path can exist from a node to itself. We assume here that you're familiar with recursive CTEs. For information about recursive CTEs, see "Get in the Loop with CTEs," May 2004 (InstantDoc ID 42072), and "Cycling with CTEs," June 2004 (InstantDoc ID 42452).

### Acyclic Graphs

Our examples use a Web links scenario. Before we begin, run the code that Listing 1 shows to create the Links table and populate it with sample data. The Links table contains a row for each Web link, including two attributes: source page ID (src_pid) and target page ID (tgt_pid). The sample data represents an acyclic graph, meaning that no path you can follow leads from a page to itself. Figure 1 depicts the acyclic directed graph represented by the Links table.

The first situation we'll look at is a request to generate the transitive closure of the Links table. Table 1 shows the desired output, which contains all pairs of source/target page IDs that have a path between them. Listing 2 shows the solution to the request.

At callout A, the anchor member of the CTE called TC returns all (source, target) Web-page pairs from the Links table, representing target pages that can be accessed directly from the source pages (following a single link). The recursive member at callout B joins the result of the previous step (TC aliased as P for Parent) to the Links table (aliased as C for Child) in order to return target pages that a user can access indirectly from the source pages (by following more than one link). The JOIN condition looks for rows in Links where the source page ID is equal to the previous step's target page ID. The SELECT list returns the source page ID from P and the target page ID from C.

The first invocation of the recursive member returns target pages that are two links away from the source. The second invocation returns target pages that are three links away from the source, and so on, until the recursive member finds no more linked pages. The outer query then selects the distinct list of source/target pairs from the result of the CTE. We applied DISTINCT because without it, you get duplicate result rows when multiple paths lead from a source page to a target. If you remove the DISTINCT, you'll see two additional rows in your result set. Both the (1,7) and (1,8) pairs have duplicates because paths exist from 1 to 4 to 7 and 1 directly to 7 for the pair (1,7) and 1×4×7×8 and 1×7×8 for the pair (1,8) in the input graph.

With a minor addition to the CTE, as Listing 3 shows, you can also calculate the minimum distance (shortest path) between the nodes of this acyclic directed graph. In other words, the minimum number of links that you need to follow to get from one page to another. In the anchor member, add the constant 1 as the distance because the anchor returns direct links. In the recursive member, the recursive member calculates the distance as the parent's distance plus 1. The outer query groups the result by source and target page IDs, and returns the minimum distance for each pair. Running the code in Listing 3 gives the desired results that Table 2 shows.

Note that the Web links graph is *equidistant*—that is, the distance between any source and target pair is considered the same (one click away), as opposed to a non-equidistant graph, which has an attribute containing the distance between nodes in cases where there might be a different distance for different pairs. So, we use the constant 1 as the distance between source and target Web pages, both in the anchor member and as the increment in the recursive member. Non-equidistant graphs (e.g., a map of roads and highways with different distances between various pairs of source and target locations) require a slight revision to calculate the cumulative distance. Simply replace the constant 1 with the attribute holding the actual distance value between the nodes.

### Cyclic Graphs

Cyclic graphs, in which a path might exist between a node and itself, are more complex to deal with than acyclic graphs, mainly because you could end up traversing paths infinitely. An example of a cyclic graph is a Web site, where you might end up in the same page you started browsing after following a series of links. The trick in dealing with cyclic graphs is to identify the cycle so that your code will know when to stop traversing.

Run the code in Listing 4 to repopulate the Links table with a graph that contains cycles. Figure 2 illustrates the relationships between the Web pages in the repopulated Links table. The Links table now represents a cyclic directed graph.

Listing 5 shows a solution that returns all possible paths and their lengths in a cyclic directed graph. In a moment, we show two modifications of this query that produce transitive closure and the shortest path between any two nodes of a cyclic directed graph. The query in Listing 5 navigates paths starting from each page as anchor, since each page might be the beginning of some navigation. While navigating, the query calculates the distance from the anchor page and constructs an enumerated path of page IDs leading from source to target. The recursive member filters out nodes for which it detects a cycle; a cycle is detected if a target page ID already appears in the source page's enumerated path. Your CTE identifies a cycle by using the LIKE predicate in the recursive member's filter. (For more information about the enumerated path construction and cycle detection, see "Cycling with CTEs.")

Running the code in Listing 5 produces the desired output that Table 3 shows. To return the transitive closure of links (cyclic directed graph) that Table 4 shows, you can revise the outer query in Listing 5 so that it returns only the distinct list of source and target page IDs. The code in Listing 6 has two revisions to the code in Listing 5. It returns only the pairs of source and target page IDs without returning the distance and the path attributes, and only the distinct list of pairs, generating as a result the actual transitive closure. Remember that transitive closure contains the distinct list of source and target items that have a path between them.

Finally, the code in Listing 7 shows how to return the shortest paths (as opposed to all possible paths) between pages in the Links relation representing a cyclic directed graph. In Listing 7, the CTE's body is the same as the CTE in Listing 5. The outer query calculates the minimum distance for each pair of pages, then generates the derived table MD from this data. Then the outer query joins all paths (an instance of TC called AP) to MD. The JOIN condition is based on source- and target-page ID matches, and the distance from AP must be equal to the minimum distance from MD. The join between AP and MD is the main addition to Listing 5's code, which returns all paths. The significance of the join is that you get back only pairs of pages that have the minimum distance between them, plus the actual paths. Run the code in Listing 7 to get the desired result, which Table 5 shows.

### Transitive Closure and Recursive CTEs

Now you can utilize the new recursive-querying capabilities of SQL Server 2005 to provide short, elegant, ANSI-compliant, set-based solutions to transitive closure and shortest path problems. You can implement the techniques we've described in applications that handle requests for hierarchical information such as you'd find in BOMs and applications that deal with reachability questions, such as for road systems. The solutions for acyclic graphs are fairly simple and are based on traversal of one level at a time. The solutions for cyclic graphs use similar traversal of levels but also utilize cycle-detection techniques to stop traversing when a cycle is detected. Using the techniques we demonstrated in this article, you can quickly respond to requests for finding a path between two items and finding the shortest path between two items.