Common table expressions can help you manipulate hierarchies with ease
Common table expressions (CTEs), a new T-SQL feature in SQL Server 2005 (formerly code-named Yukon), give you expressive powers for writing T-SQL queries to manipulate hierarchies easily and efficiently. Some major advantages of CTEs are that they're ANSI SQL-99 compliant, they let you write shorter code than ever to manipulate hierarchies, and their internal implementation uses set-based queries rather than cursors and temporary tables. Last month, in "Get in the Loop with CTEs" (InstantDoc ID 42072), I introduced CTEs and showed how to use them in single-parent hierarchies—hierarchies in which a child can't have more than one parent. This month, I look at how to manipulate multiparent hierarchies such as a bill of materials (BOM) as well as how to detect cycles and sort siblings.
Manipulating Multiparent Hierarchies
One significant advantage of CTEs over other solutions for manipulating hierarchies is that they let you generate short, efficient code for graph manipulation. For example, let's look at how to manipulate BOM data such as that in the BillOfMaterials table in the SQL Server 2005 AdventureWorks sample database. The BillOfMaterials table contains two columns that express hierarchical relationships: ComponentID contains the ID of a component, and ProductAssemblyID contains the ID of that component's parent. By following all containment relationships—those that exist between assemblies and the components they contain—and recursively following containment relationships for the contained components, you get all components and subcomponents that make up an assembly. Other relevant columns for the BOM manipulations are PerAssemblyQty, which contains the number of components in an assembly, and EndDate, which contains NULL if the containment relationship is in effect or a specific date if it's obsolete. This article's examples demonstrate the manipulation of only active containment relationships, so I'll always filter for components for which EndDate is NULL, denoting an active relationship.
A common request involving a BOM is called parts explosion. Given a certain component ID (say, 749), your task is to return the quantities of each directly and indirectly contained component. The solution is similar to returning a subtree of a given employee, as Listing 1 (which I discussed in detail last month) shows. However, the BOM parts-explosion solution also calculates the quantities as you progress down the tree levels and eventually aggregates their values. The trick here is to multiply component quantities as you move down the levels. For example, if item A contains 5 units of item B, and item B contains 3 units of item C, you know that for each unit of item A, you have 15 units of item C. In the AdventureWorks database, run the code that Listing 2 shows to return the total quantity of each component that contains component 749.
The CTE's Anchor Member (the non-recursive query) simply returns the active row from BillOfMaterials of component 749. The CTE's Recursive Member (the query that is invoked recursively) returns all components contained directly in the previous step's components. The Recursive Member multiplies the contained item's quantity by the quantity of the containing item from the previous step. Each component can appear multiple times in the result because multiple components can contain it, so the outer query groups the result by product ID and name and calculates the total.
Now, try to fulfill another typical request to test your knowledge in manipulating graphs: Return all components that contain component 996 either directly or indirectly. The solution, which Listing 3 shows, is similar to finding the management chain leading to a given employee in a single-parent hierarchy. The CTE's Anchor Member returns the row that contains the given component. The Recursive Member returns the components that directly contain the component from the previous step. The outer query joins the CTE's result to the Products table to return the product names for each component.
In "Get in the Loop with CTEs," I discussed ways CTEs can help you avoid infinite loops caused by cyclic relationships (a node becoming its own ancestor). But as I mentioned, Yukon has no built-in way to detect cyclic relationships. To detect cyclic relationships, which usually result from invalid data in your database, you need to develop a custom solution. To create a cyclic relationship for demonstration purposes, copy the Employee table in the AdventureWorks database to a temporary table called #EmployeeCycle, then change the ManagerID of Employee 12 to 267 by running the following code:
SELECT EmployeeID, FirstName, ManagerID INTO #EmployeeCycle FROM HumanResources.Employee AS E JOIN Person.Contact AS C ON C.ContactID = E.ContactID; UPDATE #EmployeeCycle SET ManagerID = 267 WHERE EmployeeID = 12;
Employee 12 is Employee 3's manager, and Employee 3 manages Employee 267. By setting Employee 12's manager to 267, you get a cycle of 12→3→267→12, where an arrow means "manages." If you use the subtree solution that Listing 1 shows to return Employee 12 and all subordinate employees, you enter an infinite cycle that causes your code to abort after the default 100 recursive calls. Of course, the query also returns incorrect results because the source data is incorrect.
You can start your cycle-detection solution by generating a result column called Path. This column contains an enumerated path of employee IDs leading to each employee, with each ID separated by a dot and with a dot at the beginning and at the end of the path. This path lets you easily find out whether an employee ID already appears in the manager's path. For example, .12.3. would be the path to Employee 3, and .12.3.267. would be Employee 267's path. You also generate another result column called Cycle that contains 1 if the current employee ID already appears in the employee's manager path—meaning that a cycle exists—and 0 otherwise. You add a logical expression in the Recursive Member's filter to return only employees whose manager's Cycle value is 0. Such a filter tells the Recursive Member to ignore subordinates of managers for whom you detected a cycle, thus preventing infinite recursive calls.
Now, add a filter to the outer query to return employees only if the Cycle value is 1, thus returning only employees for whom you detected a cycle. After detecting the cycle, you can fix the data by correcting the employees' manager IDs. The code in Listing 4 implements this algorithm. Note that you must run this listing's code from the same connection as the one where you created the #EmployeeCycle temporary table. Figure 1 shows the result of running Listing 4's code. SQL Server didn't abort the query, which returned Employee 12, because it detected a cycle. The Path column helps you identify the full cycle so that you can fix the data in your database.
A request I frequently get from readers and customers about hierarchical data manipulation is, How can I sort siblings under each parent according to a specified sort order? Say you need to return the subtree of Employee 12, sorting siblings by employee ID or by first name. SQL Server 2000 and earlier releases have no simple way to handle such a request, and unfortunately, SQL Server 2005 doesn't bring a built-in solution to this problem. However, with a bit of creativity, you can use CTEs to craft a custom solution. In much the same way I constructed an enumerated path of employee IDs in Listing 4's solution, you can construct a binary path of employee IDs leading to each employee. You convert the current employee ID from an int data type to a binary(4), then concatenate it to the path of the employee's manager. Call the concatenated binary path SortCol, and use it in the ORDER BY clause of the outer query, and you get siblings sorted by their employee IDs. Running the code that Listing 5 shows returns Employee 12 and all levels of subordinates, with siblings sorted by their employee IDs.
The code in Listing 5 also generates a pseudo column, Level, using the technique for calculating the employee level below the root of the subtree that I demonstrated in "Get in the Loop with CTEs." The outer query uses the Level column's value to indent each employee a certain number of spaces and pipes ( | ) according to his or her level in the hierarchy. The outer query returns the employee ID in parentheses followed by the employee name. Figure 2 shows the result of running Listing 5's code, listing employees who have the same manager, sorted by employee ID.
A more complex problem involves sorting by a column such as FirstName, which doesn't lend itself to constructing short binary paths out of small, fixed-size values because first names have varying lengths. To concatenate name values of fixed sizes so that you can sort correctly, first use the following query to return, for each employee, the employee's sort position among siblings under the same manager:
SELECT EmployeeID, FirstName, ManagerID, ROW_NUMBER() OVER(PARTITION BY ManagerID ORDER BY FirstName) FROM HumanResources.Employee AS E1 JOIN Person.Contact AS C ON C.ContactID = E1.ContactID
This query returns all rows from the Employee table and the corresponding first names from the Contact table. The query uses the new SQL Server 2005 ROW_NUMBER() function to return row numbers of employees separated by manager ID and sorted by first name. In other words, the function calculates the sort position of the employee among the employees who have the same manager (siblings) according to the employees' first names. Now, use Listing 5's solution, but instead of referring to the Employee table and constructing a binary path out of employee IDs, refer to a CTE that contains the query that calculates the sort positions by first name (call it EmpPos). Then, construct a binary path from the positions, as Listing 6 shows, instead of constructing the path from the employee IDs. Notice in the result, which Figure 3 shows, that employees who have the same manager are sorted by first name.
"It's the Question That Drives Us"
CTEs let you easily and efficiently manipulate multiparent hierarchies, avoiding lengthy iterative code. However, CTEs don't provide complete functionality for detecting cycles and sorting siblings. Sometimes, not having the "perfect" tools can be motivating because it gives you room for thought and creativity. One of my favorite quotes from The Matrix film series applies to puzzle enthusiasts like me. Neo, a computer programmer, is puzzling over what the Matrix is. Trinity, knowing the answer and its complexity, tells Neo, "It's the question that drives us—the answer is out there." The tough question or puzzle drives you to look for the answer. With the techniques I've shown, you can create your own cycle-detection and sorting solutions for working with hierarchies.