How to return data as hierarchical XML
Relational databases often store data in a tree or hierarchy structure, in which each item is related to another in a parent-child relationship. An example is the classic company management structure such as the one that Figure 1 shows. The hierarchy usually has an arbitrary depth, and each item contains information only about its relative position in the tree. Judging from the many questions that appear on newsgroups, using SQL Server to return the entire tree in XML format is a common requirement. For example, many people need to display a hierarchy on a Web page or report. Figure 2 shows a tree displayed as XML.
Although SQL Server 2000's XML support is flexible, it has no direct support for returning hierarchical XML to any arbitrary depth. However, you can use the FOR XML EXPLICIT T-SQL command to achieve this result. (For more details about using this technique, see Bob Pfeiff, "In Control with XML EXPLICIT," March 2001, InstantDoc ID 19810.)
Some techniques for accessing hierarchical data through T-SQL can become quite complex; this article explains the techniques and builds up the T-SQL in small steps to make it easier to understand. On the way, I cover methods such as using dynamic SQL and reveal a few tricks and tips, including how to use temporary tables instead of UNIONs in your FOR XML EXPLICIT statements to reduce the size and complexity of these statements. The article assumes an understanding of SQL Server 2000's XML capabilities and the use of the universal table. ("In Control with XML Explicit" also covers the universal table.)
The usual way of representing a hierarchy in a database is to use a table that has a self join, as Figure 3 shows. In this table, which contains information about each employee, the ReportsTo column is a foreign key that points to another row in the same table, forming the self join. A refinement of this approach is to split the information between two tables, one containing the information about employees and the other containing information about relationships, as Figure 4 shows. Listing 1, page 26, shows a script to create the tables and populate them with data (for simplicity, I didn't populate all the columns).
To return the data in XML format, you use the Explicit mode, writing queries to produce a result set that has a specific format. By adding the FOR XML EXPLICIT clause to the end of your query, you can return the results as XML rather than as a recordset. To create the XML that Figure 2 shows, you first need to create a query that produces the result set that Figure 5, page 26, shows. Note that the column names, the data order, and the data itself are each important for establishing the hierarchy.
The structure in Figure 5 is known as the universal table. This term is confusing because there is no actual table—you just need to write a query that returns data in this format. The universal table format is a means of visualizing the output of the required query. Adding FOR XML EXPLICIT to that query produces the XML in Figure 2. SQL Server is fussy about the universal table's structure. The column names and the row order must all be correct, or an error will occur—or worse, the query will produce XML with an incorrect structure. By comparing the data in Figure 5 to the XML in Figure 2, you can start to work out how SQL Server maps the data and column names to XML.
The special Tag and Parent columns contain metadata that defines the XML hierarchy. If Parent is 0 or null, this row becomes the root node. For all other rows, the Parent column stores the parent's tag. The row order is important here—the children must immediately follow the parent.
The names of the other columns also contain metadata that SQL Server uses to define the rest of the XML structure. In the example in Figure 5, Employee!1!EmployeeID adds the attribute EmployeeID to an element named Employee. The 1 denotes the tag number and ensures that the element is correctly placed in the hierarchy. The value of the attribute is the value of the column. The column Employee!1!FirstName adds a second attribute, FirstName, to the Employee element. Now you know enough about the universal table to produce hierarchical tree structures.
Now you need to write a SQL query to produce the required result set. This process isn't always straightforward because the data probably isn't stored in the native format the universal table requires, so you'll need to convert some column names and data structures. A common way to perform this conversion is by using UNIONs to join result sets to produce the required universal table format. Although it's not elegant, this approach is easy to understand and reasonably efficient. An alternative is to create a temporary table that has the rows and column names required for the universal table, then select from that table. For more complex hierarchies, this approach has some benefits, as I show later.
Let's see how to use UNIONs by writing the SQL that produces the XML in Figure 2. To make the code easier to understand, let's start by writing the SQL to build the top two reporting levels. In Query Analyzer, run the query that Listing 2 shows to produce the XML that Figure 6 shows.
Note that you need to define the column aliases only with the first SELECT of the UNION statement; SQL Server uses these column names for the resulting recordset. I enclosed the column name aliases in square brackets because the exclamation mark isn't a legal character for a column alias.
As you can see, you need a lot of SQL to produce a fairly small amount of XML, although the SQL code can handle any number of employees. The SQL code becomes more complicated if you add an extra level—in this case, all the employees who report to Bert, Jack, or Donna. To add that level, you need to add an extra UNION, as Listing 3 shows. This addition produces the XML that Figure 7 shows.
In this approach, you need to use an extra UNION for each level of elements in the XML, making the SQL code unmanageable. This example has the additional difficulty of requiring an increasing number of self joins at each level. It's apparent that using UNIONs isn't feasible when you have a large number of levels. When each level has a dependency on previous levels, a better solution is to use a temporary table.
Using a Temporary Table
To avoid UNIONs, you can create a temporary table containing the columns that FOR XML EXPLICIT requires, then populate it with the appropriate rows. In other words, you create a temporary table that has the structure of the universal table that Figure 5 shows. By using the following SQL code to create a temporary table called #TreeXML, you can return the data in XML format:
SELECT * FROM #TreeXML FOR XML EXPLICIT
For more complex queries, this technique avoids the need for one large SQL statement containing several UNION statements. As I demonstrate later, it's also useful because you can extend the code to handle a variable number of columns, as in the example tree structure. The down side is that creating temporary tables is typically slower than the relatively efficient UNION statement. Therefore, the technique is generally useful only for more complex requirements.
Listing 4, page 28, shows the complete code to generate a stored procedure that will create the complete XML hierarchy that Figure 2 shows. The procedure has two steps. The first step creates a temporary table that has the same column names as the universal table. The code then populates the table a level at a time. The top level item (Fred) is added first; note that you need a join between the Relationship and Employee tables to retrieve all the data items. The code then adds all employees who report to Fred. The code performs a self join back to the temporary table to get details of the Employee!1!EmployeeID and Employee!1!FirstName columns, which you just calculated. You can do a join back to the previous level each time you insert a new level, so you don't need an ever-increasing number of joins as with the UNIONs solution in Listings 2 and 3. Running this stored procedure returns the full required XML that Figure 2 shows.
Coping with an Arbitrary Number of Levels
The preceding solutions assume a fixed number of levels. In practice, that's unlikely to be the case for this type of business problem, so you have to cope with an arbitrary number of levels. You've probably already noticed in Figure 5 that the number of columns in the universal table depends on the number of levels present in the hierarchy, so the column count varies according to the data. This fact presents a problem because you can't vary the number of columns in a SQL statement on the fly. Column names are intended to be fixed labels to define the data content in the rows. However, for the universal table, Microsoft decided to use column names to store metadata instead.
Luckily, help is at hand. You can use dynamic SQL, which lets you construct any SQL statement within a stored procedure and then run it by using the EXEC command. As a simple example of using dynamic SQL, let's run a straightforward query on the Pubs database:
SELECT title FROM titles
You can run the same query by using dynamic SQL as follows:
DECLARE @SQL varchar(100) SELECT @SQL= 'SELECT title FROM titles' EXEC (@SQL)
Although both code examples produce the same result set, using dynamic SQL within a stored procedure carries a price. The first time you execute a stored procedure, SQL Server performs several checks before compiling and running it. Among other things, it checks for correct syntax and valid column and table references. The query optimizer then produces a query plan that determines which indexes to use. SQL Server holds this query plan ready for use each time the stored procedure runs. None of this occurs with dynamic SQL because the SQL Server engine doesn't know what SQL it will receive until you run the query. The result is that with dynamic SQL, you have an increased risk of a runtime error and performance suffers. Fortunately, for this type of business problem (e.g., a management hierarchy), data volumes probably won't be large, and if you're careful, syntax errors shouldn't occur.
Listing 5 shows the entire stored procedure that lets you use dynamic SQL to handle multiple levels, breaking it into chunks to explain how it works. If you load this stored procedure into SQL Server and run it in Query Analyzer, it returns the familiar XML from Figure 2.
At callout A in Listing 5, the stored procedure builds a temporary table; the code at callout B populates the table with one row for each employee. The table has an extra column for storing the depth of each node. At callout C, the procedure adds the depth of a particular node to each employee record by stepping through a loop that updates the appropriate employee records for each level. At first, this might appear inefficient, but the code traverses the loop once for each level, not once for each employee. In practice, most trees have many more nodes than levels; for example, a management structure for a company of 100 employees probably has only about 5 levels, so the loop will be traversed 5 times rather than 100.
Note that the code hasn't resorted to dynamic SQL yet; it's best to do as much as possible before that becomes necessary. Once you start using dynamic SQL to create tables, you can't go back to ordinary SQL because the compiler won't know anything about the new tables and will produce error messages about invalid references. Consequently, putting as much data as possible into the #Level temporary table will make things easier later. After populating the table, the code at callout D simply uses a SELECT max() statement to calculate the greatest depth, which in turn determines the number of columns required for the temporary table to return the XML.
The code at callout E starts constructing the string that will create the temporary table I described in the Universal Structure section. You've now reached the point of no return; from now on, you construct dynamic SQL strings and run them with EXEC(). The code at callout F adds all the top-level nodes (which, in this case, is just Fred). After that, the code at callout G loops around each level and adds the nodes from that level until it gets to the deepest one (the code at callout C already calculated the value of that level).
Here, the process gets a bit more difficult to understand. Unfortunately, Query Analyzer isn't much help because it doesn't color-highlight keywords for stored procedures as it does for normal T-SQL. One way around that and a useful technique for debugging is to use a PRINT @SQL statement in place of the EXEC (@SQL) statement. You can paste the SQL code into another Query Analyzer window and run it to see the syntax highlighting. This code has fewer joins because you loaded much of the data into the #Level table. The final step, which callout H shows, is to return the XML data.
Another Solution: XSLT
One of the main objectives behind the development of XML was to allow data exchange between different types of data stores. If you're writing stored procedures to export data from SQL Server, perhaps using Data Transformation Services (DTS), you'll want to convert the XML into the required target format by using SQL Server's XML support. The way Microsoft chose to implement XML in SQL Server doesn't make this conversion easy for some types of data structures, although it's achievable.
Many project architectures require a SQL Server—only solution for the production of XML. However, it's always important to look at the bigger picture. If, for instance, you have a Visual Basic (VB) front end or use Extensible Style Language Transformations (XSLT) to transform the data, it might be easier to return the XML data as "flat" (non-hierarchical) XML, then convert it to hierarchical XML elsewhere. To illustrate this technique, let's briefly look at a solution that uses XSLT.
To generate flat XML data, you can use a simple FOR XML RAW T-SQL statement, as Listing 6, page 29, shows. If you run this code in Query Analyzer, you get the XML that Figure 8, page 29, shows. Although the result isn't hierarchical XML, it contains the extra attribute ReportsToEmployeeID, which provides hierarchical information. XML parsers can't interpret this information directly—they treat it as just another attribute—so you need to convert the flat XML to hierarchical XML by using the ReportsToEmployeeID attribute.
XSLT's use of recursive techniques makes the conversion relatively easy. XSLT offers many ways, of varying efficiency, to convert the data. The code that Listing 7 shows is one possible, very short, way. If you use MSXML 3.0 to transform the XML in Figure 8 and the XSLT from Listing 7, you produce the XML structure that Figure 9 shows (tidied up a little for clarity).
The XSL has two templates. The first one matches all top-level nodes (i.e., all nodes in which ReportsToEmployeeID equals EmployeeID). The second template then recursively evaluates all other nodes. Note the addition of the <root> and </root> tags, which ensures that the XML is well formed. For more information about XSLT, see the book XSLT Quickly by Bob Ducharme (Manning, 2001).
Most developers will need to store hierarchical data in SQL Server at some time. I used an employee reporting structure here as an example, but you could use the techniques equally well for a bill of parts for a car or a dynamic menu system for a Windows application. You can use data stored in SQL Server to create hierarchical XML, either directly in SQL Server by using the built-in XML features or externally by using XSLT.