Search XML open content with an edge table
Over the past 2 months, my columns have covered methods for storing open content in and retrieving it from your SQL Server database. Those articles showed how to use OpenXML to store open content as XML fragments in a column of your database (an XML fragment is a well-formed subset of an XML document). This method works well unless you ever need to query data stored as open content. For example, after you deploy the application, you might decide to extract one or more data items and store them in separate columns in your database. Or maybe you just need to search for some data in the open content. Both operations can prove difficult because OpenXML stores the open content as text. But a little-known facility that's built into OpenXML can help you solve this problem.
OpenXML lets you define a rowset or relational view over your XML document. You define the structure, or shape, of the rowset by using the WITH clause as Listing 1 shows. Two variants of the WITH clause let you use the structure of an existing table or view in your database or define your own structure. But you can omit the WITH clause altogether to produce a different kind of structure called an edge table.
An edge table is a flattened view of the XML document in which each XML entity corresponds to a row in the edge table. If you're familiar with the XML Document Object Model (DOM), you can think of each row in the edge table as a node in the logical tree representation of the XML document. Each row of the edge table includes a set of columns that describe the entity, including the entity type (e.g., element, attribute, comment, processing instruction, text), name, namespace, data type, a unique ID, and a reference to the entity's parent. Figure 1 shows an example of an edge table obtained from OpenXML for the XML document in Figure 2. Like the DOM, the edge table provides a logical tree representation of the XML that you can generate for any XML document or well-formed fragment.
Using an Edge Table
Before I explain how to use the edge table, let me first describe an example scenario. Suppose you're a candy distributor. From your candy manufacturers, you receive product lists in XML format that you store in your database's Products table. Suppose further that each vendor includes for each product a complete ingredient list that you store as open content in the table's Overflow column. Now, let's say that you want to give your customers a search function that can retrieve from your database a list of all products that contain a particular ingredient.
Assuming you stored the open content by using OpenXML's @mp:xmltext meta-property, you now have a column in your Products table containing the XML representation of a list of ingredients as well as any other open content the vendor has sent. Let's say you want to retrieve a list of candy that contains chocolate as an ingredient. The problem becomes, How do you query the data that's stored in the column? You might consider using the CONTAINS keyword in your T-SQL query, but that could give you false positive results because it doesn't distinguish chocolate as an ingredient from the word chocolate used in other contexts (e.g., the Category element in Figure 2) or as the name of elements or attributes in the open content. In this situation, an edge table can help.
Instead of querying the Overflow column directly, you can retrieve its contents, parse them by using sp_xml_preparedocument, then query the data by using OpenXML in edge-table format. Before we dive into the example, you need to create in your database a table that contains a list of products. Listing 2, page 38, shows the T-SQL code you can run in Query Analyzer to create the Products table and the OpenXML code you use to store the ProductList from Figure 2.
But before you execute the code, you need to insert the entire XML document from Figure 2 as a text-string parameter of sp_xml_preparedocument. The second comment in the code shows where to place the XML document. After you edit the code, execute the query to create and populate the Products table. Doing this fills the Overflow column with the weight, category, and ingredient list.
Listing 3, page 38, shows an example query that uses an edge table to query the Overflow column in the example scenario's Products table. It begins by querying the Products table and retrieving the Overflow column. The T-SQL code parses the XML it retrieved from the column and assigns to the T-SQL variable @odoc a handle to the in-memory representation of the parsed XML document. The subsequent SQL query searches the edge table obtained from the call to OpenXML to determine whether the candy's ingredients include chocolate. The query finds all rows with nodetype equal to 3 (text nodes) and a text column that contains the word chocolate. Selecting text nodes avoids returning any element or attribute names that contain the word chocolate. To ensure that the query selects chocolate from the ingredients and not the category, the EXISTS clause checks that the text node's parent belongs to the urn:mm-ingred namespace.
This example is highly simplified and would be inefficient to use under high transaction loads because you'd need to process the XML for each product in the Products table separately. To increase efficiency, you can precreate edge tables that contain the open content for each product. You can then store these pre-created edge tables in a separate table in your database, indexed by ProductID. Prestoring the edge tables avoids the need to retrieve and parse the XML in the Overflow column and lets you retrieve a list of products with a single query instead of processing each row of the Products table separately. This approach could dramatically increase your query's efficiency, albeit at the expense of the time and storage required for the precreated edge tables.
Edge tables provide an alternative view of an XML document that you can process with the SQL language. Additionally, when querying an edge table, you can use the full SQL language, including aggregate functions such as COUNT(), MIN(), and MAX(), or you can perform context-sensitive searches by querying for specific tags. If you face a scenario similar to the one I described here, edge tables might give you a more efficient way to query XML.