Many types of problems require you to analyze samples of data in sequences, looking for valuable information in data organized according to some order. For example, you might want to analyze sales figures over a time series, stock rates against a series of prices, or test results against a series of ratings. Most SQL Server 2000 procedures that analyze samples of data over a sequence can be very slow because they use cursors and set-based techniques. Now, SQL Server 2005 provides recursive common table expressions (CTEs) that greatly speed up analysis. (For an overview of recursive CTEs, see "Get in the Loop with CTEs,"April 2004, InstantDoc ID 42072.)
I'll show you some fast and powerful analysis techniques that use recursive CTEs. DNA sequencing, which involves analyzing patterns in sequences, provides the scenario for our examples. But first, I want to thank Rona Kisilevitz from Intel at Kiryat Gat, Israel, who helped me with the biological background information needed for these scenarios.
Don't forget to check out the sidebar "The Logical Puzzle," page 22, for the solution to last month's logical puzzle.This month's puzzle presents a game-show scenario in which you have to guess which curtain hides the prize.
DNA sequencing is the process of determining the order of the pairs of chemical building blocks that make up a particular chromosome. DNA consists of sequences of four building blocks, or bases: adenosine (A), cytosine (C), guanine (G), and thymine (T). In their analyses, scientists look for certain patterns within mapped regions of a DNA string. A DNA string typically consists of mostly meaningless stretches of DNA known as junk DNA. For example, human DNA, made up of some 3 billion base pairs, is about 97 percent junk DNA.
That's quite a long sequence to search through. By locating certain patterns, scientists can identify the existence and placement of genes—small portions of a DNA string that control and organize an organism's vital functions. One way to identify genes within a DNA string is to look for known patterns or regions called promoters, which can indicate a gene's starting point. For example, a promoter called TATA Box has the sequence pattern TATAAA, the promoter GC Box has the sequence GGGCGG, and the promoter CCAAT Box has the sequence CCAAT. Finding the patterns that identify these promoters is a key technique in DNA sequencing.
Finding Patterns in a Sequence
To demonstrate how to use T-SQL to find patterns in a sequence, we'll create a solution that locates promoters within a given DNA sequence. First, run the code in Web Listing 1 (http://www.sqlmag.com, InstantDoc ID 48470) to create the tables called DNA, Promoters, and Patterns and populate them with some sample data.
The DNA table contains a small, imaginary DNA sequence; the Promoters table contains the IDs and names of promoters; and the Patterns table contains the patterns, or sequences of bases, that make up each promoter.We want to retrieve every promoter's starting point within the sample DNA sequence.
Typically, scientists analyze fairly small, isolated regions of about 1000 base pairs in the full DNA string. In some cases, though, there might be a need to analyze a larger string (up to the full length of the DNA string). If you want to test your solutions against large input, run the code in Web Listing 2 to populate the DNA table with 1 million rows containing random codes. Note, though, that I use the smaller set of sample data from Web Listing 1 for my examples.
Solution 1: A Normalized Representation
Listing 1 shows the first solution, which uses a normalized representation of the data created by running the code in Web Listing 1. The recursive CTE DNASeq isolates all promoter sequences within the DNA sequence. Callout A in Listing 1 shows the anchor member that joins the DNA and Patterns tables and matches all rows from the DNA table that contain a base that matches the first base of one of the promoters contained in the Patterns table. For each promoter, the query returns all bases that might be the promoter's first element within the DNA sequence. For each such possible starting point, the query returns the promoter's ID (pid), the DNA's subsequence starting point (DNAstart), the position within the subsequence (n), and the letter of the base (A, C, G, or T).
When the query finds a base that might be a promoter's starting point, the recursive member in callout B checks the next position in the DNA sequence for a base that matches the second base in the promoter. Recursion continues until the promoter's entire sequence is found or a mismatch occurs.The recursive member joins the previous result set (DNASeq) with the Patterns table (PNxt), then isolates the next row in the promoter's sequence:
Cur.n + 1
The query then joins the next row in the promoter's sequence to the next row in the DNA sequence (DNxt) and checks whether the bases match:
AND DNxt.base = PNxt.base
The recursive CTE DNASeq returns all DNA subsequences that have the same pattern of bases as a promoter. Of course, you need to keep only those subsequences that fully match a promoter's sequence.To eliminate partial matches, we use another CTE, PromoLen, which callout C shows. PromoLen returns the length of each promoter (ln). In the outer query, we join the result of the recursive CTE DNASeq with PromoLen, isolating the end points of the DNA subsequences that have the same base pattern and length as a promoter. Then, to return the promoter names, we add a join to the Promoters table, as callout D shows. By default, recursive calls are limited to 100 repetitions. The MAXRECURSION option 0 removes that limitation. Figure 1 shows the promoters' starting points that Solution 1 returns.
For short DNA sequences of about 1000 base pairs, this solution runs in a fraction of a second. For larger sequences with 1 million base pairs and more than 1000 occurrences of promoters, the solution runs in less than 30 seconds.
Solution 2: A Denormalized Representation
The second solution is based on a denormalized representation of both the promoters and the DNA sequence. We store each promoter as a single string (e.g.,TATAAA) and store the entire DNA sequence as a string in a VARCHAR(MAX) data type that can hold up to 2 billion base pairs.
First, run the code shown in Web Listing 3 to create the PromotersD and DNAD tables and populate them with the sample data in denormalized format. Web Listing 3 populates the DNA string in the DNAD table by querying the normalized form of the data stored in the DNA table created by Web Listing 1. An INSERT statement invokes a query with the FOR XML PATH option, which concatenates the letters into a single string. (If you want to instead populate the DNA table with the large DNA string, run the code in Web Listing 2 before you run the code in Web Listing 3.)
Listing 2 shows Solution 2, which is simpler and faster than Solution 1's normalized representation. In Solution 2, the anchor member joins the PromotersD and DNAD tables with a join condition that uses the CHARINDEX function at callout A to look for the first occurrence of a promoter within the DNA string. When a promoter is found, the recursive member (PromCTE) looks for the next occurrence of the same promoter, until it finds no matches.That's all there is to it. Running Solution 2 against the large DNA string generated by Web Listing 2 returns the result in 2 seconds, about 15 times faster than Solution 1.
If you originally stored your data in denormalized form and later have a solution that needs normalized data, you can easily convert denormalized data into normalized form. Split each string into the characters of the individual base names and run the code in Web Listing 4 to create and populate the Nums auxiliary table. Then join the denormalized table with Nums. Listing 3 shows the code snippet that normalizes the denormalized promoters (PromotersD table). As you can see, the join condition uses n
Fast and Flexible, or Faster and Specialized
I've shown you two ways to identify patterns in sequenced elements. Both solutions demonstrate the power of recursive CTEs. You can use Solution 1, which relies on normalized data, in a similar way with other types of sequenced elements, such as time series. Solution 2, although dramatically simpler and much faster, is also more specialized and limited. Solution 2 assumes that samples are fixed in length, that you can represent the samples in a denormalized form as a string, and that you can use string functions such as CHARINDEX to look for occurrences of a substring. In cases where you're working with variable-length samples, Solution 1's generic normalized representation will do the job.
Itzik Ben-Gan (firstname.lastname@example.org), a mentor at Solid Quality Learning, teaches, lectures, and consults internationally. He manages the Israeli SQL Server Users Group, is a SQL Server MVP, and is a coauthor of Advanced Transact-SQL for SQL Server 2000 (Apress).