Knowing when to choose the CLR—and when not to—unlocks its potential
The integration of the .NET Common Language Runtime (CLR) in SQL Server 2005 lets you develop .NET objects such as user-defined types (UDTs), user-defined aggregates (UDAs), user-defined functions (UDFs), stored procedures, and triggers in any .NET language, including C# and Visual Basic. Programmers who are experienced with .NET development probably find CLR integration appealing. But programmers and DBAs who are mainly experienced withdevelopment sometimes find the idea appalling—like giving matches and knives to kids. However, in the hands of mature, responsible people, knives and matches can be useful tools. Likewise, the CLR is a powerful tool when you use it appropriately. To help you learn when it's best to use the CLR and when to choose T-SQL, I've created a couple of simple examples that demonstrate the differences you'll see in T-SQL and CLR performance and functionality in different scenarios.
For these examples, I'm assuming you have basic experience in developing and deploying CLR objects in SQL Server. For your convenience, I provide .NET code in both C# and VB, and you can download all the code examples at http://www.sqlmag.com, InstantDoc ID 49429.
Setting Up the Sample Database, Objects, and Routines
First, to use CLR objects in SQL Server, you need to turn on the CLR enable server configuration option, which is disabled by default, by using the following statement:
EXEC sp_configure 'CLR enable', 1; RECONFIGURE;
Next, you need to set up the sample database, TSQLCLR, which contains the Customers and OrderDetails tables that I refer to in this article's examples. Run the code in Listing 1, which should take a few minutes to finish.
The Customers table in the TSQLCLR database is populated with 100,000 rows from the Customers table in the Northwind database. Similarly, the OrderDetails table in TSQLCLR is populated with 1 million rows from the Order Details table in Northwind. Typically, it's a bad idea to just duplicate rows from a source table to generate sample data for performance tests; doing so can lead to skewed performance results because the plans that the optimizer generates can vary based on data density. However, for our purposes, the data-duplication approach works fine.
Finally, you need to create the T-SQL and CLR-based routines that I discuss in this article. To create the fn_RemoveChars T-SQL UDF, run the code that Listing 2 shows. Then, deploy the C# code that Listing 3 shows or the VB version in Listing 4. If you're not familiar with deploying CLR objects in SQL Server, you'll find step-by-step instructions in the sidebar "5 Steps for Developing and Deploying CLR Code in SQL Server," page 18.
When CLR and When T-SQL?
To figure out which scenarios require TSQL code and which will benefit from CLR code, consider which activities each is best at. T-SQL is more powerful and efficient than .NET code when the bulk of the activity involves data access and manipulation. But T-SQL is a declarative language, not a procedural one, and isn't designed for computation-intensive activities such as procedural or iterative logic, string manipulation, or data formatting in general. For these activities, .NET is more efficient. Let's take a look at a couple of examples.
String manipulation. The first example involves string manipulation, which is a weak point forT-SQL and a strong point for .NET. T-SQL supports basic string-manipulation capabilities through built-in functions and predicates.When the built-in T-SQL functions meet your needs, you should use them because you'll find it hard to beat the performance of built-in functions with your own code, regardless of the development environment you use. Moreover, by using built-in T-SQL predicates (e.g., LIKE), you let the optimizer work its magic by optimizing the execution of your code. But if you need the ability to perform more complex string-manipulation tasks than T-SQL supports, it's a different story. Attempting to use T-SQL code will typically require you to implement awkward, iterative logic. T-SQL simply doesn't have the tools for such purposes.
However, .NET provides a superior environment for string manipulation. For example, it supports regular expressions, which are powerful tools for pattern matching and replacement. Even if you don't rely on .NET for string manipulation and decide instead to implement a similar iterative or procedural algorithm that uses both T-SQL and .NET, you'll see that the .NET code is simply much faster in such activities. Note that ANSI SQL supports regular expressions,but that support isn't implemented yet in T-SQL.
Suppose you need to be able to perform a complex string replacement based on patterns. T-SQL's REPLACE function accepts three character-string arguments (let's call them str1, str2, and str3).The REPLACE function replaces all occurrences of str2 in str1 with str3. However, both str2 and str3 are constant strings, not patterns. T-SQL has no built-in function that lets you replace all occurrences of a pattern with another pattern.For example, consider the Phone column in the Customers table that you created earlier. Suppose you want to remove all special characters from the Phone column—that is, all characters that aren't a digit (0-9) or an alpha character (a-z or A-Z).You might need to do this type of replacement to identify "logical" duplicates of phone numbers, which users might enter with slight variations. For example,the phone numbers (503) 555-7555, 503-555-7555, and 503.555.7555 should all be treated as identical. Similarly, the phone numbers 555-FILM and 555 FILM should also be treated as identical, whereas 555-FILM and 555-FILK should be treated as different.When you strip the special characters from the phone numbers, you'll get strings such as 5035557555 and 555FILM, which you can compare to determine logical uniqueness.
Let's start with the T-SQL implementation of the fn_RemoveChars UDF, which Listing 2 shows. The UDF accepts the inputs @string and @pattern and removes all occurrences of characters in @string that match the pattern you provide for @ pattern.The UDF mainly relies on the PATINDEX built-in function, which returns the first position of a pattern within a string. To iterate through the occurrences of the pattern, the UDF invokes PATINDEX in a loop and removes the character where it finds the pattern. It's difficult to avoid iterative logic to implement this task. Furthermore, the UDF's functionality is limited—it lets you remove characters only where it finds the pattern. You'll also find it hard to replace patterns with a different pattern rather than with a constant string.
Before you test the UDF, run the following code to enter the context of the TSQLCLR database, and to turn on the STATISTICS TIME session option:
SET STATISTICS TIME ON;
Next, run the following query, which strips all special characters from the phone numbers in the 100,000-row Customers table:
On my test machine, this query ran for 53 seconds; two of those seconds were for generating the output.To measure the runtime without generating the output, turn on the Discard results after execution option in SQL Server Management Studio (SSMS) under Tools, Options, Query Results, SQL Server, Results to Grid.
Next, examine the definition of the fn_RegexReplace CLR UDF in the C# code at callout A in Listing 3 (or the code at callout A in Listing 4 for the VB implementation). The UDF's simple definition accepts three inputs: input (input string), pattern (regular expression pattern to find), and replacement (replacement string pattern). The UDF first determines whether one of the inputs is NULL, in which case it returns a NULL. If none of the inputs is NULL, the UDF simply invokes the Regex.Replace function, which internally handles the string-pattern replacement based on regular expressions. For simplicity, I didn't include exception-handling code in any of the routines in this article, although you would in production code.
One reason this UDF is dramatically more powerful than the T-SQL one is that it supports regular expressions, which have versatile capabilities. Unfortunately, covering regular expressions is outside the scope of this article. But suffice it to say that you can do much more with this UDF than you can with the T-SQL one. Now, test the function by running the following query:
N'\[^0-9a-zA-Z\]', N') AS
This query ran in under 3 seconds on my test machine, with two seconds for generating the output. In other words, it ran about 50 times faster than the T-SQL UDF.
Stored procedures. Just as I was about to develop my own CLR stored procedure to demonstrate what you shouldn't do with CLR code, I stumbled on a perfect example in SQL Server 2005 Books Online (BOL). The stored procedure queries a table, uses a loop to aggregate all values from one of the table's columns, and returns the result value in an output parameter. I made some revisions to the original stored procedure, including directing the query to our 1-million-row OrderDetails table and aggregating two columns, Quantity and UnitPrice, instead of one to return both result values as output parameters.You can find the definition of the usp_OrderDetailsTotals CLR stored procedure in the C# code at callout B in Listing 3 (or the VB code at callout B in Listing 4).
The procedure's logic is simple: It creates a connection to the database in which the stored procedure resides; loads the result set of the query to a SqlDataReader object called reader; and iterates through the records in reader, adding in each iteration the current record's values to the output parameters—simple iterative aggregation logic.
The author of the original procedure probably intended to demonstrate the syntax for creating a CLR stored procedure and the use of output parameters. However, such logic is a classic example of what you should never do in .NET without a compelling reason—for example, if you need to perform complex computations (or some manipulation that makes more sense with CLR code) for every record in the data reader object. I have to apologize for publicly pointing out bad code in BOL, but it's vital to understand this is exactly the type of activity that requires T-SQL, both for performance and simplicity.
To test the CLR procedure, run the following code:
@sumprice AS MONEY;
@sumqty = @sumqty OUTPUT,
@sumprice = @sumprice OUTPUT;
SELECT @sumqty AS sumqty,
@sumprice AS sumprice;
On my system, this code ran for just over 4 seconds. In T-SQL, you don't need a stored procedure for this task. Instead, use the following simple query:
SUM(UnitPrice) AS sumprice
This query ran for about half a second on my system; that's about eight times faster than the CLR code. By using a set-based query, you let the optimizer do what it does best: choosing an efficient execution plan for your query. If you use the interative code in the CLR implementation, you force a specific execution plan and incur the overhead of a record-by-record manipulation.
The Real Question
By now, you've realized that the real question isn't "CLR or not CLR?" T-SQL is more powerful and efficient in data manipulation scenarios, whereas the CLR is richer and more efficient in scenarios that are computation intensive or involve procedural or iterative logic or string manipulation.When you take the time to learn how to identify whether T-SQL or the CLR is the best choice for a given situation, you'll be able to get the best performance out of your queries.
Author's Note: I'd like to express my deepest gratitude to my good friend Dejan Sarka, a fellow Solid Quality Learning mentor who taught me how elements such as XML and the CLR work with relational database models.