VarBinaryComp compresses and decompresses binary data up to 2GB in size
|Executive Summary: The Common Language Runtime (CLR) user-defined type (UDT) VarBinaryComp takes advantage of SQL Server 2008's support for large types to store, compress, and decompress binary data that's up to 2GB in size.|
New in SQL Server 2008 is support for large user-defined types (UDTs). We’re going to build a Common Language Runtime (CLR) UDT that allows us to store, compress, and decompress binary data (up to 2GB in size) inside SQL Server 2008. This technique works only in SQL Server 2008 because only SQL Server 2008 allows CLR UDTs larger than 8KB. The UDT we’ll build, called VarBinaryComp, accepts both uncompressed data (which it will compress before storing) and compressed data as input. It also has properties to determine the length of the compressed and uncompressed data. To implement this UDT, we’ll leverage the compression/decompression logic covered in “Zip Your Data,” InstantDoc ID 49065. For background on compression fundamentals, you might want to review this article and its accompanying sample code before proceeding.
A little refresher course on CLR UDTs might be helpful as well. A CLR UDT is a way of extending SQL Server’s existing type system by using a Microsoft .NET Framework language such as C# or Visual Basic .NET, and it enables you to store CLR objects inside a SQL Server database. Don’t confuse a CLR UDT with a user-defined data type, which simply acts as an alias over a built-in data type. In addition to storage, a CLR UDT can provide properties and/or methods that allow for instantiation/manipulation of the underlying data.
CLR UDTs aren’t intended to turn SQL Server into a full-blown object-oriented database. See the Web-exclusive sidebar “CLR UDT Caveats” (www.sqlmag.com, InstantDoc ID 98304) for more information about the proper place of CLR UDTs and methods for overcoming common objections to using them.
An easy way to understand the potential power of a CLR UDT is to examine how to use the built-in XML data type. Introduced in SQL Server 2005, the XML data type lets you store, retrieve, and manipulate data in XML form. Let’s take a look at the code that I wrote in Web Listing 1 as an example. After connecting to the AdventureWorks sample database, the code declares a variable (@XML) of type XML. The code then issues a relational query (by using the FOR XML AUTO clause) and stores the results in the XML variable. Next, the code uses the .value method of the XML data type to retrieve a single SQL type value from the XML data. Finally, the code demonstrates how the XML data type can be used as a column definition in a table (i.e., the HumanResources.JobCandidate table). The XML data type is pretty powerful stuff—and a great example of how the SQL Server Product Group has extended SQL Server’s relational model. We have the capability to do the same thing!
Developing a CLR UDT
The process of developing a CLR UDT in SQL Server consists of the following three steps:
- Code and build the assembly that defines the CLR UDT. To do so, you can use any language supported by the .NET Framework CLR that produces verifiable code, including C# or Visual Basic .NET. The data is exposed as fields and properties of a .NET Framework class or structure, and behaviors are defined by methods of the class or structure.
- Register the assembly. You can deploy CLR UDTs through the Microsoft Visual Studio user interface in a database project or by using the T-SQL CREATE ASSEMBLY statement, which copies the assembly containing the class or structure into a database. Note that I developed the VarBinaryComp CLR UDT by using Visual Studio 2005 and the July CTP build of SQL Server 2008. At the time, deployment to SQL Server 2008 through Visual Studio was not yet supported. Therefore, we’ll be using the T-SQL deployment method (along with an alternative debugging strategy), which I explain later. Also, don’t forget that you need to first enable CLR integration in the SQL Server instance you’re working with before you can use the CLR UDT. You can enable CLR integration by using the SQL Server Surface Area Configuration Tool.
- Create the CLR UDT in SQL Server. After you’ve loaded an assembly into a host database, you use the T-SQL CREATE TYPE statement to create a CLR UDT and expose the members of the class/structure as members of the CLR UDT. CLR UDTs exist only in the context of one database and, once registered, have no dependencies on the external files from which they were created. (To learn how to use a CLR UDT across databases, refer to the SQL Server Books Online topic “Using User-defined Types Across Databases” at msdn2.microsoft.com/en-us/library/ms178069.aspx.)
The VarBinaryComp UDT
Let’s now look at the VarBinaryComp UDT inside Visual Studio 2005. After downloading the sample code by going to www.sqlmag.com, entering InstantDoc ID 98305, and clicking the .zip file under Download the Code, open the SqlSvr_CompUDT.sln file. The UDT, VarBinaryComp, is implemented as a structure in the excerpt of the VarBinaryComp.vb file that Listing 1 shows. (Web Listing 2 shows the complete VarBinaryComp.vb file.) At callout A in Listing 1, the declaration uses the Microsoft.SqlServer.Server.SqlUserDefinedType attribute to identify this structure as a CLR UDT. Within the attribute, two property values have been set. The MaxByteSize property has been set to -1, which lets the UDT store more than the prior 8KB limit—up to 2GB.
The Format property has been set to UserDefined. Format determines how the UDT is serialized, as Native or as UserDefined. Format.Native, as the name implies, uses native SQL Server binary serialization. While very fast, this format is restricted to fixed-length, value-type data types (e.g., int, datetime). Because VarBinaryComp deals with variable-length binary data, I chose Format.UserDefined. This format gives us complete flexibility over the serialization process, but it also requires that we implement the IBinarySerialize interface. The IBinarySerialize interface consists of two methods—Write and Read—which appear at the end of Web Listing 2. Note these methods will be called each time you write or read any UDT property.
A CLR UDT must also implement the INullable interface (which consists of the read-only property IsNull) and a shared read-only method named Null (as callout C in Listing 1 shows). This Null method needs to instantiate and return a new instance of a UDT with its IsNull property set to true.
Finally, a UDT CLR must also implement the ToString and Parse methods to convert data to and from a string value. As the name implies, the ToString method (at callout B in Listing 1) converts the UDT to a string representation. For the VarBinaryComp UDT, this method isn’t useful because viewing binary data as a string provides little practical value. In fact, during my testing, I discovered that this method raises an error if the underlying data exceeds 8,000 bytes. Therefore, if you need to view the binary data as a string (or convert it to a string), you might want to consider adding a custom method (e.g., ToNvarchmax) to overcome this size limitation.
The Parse shared method (at callout G in Listing 1) allows a string to be converted into a new instance of a UDT. Although the Parse method isn’t the focus of this article, I did use it (which in T-SQL code involves setting a UDT variable equal to a String value/variable) to compress varchar, nvarchar(), and nvarchar(max) data types.
Optionally, a CLR UDT can implement additional methods and properties. VarBinaryComp implements several methods/properties specific to compressing/uncompressing data. Let’s review a couple of these. ParseVarBinaryU (at callout E in Listing 1) is a shared function that creates a new instance of VarBinaryComp—using uncompressed binary data as input. Notice that within ParseVarBinaryU, the code calls an “overloaded constructor” (i.e., the New function) to take care of the actual instantiation. It’s here that the code invokes the algorithm that compresses the binary data.
ParseVarBinaryC (at callout F in Listing 1) is another shared function—but one that creates a new instance of VarBinaryComp with data that has already been compressed. (This could be useful when an application compresses the data before sending it to SQL Server.) VarBinaryComp also implements a read-only property, UnCompressedLength, that represents the uncompressed length of the binary data (at callout D in Listing 1).
Compile and Deploy VarBinaryComp
After opening the SqlSrv_CompUDT\SqlSrv_CompUDT.sln file in Visual Studio, compile the project by selecting Build SqlSvr_CompUDT from the Build menu. Then, open the Deployment.sql file (located in the SqlSrv_CompUDT\Test Scripts folder) inside SQL Server Management Studio. After changing the Directory Path variable to match your machine settings, run Deployment.sql to deploy the assembly to the AdventureWorks database. Note that in this script I have deployed both the .dll and .pdb files to allow debugging of this UDT.
To verify your deployment, open the Test.sql file (also located in the SqlSrv_CompUDT\Test Scripts folder) inside Management Studio. Run this script to see how to use the VarBinaryComp UDT inside of T-SQL Code. Web Listing 3 shows the complete Test.sql script, but let me draw your attention to a few points of interest. First, in Listing 2, look at how a variable of type VarBinaryComp is declared and then instantiated.
As discussed previously, the shared method Parse-VarBinaryU lets us instantiate the UDT by using uncompressed binary data (i.e., a varbinary(max) data type) as input. Notice the double-colon notation—this is how you call a UDT shared method from T-SQL code. Just to be clear, shared methods can be called without first instantiating the UDT. For example, the Test.sql code excerpt in Listing 3 calls the shared method Compress to compress binary data and store the results in a variable of the standard type varbinary(max). In this manner, a UDT can also be used as a “collection” of useful, related functions.
Debugging a CLR UDT
Typically, debugging a CLR UDT is a fairly simple process that you can initiate within Visual Studio 2005 by selecting Start Debugging from the Debug menu. But as I mentioned, this capability isn’t supported with the July CTP of SQL Server 2008. Here’s an alternative approach you can use:
- Manually deploy your assembly to SQL Server 2008 by using T-SQL (as we have already done).
- With the SqlSvr_CompUDT.sln solution open in Visual Studio 2005, select Attach to Process from the Debug menu. Select the Show processes from all users check box, click sqlservr.exe in the Available Processes list box, and click Attach.
- Set breakpoints as desired in your UDT code, then run T-SQL code, a client application, or some other program that will make use of the UDT. In this case, you can run the SqlSrv_CompUDT\Test.sql script from within Management Studio.
- When you’re finished debugging, select Stop Debugging from the Debug menu.
The Integrated Sample
I’ve also built a sample client that demonstrates VarBinaryComp being used as part of an application. Open the solution file (WinApp_TestUDT\WinApp_TestUDT.sln) in Visual Studio, and you’ll see a simple Windows form application (which Figure 1 shows). The application lets you insert and select records that contain binary data representing a file. You can let the UDT handle all the compression/decompression logic, or you can optionally compress a file before inserting it.
Before running this application, open the CreateTableAndSps.sql file (located in the WinApp_TestUDT\Test Scripts folder) inside Management Studio; run this script to create a table which uses the Var-BinaryComp UDT as one of its columns. The script will also create two stored procedures that the application uses to insert and select data. The stored procedure used to insert data takes as input a standard varbinary(max) datatype; likewise, the stored procedure used to select data returns a standard varbinary(max) data type. By default, the compression/decompression logic happens inside these stored procedures—thereby making the UDT “transparent” to the application.
Where to Go from Here
The VarBinaryComp CLR UDT is fully functional and ready to help you reduce the size of binary data stored inside SQL Server 2008. However, you might want to consider implementing a number of potential enhancements:
- Alter the CLR UDT to be compliant with the popular Zip file format (described at www.winzip.com/aes_info.htm#zip-format)
- Allow for different compression algorithms. In “Zip Your Data,” I demonstrated how you could use either of the two compression algorithms that ship with version 2.0 of .NET Framework. You might want to re-enable this capability and/or add support for third-party algorithms.
- Add encryption. By using capabilities built into .NET Framework, you can easily encrypt binary data by making a “second pass” over the data after it’s been compressed (and then decrypt the data before it’s decompressed).
Support for large CLR UDTs really opens the door for developers to extend the built-in type system in SQL Server 2008. Although the potential for misuse exists, I believe this new capability—when used properly—provides a great degree of flexibility for storing rich data types (and accompanying logic) inside the relational database.