Save space with these SQLCLR compression functions
Relational databases provide an organized repository for data and store most data types efficiently, but there are exceptions. For example, you can often greatly reduce the size of binary-type data by compressing it. And if you have large volumes of binary-type data, compression can substantially reduce your storage requirements.
SQL Server doesn't provide built-in compression facilities, but you can add compression capabilities to SQL Server 2005 by leveraging the Common Language Runtime (CLR). Let's look at how to compress data in SQL Server 2005 and how to take advantage of the new varbinary(max) data type, which replaces the often inflexible—and now deprecated—image data type. After working through this article's examples, you'll understand how to compress SQL Server data and how to leverage Windows .NET Framework code to extend the power of SQL Server 2005.
For years, developers have had to create custom solutions to compress data before storing it in the database and decompress data after retrieving it. (For an overview of compression, see the sidebar "Compression Basics." The .NET Framework greatly simplifies these tasks. .NET Framework 1.1 provides the java.util.zip namespace, which offers methods for zipping and unzipping data.With this version of the Framework, for example, you can create wrapper classes to indirectly call java.util.zip from .NET languages such asVB.NET and C#. However, .NET Framework 2.0, which is part of SQL Server 2005 and Visual Studio (VS) 2005, features the System.IO.Compression namespace. By using the methods in this namespace, you can conveniently compress and decompress data.
The System.IO.Compression namespace, which is available to all .NET languages, offers two classes for compressing and decompressing data:DeflateStream and GZipStream. DeflateStream implements the DEFLATE algorithm (as defined in Request for Comment—RFC—1951). GZipStream implements the gzip format (as defined in RFC 1952),which compresses a single file and is also based on the DEFLATE algorithm. As the class names suggest, DeflateStream and GZip Stream work with streams of data. Streams aren't just sequences of bytes; they're objects, complete with methods for manipulating the objects.
Before looking at our sample compression programs, let's quickly walk through the key steps for creating a compression application. First, you need to add Visual Basic (VB) Imports or C# code, using statements for the System.IO and System.IO. Compression namespaces:
' VB Imports statements Imports System.IO _ ' for Stream object Imports System.IO.Compression _ ' for DeflateStream and _ GZipStream // C# code using System.IO; // for Stream object using System.IO.Compression; // for DeflateStream and GZipStream
(Note: Some code lines in text wrap to several lines because of space constraints.)
You also need to use the SqlBytes data type, a native SQL Server data type, as both a parameter and a function return value. SqlBytes is part of the new System.Data.SqlType name space, which provides classes for native data types within SQL Server 2005.According to the .NET documentation, these classes " provide a safer, faster alternative to the data types provided by the .NET \[CLR\]."As an added benefit, SqlType namespaces implement the INullable interface, which lets them contain an actual null value. Using native SQL Server data types also avoids type-conversion problems. Although you can use .NET common data types such as a Byte array within SQL CLR functions, the runtime will implicitly convert the .NET data types to a corresponding SqlType.Type conversions incur a slight performance penalty and open the door to possible conversion errors.For example,when we first wrote the CompressBytes function, we used Byte arrays. After we deployed the code to SQL Server 2005, we realized that the runtime was implicitly converting the Byte arrays to varbinary(8000).This unfortunate data-type conversion precluded working with documents larger than 8K.
To compress data, you need to create a managed-code user-defined function (UDF) of type SqlBytes:
Public Shared Function _ CompressBytes(ByVal _ UncompressedBytes As _ SqlBytes) As SqlBytes
You use both a MemoryStream object and a GZipStream object for the compression process, like this:
Dim outputStream As New _ MemoryStream ' Contains the _ compressed data Dim zipStream As Stream _ ' The zip stream used for _ compression zipStream = New GZipStream_ (outputStream, _ CompressionMode.Compress) _ ' instantiate
The function performs the compression by invoking the GZipStream object's Write method, which writes the compressed bytes into the MemoryStream object (output Stream). As you see in the following code snippet, the Write method requires the number of uncompressed bytes (we'll explain the significance of this in a moment):
zipStream.Write _ (UncompressedBytes.Value, _ 0, CInt(UncompressedBytes. _ Length))
With the compressed data now in output Stream, you just need to return the data from the function:
Return New SqlBytes _ (outputStream.ToArray)
Compressing data is that easy. And decompressing compressed data is almost as simple. You use the GZipStream's Read method to decompress data. As with the Write method, the Read method requires the number of compressed bytes that it will process. We found that keeping track of the number of compressed and uncompressed bytes was inconvenient when developing a generic compression and decompression wrapper class. We decided that adopting a chunk-oriented approach, in which we compress and uncompress a block of bytes at a time, was better because it abstracts the details of keeping track of the number of bytes.
Sample Compression Projects
You can download three sample .NET projects from the SQL Server Magazine Web site (http://www.sqlmag.com, InstantDoc ID 49065) to see how to use the System.IO .Compression namespace to compress and decompress data from within SQL Server 2005 or outside of it in a standard Windows Forms application. These sample projects use the AdventureWorks database because it contains data well suited for a compression test.The Document table in the Production schema has uncompressed Word documents stored in the Document column, which has a data type of varbinary(max).
To keep things simple in the first sample project,we didn't perform any database access. The program is a simple Windows Forms application that lets you pick a file, specify a compression method, and see the percentage of compression.The application also displays the elapsed time to compress the file and the time to decompress the file, as Figure 1 shows.
Because the process of compressing and decompressing data by using DeflateStream and GZip Stream are the same, you can encapsulate the logic for both compression algorithms into the same subroutine or function. As Listing 1 shows, the Compress Wrapper class contains the Compress and Decompress methods, both of which have a parameter to specify your choice of compression algorithm.The compression code needs to copy an uncompressed stream to a compressed stream.The decompression code needs to copy a compressed stream to a decompressed stream.You can encapsulate these two stream-copy operations into one generic stream-copy function. (For a C# version of CompressWrapper, see Web Listing 1.)
Notice at callout A in Listing 1 that the CopyStream method accepts Stream objects as parameters. Because both the Deflate Stream and GZipStream classes inherit from the Stream class, they are of type Stream.When Deflate Stream and GZipStream stream objects are instantiated, the second parameter of the constructor indicates whether the stream will contain compressed or decompressed data.When your application passes Stream objects as parameters to a method, you can call the Stream objects' Read and Write methods from within the method, which is what CopyStream does. If, for example, an uncompressed stream object is in the input parameter and a GZipStream object is the output parameter, CopyStream uses the uncompressed stream's Read method to get a chunk of uncompressed data from the input stream.The application then uses the compressed stream's Write method to compress the chunk and transfer the compressed data into the GZipStream output stream. Similarly, if a compressed GZipStream object is the input parameter and an uncompressed stream object is the output parameter, CopyStream uses the compressed stream's Read method to get a chunk of compressed data and uncompress it. The application then uses the compressed stream's Write method to transfer the uncompressed chunk into the uncompressed output stream. CopyStream uses a loop to call the Read and Write methods in chunks so that you don't have to specify the total number of bytes to compress or decompress.
Now that you've seen the basic approach to compression and decompression, the next step is to create a SQL Server project and run the CompressWrapper class in the SQLCLR, which is what our second sample project shows. (For more information about how to write and deploy CLR code in SQL Server 2005, see "Making the Most of the CLR," May 2005, InstantDoc ID 45673.) In the Solution Explorer, right-click the SQL Server project and select Add, New Item from the context menu. You'll see several available database object templates; click User-Defined Function.VS will create a class with the necessary Imports statements along with a sample scalar function. Before you modify this class, add our compression wrapper class (CompressWrapper.vb) to the project by clicking Add, Existing Item. As mentioned earlier, this class encapsulates compression/decompression logic for reusability. You might be thinking that for true reusability, we should compile this class into a separate assembly—and you'd be right.With that said, using custom assemblies in a SQL Server project requires a bit more setup. (See the Web-exclusive sidebar "Can I Have a Reference?" http://www.sqlmag.com, InstantDoc ID 49067, for tips about referencing custom .NET assemblies). So, we chose to stick with reuse at the class level. Import CompressWrapper.vb into your function class file, and add the two SQL function calls that Web Listing 2 shows. (For a C# version of these UDFs, see Web Listing 3.) Deploy the project by using the VS 2005 Deploy option.
Varbinary(max) to the Rescue
By looking at the SqlType class name, you can usually easily map the class to the corresponding SQL Server data type (e.g., Sql Money, SqlInt32). Because the compressed data is binary data, it's stored in the new varbinary(max) data type, which maps to the SqlBytes data type.
As we noted earlier, Microsoft has deprecated the image data type.Varbinary(max) replaces the image data type and is critical for our functions. If you've tried to work with an image data type inside a SQL Server 2000 stored procedure, you know that image data types are invalid for local variables. Moreover, although you can pass an image variable into a stored procedure, you can't change or select the value of the variable into other variables within the procedure itself. Varbinary(max) makes working with large binary data as easy as working with numeric data.
Putting It All Together
Now that we've walked through the basics of creating .NET UDFs, let's look at the third integrated sample we developed for you. From the project files available on the Web, open the WindowsAppTestCompression.sln file, where you'll find Windows application projects that can insert and retrieve documents from the Adventure-Works database.To make things a little easier to visualize (and to avoid modifying the original AdventureWorks schema), the sample uses a copy of the Production. Document table.The table is called Document-CompressEg, and it has three additional columns to track document size (before and after compression) along with the method of compression used.
To create this table (and the stored procedures that the Windows application uses), run the CreateTableAndSps.sql script file located in the project's Test Scripts folder. You can then insert and retrieve both compressed and uncompressed documents and see the amount of compression achieved for each document, as Figure 2 shows.
Where Do You Go from Here?
SQL Server 2005 provides greater flexibility and choice in deployment, storage, and programming options. For the project we've described, SQL Server 2005's CLR integration gave us the power to deploy complex compression logic to our database server, using a sophisticated programming language. In addition,the new varbinary(max) data type helped us easily manipulate, store, and retrieve large binary objects. Of course, you don't want to run out and start moving all complex business and technical logic to the database server. But with these enhancements, you now have more choices when designing and architecting your systems.