When discussing table compression, I mentioned that the storage of string values was improved by only in two basic ways:

  • When ROW compression was applied, char and nchar data was stored in a variable length format i.e. if a value “hello” was stored as char(40) , then SQL Server did not occupy 40 bytes to store it. However, when the data was retrieved, 40 bytes were still returned as expected.
  • SQL Server 2008 R2 added Unicode compression into the mix. Many double-byte character values ended up occupying only half the space.

The problem with this is that systems often store large amounts of highly-compressible character data, and this is still stored in an essentially uncompressed format.

For the customer system that we are discussing, I found that a large volume of XML data (around 700GB) was being stored in columns of varchar data type. This data could have been stored as xml data type but there was no need to deal with the overhead involved, because the data was never queried using XQuery functions within the database, so the customer had decided to simply store it all as varchar data. If on a rare occasion, they did need to use XQuery on the data, they were happy to CAST it to xml data before applying the XQuery functions.

If you have ever used a ZIP style compression on this type of data, you’ll realize how well it can be compressed, so I decided to pursue the use of ZIP style compression on the relevant columns. Applying ZIP compression is a CPU intensive activity though. Generally, I would not want SQL Server being the application that performed the compression.

Recompressing the existing data would be a fairly intensive process so I wanted to make sure that it could be carried out over a period of time, rather than expecting it all to be immediately converted. To support this, in each table where the data needed to be held either compressed or uncompressed, I added additional columns. For example, look at the following table design:

CREATE TABLE dbo.SystemEvents

(

    SystemEventID bigint IDENTITY(1,1)

        CONSTRAINT PK_dbo_SystemEvents PRIMARY KEY,

    LoggedWhen datetime NOT NULL,

    LogData varchar(max) NOT NULL

);

GO

 

I would change this table as follows:

CREATE TABLE dbo.SystemEvents

(

    SystemEventID bigint IDENTITY(1,1)

        CONSTRAINT PK_dbo_SystemEvents PRIMARY KEY,

    LoggedWhen datetime NOT NULL,

    LogData varchar(max) NULL,

    IsCompressed bit NOT NULL

        CONSTRAINT DF_dbo_SystemEvents_IsCompressed DEFAULT(0),

    LogDataCompressed varbinary(max) NULL,

    CONSTRAINT CK_dbo_SystemEvents_Either_LogData_Or_LogDataCompressed_Is_Required

          CHECK (LogData IS NOT NULL OR LogDataCompressed IS NOT NULL)

);

GO

 

I then had the application changed so that it was happy regardless of whether the LogData or LogDataCompressed column was returned to it. It was also changed so that for all new rows, the data would be GZip compressed before being sent to SQL Server.

Finally, we created a small utility application that would progressively compress all existing data in the table during periods of low system activity.

The outcome of this is that the 700GB of data became 180GB of data, and the database had now dropped to a total of 880GB.