This is the 3rd article in a series on a database that needed to go on a diet for the new year.

In the first article, we saw why size really does matter, both for the DB itself and all the other copies of it that exist in most organizations. We then saw how to estimate the improvement in size with ROW compression. Our customer’s database that started at 3.8TB was reduced to 2.6TB by applying ROW compression without any code changes. Better still the performance of the I/O bound application improved significantly by compressing the tables and indexes.

In the second article, we saw how both ROW and PAGE compression change the internal structure of the database pages, and saw why PAGE compression could achieve such great size reductions. The question that we posed at the end of that article was that if ROW compression made everything better, wouldn’t PAGE compression make it all better again.

ROW vs PAGE compression

I’ve not seen a system where enabling ROW compression has caused any issue at all. Storing more rows per page, performing less I/O, and having more rows fit into memory buffers just seems to be a good story.

This is also particularly important if you are using the Unicode data types nchar and nvarchar. In the last article we saw that when the page was uncompressed, that nvarchar characters are stored with two bytes per character, a scheme known as UCS-2 encoding. Ever since SQL Server 2008 R2, when you compress the page, SQL Server uses the Standard Compression Scheme for Unicode (SCSU), which can drastically cut the size of the data. You can find more information on SCSU here: https://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_Unicode. For English and many other languages, text stored as nvarchar or nchar, can be further reduced in size by 50%. This effectively removes the penalty for using these wider data types. (Some languages do not benefit as well. Japanese is typically only reduced by 15% but this is still a worthwhile benefit).

PAGE compression, however, is a different story. Rather than assessing how to compress across a single row, decisions are being made across an entire page of data. That’s fine for reading, and for inserts to the end of the table, but not if you are constantly changing the rows on those pages.

While enabling ROW compression across the board seems to be a good thing, enabling PAGE compression across all tables will likely lead you into problems with the performance of your data modifications.

So the normal optimal outcome for most systems is to use a combination or ROW and PAGE compression, not just a single choice.

The same applies to indexes. While it might make sense for a table to have PAGE compression, the optimal setting for the non-clustered indexes on the same table might be ROW. These combinations of choices are all supported.

Partition Impacts

Yet another core concept with table compression is that while we have been looking at examples of it being applied to entire tables or indexes, it is able to be applied at the partition level, not just at the table or index level.

Different parts of each table might be used in different ways. For example, a table that holds transaction history might have a higher level of data modifications occurring on the most recent data, yet the older data in the same table might not be changing at all.

This means that a sensible strategy for a table like this might look like this:

In this case, the older data for the table (shown in green) is PAGE compressed, the current data (shown in red) is ROW compressed, and the non-clustered indexes are both ROW compressed. If there are any partition-aligned indexes, they might well use the same strategy as the base table.

Deciding when to use PAGE compression

Fortunately for us, Microsoft started adding dynamic management views to SQL Server back in SQL Server 2005 and has improved them ever since. We can use a DMV to determine how a table or index is used. The DMV sys.dm_db_index_operational_stats is very useful here. It includes columns that show how often scans and updates of various types occur. I’ve intentionally skipped any Microsoft-shipped tables.

                              
SELECT t.name AS TableName,

       i.name AS IndexName,

       i.index_id AS IndexID,

       ios.partition_number AS PartitionNumber,

       FLOOR(ios.leaf_update_count * 100.0 /

             ( ios.range_scan_count + ios.leaf_insert_count

               + ios.leaf_delete_count + ios.leaf_update_count

               + ios.leaf_page_merge_count + ios.singleton_lookup_count

             )) AS UpdatePercentage,

       FLOOR(ios.range_scan_count * 100.0 /

             ( ios.range_scan_count + ios.leaf_insert_count

               + ios.leaf_delete_count + ios.leaf_update_count

               + ios.leaf_page_merge_count + ios.singleton_lookup_count

             )) AS ScanPercentage

FROM sys.dm_db_index_operational_stats(DB_ID(), NULL, NULL, NULL) AS ios

INNER JOIN sys.objects AS o

ON o.object_id = ios.object_id

INNER JOIN sys.tables AS t

ON t.object_id = o.object_id

INNER JOIN sys.indexes AS i

ON i.object_id = o.object_id

AND i.index_id = ios.index_id

WHERE ( ios.range_scan_count + ios.leaf_insert_count

        + ios.leaf_delete_count + leaf_update_count

        + ios.leaf_page_merge_count + ios.singleton_lookup_count) <> 0

AND t.is_ms_shipped = 0

ORDER BY TableName, IndexName, PartitionNumber;

This shows the percentage of time that the partition of the index or table has been scanned, and updated. The code is based on and updated from code in the following whitepaper: Data Compression: Strategy, Capacity Planning and Best Practices. In that whitepaper, Sanjay Mishra provides a good set of guidelines. My experience has been fairly much in line with what I’ve seen in that whitepaper.

Generally I recommend ROW compression across the board, but PAGE compression on partitions that are scanned more than 70% of the time, and updated less than 15% of the time. (These figures are not too far different from the whitepaper figures but are based on what I’ve found works well).

In that case, we could rearrange the code to have it provide recommendations:

                              
DECLARE @ScanCutoff int = 70;

DECLARE @UpdateCutoff int = 15;



WITH PartitionStatistics

AS

(

    SELECT t.name AS TableName,

           i.name AS IndexName,

           i.index_id AS IndexID,

           ios.partition_number AS PartitionNumber,

           FLOOR(ios.leaf_update_count * 100.0 /

                 ( ios.range_scan_count + ios.leaf_insert_count

                   + ios.leaf_delete_count + ios.leaf_update_count

                   + ios.leaf_page_merge_count + ios.singleton_lookup_count

                 )) AS UpdatePercentage,

           FLOOR(ios.range_scan_count * 100.0 /

                 ( ios.range_scan_count + ios.leaf_insert_count

                   + ios.leaf_delete_count + ios.leaf_update_count

                   + ios.leaf_page_merge_count + ios.singleton_lookup_count

                 )) AS ScanPercentage

    FROM sys.dm_db_index_operational_stats(DB_ID(), NULL, NULL, NULL) AS ios

    INNER JOIN sys.objects AS o

    ON o.object_id = ios.object_id

    INNER JOIN sys.tables AS t

    ON t.object_id = o.object_id

    INNER JOIN sys.indexes AS i

    ON i.object_id = o.object_id

    AND i.index_id = ios.index_id

    WHERE ( ios.range_scan_count + ios.leaf_insert_count

            + ios.leaf_delete_count + leaf_update_count

            + ios.leaf_page_merge_count + ios.singleton_lookup_count) <> 0

    AND t.is_ms_shipped = 0

)

SELECT TableName, IndexName, IndexID, PartitionNumber,

       UpdatePercentage, ScanPercentage,

       CASE WHEN UpdatePercentage <= @UpdateCutoff

            AND ScanPercentage >= @ScanCutoff

            THEN 'PAGE'

            ELSE 'ROW'

       END AS Recommendation

FROM PartitionStatistics

ORDER BY TableName, IndexName, PartitionNumber;

 

Keep in mind that this code works on the index statistics that are currently in memory. These get cleared when the server restarts. Generally I’d recommend not using these recommendations unless the SQL Server service has been running non-stop for at least three months.

Do INSERTs count as UPDATEs ?

While this code generally does a good job of providing recommendations, there is another scenario where I need to take a second look. In the code provided, we are treating INSERT operations as a type of update. While this is somewhat true, INSERT operations do not always have the same impact on PAGE compression as other data modifications.

I find that if all the inserts on a table are being made at the end of the clustered index for a table, that I can ignore the inserts for the purpose of these recommendations. Unfortunately, there is no easy programmatic way to know whether or not that is happening.

One simple indication would be if there is an IDENTITY column on the table, and that column is the clustering key (often also the PRIMARY KEY but not necessarily). In that situation, the page rearrangements that we’re concerned with in PAGE compression don’t seem to be an issue.

But unfortunately, that’s not the only way of knowing what’s going on, and in many cases, you need specific knowledge of the application to know the order that the INSERT operations are occurring in.

Applying ROW or PAGE compression

Compression occurs when data is being inserted into pages, so we need to re-write all the pages to get the effect of compression. We can do this when we use ALTER INDEX to rebuild an index, or ALTER TABLE to rebuild the HoBT (heap or binary tree) for the table.

As an example of applying ROW compression to all indexes on the Production.Product table in AdventureWorks, we could execute the following:

ALTER INDEX ALL

ON Production.Product

REBUILD WITH (DATA_COMPRESSION = ROW);

Summary and Outcomes

So how did this apply to our customer database that needed to go on a diet? As I mentioned earlier, applying ROW compression across the board dropped the 3.8TB database to 2.6TB. Applying selective PAGE compression based on the recommendations in this article dropped the database to 1.4TB and still had improved performance. While the CPU per page is higher, it’s worth noting that a huge reduction in the number of pages, offsets quite a deal of this load.

So was the diet finished yet? Not by a long shot. There are many more things that I did to this database, and we’ll see more in the next article. The large string and XML values haven’t been reduced yet. We’ll deal with them next.