Improve SQL Server partitioning query performance
The idea behind data partitioning is really pretty simple: Rather than storing data in one really large table, you spread the data out among many smaller tables. When used properly, this technique can aid in the maintenance and scalability of storing data, and a good partitioning strategy lets you increase the efficiency of operations against that data. Partitioning lets you spread your data among filegroups on different drives as well and makes it easier to re-index smaller segments of the overall data set. However, although the idea is relatively simple, historically, the implementation has not been. Before SQL Server 2005, partitioning was accomplished by spreading data among separate physical tables and joining the tables as needed through the use of views. This method lent itself to a slew of challenges. First, anytime a new table was added or an old one removed, definitions would need to be changed to accommodate the underlying table changes. Special logic would have to be added to the system to indicate into which tables new records should be inserted. Also, queries against partitioned views were typically slow, especially when data was distributed among different servers.
The Developer and Enterprise editions of SQL Server 2005 andbrought changes that made partitioning much easier. Instead of defining different tables and hashing them together, as in previous versions, SQL Server would let you define a single table as well as define how your data would be logically and physically distributed through the use of schemes and functions. In addition, SQL Server would manage the distribution in the background so you didn’t have to. Your data would still be in smaller tables, as it had been in previous versions, but SQL Server would handle those details in the background. In this article, my goal is to give you an overall high-level view of some of the challenges I’ve faced and the differences I’ve noticed between SQL Server 2005 and SQL Server 2008 performance when I run the same queries against nonpartitioned and partitioned tables.
Design and Implementation
In this article, I won’t cover partitioning design and implementation in extensive detail. Just know that you need to define a partition function to specify how you want your data logically distributed. With that definition, you specify your value barriers. You must then create a partition scheme to physically distribute your partitions among filegroups based on the logical barriers you defined in your partition function. When you create a clustered index on your table, you specify the partition scheme that will be used to partition the table (alternatively, you could use a unique constraint). You achieve many administrative and maintenance-related benefits by partitioning your tables, such as the ability to move data in and out of tables quickly by “switching” partitions. Partitioning also allows for rebuilding indexes on one or more partitions individually. You can find a highly useful white paper about partitioning at http://msdn.microsoft.com/en-us/library/ms190787.aspx.
When it comes to querying partitioned data, you face performance challenges, although current challenges pale in comparison to those encountered with pre–SQL Server 2005 versions. At first glance from a high level, it might seem that query performance for nonpartitioned and partitioned tables should be the same. But because partitioned data is spread out among partitions rather than being in the same spot, it becomes trickier for the database engine to easily locate data. However, if you design your partitioning strategy so that your queries always point to a particular partition, you can see query performance benefits similar to those for a nonpartitioned table. These performance benefits can grow dramatically if you’re dealing with huge amounts of partitioned data. If a majority of your queries will span partitions (e.g., data from different time ranges, different customers), then achieving good performance is trickier still. Note: If you find that your queries continually span partitions, you may need to reevaluate your partitioning strategy and consider whether this feature is the right choice for your environment.
As I compared SQL Server 2005 and SQL Server 2008 performance running the same queries against nonpartitioned and partitioned tables, I intentionally kept the tests as straightforward as possible. This approach lets you easily see (and lets me easily describe) the differences between the different methods. Results may—and probably will—vary when higher loads of data are examined and when partitioning strategies and query criteria change.
The code snippet presented in the first script, shown in Listing 1, sets up the tables and indicates how they’re partitioned. I’ve designed my partitioning function to distribute data across 10 different partitions, all of which are in my primary filegroup. In a perfect world, I would have these partitions on different disks or sets of disks, but for this scenario, placing all of the partitions in the primary filegroup should be fine.
In this setup, I load 1 million records into a table aptly named NonPartitionedTable. Each column in this table will play a different role in the examples presented. The ValueColumn contains a globally unique identifier (GUID), which I’ll later use to return a single record lookup. The PartitionID is a hashed value that will be part of my clustered index for my partitioned table and will serve as the partitioning key for data distribution. If you partition a table, the partition key must be included in the clustered index definition because partitioning physically distributes a table, and a clustered index is physically a part of the table. Note that I’m not designing this partition strategy on a date-based element. Instead, I mimic an environment where data is segmented based on customer-related data.
The GroupedValue is another hashed column that I’ll use to query values across partitions and return an aggregated set of data. This type of query is of particular importance because many queries run against partitioned tables can’t specify the partitioning key as part of the criteria because records from multiple partitions must be returned. The performance of such queries is critical and should play a significant role in your design strategy if you determine that you need to partition your database tables.