I work in a SQL Server production environment that consists of multiple customer databases on multiple SQL Server 2008 64-bit instances. One database is maintained for each customer. Every week each customer provides a file that contains its sales data for the previous week. An automated process loads this data directly into a staging table with few validations. One task that I need to perform every week is validating and loading this sales data from the staging table to a single Sales table in the same customer database. For the large customer databases, a week's worth of data ranges between 20 million and 40 million records.
Another task I need to perform every week is purging one or more weeks of old data from the Sales table because my company's data retention policy is to keep only 105 or 156 weeks (about 2 or 3 years) of sales data, depending on the customer. I purge the old sales data the same day I load the new sales data.
Loading and purging the data took a lot of time each week because the Sales table wasn't partitioned. So, I decided to create an elegant partition solution that would load the new sales data and purge the old sales data as quickly as possible. I came up with four-stage plan:
1. Identify the best partition solution.
2. Implement the partition solution.
3. Migrate the data.
4. Set up the purge process.
Stage 1: Identify the Best Partition Solution
My goal for stage 1 was to determine the best partition solution. After some contemplation, I decided on the following:
- Each partition in the new partitioned table would hold one week's worth of sales data, making it easy to add new data and purge old data.
- The ideal column to define the partitions would be the SalesDate column in the old nonpartitioned table, because this column contained the date the records were inserted.
- I would dynamically create the initial partition and transfer data from the nonpartitioned table to the partitioned table. I would also dynamically create subsequent partitions.
- Because only 105 or 156 weeks of sales data can be kept in the partitioned table, I would dynamically delete partitions containing data older than that. The partition numbers would always be 1 to 105 or 1 to 156, with partition 1 containing the oldest data.
Stage 2: Implement the Partition Solution
My goal for stage 2 was to create a new table that retained the definition of the nonpartitioned Sales table but used dynamically generated partitions. Here's how I created that table in each customer database:
1. I dropped all the primary-key, unique-key, and foreign-key constraints and all indexes in the nonpartitioned Sales table.
2. I renamed the nonpartitioned Sales table to SalesNormalTable_B4_Partition.
3. I declared a local variable (@tmpSalesWeekDate), obtained the oldest sales date from the SalesDate column in the SalesNormalTable_B4_Partition table, and stored it in that variable, using code like that in Listing 1.
SELECT @tmpSalesWeekDate = ISNULL(MIN(SalesDate), '01/01/2010')
FROM dbo.SalesNormalTable_B4_Partition WITH (NOLOCK)
4. To dynamically generate the partitions, I created a partition function named PF_dbo_SalesPartitionTable_SalesDate_ByWeek. As Listing 2 shows, I used the sales date in the local variable as the first value so that the first partition would hold the oldest sales data from the SalesNormalTable_B4_Partition table.
AS RANGE LEFT FOR VALUES (@tmpSalesWeekDate)
5. To dynamically generate the partition schemes in the new partitions, I created a new partition scheme named PS_dbo_SalesPartitionTable_SalesDate_ByWeek for the PF_dbo_SalesPartitionTable_SalesDate_ByWeek function, as Listing 3 shows.
AS PARTITION PF_dbo_SalesPartitionTable_SalesDate_ByWeek
ALL TO ([PRIMARY])
6. Using the newly created partition scheme, I created a new partitioned Sales table named SalesPartitionTable and a temporary placeholder table, both of which had the same definition as the SalesNormalTable_B4_Partition table. The placeholder table would be used during the initial data transfer and during the subsequent weekly data transfers.
7. I created the necessary primary key and indexes in the SalesPartitionTable and placeholder tables.
Stage 3: Migrate the Data
My goal for stage 3 was to migrate all the data from the SalesNormalTable_B4_Partition table to the SalesPartitionTable table, dynamically creating partitions when needed. Here's how I accomplished this goal in each customer database.
I first obtained all the distinct sales dates from the SalesDate column in the SalesNormalTable_B4_Partition table and stored them in a temporary table named DistinctSalesDate. Then, a row at a time, I retrieved each sales date from DistinctSalesDate, storing it in a local variable.
Next, I passed the sales date in the local variable to a scalar-valued user-defined function (UDF) named dbo.fn_GetPartitionNumberForPartitionFunctionAndValue, as Listing 4 shows. I designed this UDF to return the partition number for a specified partition function and sales date.
' PF_dbo_SalesPartitionTable_SalesDate_ByWeek ',
When the UDF returned NULL, a new partition for the sales date was dynamically created using the ALTER PARTITION FUNCTION statement with the SPLIT RANGE argument, as shown in Listing 5. In addition, the ALTER PARTITION SCHEME statement with the NEXT USED argument was used to designate the next file group to be used in the partition scheme.
SPLIT RANGE (@tmpSalesWeekDate)
ALTER PARTITION SCHEME PS_dbo_SalesPartitionTable_SalesDate_ByWeek
NEXT USED [PRIMARY]
When the UDF returned a partition number that was greater than zero, the following steps were taken:
1. The partition was switched from the SalesPartitionTable table to the placeholder table.
2. The primary-key constraint and indexes were dropped from the SalesPartitionTable and placeholder tables.
3. The row matching the sales date in the local variable was copied from the SalesNormalTable_B4_Partition table to the placeholder table.
4. The partition containing the transferred data was switched from the placeholder table to the SalesPartitionTable table.
5. The row matching the sales date in the local variable was deleted from the placeholder table.
6. Steps 2 through 5 were repeated until all the distinct sales dates in the DistinctSalesDate table were processed.
7. The number of rows in the SalesPartitionTable and SalesNormalTable_B4_Partition tables were compared. If they matched, the necessary primary key and indexes were created in the SalesPartitionTable and placeholder tables. The SalesNormalTable_B4_Partition table was then dropped.
Note that by dropping the primary-key constraint and indexes before any data was transferred (step 2) and creating the primary key and indexes after the transfer was completed (step 7), I avoided having to perform these tasks in each partition.
To add the new sales data each week, the same basic process would be followed: A new partition would be dynamically created for the new sales date, and the new partition would be switched to the placeholder table, where the primary-key constraint and indexes would be dropped. The new sales data would then be inserted from the staging table into the placeholder table. After creating the necessary primary key and indexes in the placeholder table, the new partition would be switched to the SalesPartitionTable table.
Stage 4: Set Up the Purge Process
At this point, I had successfully implemented and populated the partitions in SalesPartitionTable in each customer database. My next goal was to set up a weekly process that would purge the old data when a new week's worth of sales data is added. Because purging old sales data would leave empty partitions at the beginning of the table, I also wanted to recycle the empty partitions.
To purge the old sales data and recycle the empty partitions, I created a dynamic purge process that runs while the new sales data is added each week. Here's how purge process works in each customer database. To begin, it identifies the sales data that's older than the retention period (either 105 or 156 weeks) by comparing the current date to the sales dates in the SalesDate column in SalesPartitionTable. For each sales date identified as older, the following activities occur:
- The partition matching the sales date is switched from the SalesPartitionTable table to the placeholder table.
- The placeholder table is truncated to delete the old data almost instantaneously.
- The partition is switched from the placeholder table to the SalesPartitionTable table.
The truncation of the sales data results in an empty partition, so the purge process dynamically recycles it using code like that in Listing 6. This code first obtains the oldest sales date in the SalesDate column in SalesPartitionTable and stores it in a local variable named @DeleteDateDT. This local variable is then used in a call to dbo.fn_GetPartitionNumberForPartitionFunctionAndValue. This function returns the associated partition number, which is stored in another local variable named @PartitionNumber.
DECLARE @PartitionNumber INT
SELECT @DeleteDateDT = MIN(SalesDate)
FROM dbo.SalesPartitionTable WITH (NOLOCK)
IF @DeleteDateDT IS NOT NULL
SELECT @PartitionNumber =
( 'PF_dbo_SalesPartitionTable_SalesDate_ByWeek', @DeleteDateDT )
-- BEGIN CALLOUT A
WHILE (@PartitionNumber > 1)
SET @PartitionNumber = @PartitionNumber - 1
SELECT @DeleteDateDT =
@PartitionNumber ) AS DATE)
IF (SELECT COUNT(*) FROM dbo.SalesPartitionTable WITH (NOLOCK)
WHERE SalesDate = @DeleteDateDT) = 0
ALTER PARTITION FUNCTION PF_dbo_SalesPartitionTable_SalesDate_ByWeek()
MERGE RANGE (@DeleteDateDT)
END /* IF (SELECT COUNT(*)... */
END /* WHILE (@PartitionNumber > 1) */
-- END CALLOUT A
END /* IF @DeleteDateDT IS NOT NULL */
The @PartitionNumber variable plays an integral role in the WHILE loop shown at callout A in Listing 6 (below). Provided that the number in @PartitionNumber is greater than 1 (at this point in the purge process, the partition with the oldest sales data is going to be partition 2 because partition 1 no longer contains any data), several actions are taken during each iteration. First, 1 is subtracted from the number in @PartitionNumber (e.g., 2 - 1 = 1). The @PartitionNumber variable is then used as an argument in a call to a scalar-valued UDF named dbo.fn_GetPartitionRangeValueForPartitionFunctionAndNumber, which returns the sales date for that partition. The other argument that this UDF requires is the name of the partition function used to create the partition (in this case, PF_dbo_SalesPartitionTable_SalesDate_ByWeek).
Each sales date returned by dbo.fn_GetPartitionRangeValueForPartitionFunctionAndNumber is passed to the ALTER PARTITION FUNCTION command. When used with the MERGE RANGE argument, this command merges two partitions into one and drops the empty partition. After all the sales dates in the WHILE loop are processed through this command, the empty partition will have been removed and the non-empty partitions will have moved down one number so that partition 2 becomes partition 1, partition 3 becomes partition 2, and so on until partition 106 becomes partition 105 (or partition 157 becomes partition 156).
Demonstration of the Partition Solution
I created 13 T-SQL files to demonstrate the concepts used in my partition solution. You can download them by clicking the hotlink at the top of this page. Each file is prefixed with number, starting with 1. If you run each T-SQL file in sequence, you'll see the partition solution in action. Some of the files create database objects (e.g., UDFs, stored procedures), whereas others are T-SQL scripts. Here's a description of what each one does:
1_dbo.fn_GetPartitionRangeValueForPartitionFunctionAndNumber.sql. This file creates a scalar-valued UDF in a user database. The UDF returns the sales date (i.e., the range value) for the specified partition function and partition number.
2_dbo.fn_GetPartitionNumberForPartitionFunctionAndValue.sql. This file creates a scalar-valued UDF in a user database. The UDF returns the partition number for the specified partition function and sales date.
3_dbo.Proc_CreateORGetPartitionNumber.sql. This file creates a stored procedure in a user database. The stored procedure either creates a new partition or returns an existing partition number for the specified partition function and sales date.
4_Script_CreateTable_dbo.SalesNormalTable.sql. This script creates a nonpartitioned user table named SalesNormalTable in a user database. SalesNormalTable holds the records that will be transferred to a partitioned table.
5_Script_PopulateData_dbo.SalesNormalTable.sql. This script populates the SalesNormalTable table with 11 weeks of sales data. Thus, the table contains 11 rows.
6_Script_Implement_dbo.SalesPartitionTable.sql. This script renames the SalesNormalTable to SalesNormalTable_B4_Partition and drops the primary-key constraint. It also creates two partitioned user tables in a user database: SalesPartitionTable and TempSalesPartitionTable. SalesPartitionTable will hold the sales data from the nonpartitioned table. TempSalesPartitionTable is the temporary placeholder table that will be used in the partition switch operations. In addition, the script creates the partition function, the partition schema, and the first partition.
7_sp_GetTablePartitionRowsInfo.sql. This file creates a system stored procedure named sp_GetTablePartitionRowsInfo in the master database. The stored procedure provides information about the partitions in a partitioned table, such as the partition numbers and the number of rows in each partition.
Note that I created sp_GetTablePartitionRowsInfo as a system object in the master database because of the SQL Server environments in which I work. There are five different environments (development, quality assurance, staging, demo, and production) and a strict release process that governs movement between them. Because sp_GetTablePartitionRowsInfo is basically just a utility, I decided not to make it part of the release process. This allows me to make enhancements to the stored procedure and push it to the master databases in all five SQL Server environments at my own discretion.
If you don't want to create sp_GetTablePartitionRowsInfo as a system stored procedure in the master database, you can create it as a regular stored procedure in a user database. To do so, you need to make several changes to 7_sp_GetTablePartitionRowsInfo.sql before running it:
1. Find each instance of the line
and replace master with the name of your user database.
2. Find and remove the code
8_Script_Run_sp_GetTablePartitionRowsInfo.sql. This script runs the sp_GetTablePartitionRowsInfo stored procedure just created to show the initial partition information for the SalesPartitionTable and TempSalesPartitionTable tables. You need to run this script in the database in which you created sp_GetTablePartitionRowsInfo.
Running script 8 at this point produces results like that in Figure 1. The first result set is for SalesPartitionTable, and the second result set is for TempSalesPartitionTable. Note that the second result set shows an extra partition (partition 2) because SQL Server creates an extra partition by default, which appears when you don't include the range value (in this case, the sales date).
9_Script_Migrate_Data_From_dbo.SalesNormalTable_To_dbo.SalesPartitionTable.sql. This script uses TempSalesPartitionTable to transfer data from SalesNormalTable_B4_Partition to SalesPartitionTable. Because SalesPartitionTable and TempSalesPartitionTable are created with the first partition, the script just transfers the data for the first week. From the second week onward, it dynamically creates the partition and transfers the data.
If you run script 8 again after script 9 completes, you can see the updated partition information for SalesPartitionTable and TempSalesPartitionTable. As Figure 2 shows, the information reflects that all the data transferred successfully.
Note that in script 9, I use trace flags 610 and 1118. Trace flag 610 allows for minimal logging during the script's execution. (Click the link to get more information about trace flag 610. Trace flag 1118 forces uniform extent allocation instead of mixed extent allocation. Although SQL Server 2008 optimized mixed extent allocation, I decided to err on the side of caution. (To get more information about trace flag 1118, read the Microsoft article "Concurrency enhancements for the tempdb databasesupport.)
10_dbo.Proc_PurgeWeeklyPartitionedTables.sql. This file creates a stored procedure named dbo.Proc_PurgeWeeklyPartitionedTables in the user database. The stored procedure purges any data older than the specified retention period from SalesPartitionTable. For this demonstration, the retention period is 10 weeks. This means that any data older than 10 weeks compared to the current date will be truncated.
11_Script_Run_dbo_Proc_PurgeWeeklyPartitionedTable.sql. This script executes the dbo.Proc_PurgeWeeklyPartitionedTables stored procedure. If you run script 8 after script 11 completes, you'll see partition information like that in Figure 3. In the first result set, notice that the number of rows in partition 1 is zero. That partition's data was truncated because it was more than 10 weeks old.
12_dbo.Proc_PurgeTableParitionsWithNoRows.sql. This file creates a stored procedure named dbo.Proc_PurgeTableParitionsWithNoRows in a user database. The stored procedure removes partitions that have no rows in them so that SalesPartitionTable doesn't include empty partitions. It also recycles the partition numbers so that the table's partitions are always numbered from 1 to 10.
13_Script_Run_dbo_Proc_PurgeTableParitionsWithNoRows.sql. This script executes the dbo.Proc_PurgeTableParitionsWithNoRows stored procedure. If you run script 8 after script 13 completes, you'll see partition information like that in Figure 4. As you can see, the first result set contains 10 partitions -- the partition with the empty rows (sales date 2011-04-22) is gone.
A Time-Saving Tool
The partition solution I created has saved me a lot of time. Previously, it took more than 3 hours each week to add the new sales data and create any necessary indexes because of the amount of data added (up to 40 million rows). Now it takes less than 30 minutes to load 40 million rows and create 3 indexes. Similarly, purging the old data used to take several hours. Now, it's almost instantaneous because I just truncate the partition that contains the old data.
Queries also take a lot less time to run. For example, each week I have to run a query that returns aggregated values from all 105 or 156 weeks of sales data. Previously, this query would take hours and sometimes even timeout because it had to read through the entire nonpartitioned table. After the partitioned table was in place, I modified the query to take advantage of partition elimination (i.e., accessing the necessary data in the partition at any given time based on a partition column value). The modified query consistently runs in about 11 minutes. I hope that you, too, can save some time by applying the concepts and code I've presented.