Data deletion in the real world
| Downloads |
|---|
| 129024.zip |
If you've worked with databases of any size, chances are you've also had to wrestle with issues associated with deleting large volumes of non-partitioned data. By non-partitioned data, I mean data that isn't partitioned by something like a date field.
Partitioned data is stored as separate sets of data. A good example is monthly sales data. If the data is partitioned by month, you might have 13 separate partitions—one for each of the 12 prior calendar months and one for only current data. When you have partitioned data, getting rid of millions or even billions of rows is a matter of switching and dropping data partitions. No data is physically Ldeleted from the tables as it would be by DELETE statements, but all the space allocated to a table is deallocated and the data on those pages is physically deleted as if you had issued a DELETE FROM SomeTable command. In this case, partition management is a matter of updating the data dictionary tables. In terms of overall data deletion performance, managing partitioned data by dropping and switching partitions is about as fast as removing data can get.
Non-partitioned data lives in either a heap or clustered index. There are no boundaries defined for splitting the data as there are in partitioned tables. Because of this, deletions become significantly more complicated, especially when dealing with millions of rows. Unless you can physically truncate a non-partitioned table, which is a data dictionary modification, deleting rows from tables can potentially generate an extremely large log file. The deletion process might also adversely affect database performance at several different levels. Truncating a table in SQL Server is a matter of deallocating used pages. Because these page deallocations are a logged action, it's possible to roll back a TRUNCATE operation. In other databases, such as Oracle, this isn't the case.
In an ideal world, all data sets would be partitioned and deleting huge amounts of data would be little more than a data dictionary update. Because we don't live in an ideal world, I'll discuss some alternative strategies for purging large amounts of non-partitioned data, particularly as they apply to mission-critical 24 ´ 7 ´ 365 environments.
Goals of Purging
When an organization operates around the clock, you're left with a very small window for database maintenance. The challenge is that the database tables must still be maintained for peak database performance. There are a number of goals to balance when designing a database purge process:
Minimize system impact. The point of purging is to delete unwanted data from the database, but a paramount design consideration is doing this in such a way that it doesn't impair your system while the process is running. Building a purge process that "first does no harm" isn't as easy as it would seem.
It's important to define what "minimizing system impact" means in the context of your design. It might be that your purge process can't consume excessive amounts of CPU, or perhaps it can't cause unnecessary database blocking. Start by defining the things your purge process should not impact, then determine what it will impact.
Delete data as quickly as possible. The counterpart to minimizing system impact is deleting data as quickly as possible. These goals aren't at odds with each other, but you do need to balance them in the design process.
Avoid blowing out the database transaction logs. In mission-critical systems, you should always be running database backups using the full recovery model. This has ramifications that become exceedingly important when trying to purge large volumes of data. You probably won't be able to issue a DELETE FROM SomeTable command, which is a fully logged operation, because doing so can increase the size of the transaction log to the point where it might run out of space. Truncating tables isn't always an option either. There might be current data that you need to retain.
I've had extremely large delete operations blow out at very bad times, resulting in hours of rollback activity. Fortunately, none of those blowouts occurred in mission-critical environments, but it underscores an important issue that you need to keep in mind as you design your database purge process.
Approaches to Purging
Given the purging goals, how do you go about designing the purge process? When I need to purge billions of rows that contain terabytes of data, I often use batched DELETE operations (aka batched deletes). Although there are several different ways to approach batched deletes, I'll discuss and provide code samples for four methods:
1. Truncation followed by a batched delete
2. Batched delete using a sliding window
3. Row-at-a-time batched delete
4. Batched delete using a sliding window with an inner join
The code samples use two tables—a parent table named Table1 and a child table named Table2—that contain a decent amount of test data. I wrote a script to create these tables. It cross-joins sys.tables against itself to seed Table1. After adding an identity column to Table1 and defining that column as the primary key, the script creates Table2, which is the child table for these demonstrations. The script then adds the foreign key constraints from Table2 to Table1. It also adds two indexes to Table1.
If you'd like to test the five methods, you can use this script to generate Table1 and Table2. In my test database, it generated 1.9 million rows. Your execution might generate more or less rows, depending on the number of tables in your database. BatchedDeleteExamples.sql contains the script as well as the code samples for the four methods. (The script creates backup tables so that you can restore Table1 and Table2 between running each method's code.) You can download BatchedDeleteExamples.sql by clicking here.

