A Data Purge Process

What is in this article?:

  • A Data Purge Process
Downloads
129368.zip

To meet legal requirements, companies might need to retain SQL Server data for a certain length of time. Companies might also want to retain SQL Server data for a certain time period for their own business requirements. Companies can control when the data can be purged from the system as long as the boundaries enforced by the legal requirements aren't affected. By purging the data that's no longer needed to meet legal and business requirements, companies can improve the overall daily performance of their SQL Server systems and more precisely estimate future storage and server needs. Thus, ROI will improve for existing hardware, maintenance, and future hardware requisitions.

An application's software development life cycle (SDLC) documentation should detail how long its data should be available. However, you seldom find data retention guidelines included in SDLC documentation, unless a business function depends on that data. The picture is even bleaker for vendor-supplied software. Unless there's a business function dependency, most of the vendor software will facilitate storing data in history tables indefinitely.

DBAs mainly concentrate on daily server operations and upkeep, so they seldom have time to become familiar with every application and keep track of the data that's no longer needed to meet legal and business requirements. However, when you don't regularly purge this data, tables can snowball to a size that impacts performance and that can only be purged in a maintenance window. To avoid this, I created a purge process that automatically purges the data that's no longer needed to meet legal and business requirements. After explaining this process, I'll walk you through a sample implementation.

The Purge Process

The purge process moves data between and deletes data from three categories of data, or data sets:

  • The current data set, which contains data that needs to be available to users. Users can change or review the data.
  • The history data set, which contains data that can only be reviewed or aggregated by users. Users can't change the data.
  • The dormant data set, which contains data that falls out of the availability threshold but needs to be stored to meet legal or business requirements. This data is archived, so it can't be changed, reviewed, or aggregated by users.

The purge process is configurable and dynamic, so new categories or subcategories can be added to accommodate changes in legal and business requirements. For example, the current data set could be further classified into read-only data and updateable data. By hosting the read-only data on read-only file groups, you can improve SQL Server performance.

A purge database and a stored procedure named purgeManager are used to control the purge process. The purge database stores the configuration information that purgeManager needs to purge data from the data sets.

When purgeManager is invoked, up to three purge tasks (one purge task for each data set) can execute. Whether or not a purge task executes is determined by the schedule (aka frequency) you set and the last time the purge task ran (aka last run date). This design allows for a rerun if necessary and makes the purge process re-entrant. Repeated executions (intentional or otherwise) won't have any negative effects.

Dynamic queries are used to purge the data. Some of these queries use the OUTPUT clause in a DELETE statement to move the data that's being purged. The purge is performed in batches, whose size you can configure. That way, you can avoid problems associated with long-running transactions.

The purge process captures any errors. When one occurs, it makes a note in the SQL Server error log and gracefully exits.

 

An Example

Using the three data sets I just mentioned, let's walk through an example. Suppose an international bank defines the following categories for its SQL Server 2005 platform based on its legal and business requirements:

  • Current data set: This data set contains data from bank transactions that occurred in the last 45 days.
  • History data set: This data set contains data from bank transactions that occurred in the last 13 months, excluding the most recent 45 days.
  • Dormant data set: This data set contains data from bank transactions that occurred in the past eight years, excluding the most recent 13 months, to comply with IRS rules. Unlike the current and history data, the dormant data doesn't need to capture all the transaction details. However, only US currency is allowed in the dormant data set, so when appropriate, the transaction amount is multiplied by the exchange rate during the purge process.

The bank decides to use the following purge schedule:

  • Every day, purge data older than 45 days from the current data set, moving it to the history data set.
  • The first day of every month, purge data older than 13 months from the history data set, moving it to the dormant data set.
  • Every day, purge data older than eight years from the dormant data set.
 »

Please or Register to post comments.

IT/Dev Connections

Las Vegas
September 30th - October 4th

Paul ThurottOur Experts will show you:
• Common SQL Server
Problems
• Best Practices for T-SQL
• SQL Server Integration
Services
• Database Development

Come See Mike Otey & Tim Ford in Person!

Early Registration Now Open

From the Blogs
May 9, 2013
blog

My ISO 8601-Compliant Signature 2

My family recently just "officially" announced that we're in the process of adopting a child from South Africa. We're quite excited, of course, but there's a ton of paperwork to do—along with the need for gobs of signatures....More
May 8, 2013
blog

Use SSIS for ETL from Hadoop

In this blog post, Mark Kromer walks you through using SSIS as a way to use ETL techniques using Microsoft's Hadoop on Windows (HDInsight) as a source using Hive connectors...More
Vision road sign
May 6, 2013
blog

Cheaters Never Win, Even in TPC Benchmarks

In this portion of the series on database benchmarking, I want to tell you about one of my favorite aspects of the TPC benchmarks – CHEATING....More
SQL Server Pro Forums

Get answers to questions, share tips, and engage with the SQL Server community in our Forums.