To meet legal requirements, companies might need to retain SQL Server data for a certain length of time. Companies might also want to retain SQL Server data for a certain time period for their own business requirements. Companies can control when the data can be purged from the system as long as the boundaries enforced by the legal requirements aren't affected. By purging the data that's no longer needed to meet legal and business requirements, companies can improve the overall daily performance of their SQL Server systems and more precisely estimate future storage and server needs. Thus, ROI will improve for existing hardware, maintenance, and future hardware requisitions.
An application's software development life cycle (SDLC) documentation should detail how long its data should be available. However, you seldom find data retention guidelines included in SDLC documentation, unless a business function depends on that data. The picture is even bleaker for vendor-supplied software. Unless there's a business function dependency, most of the vendor software will facilitate storing data in history tables indefinitely.
DBAs mainly concentrate on daily server operations and upkeep, so they seldom have time to become familiar with every application and keep track of the data that's no longer needed to meet legal and business requirements. However, when you don't regularly purge this data, tables can snowball to a size that impacts performance and that can only be purged in a maintenance window. To avoid this, I created a purge process that automatically purges the data that's no longer needed to meet legal and business requirements. After explaining this process, I'll walk you through a sample implementation.
The Purge Process
The purge process moves data between and deletes data from three categories of data, or data sets:
- The current data set, which contains data that needs to be available to users. Users can change or review the data.
- The history data set, which contains data that can only be reviewed or aggregated by users. Users can't change the data.
- The dormant data set, which contains data that falls out of the availability threshold but needs to be stored to meet legal or business requirements. This data is archived, so it can't be changed, reviewed, or aggregated by users.
The purge process is configurable and dynamic, so new categories or subcategories can be added to accommodate changes in legal and business requirements. For example, the current data set could be further classified into read-only data and updateable data. By hosting the read-only data on read-only file groups, you can improve SQL Server performance.
A purge database and a stored procedure named purgeManager are used to control the purge process. The purge database stores the configuration information that purgeManager needs to purge data from the data sets.
When purgeManager is invoked, up to three purge tasks (one purge task for each data set) can execute. Whether or not a purge task executes is determined by the schedule (aka frequency) you set and the last time the purge task ran (aka last run date). This design allows for a rerun if necessary and makes the purge process re-entrant. Repeated executions (intentional or otherwise) won't have any negative effects.
Dynamic queries are used to purge the data. Some of these queries use the OUTPUT clause in a DELETE statement to move the data that's being purged. The purge is performed in batches, whose size you can configure. That way, you can avoid problems associated with long-running transactions.
The purge process captures any errors. When one occurs, it makes a note in the SQL Server error log and gracefully exits.
Using the three data sets I just mentioned, let's walk through an example. Suppose an international bank defines the following categories for its SQL Server 2005 platform based on its legal and business requirements:
- Current data set: This data set contains data from bank transactions that occurred in the last 45 days.
- History data set: This data set contains data from bank transactions that occurred in the last 13 months, excluding the most recent 45 days.
- Dormant data set: This data set contains data from bank transactions that occurred in the past eight years, excluding the most recent 13 months, to comply with IRS rules. Unlike the current and history data, the dormant data doesn't need to capture all the transaction details. However, only US currency is allowed in the dormant data set, so when appropriate, the transaction amount is multiplied by the exchange rate during the purge process.
The bank decides to use the following purge schedule:
- Every day, purge data older than 45 days from the current data set, moving it to the history data set.
- The first day of every month, purge data older than 13 months from the history data set, moving it to the dormant data set.
- Every day, purge data older than eight years from the dormant data set.