Combining CDC and SSIS for Incremental Data Loads

New SSIS functionality makes the process straightforward

What is in this article?:

  • Combining CDC and SSIS for Incremental Data Loads
Downloads
143311.zip

As the DBA at the XYZ Company, you've been assigned the task of creating a report that shows sales order data. The data can be obtained from your order tracking application, but this system needs to be highly available and shouldn't be hit with large queries during the day when orders are being processed. The solution that makes the most sense is to create a copy of the database that can be used for reporting, but there are millions of records on these tables and even a nightly load won't complete in the available maintenance window.

One challenge in feeding data from an OLTP database to an operational data store (ODS) is keeping the data in sync. The target tables in the ODS can be entirely reloaded from the source on a regular schedule, but this has performance implications on the source and target databases. Tracking incremental changes can be accomplished using triggers or marking records, but these solutions can also affect performance and are usually quite complex.

The change data capture (CDC) feature introduced in SQL Server 2008 provides an efficient framework for tracking inserts, updates, and deletes in tables in a SQL Server database. CDC has a performance cost, but it's less than other change-tracking solutions such as triggers. When CDC is enabled, the changes are written to tables in the database and can be accessed using CDC system stored procedures. Note that CDC isn't supported on system databases.

By using both SQL Server Integration Services (SSIS) and CDC, you can perform incremental loads to an ODS. This process has been available since SQL Server 2008, but with the release of SQL Server 2012, there are new SSIS tasks and components that make the process less complex. As in SQL Server 2008, data is retrieved from the CDC tables using log sequence numbers (LSNs), but the new CDC Control tasks in SSIS make it much easier to track the range of LSNs needed for processing the changes. Once configured, the CDC source transparently uses the CDC stored procedures to pull the changed data.

Given the complexity of what's being accomplished, using CDC and SSIS to migrate incremental changes is amazingly straightforward. You start by enabling CDC on the database and the tables you want to track. Next, you use an SSIS package to do a one-time load of existing data into your ODS table. Then, you use another SSIS package to bring over incremental changes using the new CDC task and CDC source. This package can be run at the desired interval to get changes into the ODS tables. Depending on your reporting needs, this could be every five minutes, once a day, or anything in between. Let's take a closer look at each of these steps.

Enabling CDC

Enabling CDC on a SQL Server database (Enterprise edition only) is done by executing sys.sp_cdc_enable_db. Running this system stored procedure requires membership in the sysadmin server role. For example, if you want to enable CDC on the AdventureWorks2012 database, you'd run the code

USE AdventureWorks2012

GO

EXEC sys.sp_cdc_enable_db;

GO

When CDC is enabled on a database, is_cdc_enabled is set to True for that database in the sys.databases table.

After enabling CDC, a new cdc system schema is created containing the tables and stored procedures used to track changes. You can view the cdc schema items in the System Tables and System Stored Procedures folders in SQL Server Management Studio (SSMS).

Any tables you want to track need to be enabled, which you accomplish with the sys.sp_cdc_enable_table stored procedure. For example, suppose you want to load changes for the SalesOrderHeader table in the AdventureWorks2012 database into an ODS table. To do so, you enable the SalesOrderHeader table with code such as

USE AdventureWorks2012

GO

EXEC sys.sp_cdc_enable_table

@source_schema = N'Sales',

@source_name = N'SalesOrderHeader',

@role_name = NULL,

@supports_net_changes = 1;

GO

The sp_cdc_enable_table stored procedure has many parameters you can use when enabling a table for CDC. For a basic setup, you only need to set three parameters: source_schema, source_name, and role_name. The source_name and source_schema parameters specify the name of the table to track and the schema to which that table belongs, respectively.

The role_name parameter assigns a database security role, which will be used to allow access to the change data. As this example shows, you can explicitly set this parameter to NULL when you don't want to use a role to limit access to the change data. You can also use the role_name parameter to create a role if it doesn't already exist. If you specify a database security role that currently exists, it's used. If the role doesn't exist, SQL Server will create a database role using the specified name. Best practice is to create a role and explicitly allow access to the service account running SSIS.

In the code to enable the SalesOrderHeader table, note the inclusion of the supports_net_changes parameter. It allows querying only net changes using the cdc.fn_cdc_get_net_changes_<capture_instance> function. Enabling this parameter adds an additional nonclustered index to the CDC table, so there might be performance considerations.

 ยป

Discuss this Article 3

SQLMD
on Aug 22, 2012
Thanks. You are correct that CDC would not automatically pick up any schema changes. And unfortunately the new components won't automatically pick up the new metadata. You should be able to handle the changes by updating the CDC tables (drop and recreate to grab all columns), updating the target tables, and updating the SSIS packages to include the schema changes. Then you can update the new columns in the source which will add to the CDC table and flow through to the target table. The data being moved is no less than a full refresh since every record is probably getting updated, but you have a better history of what happened. When I have done this in the past I would run the package as part of deployment of the code changes and do the deployment at off hours. Not a very automatic solution, but the changes needed are fewer than before these tasks were available.
KJW_SQLDBA
on Aug 24, 2012
Thank you for responding and detailing the steps to take!
KJW_SQLDBA
on Aug 21, 2012
Great article!! Can you explain how you would accommodate a schema change (example: adding a new column) at the source table that has CDC enabled? I understand that CDC won't propogate this change to the cdc tables in order to make sure that everything will continue to flow through. But if you actually need the new column to be included in your destination, would you have to do a full load again on that one table to incorporate the new column into the destination? If so, will the new CDC SSIS components be aware of the changes in the metadata automatically in the full and incremental packages? Thanks for any light you can shed!

Please or Register to post comments.

IT/Dev Connections

Las Vegas
September 30th - October 4th

Paul ThurottOur Experts will show you:
• Common SQL Server
Problems
• Best Practices for T-SQL
• SQL Server Integration
Services
• Database Development

Come See Mike Otey & Tim Ford in Person!

Early Registration Now Open

From the Blogs
May 21, 2013
blog

A Common Misconception about MAXDOP

Out of the box, SQL Server is (and has been) able to take advantage of multiple processors/cores without any effort on behalf of administrators....More
May 9, 2013
blog

My ISO 8601-Compliant Signature 2

My family recently just "officially" announced that we're in the process of adopting a child from South Africa. We're quite excited, of course, but there's a ton of paperwork to do—along with the need for gobs of signatures....More
May 8, 2013
blog

Use SSIS for ETL from Hadoop

In this blog post, Mark Kromer walks you through using SSIS as a way to use ETL techniques using Microsoft's Hadoop on Windows (HDInsight) as a source using Hive connectors...More
SQL Server Pro Forums

Get answers to questions, share tips, and engage with the SQL Server community in our Forums.