New SSIS functionality makes the process straightforward
| Downloads |
|---|
| 143311.zip |
As the DBA at the XYZ Company, you've been assigned the task of creating a report that shows sales order data. The data can be obtained from your order tracking application, but this system needs to be highly available and shouldn't be hit with large queries during the day when orders are being processed. The solution that makes the most sense is to create a copy of the database that can be used for reporting, but there are millions of records on these tables and even a nightly load won't complete in the available maintenance window.
One challenge in feeding data from an OLTP database to an operational data store (ODS) is keeping the data in sync. The target tables in the ODS can be entirely reloaded from the source on a regular schedule, but this has performance implications on the source and target databases. Tracking incremental changes can be accomplished using triggers or marking records, but these solutions can also affect performance and are usually quite complex.
The change data capture (CDC) feature introduced in SQL Server 2008 provides an efficient framework for tracking inserts, updates, and deletes in tables in a SQL Server database. CDC has a performance cost, but it's less than other change-tracking solutions such as triggers. When CDC is enabled, the changes are written to tables in the database and can be accessed using CDC system stored procedures. Note that CDC isn't supported on system databases.
By using both SQL Server Integration Services (SSIS) and CDC, you can perform incremental loads to an ODS. This process has been available since SQL Server 2008, but with the release of SQL Server 2012, there are new SSIS tasks and components that make the process less complex. As in SQL Server 2008, data is retrieved from the CDC tables using log sequence numbers (LSNs), but the new CDC Control tasks in SSIS make it much easier to track the range of LSNs needed for processing the changes. Once configured, the CDC source transparently uses the CDC stored procedures to pull the changed data.
Given the complexity of what's being accomplished, using CDC and SSIS to migrate incremental changes is amazingly straightforward. You start by enabling CDC on the database and the tables you want to track. Next, you use an SSIS package to do a one-time load of existing data into your ODS table. Then, you use another SSIS package to bring over incremental changes using the new CDC task and CDC source. This package can be run at the desired interval to get changes into the ODS tables. Depending on your reporting needs, this could be every five minutes, once a day, or anything in between. Let's take a closer look at each of these steps.
Enabling CDC
Enabling CDC on a SQL Server database (Enterprise edition only) is done by executing sys.sp_cdc_enable_db. Running this system stored procedure requires membership in the sysadmin server role. For example, if you want to enable CDC on the AdventureWorks2012 database, you'd run the code
GO
EXEC sys.sp_cdc_enable_db;
GO
When CDC is enabled on a database, is_cdc_enabled is set to True for that database in the sys.databases table.
After enabling CDC, a new cdc system schema is created containing the tables and stored procedures used to track changes. You can view the cdc schema items in the System Tables and System Stored Procedures folders in SQL Server Management Studio (SSMS).
Any tables you want to track need to be enabled, which you accomplish with the sys.sp_cdc_enable_table stored procedure. For example, suppose you want to load changes for the SalesOrderHeader table in the AdventureWorks2012 database into an ODS table. To do so, you enable the SalesOrderHeader table with code such as
GO
EXEC sys.sp_cdc_enable_table
@source_schema = N'Sales',
@source_name = N'SalesOrderHeader',
@role_name = NULL,
@supports_net_changes = 1;
GO
The sp_cdc_enable_table stored procedure has many parameters you can use when enabling a table for CDC. For a basic setup, you only need to set three parameters: source_schema, source_name, and role_name. The source_name and source_schema parameters specify the name of the table to track and the schema to which that table belongs, respectively.
The role_name parameter assigns a database security role, which will be used to allow access to the change data. As this example shows, you can explicitly set this parameter to NULL when you don't want to use a role to limit access to the change data. You can also use the role_name parameter to create a role if it doesn't already exist. If you specify a database security role that currently exists, it's used. If the role doesn't exist, SQL Server will create a database role using the specified name. Best practice is to create a role and explicitly allow access to the service account running SSIS.
In the code to enable the SalesOrderHeader table, note the inclusion of the supports_net_changes parameter. It allows querying only net changes using the cdc.fn_cdc_get_net_changes_<capture_instance> function. Enabling this parameter adds an additional nonclustered index to the CDC table, so there might be performance considerations.

