We all know that data warehouses aren’t "real time," right? I mean, a warehouse is there to store data for historical analysis, after all. But the struggle to separate operational reporting against an ODS or a reporting schema from BI reporting against a data warehouse continues to confound IT and business organizations. The "real time" data warehouse and analytics can be PowerPivot, after all, or smaller functionally-confined data marts.
But if your data warehouse has a business requirement to be more “real time” than, say, the classic nightly load or even hourly, then I typically recommend using the "trickle-feed" or incremental update approach. In SQL Server 2008 R2, there are 2 mechanisms built into the product that you can use to accommodate this, provided that your data sources are SQL Server databases. So, let’s take a brief look at Change Tracking and Change Data Capture or CDC:
This is a feature that is built into the query engine in SQL Server and is available in both the Standard Edition and Enterprise Editions of SQL Server. You do not have the kind of control over the tracking mechanisms that CDC provides and SQL Server performs some behind-the-scenes magic by adding flags to your rows once you’ve enabled tracking on. You can see the difference in the bytes per row on your tables when you enable change tracking on a table using sp_spaceused. Before you start building out your data warehouse ETL, you will enable change tracking on your database and then on each table to track with these commands:
- ALTER DATABASE AdventureWorks
- CHANGE_TRACKING = ON
- CHANGE_RETENTION = 2 DAYS,
- AUTO_CLEANUP = ON
- ALTER TABLE dbo.Products
- ENABLE CHANGE_TRACKING
As you can see, you can tell SQL Server how long to retain the changes, to automatically cleanup after itself (I recommend setting that) and then which tables to track changes on. Very simple, low maintenance.
What you would do in ETL is to have an SSIS workflow that will periodically check for the latest net changes in a table using the CHANGETABLE function in SQL Server 2008. You can get the latest version of a row using the VERSION table or the changes since that last synch’d version using CHANGES. In most cases, you use LEFT OUTER JOIN with this change table function and the application database table that you are tracking so that you can grab the field values and make sure that you do not ignore deleted rows. Here is an example:
- SELECT SYS_CHANGE_VERSION, SYS_CHANGE_OPERATION,c.product_id, p.name
- FROM CHANGETABLE (CHANGES Products, @last_sync_version) AS C
- LEFT OUTER JOIN Products p ON p.product_id = c.product_id
You should be sure to store the version # in a package variable or in a table. That is what you will use for the @last_sync_version variable so that next time you run the CHANGE function, you will only get what has changed since the last run. Then use the SYS_CHANGE_OPERATION value to determine what to do with the data. This would include marking a fact or dimension record as obsolete in the case of a delete or an update that could trigger a SCD workflow.
CDC is also only for SQL Server and only in the 2008 & 2008 R2 products. But as our reporter-on-the-spot Derek has already reported, the next version of SQL Server (Denali) may very likely include an OEM of the Attunity CDC product which would be able to do more than just read from the SQL Server log. The current CDC is an Enterprise Edition-only feature and uses the same Log Reader that transactional replication uses. It is a bit more complex than Change Tracking, but gives you much more flexibility. For example, when you enable CDC on your database, 2 new agent jobs will be created: one to kick-off the Log Reader process and one to do the clean-up. Instead of adding trackers on your schema table objects, CDC creates a new scheme in your database called CDC for storing objects like the functions used to get changes. You can modify the agent jobs, but just be sure to use the stored procedure that they give you to do this, sys.sp_cdc_change_job, to modify things like the polling interval and retention time.
Here is how you enable CDC on your database and tables to be tracked:
- USE AdventureWorks2008
- EXECUTE sys.sp_cdc_enable_db;
- EXEC sys.sp_cdc_enable_table
- @source_schema = N'dbo',
- @source_name = N'Products',
- @role_name = N'cdc_Admin';
Since the CDC function is outside of the SQL engine, you use SPs to activate and de-activate tracking. Unlike Change Tracking, there is no GUI support for this, it’s all T-SQL. The role name parameter above, gives you the option of having a role that can own that CDC schema that the feature creates. It will create the role for you if it is not already existing in your database, or you can just leave it NULL.
After you run this command, you’ll get new functions in your database with the names of the schema and tables appended as such: fn_cdc_get_all_changes_dbo_Products. You are now ready to start adding this to your ETL.
In SSIS, you’ll want to have your task poll using the get NET changes passing it the starting LSN and ending LSN from the log that you wish to capture. Since CDC is using Log Reader, it is all based on LSNs, not version #s like in Change Tracking. So you’ll need to store that LSN value in a variable or in a table (or file for that matter), the ending LSN from this iteration. That will become the starting LSN on the next run. A sample of what the T-SQL command will look like is:
- -- Get the last LSN in the log
- select sys.fn_cdc_map_time_to_lsn('largest less than or equal', getdate())
- -- Get net changes on a table with CDC
- SELECT *
- FROM cdc.fn_cdc_get_net_changes_dbo_Products
- --starting LSN
- --ending LSN
At first, the CDC method may sound daunting compared to the built-in change tracking. But my experience has been that it is more robust and provides you as a developer or DBA more control over the process. You should also note that CDC is asynchronous which provides the same benefits and traps that asynch in other transactional scenarios does: the transaction don’t have to wait for the tracking to complete, but your tracking can start to fall behind if the Log Reader is very busy. Also, since this is reading from the log, beware of a few things that you need to account for in terms of back ups and log shipping. Change Tracking is synchronous but makes your rows a little bit bigger.
Also, with CDC, you can always go to Attunity now and buy their SSIS component for CDC, which you can use to add to your SSIS workflow. And you don’t need to use SSIS at all with CDC (or with Change Tracking, either) for data integration. I have a quick article that I wrote when SQL Server 2008 was released that shows you most of the T-SQL that you need to wire it up here if you prefer to integrate with T-SQL procedures instead.
One more thing … If you are going to use SSIS with CDC for incremental data loads into your data warehouse to short load times and create more real-ish time reporting, then take a look at a very good free sample from the SSIS team that is posted on Codeplex. It is part of the many good samples in the SSIS samples package and uses AdventureWorks as the OLTP database and AdventureWorksDW as the target data warehouse. It has the prebuilt tasks and procedures to set-up CDC, log changes and add just the changed rows into your data warehouse. It even mimics live updates to show you how it’s done. I put a simple screen shot below of the main package to show you that it really isn’t so bad. Just follow the instructions on how to load the T-SQL scripts and then give a whirl. For me, it really helped me to better understand how to integrate CDC into my SSIS-based ETL processes.
Take Care! Mark