This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Managing changed data is a major challenge. Sometimes referred to as “incremental updates”, Changed Data Capture is essential for data warehousing operations, as it keeps data volume low and helps to manage the synchronization between the source and the warehouse.
Without a Changed Data Capture methodology, you would be forced to blow away and reload your data warehouse every time you needed it refreshed. If you only need your data warehouse once a month, this might be OK, but that would be rare.
Here’s the definition that I’ll be using:
- Changed Data Capture
- Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.
Accomplishing this is easier said than done. In some cases you will need to trust the source system to adequately identify its changes, in others you’ll need to figure it out on your own using a brute force row-by-row comparison. If you have access to, and can work with transaction logs of the source system, then this route may be best.
If you can trust the source system, you may be relying on a “last updated” datatime stamp stored in an audit table or in the row of the table that contains the change. You may also have some versioning information available as well. In this scenario, you rely on the source system to tell you when a change was made. Your ETL solution will look for these changes and only select that data for further processing.
The danger is that you are trusting a source system to report all changes faithfully. If you can access the logs, or if the changes are logged using a database trigger, than your trust is likely well placed. If the source relies on application logic to maintain this information, you should consider a different approach. It is quite possible that a database can change through a back-door, bypassing the application logic that would normally track the change. Steer clear.
If the source cannot tell you what has changed (or you simply don’t trust it), then you will need to do a comparison. Depending on the size of the project, it would seem to make most sense to do this type of comparison in the staging area — avoiding any contact with your dimensions in the dimensional model. Touching the Dimension for this purpose would result in a performance hit and could affect any analysis or queries currently being run against the data. In the “Hand Coding” section below, I outline a brute force method that relies on a staging scheme and some SQL to determine changed data for further processing.
In general, the CDC process involves the following steps:
- Identify changed rows if possible
- Do a lookup on the dimension or staging table for the natural key of the record in question
- For Dimensions, handle Type 1 and Type 2 changes using a Script component (see Dimensional Modeling: Loading The Slowly Changing Dimension for the logic behind this).
- Perform the updates and inserts
- Handle deletes through negation techniques
SQL Server 2005 Integration Services (SSIS)
SQL Server 2005 will allow you to do all CDC work within SSIS. For example, if the source system reports changes, you can write an extract (utilizing variables set during package execution) that will look for the specific change indicators whether it be a date range, version, or some status flag.
For brute force comparisons on Dimension tables, you can utilize the Slowly Changing Dimension task. I have already discussed the SCD at length in my post “Dimensional Modeling: Loading The Slowly Changing Dimension“, so please read that if you need more information about how a SCD is loaded and maintained. For the purposes of CDC, the only change I would make to that post is to use a staging area for the task instead of the production database. This will allow you to focus processing efforts on identifying changes, filtering out any unnecessary data in the process which will speed up processing down the line.
If the Slowly Changing Dimension task won’t work in your scenario (perhaps you don’t have dimensions available to check for new and changed rows or you have very, very large dimensions which the SCD task is not great at handling), then you can use SQL, checksums (see System.Security.Cryptography for more information), or a variety of other SSIS transforms to do the trick. In the next section, I show how SQL can be used.
Update: For SQL Server 2008 Enterprise Edition, there will be a CDC feature for tables (for DML only). I’m not too thrilled about this new feature, and in fact, I’m not sure if it will really work in a true ETL environment. But, I have not evaluated it, so I could be way off base. Check out these blog postings for more information:
Hand Coding with Visual FoxPro (VFP9)
In the following code example, I present a brute-force CDC solution in VFP. This solution, as discussed in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy” and at FoxForward this past September, relies on using VFP tables in a staging directory called sometable_curr (which represents the fresh data) and sometable_prev which represents the data from the previous run. Comparing these two data sets for changes is fairly straightforward.
The code for this is located in my 2007 Fox Forward Presentation Examples (zip file, prgs in file). Specifically, open program files “4 - extract.prg” and “5 - changed data capture.prg”. The examples use the Northwind Database.
Here are the bare-bones (refer to the programs above for more details)
*-- open the current and previous tables
USE sometable_curr IN 0
USE sometable_prev IN 0
*-- do the compare using UNION and GROUP BY.
SELECT MIN(source) as source, sometable_id,;
sometable_name, sometable_phone ;
FROM ( ;
SELECT "curr" as source, sc.sometable_id, ;
sc.sometable_name, sc.sometable_phone ;
FROM sometable_curr sc ;
UNION ;
SELECT "prev" as source, sp.sometable_id,;
sp.sometable_name, sp.sometable_phone ;
FROM sometable_prev sp ;
) as currprev ;
GROUP BY sometable_id, sometable_name, sometable_phone;
HAVING COUNT(*) = 1 ;
INTO TABLE sometable_cdc.dbf
As you can see, a combination of using the staging area (curr and prev) and some SQL can give you changed data with minimal effort. This method is fairly efficient because it is done in the staging area where you could isolate the process on a dedicated a machine.
To use a checksum to do a comparison, please take a look at my post “Sys(2017) and Comparing Data“. I go into detail on how this can be accomplished.
From here
There are several avenues to follow when designing a CDC solution. Moreover, each source will likely demand a different approach. Figuring this out ahead of time is essential for the success of the integration system.
The next ETL Subsystem is [Data] Extraction. Once the source system has been profiled and a mechanism for CDC has been developed, the ETL team can begin the process of extracting the changed data from the source.