Posts Tagged CDC

A Data Warehouser’s Vocabulary (Part 1)

Partly inspired by a post entitled “The most important thing I know about Analytics is that no-one agrees what it means” by James Taylor and partly inspired by the section “Slowly Changing Vocabulary” in the book “Data Warehouse Lifecycle Toolkit 2nd Edition“, I have decided to compile a glossary of terms and concepts that I feel have some relevance to the data warehousing and business intelligence world. I’ll break this list into several postings, and I reserve the right to refine, enhance, clarify, and augment a definition at any time! When finished, I’ll make them a permanent feature of TmF.

With this list, I am not attempting to resolve any debates, nor am I attempting to invalidate or discredit a definition you may be using. These are the definitions I use. Also be aware that certain terms might hold different meanings under different contexts. If I need to use one of those ambiguous terms, I try my best to put a good context around it. For example, when I refer to “Data Mart”, I specifically mean “Atomic Business Process Dimensional Model”. However, there are times when what I mean is to describe a separate (perhaps normalized) database for a specific user or department (i.e. a throw-away sandbox for the big kids).

Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! Also, when I finish this list, and put these all together on a single page, I’ll be sure to include a reference link section as well.

So, without further ado, I give you the first group of many to come (A-Z):

Business Intelligence (BI)
A generic term to describe leveraging the organizations’ internal and external information assets for making better business decisions.
Business Process
The complete response that a business makes to an event. A business process entails the execution of a sequence of one or more process steps. It has a clearly defined deliverable or outcome. A Business Process is defined by the business event that triggers the process, the inputs and outputs, all the operational steps required to produce the output, the sequential relationship between the process steps, the business decisions that are part of the event response, and the flow of material and/or information between process steps.
Changed Data Capture (CDC)
Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.
Data Mart
A business process dimensional model.
Data Profiling
Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more than reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.
Data Quality
Assurances that the integrated data is consistent, complete, and fit to publish to the business community.
Data Warehouse Database
The largest possible union of queryable presentation data in a DW/BI System.
ETL
A set of processes that prepare source data for a Data Warehouse, adding value and confidence along the way. These processes include extraction, transformations (cleans & conform), and load operations. Note that the order in which ETL processes occur can be varied based on the situation. Some sources refer to the ET or just the E broadly as “Data Acquisition”.
Master Data Management (MDM)
Centralized facilities designed to hold master copies of shared entities, such as Customer and Product.
Metadata
All the information that defines and describes the structures, operations, and contents of a BI/DW system.
Operational Data Store (ODS)
A physical set of tables sitting between the operational systems and the data warehouse, or a specially administered hot partition of the data warehouse itself. The main purpose of an ODS is to provide immediate reporting of operational results if neither the operational system or the data warehouse can provide satisfactory access.
Staging
Physical workspace for data during the ETL process. Some data is temporarily staged, while other data may persist.

Tags: , , , ,

1 Comment

ETL Subsystem 3: Data Extraction

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

One of the major pieces in the Data Integration puzzle is data extraction. Simply put, data extraction is the process of taking data from some source system to be used in some other process. Usually, in an ETL environment, extracted data is staged for further processing (staging is important for lineage and recovery purposes.).

Extracting comes in a few different flavors:

Push
The source system ‘pushes’ the data into the ETL system (through an export, direct write into a staging area, etc.)
Pull
The ETL process ‘pulls’ the data from the source system (usually using SQL to read the data)
Dribble
For real-time/right-time systems, this mechanism makes data available to the ETL system quite frequently throughout the day, possibly through database triggers or message queues (see MS Message Queuing for more details)

Depending on the source system, the extraction may be different. Also, depending on the purpose of the process (complete historical load verses incremental loads) the connection may be different. It is imperative that the decisions on how to extract data from source systems are well documented in the ETL technical documentation. This document will aid integration specialists, together with the data mappings, when developing the extract system and methodology.

Furthermore, you should strive to extract only relevant data. Extra, non-essential data will take up additional disk space and memory during processing. This is especially true in SSIS where memory is quite important.

In the perfect world

Ideally, the source system will push data to the ETL system on some pre-determined, business-relevant interval (hourly, daily, weekly, etc.). The source system will be responsible for determining the changed data; however, you should never fully trust source systems and always build a CDC system (see my previous post ETL Subsystem 2: Changed Data Capture for more information) to pick up any changed data missed by the source system.

The ideal extraction strategy for Pull systems relies on trigger-level insert and last update dates on each row in each table in the source. Pulling the data by date range — perhaps one subset for inserts and another for updates — is quite efficient and as long as the database is supplying the dates through triggers, you can use this method with confidence. If the source system uses application logic to populate these dates: steer clear. You’re asking for trouble. There is always at least one rogue update process that will fail to manually insert the correct date!

Realistically though

Things are usually less than ideal. Your CDC strategy will help determine how you acquire extract data and what to do with it when you do. Some additional ways to access extracted data including using vendor-supplied APIs, web services, direct data access, reading file-based reports, tapping into the transaction logs, or using some external 3rd Party tool. Be concerned with difficult-to-access legacy systems, especially if the experts are beginning to ask for retirement!

It is also possible to use database replication. A Publisher can distribute transactional data to any Subscribers such as an OLTP or ODS server and your ETL system. This scenario is best in a controlled environment where the IT department can manage the replication across systems.

Data Extracted. Now What?

I prefer to stage extracted data for later use (even if ‘later’ is only a few seconds away). You acquire the data from the source, and drop it off on disk in some native format (DBF files for FoxPro, RAW files for SSIS). This allows you to return to this step if problems were encountered downstream, if downstream processes need to wait for some other condition before continuing, for data lineage purposes, and for recovery in case of failure. You will not want to have to go back to the source to re-capture this data. It isn’t efficient and a lot can happen to the data while the ETL process fumbles around.

SQL Server 2005 Integration Services (SSIS)

Data extraction in Integration Services is rather easy. There are many different ways to connect to source systems. Dot NET Providers, for example, give you access to many types of systems including Informix Dynamic Server, SAP, and Sybase. Connection types included:

  • DataReader Source: Allows you to read data from a .NET Provider (see the ADO.NET team blog for more details) using the sqlcommand property. Tip: The DataReader usually performs better than OLE DB even for SQL Server. Also, be aware that you can sort data for further processing by specifying the IsSorted property = True on the Input and Output Properties tab. Remember to set the SortKeyPosition for each Output Column involved in the sort as well.
  • OLE DB Source: Use OLE DB to retrieve data from a variety of sources. You can use this adapter to request a result from a table, view, stored procedure, or SQL query. I like to develop my queries in SSMS or the source system’s editor (MySQL Query Browser or VFP IDE for example) if I have it. This helps to get any syntax quirks out of the way. SSIS is a bit lacking when it comes to writing queries using the provided edit box.
  • Excel Source: Simplifies access to MS Excel data. I prefer to stay away from using Excel as a data source for incremental loads because it is too easy for the user to break the process by changing the format of the file.
  • Flat File Source: A great adapter for reading flat files (fixed, delimited, ragged right). You will actually set up most of the information in the flat file connection and not in the source adapter. I use the flat file adapter more than any other particularly because it is easy to get data from source systems in a flat file. XML is slowly taking over, however.
  • XML Adapter: Used to read XML data and help convert it to a relational form. Use this adapter to talk to XML, either via file or direct data from a variable, shred the collections and hierarchies, and produce multiple outputs. I even use this adapter to shred XML for other purposes (non-data integration purposes) because it is so easy. Tip: Accept XML with XSD schema definitions only; this will save you a ton of time and effort profiling the XML for determining destination details.
  • Raw File Source: Internal SSIS format that can be used to stage data or exchange data among packages. This is a very fast mechanism to read from because the format of the raw file is the same format that SSIS uses in memory; SSIS simply loads the raw data and you’re off to the races.

I also make heavy use of the FTP Control Flow Task and have also used the Web Service Control Flow Task to retrieve data. Both of these tasks are relatively easy to set up and use.

Also, for real-time systems (using the dribble extraction approach), you will likely be looking into the Message Queue Task and WMI Event Watcher Task. I have not had the privilege of working in a dribble environment (yet), so I don’t have much more to offer at this time.

Hand Coding with Visual FoxPro (VFP9)

Of course, much of the above is easily accomplished in Visual Foxpro. You won’t have a colorful drag-and-drop editor to work with but developing your own classes to accomplish what SSIS can do is rather trivial (technically, you can then use the Toolbox Panel and get some drag and drop if you want). With the available adapters (ADO, ODBC, XML, Native), and the ability to read and append from files directly, FoxPro is well positioned as a data extraction tool. Not only that, but you can use the native DBF format as a staging format, making downstream operations that much faster (don’t stage to flat files when you can stage into native tables in a staging database).

I outlined a rather comprehensive data extraction and staging strategy in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy“. I also talked at length about certain techniques in my Fox Forward and AFUG presentations earlier this year (2007 Fox Forward Presentation Examples).

From here

Once data is extracted and sitting in a staging area (or in memory if you don’t use a physical stage), you can move on to cleaning it. The 4th subsystem is just that: The Data Cleansing Subsystem.

Tags: , , , , , , ,

3 Comments

ETL Subsystem 2: Changed Data Capture

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Managing changed data is a major challenge. Sometimes referred to as “incremental updates”, Changed Data Capture is essential for data warehousing operations, as it keeps data volume low and helps to manage the synchronization between the source and the warehouse.

Without a Changed Data Capture methodology, you would be forced to blow away and reload your data warehouse every time you needed it refreshed. If you only need your data warehouse once a month, this might be OK, but that would be rare.

Here’s the definition that I’ll be using:

Changed Data Capture
Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.

Accomplishing this is easier said than done. In some cases you will need to trust the source system to adequately identify its changes, in others you’ll need to figure it out on your own using a brute force row-by-row comparison. If you have access to, and can work with transaction logs of the source system, then this route may be best.

If you can trust the source system, you may be relying on a “last updated” datatime stamp stored in an audit table or in the row of the table that contains the change. You may also have some versioning information available as well. In this scenario, you rely on the source system to tell you when a change was made. Your ETL solution will look for these changes and only select that data for further processing.

The danger is that you are trusting a source system to report all changes faithfully. If you can access the logs, or if the changes are logged using a database trigger, than your trust is likely well placed. If the source relies on application logic to maintain this information, you should consider a different approach. It is quite possible that a database can change through a back-door, bypassing the application logic that would normally track the change. Steer clear.

If the source cannot tell you what has changed (or you simply don’t trust it), then you will need to do a comparison. Depending on the size of the project, it would seem to make most sense to do this type of comparison in the staging area — avoiding any contact with your dimensions in the dimensional model. Touching the Dimension for this purpose would result in a performance hit and could affect any analysis or queries currently being run against the data. In the “Hand Coding” section below, I outline a brute force method that relies on a staging scheme and some SQL to determine changed data for further processing.

In general, the CDC process involves the following steps:

  1. Identify changed rows if possible
  2. Do a lookup on the dimension or staging table for the natural key of the record in question
  3. For Dimensions, handle Type 1 and Type 2 changes using a Script component (see Dimensional Modeling: Loading The Slowly Changing Dimension for the logic behind this).
  4. Perform the updates and inserts
  5. Handle deletes through negation techniques


SQL Server 2005 Integration Services (SSIS)

SQL Server 2005 will allow you to do all CDC work within SSIS. For example, if the source system reports changes, you can write an extract (utilizing variables set during package execution) that will look for the specific change indicators whether it be a date range, version, or some status flag.

For brute force comparisons on Dimension tables, you can utilize the Slowly Changing Dimension task. I have already discussed the SCD at length in my post “Dimensional Modeling: Loading The Slowly Changing Dimension“, so please read that if you need more information about how a SCD is loaded and maintained. For the purposes of CDC, the only change I would make to that post is to use a staging area for the task instead of the production database. This will allow you to focus processing efforts on identifying changes, filtering out any unnecessary data in the process which will speed up processing down the line.

If the Slowly Changing Dimension task won’t work in your scenario (perhaps you don’t have dimensions available to check for new and changed rows or you have very, very large dimensions which the SCD task is not great at handling), then you can use SQL, checksums (see System.Security.Cryptography for more information), or a variety of other SSIS transforms to do the trick. In the next section, I show how SQL can be used.

Update: For SQL Server 2008 Enterprise Edition, there will be a CDC feature for tables (for DML only). I’m not too thrilled about this new feature, and in fact, I’m not sure if it will really work in a true ETL environment. But, I have not evaluated it, so I could be way off base. Check out these blog postings for more information:

Hand Coding with Visual FoxPro (VFP9)

In the following code example, I present a brute-force CDC solution in VFP. This solution, as discussed in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy” and at FoxForward this past September, relies on using VFP tables in a staging directory called sometable_curr (which represents the fresh data) and sometable_prev which represents the data from the previous run. Comparing these two data sets for changes is fairly straightforward.

The code for this is located in my 2007 Fox Forward Presentation Examples (zip file, prgs in file). Specifically, open program files “4 - extract.prg” and “5 - changed data capture.prg”. The examples use the Northwind Database.

Here are the bare-bones (refer to the programs above for more details)

 
*-- open the current and previous tables
USE sometable_curr IN 0 
USE sometable_prev IN 0 
 
*-- do the compare using UNION and GROUP BY. 
SELECT MIN(source) as source, sometable_id,;
    sometable_name, sometable_phone ;
FROM ( ;
  SELECT "curr" as source, sc.sometable_id, ;
      sc.sometable_name, sc.sometable_phone ;
    FROM sometable_curr sc ;
  UNION ;
  SELECT "prev" as source, sp.sometable_id,;
      sp.sometable_name, sp.sometable_phone ;
  FROM sometable_prev sp ;
) as currprev ;
GROUP BY sometable_id, sometable_name, sometable_phone;
HAVING COUNT(*) = 1 ;
INTO TABLE sometable_cdc.dbf

As you can see, a combination of using the staging area (curr and prev) and some SQL can give you changed data with minimal effort. This method is fairly efficient because it is done in the staging area where you could isolate the process on a dedicated a machine.

To use a checksum to do a comparison, please take a look at my post “Sys(2017) and Comparing Data“. I go into detail on how this can be accomplished.

From here

There are several avenues to follow when designing a CDC solution. Moreover, each source will likely demand a different approach. Figuring this out ahead of time is essential for the success of the integration system.

The next ETL Subsystem is [Data] Extraction. Once the source system has been profiled and a mechanism for CDC has been developed, the ETL team can begin the process of extracting the changed data from the source.

Tags: , , , , , , ,

2 Comments

Sys(2017) and Comparing Data

I’ve been showing how I use intersect and the symmetric difference to compare data in two datasets. What’s nice about these approaches is that they are extremly flexible. You can, for example, pick just a few columns to compare from each table. So if you had two similar tables, you could easily detect which primary key + select attributes are either in both sets or not in both sets, returning the result into a third table or cursor. I’ve used these techniques in the following scenarios:

  1. to build master lists from client tables
  2. test code changes (run a procedure once, save the results A, make a code change, save the results B, compare differences)
  3. for data integration projects, especially for Changed Data Capture (CDC or incremental update) operations
  4. to check the status of metadata
  5. to do some down-and-dirty comparisons between two or more tables

Sys(2017)

One limitation of the SQL approach that I’ve been demonstrating is that you need to use a GROUP BY clause, which insists that you identify each attribute used in the select list (true for VFP, T-SQL, and other ANSI compliant SQL-enabled languages). This can be a lot of work, especially if you have a handful of tables with dozens of attributes.

The solution? Use a CRC (Cyclical Redundancy Check):

SELECT p_key, crc ;
  FROM (;
    SELECT p_key, PADR(Sys(2017, '', 1, 1),10) as crc FROM CursorA;
    UNION ALL ;
    SELECT p_key, PADR(Sys(2017, '', 1, 1),10) as crc  FROM CursorB;
  ) AS tmp1 ;
  GROUP BY p_key, crc  ;
  HAVING COUNT(*) = 1

SYS(2017) will Return a checksum value based on each row in the table. The second parameter is an exclusion list of attributes that you do not want included in the algorithm. The third parameter is the seed value, ignored when using the CRC32 version of function. The last parameter allows you to set flags to include memos or to use the CRC32 algorithm. In my example above, I am using CRC32 without memo support.

The resulting table shows me the primary keys that intersect (when HAVING COUNT(*)=2) or do not intersect (when HAVING COUNT(*)=1).

I hope you find this useful!

PS: It’s worth noting that you should avoid SYS(2017) in FoxPro versions 8 and below. There is a memory leak (fixed in VFP9) that could crash your app!

Tags: , , ,

3 Comments