Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘SQL Server 2008’


Published April 28th, 2008

Life in Redmond

Imagine this:

“Everyone has an office (a few people share), doors are often closed, and people are heads down working. […] Sounds are muted, there’s no animation from the employees, it just seems like it could be an accounting firm as much as it could be a software developer. I’m not sure what I was expecting, maybe more people running around with Nerf dart guns, maybe more people excited and jazzed about what they’re working on. A few loud “Yes”s or “Aha”s, following by someone outside their cube, reveling passerbys with the amazing code they’re just written. ”

Source: Steve Jones of SQLServerCentral.com

That’s what things are like, according to Steve Jones of SQLServerCentral.com, for the SQL Server team at Microsoft as they busily prepare the next great version of SQL Server. Things have been delayed already. And as it goes, they may need to change the name to SQL Server 2009. “2008″ already sounds dated.

Published December 11th, 2007

ETL Subsystem 2: Changed Data Capture

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Managing changed data is a major challenge. Sometimes referred to as “incremental updates”, Changed Data Capture is essential for data warehousing operations, as it keeps data volume low and helps to manage the synchronization between the source and the warehouse.

Without a Changed Data Capture methodology, you would be forced to blow away and reload your data warehouse every time you needed it refreshed. If you only need your data warehouse once a month, this might be OK, but that would be rare.

Here’s the definition that I’ll be using:

Changed Data Capture
Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.

Accomplishing this is easier said than done. In some cases you will need to trust the source system to adequately identify its changes, in others you’ll need to figure it out on your own using a brute force row-by-row comparison. If you have access to, and can work with transaction logs of the source system, then this route may be best.

If you can trust the source system, you may be relying on a “last updated” datatime stamp stored in an audit table or in the row of the table that contains the change. You may also have some versioning information available as well. In this scenario, you rely on the source system to tell you when a change was made. Your ETL solution will look for these changes and only select that data for further processing.

The danger is that you are trusting a source system to report all changes faithfully. If you can access the logs, or if the changes are logged using a database trigger, than your trust is likely well placed. If the source relies on application logic to maintain this information, you should consider a different approach. It is quite possible that a database can change through a back-door, bypassing the application logic that would normally track the change. Steer clear.

If the source cannot tell you what has changed (or you simply don’t trust it), then you will need to do a comparison. Depending on the size of the project, it would seem to make most sense to do this type of comparison in the staging area — avoiding any contact with your dimensions in the dimensional model. Touching the Dimension for this purpose would result in a performance hit and could affect any analysis or queries currently being run against the data. In the “Hand Coding” section below, I outline a brute force method that relies on a staging scheme and some SQL to determine changed data for further processing.

In general, the CDC process involves the following steps:

  1. Identify changed rows if possible
  2. Do a lookup on the dimension or staging table for the natural key of the record in question
  3. For Dimensions, handle Type 1 and Type 2 changes using a Script component (see Dimensional Modeling: Loading The Slowly Changing Dimension for the logic behind this).
  4. Perform the updates and inserts
  5. Handle deletes through negation techniques


SQL Server 2005 Integration Services (SSIS)

SQL Server 2005 will allow you to do all CDC work within SSIS. For example, if the source system reports changes, you can write an extract (utilizing variables set during package execution) that will look for the specific change indicators whether it be a date range, version, or some status flag.

For brute force comparisons on Dimension tables, you can utilize the Slowly Changing Dimension task. I have already discussed the SCD at length in my post “Dimensional Modeling: Loading The Slowly Changing Dimension“, so please read that if you need more information about how a SCD is loaded and maintained. For the purposes of CDC, the only change I would make to that post is to use a staging area for the task instead of the production database. This will allow you to focus processing efforts on identifying changes, filtering out any unnecessary data in the process which will speed up processing down the line.

If the Slowly Changing Dimension task won’t work in your scenario (perhaps you don’t have dimensions available to check for new and changed rows or you have very, very large dimensions which the SCD task is not great at handling), then you can use SQL, checksums (see System.Security.Cryptography for more information), or a variety of other SSIS transforms to do the trick. In the next section, I show how SQL can be used.

Update: For SQL Server 2008 Enterprise Edition, there will be a CDC feature for tables (for DML only). I’m not too thrilled about this new feature, and in fact, I’m not sure if it will really work in a true ETL environment. But, I have not evaluated it, so I could be way off base. Check out these blog postings for more information:

Hand Coding with Visual FoxPro (VFP9)

In the following code example, I present a brute-force CDC solution in VFP. This solution, as discussed in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy” and at FoxForward this past September, relies on using VFP tables in a staging directory called sometable_curr (which represents the fresh data) and sometable_prev which represents the data from the previous run. Comparing these two data sets for changes is fairly straightforward.

The code for this is located in my 2007 Fox Forward Presentation Examples (zip file, prgs in file). Specifically, open program files “4 - extract.prg” and “5 - changed data capture.prg”. The examples use the Northwind Database.

Here are the bare-bones (refer to the programs above for more details)

 
*-- open the current and previous tables
USE sometable_curr IN 0 
USE sometable_prev IN 0 
 
*-- do the compare using UNION and GROUP BY. 
SELECT MIN(source) as source, sometable_id,;
    sometable_name, sometable_phone ;
FROM ( ;
  SELECT "curr" as source, sc.sometable_id, ;
      sc.sometable_name, sc.sometable_phone ;
    FROM sometable_curr sc ;
  UNION ;
  SELECT "prev" as source, sp.sometable_id,;
      sp.sometable_name, sp.sometable_phone ;
  FROM sometable_prev sp ;
) as currprev ;
GROUP BY sometable_id, sometable_name, sometable_phone;
HAVING COUNT(*) = 1 ;
INTO TABLE sometable_cdc.dbf

As you can see, a combination of using the staging area (curr and prev) and some SQL can give you changed data with minimal effort. This method is fairly efficient because it is done in the staging area where you could isolate the process on a dedicated a machine.

To use a checksum to do a comparison, please take a look at my post “Sys(2017) and Comparing Data“. I go into detail on how this can be accomplished.

From here

There are several avenues to follow when designing a CDC solution. Moreover, each source will likely demand a different approach. Figuring this out ahead of time is essential for the success of the integration system.

The next ETL Subsystem is [Data] Extraction. Once the source system has been profiled and a mechanism for CDC has been developed, the ETL team can begin the process of extracting the changed data from the source.

Published December 7th, 2007

ETL Subsystem 1: Data Profiling

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data profiling is the most important step in designing a data warehousing solution. Kimball identifies it as Subsystem 1 for good reason. A Data profile forms the foundation for all phases of data integration. As one author writes, without a data profile you end up in the cycle of code, load, and explode! Trust me (and him, and everyone else) on this: Invest in Data Profiling to ensure integration success.

I have already talked about the data profile in my posts “Data Profiling” and “Getting Started with Data Profiling“. I won’t beat a dead horse here, so please take a moment to review those two short posts for more details. Here is the definition that I have been using:

Data Profiling
Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more than reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.

What’s more is that data profiling is dynamic in that it must be repeated often during the data warehousing life cycle. Attribute domains and ranges can change, for example. But also consider a situation where the source system goes through a version upgrade. The developers of the system might have added some new data elements or cleaned up some quality issues. You will need to re-run your data profile and reconcile any differences with your current integration. Without this, you would be left with only the documentation provided which is often not enough.

SQL Server 2005 Integration Services (SSIS)

Informatica, Oracle, DataFlux (SAS), IBM and many others have dedicated profiling tools. Microsoft does not (do a Google search for “Data Profiling Tools” and you won’t see Microsoft anywhere on the map). So if it is so important, why doesn’t Microsoft provide a data profiling tool? Good question.

In SQL Server 2008, there promises to be a new SSIS task called the Data Profiling Task (and accompanying Viewer). This task will certainly help in designing profiling and quality solutions. Although I have not evaluated it yet, I suppose it will take the place of using the hodgepodge of other tasks to accomplish domain and range-related profiling functions. You can read more about it here at Stacia Misner’s Blog. The SQL Server 2008 November CTP, which contains the task is located here.

For the rest of us using SQL Server 2005 in a production environment, you can use SSIS as a data profiling tool with some effort. The process involves utilizing the Script, Row Count, Multicast, Aggregate, and Conditional Split tasks among others in a data flow. The Script Component is very useful in this regard, especially for being able to use regular expressions.

Using SSIS in this way also allows you to discover table structures in the source system if available through the connection type. For relationships, you would need to examine the database schema manually.

Through SSIS, you would run a profiling package on each table or common group of tables and log the results to a specialized database for analysis. Tons of very useful metadata will be produced that can form the basis of your data model and logical data map. None of this comes automatically. You will need to develop this system from the ground-up.

I get the feeling that Data Profiling was an afterthought for the design team at Microsoft. I also get the feeling that they’re finally waking up and will begin to provide important profiling tools in subsequent releases. I don’t think that their offering in 2008 will make much difference though for those serious about data profiling.

Hand Coding with Visual FoxPro (VFP9)

Personally, I feel that your best bet for data profiling is to purchase a data profiling tool from a vendor with a proven track record and with good support. Hand coding ETL is difficult enough as it is, but hand-coding a profiler is quite a challenge in its own right. Data profiling is much more than simply running some SQL queries on a table, and it goes well beyond determining candidate keys and identifying nullable columns. Hand coding this type of application requires research, development time, and plenty of quality controls.

With that said, I have a sample profiling application available in VFP9 that you can download here (click on “VFP Data Profiler”). It is very basic, but should give you a feel for what you will need to do to get started.

If you could choose a language to use, I would select FoxPro over anything else I’ve used (PHP, C++, C#, VB). It can shred XML, process strings blazingly fast, and its database container makes for a perfect metadata and staging repository. You can do more for less in VFP9 than you can do using many other languages or even SSIS.

From here

Data Profiling is an ongoing effort, but when the initial round is complete, the development of the datawarehouse shifts to designing the data model (for analytical data warehousing applications it would be a dimensional model). At the same time, quality issues are being addressed and the ETL team begins to prepare the groundwork for the next subsystem: Changed Data Capture (CDC).