Archive for December, 2007

ETL Subsystem 3: Data Extraction

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

One of the major pieces in the Data Integration puzzle is data extraction. Simply put, data extraction is the process of taking data from some source system to be used in some other process. Usually, in an ETL environment, extracted data is staged for further processing (staging is important for lineage and recovery purposes.).

Extracting comes in a few different flavors:

Push
The source system ‘pushes’ the data into the ETL system (through an export, direct write into a staging area, etc.)
Pull
The ETL process ‘pulls’ the data from the source system (usually using SQL to read the data)
Dribble
For real-time/right-time systems, this mechanism makes data available to the ETL system quite frequently throughout the day, possibly through database triggers or message queues (see MS Message Queuing for more details)

Depending on the source system, the extraction may be different. Also, depending on the purpose of the process (complete historical load verses incremental loads) the connection may be different. It is imperative that the decisions on how to extract data from source systems are well documented in the ETL technical documentation. This document will aid integration specialists, together with the data mappings, when developing the extract system and methodology.

Furthermore, you should strive to extract only relevant data. Extra, non-essential data will take up additional disk space and memory during processing. This is especially true in SSIS where memory is quite important.

In the perfect world

Ideally, the source system will push data to the ETL system on some pre-determined, business-relevant interval (hourly, daily, weekly, etc.). The source system will be responsible for determining the changed data; however, you should never fully trust source systems and always build a CDC system (see my previous post ETL Subsystem 2: Changed Data Capture for more information) to pick up any changed data missed by the source system.

The ideal extraction strategy for Pull systems relies on trigger-level insert and last update dates on each row in each table in the source. Pulling the data by date range — perhaps one subset for inserts and another for updates — is quite efficient and as long as the database is supplying the dates through triggers, you can use this method with confidence. If the source system uses application logic to populate these dates: steer clear. You’re asking for trouble. There is always at least one rogue update process that will fail to manually insert the correct date!

Realistically though

Things are usually less than ideal. Your CDC strategy will help determine how you acquire extract data and what to do with it when you do. Some additional ways to access extracted data including using vendor-supplied APIs, web services, direct data access, reading file-based reports, tapping into the transaction logs, or using some external 3rd Party tool. Be concerned with difficult-to-access legacy systems, especially if the experts are beginning to ask for retirement!

It is also possible to use database replication. A Publisher can distribute transactional data to any Subscribers such as an OLTP or ODS server and your ETL system. This scenario is best in a controlled environment where the IT department can manage the replication across systems.

Data Extracted. Now What?

I prefer to stage extracted data for later use (even if ‘later’ is only a few seconds away). You acquire the data from the source, and drop it off on disk in some native format (DBF files for FoxPro, RAW files for SSIS). This allows you to return to this step if problems were encountered downstream, if downstream processes need to wait for some other condition before continuing, for data lineage purposes, and for recovery in case of failure. You will not want to have to go back to the source to re-capture this data. It isn’t efficient and a lot can happen to the data while the ETL process fumbles around.

SQL Server 2005 Integration Services (SSIS)

Data extraction in Integration Services is rather easy. There are many different ways to connect to source systems. Dot NET Providers, for example, give you access to many types of systems including Informix Dynamic Server, SAP, and Sybase. Connection types included:

  • DataReader Source: Allows you to read data from a .NET Provider (see the ADO.NET team blog for more details) using the sqlcommand property. Tip: The DataReader usually performs better than OLE DB even for SQL Server. Also, be aware that you can sort data for further processing by specifying the IsSorted property = True on the Input and Output Properties tab. Remember to set the SortKeyPosition for each Output Column involved in the sort as well.
  • OLE DB Source: Use OLE DB to retrieve data from a variety of sources. You can use this adapter to request a result from a table, view, stored procedure, or SQL query. I like to develop my queries in SSMS or the source system’s editor (MySQL Query Browser or VFP IDE for example) if I have it. This helps to get any syntax quirks out of the way. SSIS is a bit lacking when it comes to writing queries using the provided edit box.
  • Excel Source: Simplifies access to MS Excel data. I prefer to stay away from using Excel as a data source for incremental loads because it is too easy for the user to break the process by changing the format of the file.
  • Flat File Source: A great adapter for reading flat files (fixed, delimited, ragged right). You will actually set up most of the information in the flat file connection and not in the source adapter. I use the flat file adapter more than any other particularly because it is easy to get data from source systems in a flat file. XML is slowly taking over, however.
  • XML Adapter: Used to read XML data and help convert it to a relational form. Use this adapter to talk to XML, either via file or direct data from a variable, shred the collections and hierarchies, and produce multiple outputs. I even use this adapter to shred XML for other purposes (non-data integration purposes) because it is so easy. Tip: Accept XML with XSD schema definitions only; this will save you a ton of time and effort profiling the XML for determining destination details.
  • Raw File Source: Internal SSIS format that can be used to stage data or exchange data among packages. This is a very fast mechanism to read from because the format of the raw file is the same format that SSIS uses in memory; SSIS simply loads the raw data and you’re off to the races.

I also make heavy use of the FTP Control Flow Task and have also used the Web Service Control Flow Task to retrieve data. Both of these tasks are relatively easy to set up and use.

Also, for real-time systems (using the dribble extraction approach), you will likely be looking into the Message Queue Task and WMI Event Watcher Task. I have not had the privilege of working in a dribble environment (yet), so I don’t have much more to offer at this time.

Hand Coding with Visual FoxPro (VFP9)

Of course, much of the above is easily accomplished in Visual Foxpro. You won’t have a colorful drag-and-drop editor to work with but developing your own classes to accomplish what SSIS can do is rather trivial (technically, you can then use the Toolbox Panel and get some drag and drop if you want). With the available adapters (ADO, ODBC, XML, Native), and the ability to read and append from files directly, FoxPro is well positioned as a data extraction tool. Not only that, but you can use the native DBF format as a staging format, making downstream operations that much faster (don’t stage to flat files when you can stage into native tables in a staging database).

I outlined a rather comprehensive data extraction and staging strategy in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy“. I also talked at length about certain techniques in my Fox Forward and AFUG presentations earlier this year (2007 Fox Forward Presentation Examples).

From here

Once data is extracted and sitting in a staging area (or in memory if you don’t use a physical stage), you can move on to cleaning it. The 4th subsystem is just that: The Data Cleansing Subsystem.

Tags: , , , , , , ,

ETL Subsystem 2: Changed Data Capture

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Managing changed data is a major challenge. Sometimes referred to as “incremental updates”, Changed Data Capture is essential for data warehousing operations, as it keeps data volume low and helps to manage the synchronization between the source and the warehouse.

Without a Changed Data Capture methodology, you would be forced to blow away and reload your data warehouse every time you needed it refreshed. If you only need your data warehouse once a month, this might be OK, but that would be rare.

Here’s the definition that I’ll be using:

Changed Data Capture
Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.

Accomplishing this is easier said than done. In some cases you will need to trust the source system to adequately identify its changes, in others you’ll need to figure it out on your own using a brute force row-by-row comparison. If you have access to, and can work with transaction logs of the source system, then this route may be best.

If you can trust the source system, you may be relying on a “last updated” datatime stamp stored in an audit table or in the row of the table that contains the change. You may also have some versioning information available as well. In this scenario, you rely on the source system to tell you when a change was made. Your ETL solution will look for these changes and only select that data for further processing.

The danger is that you are trusting a source system to report all changes faithfully. If you can access the logs, or if the changes are logged using a database trigger, than your trust is likely well placed. If the source relies on application logic to maintain this information, you should consider a different approach. It is quite possible that a database can change through a back-door, bypassing the application logic that would normally track the change. Steer clear.

If the source cannot tell you what has changed (or you simply don’t trust it), then you will need to do a comparison. Depending on the size of the project, it would seem to make most sense to do this type of comparison in the staging area — avoiding any contact with your dimensions in the dimensional model. Touching the Dimension for this purpose would result in a performance hit and could affect any analysis or queries currently being run against the data. In the “Hand Coding” section below, I outline a brute force method that relies on a staging scheme and some SQL to determine changed data for further processing.

In general, the CDC process involves the following steps:

  1. Identify changed rows if possible
  2. Do a lookup on the dimension or staging table for the natural key of the record in question
  3. For Dimensions, handle Type 1 and Type 2 changes using a Script component (see Dimensional Modeling: Loading The Slowly Changing Dimension for the logic behind this).
  4. Perform the updates and inserts
  5. Handle deletes through negation techniques


SQL Server 2005 Integration Services (SSIS)

SQL Server 2005 will allow you to do all CDC work within SSIS. For example, if the source system reports changes, you can write an extract (utilizing variables set during package execution) that will look for the specific change indicators whether it be a date range, version, or some status flag.

For brute force comparisons on Dimension tables, you can utilize the Slowly Changing Dimension task. I have already discussed the SCD at length in my post “Dimensional Modeling: Loading The Slowly Changing Dimension“, so please read that if you need more information about how a SCD is loaded and maintained. For the purposes of CDC, the only change I would make to that post is to use a staging area for the task instead of the production database. This will allow you to focus processing efforts on identifying changes, filtering out any unnecessary data in the process which will speed up processing down the line.

If the Slowly Changing Dimension task won’t work in your scenario (perhaps you don’t have dimensions available to check for new and changed rows or you have very, very large dimensions which the SCD task is not great at handling), then you can use SQL, checksums (see System.Security.Cryptography for more information), or a variety of other SSIS transforms to do the trick. In the next section, I show how SQL can be used.

Update: For SQL Server 2008 Enterprise Edition, there will be a CDC feature for tables (for DML only). I’m not too thrilled about this new feature, and in fact, I’m not sure if it will really work in a true ETL environment. But, I have not evaluated it, so I could be way off base. Check out these blog postings for more information:

Hand Coding with Visual FoxPro (VFP9)

In the following code example, I present a brute-force CDC solution in VFP. This solution, as discussed in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy” and at FoxForward this past September, relies on using VFP tables in a staging directory called sometable_curr (which represents the fresh data) and sometable_prev which represents the data from the previous run. Comparing these two data sets for changes is fairly straightforward.

The code for this is located in my 2007 Fox Forward Presentation Examples (zip file, prgs in file). Specifically, open program files “4 – extract.prg” and “5 – changed data capture.prg”. The examples use the Northwind Database.

Here are the bare-bones (refer to the programs above for more details)


*-- open the current and previous tables
USE sometable_curr IN 0 
USE sometable_prev IN 0 
 
*-- do the compare using UNION and GROUP BY. 
SELECT MIN(source) as source, sometable_id,;
    sometable_name, sometable_phone ;
FROM ( ;
  SELECT "curr" as source, sc.sometable_id, ;
      sc.sometable_name, sc.sometable_phone ;
    FROM sometable_curr sc ;
  UNION ;
  SELECT "prev" as source, sp.sometable_id,;
      sp.sometable_name, sp.sometable_phone ;
  FROM sometable_prev sp ;
) as currprev ;
GROUP BY sometable_id, sometable_name, sometable_phone;
HAVING COUNT(*) = 1 ;
INTO TABLE sometable_cdc.d
*-- open the current and previous tables
USE sometable_curr IN 0 
USE sometable_prev IN 0 

*-- do the compare using UNION and GROUP BY. 
SELECT MIN(source) as source, sometable_id,;
    sometable_name, sometable_phone ;
FROM ( ;
  SELECT "curr" as source, sc.sometable_id, ;
      sc.sometable_name, sc.sometable_phone ;
    FROM sometable_curr sc ;
  UNION ;
  SELECT "prev" as source, sp.sometable_id,;
      sp.sometable_name, sp.sometable_phone ;
  FROM sometable_prev sp ;
) as currprev ;
GROUP BY sometable_id, sometable_name, sometable_phone;
HAVING COUNT(*) = 1 ;
INTO TABLE sometable_cdc.d
*-- do the compare using UNION and GROUP BY. 
SELECT MIN(source) as source, sometable_id,;
    sometable_name, sometable_phone ;
FROM ( ;
  SELECT "curr" as source, sc.sometable_id, ;
      sc.sometable_name, sc.sometable_phone ;
    FROM sometable_curr sc ;
  UNION ;
  SELECT "prev" as source, sp.sometable_id,;
      sp.sometable_name, sp.sometable_phone ;
  FROM sometable_prev sp ;
) as currprev ;
GROUP BY sometable_id, sometable_name, sometable_phone;
HAVING COUNT(*) = 1 ;
INTO TABLE sometable_cdc.dbf

As you can see, a combination of using the staging area (curr and prev) and some SQL can give you changed data with minimal effort. This method is fairly efficient because it is done in the staging area where you could isolate the process on a dedicated a machine.

To use a checksum to do a comparison, please take a look at my post “Sys(2017) and Comparing Data“. I go into detail on how this can be accomplished.

From here

There are several avenues to follow when designing a CDC solution. Moreover, each source will likely demand a different approach. Figuring this out ahead of time is essential for the success of the integration system.

The next ETL Subsystem is [Data] Extraction. Once the source system has been profiled and a mechanism for CDC has been developed, the ETL team can begin the process of extracting the changed data from the source.

Tags: , , , , , , ,

Database Design: 4th and 5th Normal Forms

I had the pleasure of speaking with Andrew MacNeill of The FoxShow last week for an interview. We discussed database normalization and denormalization in relational databases. I’d like to thank Andrew for the opportunity to discuss this very important database design topic. He contacted me after a listener asked for a discussion on normalization, and having seen my recent posting “The Role of Good Database Design and Normalization for Data Integration Projects“, Andrew felt that I would be up to the challenge.

You can listen to the Interview here: FoxShow #47: Data Normalization with FoxPro and Tod McKenna.

I promised during the interview to present the 4th and 5th normal forms here on Tod means Fox because trying to explain these forms without pictures and examples is very difficult. I would also like to confess (shame on me!) that I didn’t take the time ahead of time to prepare solid examples for these forms, so my explanation of these forms in the interview was a bit lacking. I hope to clarify what I said here.

4th Normal Form

In my previous post, I defined the 4th Normal Form as a form that seeks to “remove all multi-valued dependencies”. So what exactly does this mean? A multi-valued attribute is an attribute that helps to describe the entity. In order to represent this in a table, the attribute could be repeated several times. Consider this real-world scenario:

 Database Design: 4th and 5th Normal Forms

The town, Region, and (Congressional) District together form the candidate key. In this example, Providence has a split congressional district and exists in multiple regions. Lincoln also exists in multiple regions but only has a single district. The above table meets Boyce-Codd Normal Form because all three attributes are part of the candidate key (read more about BCNF here).

This type of table is called a ‘cross-product’, where data is repeated. It is not in 4th normal form because of this repeating data scenario. To fix, you would create two separate tables:

 Database Design: 4th and 5th Normal Forms

In 4th normal form, you reduce the need for inserts, updates and deletes and therefore you eliminate the possibility for data anomalies. It is reasonable to strive for 4th normal form as a baseline in your designs. In some cases, your database will take up less space. The cost for normalizing this far is only in query performance, as you’ll need additional joins to report on the data. But in my opinion, data integrity (especially if you plan on re-purposing your data for data warehousing and data integration projects) should always take precedence in OLTP systems. With that said, the 5th Normal Form goes even further:

5th Normal Form

The 5th Normal Form is a “projection-based normalization approach where every nontrivial join dependency is implied by a candidate key”. In the previous example, I showed how one table became two, reducing the risk of data anomalies. 5th normal form could split a single table into three or more.

Let’s say you have a table with store, supplier, and products. The table contains a row for each store, which is repeated for each supplier the store uses, and again for each product that the supplier provides to the particular store (A store might not sell all the products that a supplier provides). All fields in this table are valuable and necessary to represent the row and is in 4th Normal Form. In fact, this is a common type of table that could represent a many-to-many-to-many relationship among stores, products, and suppliers. But, notice that there is repeating data:

 Database Design: 4th and 5th Normal Forms

In 5NF, this table would become 3: a store_to_prod table, a supp_to_prod table, and finally a store_to_supp.

 Database Design: 4th and 5th Normal Forms

When a store picks up a new supplier, a new row is added to the store_to_supp table, eliminating the redundancy of extra rows for product. When a store sells a new product, the supp_to_prod table will let you know where to get it. A simple example, yes — but it demonstrates 5th normal form.

As you can see, you increase the amount of relations and tables as you normalize deeper and deeper.

Denormalizing

As discussed in the interview, at a certain point it is logical to push back and denormalize. You would do this for performance and manageability purposes, or because the design of the database dictates it. In the previous example, it might not be practical from a development standpoint to bring all tables to 5th form. However, you should weigh the decision to denormalize carefully. Consider the purpose of the database. If it is a transaction-based database with many data access points, you will want to normalize as much as possible to ensure the integrity of the data. If the primary purpose of the database is for ad-hoc queries, then backing off might be well received.

As I said in my previous post, normalization is an art form. I find it difficult to explain at times, but relatively easy to implement. I hope that this post is helpful in some way, and if I have made any errors, or if you don’t agree with any part of this discussion please feel free to contact me!

Tags: , ,