Posts Tagged Kimball

Data Warehouse Lifecycle in Depth

I’m starting to get excited about next week. I’m leaving Brussels for Amsterdam at about 5pm on Monday to attend a four day course at Kimball University. Subject: The Data Warehouse Lifecycle in Depth. Instructors: Margy Ross & Warren Thornthwaite. I’m looking forward to diving into the DW Lifecycle with industry experts: those who literally wrote the book. Ralph Kimball won’t be there but I trust that Margy and Warren will give us a great week of sessions.

What’s really, really cool is that I’ll be getting the new edition of “The Data Warehouse Lifecycle Toolkit“:

What better way to start 2008 than with a major new data warehousing book! Joy, Warren, Margy, Bob and I have been working on the second edition of The Data Warehouse Lifecycle Toolkit for most of the last year. Virtually every paragraph of the first edition was changed, and more than 60% of the book is new material. The happy looks on our faces are both pride in our new baby, as well as relief from a long writing and editing stint.

I plan on blogging from the event, so if anyone is interested in getting some insight on any of the sessions, please let me know. I’ll be sure to take especially good notes!

Tags: , , , ,

3 Comments

ETL Subsystem 4: Data Cleansing

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data Cleansing is a Data Quality process that is designed to catch and resolve data quality violations. Proper data cleansing will add tremendous value to the data, giving your business users plenty of reasons to trust it. Sometimes referred to as ‘scrubbing’, this subsystem is a must for all Data Warehousing projects.

Most quality issues will be discovered during data Profiling, which I discussed here. Other issues will be identified by business users who have had to deal with problem data in the past. Some cleansing operations will be performed because of Master Data initiatives (for example, converting customer codes to universal codes for comparisons).

Cleansing data can consist of tasks like the following:

  1. Parsing city, state, and zip from an address string
  2. Changing the case of a person’s last name to UPPER
  3. Finding the delivery point from a street address and zip code
  4. Stripping out unwanted characters from text
  5. Swapping NULLs for some other value suitable for the DWH
  6. Deriving a value from one or more fields to create a new field
  7. Concatenating two or more fields into one (for example, turning day, month, year fields into a date)

Obviously, the above list is not exhaustive. There are myriad ways to improve the quality of data during integration. The important thing to keep in mind is that this subsystem, more than any other, actually seeks to change the data. As quality is the driving factor, these changes should always be for the better and serve some business purpose. Any change to the data should add value and confidence.

Garbage in, Garbage out

Think of data cleansing as the process of cleaning up all that garbage that got in in the first place. OLTP systems (especially legacy systems that have been kicking around for a few decades) are notorious for doing some strange things with the data. I have encountered cases where fields were re-purposed (“hey we’re not using store_id2 anymore, so now we have some place to put the manager’s phone number!”), algorithms re-written (but historical data not recalculated or stamped with a version), and nulls represented as anything but NULL (99, -99, -1, “N/A”, etc…).

Not only that, but it’s likely that you have run into cases where users were able to free-form type into data entry fields. For example, a list of valid products might include “Widget A” and “Widget B”, but because a user can type what they want, you get “WedgetA” and “WDGT-B”. Of course, from an application architecture point of view, this is a terrible design problem. Anyway, as a data integrator, expect this and worse.

One man’s junk is another man’s treasure

But there is light. Many of these quality problems can be corrected outright through integration, while others can be addressed directly in the source OLTP system. Many home-grown or custom applications compensate for bad data in some way (lookup/mapping tables, loose database constraints, routines that go back and ‘fix’ the data, etc…). Through ETL, you will have to do similar work.

What might seem like worthless data is oftentimes salvageable. In fact, I have been involved in projects where large chunks of data were simply discarded because the business users felt it was unworkable. After some tinkering, it turned out to be far from the truth. Not only was some of the data useful, but we were able to take our ‘dirty data proof’ back to the application developers to get the application fixed (fixing DQ issues at the source is always preferable).

SQL Server 2005 Integration Services (SSIS)

SSIS offers a great set of tasks for cleansing data. If you add the OLEDB Command (for SQL code) and Script Component (for VB code) to the list below, then there really are no limits to what can be achieved. My only criticism is that using expressions in SSIS is quite difficult at first and takes some getting used to (see Integration Services Expression Language). The language and syntax is native to SSIS. On the plus side, the expression builder dialog box is helpful. Expressions are used throughout SSIS, but especially in the Derived Column task.

Typically, data cleansing will involve one or more of the following SSIS tasks:

  • Lookup Transform – performs a lookup using a join to another table, adding columns from the lookup to the data flow
  • Character Map – for string data, this transform applies a variety of string functions to the input column (changing case, for example)
  • Copy/Map – create a copy of a column
  • Data Conversion – convert data types from one to another
  • Derived Column – use this task to create or replace existing data by concatenation, exploding, applying calculations, and doing comparisons
  • Fuzzy Lookup – a component you can use for inexact matching
  • Fuzzy Grouping – uses algorithms to match or group disparate data to a configurable degree of accuracy

In future posts, I will demonstrate how many of these tasks work (likely through a SQL Server 2008 lens).

Hand Coding with Visual FoxPro (VFP9)

I’ve done a ton of VFP-related data cleansing over the past decade. Of course, VFP doesn’t have any fancy squares that turn green when a cleansing operation succeeds (or red when it fails!), but the language itself will give you everything you need to start cleansing data with little effort. I started trying to count the number of different commands you could use for cleaning in VFP. I stopped when I hit 50. There are likely more than a 100. Commands like SUBSTRING, ALLTRIM, CHRTRAN, LEFT, RIGHT, OCCURS, AT, LOWER, PROPER, SOUNDEX, LIKE, TRANSFORM, etc. Consult with the Visual FoxPro documentation or resources such as The Visual FoxPro Wiki for more details regarding VFP’s language features. I know this is a bit of a cop-out on my part but hey, I could spend days trying to summarize the capabilities.

What I can do is share some data cleansing / VFP implementation experience with you. The Logical Data Map (LDM) is a mapping document (sometimes done in Excel or relational database if you prefer) that tells you all of the source-to-target mappings for the integration. In VFP, you can create a VFP table (or tables) and embed cleansing code snippets into memo fields. This code can execute as part of a metadata-driven cleansing engine. I demonstrate how this can work in my articles for Advisor media. You can see some examples in my FoxForward material as well (see: 2007 Fox Forward Presentation Logical Data Map Database).

From here

Both SSIS and VFP have great data cleansing capabilities; in both systems, data cleansing is easy. In VFP, you can expect a mature language and blazing speeds (FoxPro is excellent when working with string data). SSIS makes it very easy to derive columns and change data types. SSIS can also make use of multiple processors and has some good built-in logging functionality. In the end, I prefer VFP because it is much easier to work with FoxPro’s languange (as oposed to VB and the SSIS expression language). Also, in VFP I find it much easier to manage metadata and the LDM.

In the next article, I’ll discuss the Error Event Management subsystem and how VFP and SSIS can deal with that!

Tags: , , , , , , , ,

6 Comments

ETL Subsystem 3: Data Extraction

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

One of the major pieces in the Data Integration puzzle is data extraction. Simply put, data extraction is the process of taking data from some source system to be used in some other process. Usually, in an ETL environment, extracted data is staged for further processing (staging is important for lineage and recovery purposes.).

Extracting comes in a few different flavors:

Push
The source system ‘pushes’ the data into the ETL system (through an export, direct write into a staging area, etc.)
Pull
The ETL process ‘pulls’ the data from the source system (usually using SQL to read the data)
Dribble
For real-time/right-time systems, this mechanism makes data available to the ETL system quite frequently throughout the day, possibly through database triggers or message queues (see MS Message Queuing for more details)

Depending on the source system, the extraction may be different. Also, depending on the purpose of the process (complete historical load verses incremental loads) the connection may be different. It is imperative that the decisions on how to extract data from source systems are well documented in the ETL technical documentation. This document will aid integration specialists, together with the data mappings, when developing the extract system and methodology.

Furthermore, you should strive to extract only relevant data. Extra, non-essential data will take up additional disk space and memory during processing. This is especially true in SSIS where memory is quite important.

In the perfect world

Ideally, the source system will push data to the ETL system on some pre-determined, business-relevant interval (hourly, daily, weekly, etc.). The source system will be responsible for determining the changed data; however, you should never fully trust source systems and always build a CDC system (see my previous post ETL Subsystem 2: Changed Data Capture for more information) to pick up any changed data missed by the source system.

The ideal extraction strategy for Pull systems relies on trigger-level insert and last update dates on each row in each table in the source. Pulling the data by date range — perhaps one subset for inserts and another for updates — is quite efficient and as long as the database is supplying the dates through triggers, you can use this method with confidence. If the source system uses application logic to populate these dates: steer clear. You’re asking for trouble. There is always at least one rogue update process that will fail to manually insert the correct date!

Realistically though

Things are usually less than ideal. Your CDC strategy will help determine how you acquire extract data and what to do with it when you do. Some additional ways to access extracted data including using vendor-supplied APIs, web services, direct data access, reading file-based reports, tapping into the transaction logs, or using some external 3rd Party tool. Be concerned with difficult-to-access legacy systems, especially if the experts are beginning to ask for retirement!

It is also possible to use database replication. A Publisher can distribute transactional data to any Subscribers such as an OLTP or ODS server and your ETL system. This scenario is best in a controlled environment where the IT department can manage the replication across systems.

Data Extracted. Now What?

I prefer to stage extracted data for later use (even if ‘later’ is only a few seconds away). You acquire the data from the source, and drop it off on disk in some native format (DBF files for FoxPro, RAW files for SSIS). This allows you to return to this step if problems were encountered downstream, if downstream processes need to wait for some other condition before continuing, for data lineage purposes, and for recovery in case of failure. You will not want to have to go back to the source to re-capture this data. It isn’t efficient and a lot can happen to the data while the ETL process fumbles around.

SQL Server 2005 Integration Services (SSIS)

Data extraction in Integration Services is rather easy. There are many different ways to connect to source systems. Dot NET Providers, for example, give you access to many types of systems including Informix Dynamic Server, SAP, and Sybase. Connection types included:

  • DataReader Source: Allows you to read data from a .NET Provider (see the ADO.NET team blog for more details) using the sqlcommand property. Tip: The DataReader usually performs better than OLE DB even for SQL Server. Also, be aware that you can sort data for further processing by specifying the IsSorted property = True on the Input and Output Properties tab. Remember to set the SortKeyPosition for each Output Column involved in the sort as well.
  • OLE DB Source: Use OLE DB to retrieve data from a variety of sources. You can use this adapter to request a result from a table, view, stored procedure, or SQL query. I like to develop my queries in SSMS or the source system’s editor (MySQL Query Browser or VFP IDE for example) if I have it. This helps to get any syntax quirks out of the way. SSIS is a bit lacking when it comes to writing queries using the provided edit box.
  • Excel Source: Simplifies access to MS Excel data. I prefer to stay away from using Excel as a data source for incremental loads because it is too easy for the user to break the process by changing the format of the file.
  • Flat File Source: A great adapter for reading flat files (fixed, delimited, ragged right). You will actually set up most of the information in the flat file connection and not in the source adapter. I use the flat file adapter more than any other particularly because it is easy to get data from source systems in a flat file. XML is slowly taking over, however.
  • XML Adapter: Used to read XML data and help convert it to a relational form. Use this adapter to talk to XML, either via file or direct data from a variable, shred the collections and hierarchies, and produce multiple outputs. I even use this adapter to shred XML for other purposes (non-data integration purposes) because it is so easy. Tip: Accept XML with XSD schema definitions only; this will save you a ton of time and effort profiling the XML for determining destination details.
  • Raw File Source: Internal SSIS format that can be used to stage data or exchange data among packages. This is a very fast mechanism to read from because the format of the raw file is the same format that SSIS uses in memory; SSIS simply loads the raw data and you’re off to the races.

I also make heavy use of the FTP Control Flow Task and have also used the Web Service Control Flow Task to retrieve data. Both of these tasks are relatively easy to set up and use.

Also, for real-time systems (using the dribble extraction approach), you will likely be looking into the Message Queue Task and WMI Event Watcher Task. I have not had the privilege of working in a dribble environment (yet), so I don’t have much more to offer at this time.

Hand Coding with Visual FoxPro (VFP9)

Of course, much of the above is easily accomplished in Visual Foxpro. You won’t have a colorful drag-and-drop editor to work with but developing your own classes to accomplish what SSIS can do is rather trivial (technically, you can then use the Toolbox Panel and get some drag and drop if you want). With the available adapters (ADO, ODBC, XML, Native), and the ability to read and append from files directly, FoxPro is well positioned as a data extraction tool. Not only that, but you can use the native DBF format as a staging format, making downstream operations that much faster (don’t stage to flat files when you can stage into native tables in a staging database).

I outlined a rather comprehensive data extraction and staging strategy in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy“. I also talked at length about certain techniques in my Fox Forward and AFUG presentations earlier this year (2007 Fox Forward Presentation Examples).

From here

Once data is extracted and sitting in a staging area (or in memory if you don’t use a physical stage), you can move on to cleaning it. The 4th subsystem is just that: The Data Cleansing Subsystem.

Tags: , , , , , , ,

3 Comments

ETL Subsystem 2: Changed Data Capture

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Managing changed data is a major challenge. Sometimes referred to as “incremental updates”, Changed Data Capture is essential for data warehousing operations, as it keeps data volume low and helps to manage the synchronization between the source and the warehouse.

Without a Changed Data Capture methodology, you would be forced to blow away and reload your data warehouse every time you needed it refreshed. If you only need your data warehouse once a month, this might be OK, but that would be rare.

Here’s the definition that I’ll be using:

Changed Data Capture
Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.

Accomplishing this is easier said than done. In some cases you will need to trust the source system to adequately identify its changes, in others you’ll need to figure it out on your own using a brute force row-by-row comparison. If you have access to, and can work with transaction logs of the source system, then this route may be best.

If you can trust the source system, you may be relying on a “last updated” datatime stamp stored in an audit table or in the row of the table that contains the change. You may also have some versioning information available as well. In this scenario, you rely on the source system to tell you when a change was made. Your ETL solution will look for these changes and only select that data for further processing.

The danger is that you are trusting a source system to report all changes faithfully. If you can access the logs, or if the changes are logged using a database trigger, than your trust is likely well placed. If the source relies on application logic to maintain this information, you should consider a different approach. It is quite possible that a database can change through a back-door, bypassing the application logic that would normally track the change. Steer clear.

If the source cannot tell you what has changed (or you simply don’t trust it), then you will need to do a comparison. Depending on the size of the project, it would seem to make most sense to do this type of comparison in the staging area — avoiding any contact with your dimensions in the dimensional model. Touching the Dimension for this purpose would result in a performance hit and could affect any analysis or queries currently being run against the data. In the “Hand Coding” section below, I outline a brute force method that relies on a staging scheme and some SQL to determine changed data for further processing.

In general, the CDC process involves the following steps:

  1. Identify changed rows if possible
  2. Do a lookup on the dimension or staging table for the natural key of the record in question
  3. For Dimensions, handle Type 1 and Type 2 changes using a Script component (see Dimensional Modeling: Loading The Slowly Changing Dimension for the logic behind this).
  4. Perform the updates and inserts
  5. Handle deletes through negation techniques


SQL Server 2005 Integration Services (SSIS)

SQL Server 2005 will allow you to do all CDC work within SSIS. For example, if the source system reports changes, you can write an extract (utilizing variables set during package execution) that will look for the specific change indicators whether it be a date range, version, or some status flag.

For brute force comparisons on Dimension tables, you can utilize the Slowly Changing Dimension task. I have already discussed the SCD at length in my post “Dimensional Modeling: Loading The Slowly Changing Dimension“, so please read that if you need more information about how a SCD is loaded and maintained. For the purposes of CDC, the only change I would make to that post is to use a staging area for the task instead of the production database. This will allow you to focus processing efforts on identifying changes, filtering out any unnecessary data in the process which will speed up processing down the line.

If the Slowly Changing Dimension task won’t work in your scenario (perhaps you don’t have dimensions available to check for new and changed rows or you have very, very large dimensions which the SCD task is not great at handling), then you can use SQL, checksums (see System.Security.Cryptography for more information), or a variety of other SSIS transforms to do the trick. In the next section, I show how SQL can be used.

Update: For SQL Server 2008 Enterprise Edition, there will be a CDC feature for tables (for DML only). I’m not too thrilled about this new feature, and in fact, I’m not sure if it will really work in a true ETL environment. But, I have not evaluated it, so I could be way off base. Check out these blog postings for more information:

Hand Coding with Visual FoxPro (VFP9)

In the following code example, I present a brute-force CDC solution in VFP. This solution, as discussed in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy” and at FoxForward this past September, relies on using VFP tables in a staging directory called sometable_curr (which represents the fresh data) and sometable_prev which represents the data from the previous run. Comparing these two data sets for changes is fairly straightforward.

The code for this is located in my 2007 Fox Forward Presentation Examples (zip file, prgs in file). Specifically, open program files “4 – extract.prg” and “5 – changed data capture.prg”. The examples use the Northwind Database.

Here are the bare-bones (refer to the programs above for more details)


*-- open the current and previous tables
USE sometable_curr IN 0 
USE sometable_prev IN 0 
 
*-- do the compare using UNION and GROUP BY. 
SELECT MIN(source) as source, sometable_id,;
    sometable_name, sometable_phone ;
FROM ( ;
  SELECT "curr" as source, sc.sometable_id, ;
      sc.sometable_name, sc.sometable_phone ;
    FROM sometable_curr sc ;
  UNION ;
  SELECT "prev" as source, sp.sometable_id,;
      sp.sometable_name, sp.sometable_phone ;
  FROM sometable_prev sp ;
) as currprev ;
GROUP BY sometable_id, sometable_name, sometable_phone;
HAVING COUNT(*) = 1 ;
INTO TABLE sometable_cdc.d
*-- open the current and previous tables
USE sometable_curr IN 0 
USE sometable_prev IN 0 

*-- do the compare using UNION and GROUP BY. 
SELECT MIN(source) as source, sometable_id,;
    sometable_name, sometable_phone ;
FROM ( ;
  SELECT "curr" as source, sc.sometable_id, ;
      sc.sometable_name, sc.sometable_phone ;
    FROM sometable_curr sc ;
  UNION ;
  SELECT "prev" as source, sp.sometable_id,;
      sp.sometable_name, sp.sometable_phone ;
  FROM sometable_prev sp ;
) as currprev ;
GROUP BY sometable_id, sometable_name, sometable_phone;
HAVING COUNT(*) = 1 ;
INTO TABLE sometable_cdc.d
*-- do the compare using UNION and GROUP BY. 
SELECT MIN(source) as source, sometable_id,;
    sometable_name, sometable_phone ;
FROM ( ;
  SELECT "curr" as source, sc.sometable_id, ;
      sc.sometable_name, sc.sometable_phone ;
    FROM sometable_curr sc ;
  UNION ;
  SELECT "prev" as source, sp.sometable_id,;
      sp.sometable_name, sp.sometable_phone ;
  FROM sometable_prev sp ;
) as currprev ;
GROUP BY sometable_id, sometable_name, sometable_phone;
HAVING COUNT(*) = 1 ;
INTO TABLE sometable_cdc.dbf

As you can see, a combination of using the staging area (curr and prev) and some SQL can give you changed data with minimal effort. This method is fairly efficient because it is done in the staging area where you could isolate the process on a dedicated a machine.

To use a checksum to do a comparison, please take a look at my post “Sys(2017) and Comparing Data“. I go into detail on how this can be accomplished.

From here

There are several avenues to follow when designing a CDC solution. Moreover, each source will likely demand a different approach. Figuring this out ahead of time is essential for the success of the integration system.

The next ETL Subsystem is [Data] Extraction. Once the source system has been profiled and a mechanism for CDC has been developed, the ETL team can begin the process of extracting the changed data from the source.

Tags: , , , , , , ,

2 Comments

34 Subsystems of ETL Data Integration

In this, and in the next series of posts, I will be exploring the 34 subsystems of ETL Data Integration (note: there used to be 38) as defined by the Kimball Group. I introduce the subsystems in this post, and then I will discuss how each fits (or does not fit) into both a SQL Server environment and in a hand-coded environment using VFP9.

The subsystem concept is a best-practice initiative formulated by The Kimball Group to help organizations design effective and efficient Data Integration environments for Data Warehousing using the Dimensional Model. As you’ll see, the subsystems are well-defined, battle-tested, and represent a relatively complete picture of what is needed for ETL Integration.

The Kimball Group categorizes the subsystems into 4 distinct groups: Data Extraction, Cleansing and Conforming Tasks, Data Delivery, and Management. Management is an often forgotten aspect of ETL (If the acronym was ETLM, we’d be OK!). You’ll notice that there are quite a few management-specific subsystems.

    Data Extraction

  1. Data Profiling
  2. Change Data Capture (CDC)
  3. Extraction
  4. Cleansing and Conforming Tasks

  5. Data Cleansing Subsystem
  6. Error Event Management
  7. Auditing
  8. Removing Duplicates
  9. Data Conformance
  10. Data Delivery

  11. Slowly Changing Dimensions (SCD)
  12. Surrogate Key Generator
  13. Hierarchy Manager
  14. Special Dimensions Manager
  15. Fact Table Builders
  16. Surrogate Key Management
  17. Bridge Table Builder
  18. Late Arriving Data Handler
  19. Dimension Manager
  20. Fact Table Provider
  21. Aggregate Generation
  22. OLAP Cube Builder
  23. Data Propagation Manager
  24. Management

  25. Scheduler
  26. Backup System
  27. Recovery and Restart
  28. Version Control
  29. Version Migration
  30. Work flow Monitor
  31. Sorting
  32. Data Lineage and Dependency
  33. Problem Escalation
  34. Paralleling and Pipelining
  35. Security
  36. Compliance Manager
  37. Metadata Repository

Data Integration architects have a lot to be thankful for the work and effort put forth by the Kimball Group. As you can probably guess, to do ETL right, you’ll need adequate time and resources. It is no wonder that data integration is by far the most time-consuming aspect of data warehousing.

In subsequent articles I will define each of these systems, provide examples, and discuss how they can be accomplished in SQL Server as well as VFP9. I have a lot of work to do, so stay tuned!

Tags: , , ,

43 Comments