This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Data Cleansing is a Data Quality process that is designed to catch and resolve data quality violations. Proper data cleansing will add tremendous value to the data, giving your business users plenty of reasons to trust it. Sometimes referred to as ‘scrubbing’, this subsystem is a must for all Data Warehousing projects.
Most quality issues will be discovered during data Profiling, which I discussed here. Other issues will be identified by business users who have had to deal with problem data in the past. Some cleansing operations will be performed because of Master Data initiatives (for example, converting customer codes to universal codes for comparisons).
Cleansing data can consist of tasks like the following:
- Parsing city, state, and zip from an address string
- Changing the case of a person’s last name to UPPER
- Finding the delivery point from a street address and zip code
- Stripping out unwanted characters from text
- Swapping NULLs for some other value suitable for the DWH
- Deriving a value from one or more fields to create a new field
- Concatenating two or more fields into one (for example, turning day, month, year fields into a date)
Obviously, the above list is not exhaustive. There are myriad ways to improve the quality of data during integration. The important thing to keep in mind is that this subsystem, more than any other, actually seeks to change the data. As quality is the driving factor, these changes should always be for the better and serve some business purpose. Any change to the data should add value and confidence.
Garbage in, Garbage out
Think of data cleansing as the process of cleaning up all that garbage that got in in the first place. OLTP systems (especially legacy systems that have been kicking around for a few decades) are notorious for doing some strange things with the data. I have encountered cases where fields were re-purposed (“hey we’re not using store_id2 anymore, so now we have some place to put the manager’s phone number!”), algorithms re-written (but historical data not recalculated or stamped with a version), and nulls represented as anything but NULL (99, -99, -1, “N/A”, etc…).
Not only that, but it’s likely that you have run into cases where users were able to free-form type into data entry fields. For example, a list of valid products might include “Widget A” and “Widget B”, but because a user can type what they want, you get “WedgetA” and “WDGT-B”. Of course, from an application architecture point of view, this is a terrible design problem. Anyway, as a data integrator, expect this and worse.
One man’s junk is another man’s treasure
But there is light. Many of these quality problems can be corrected outright through integration, while others can be addressed directly in the source OLTP system. Many home-grown or custom applications compensate for bad data in some way (lookup/mapping tables, loose database constraints, routines that go back and ‘fix’ the data, etc…). Through ETL, you will have to do similar work.
What might seem like worthless data is oftentimes salvageable. In fact, I have been involved in projects where large chunks of data were simply discarded because the business users felt it was unworkable. After some tinkering, it turned out to be far from the truth. Not only was some of the data useful, but we were able to take our ‘dirty data proof’ back to the application developers to get the application fixed (fixing DQ issues at the source is always preferable).
SQL Server 2005 Integration Services (SSIS)
SSIS offers a great set of tasks for cleansing data. If you add the OLEDB Command (for SQL code) and Script Component (for VB code) to the list below, then there really are no limits to what can be achieved. My only criticism is that using expressions in SSIS is quite difficult at first and takes some getting used to (see Integration Services Expression Language). The language and syntax is native to SSIS. On the plus side, the expression builder dialog box is helpful. Expressions are used throughout SSIS, but especially in the Derived Column task.
Typically, data cleansing will involve one or more of the following SSIS tasks:
- Lookup Transform – performs a lookup using a join to another table, adding columns from the lookup to the data flow
- Character Map – for string data, this transform applies a variety of string functions to the input column (changing case, for example)
- Copy/Map – create a copy of a column
- Data Conversion – convert data types from one to another
- Derived Column – use this task to create or replace existing data by concatenation, exploding, applying calculations, and doing comparisons
- Fuzzy Lookup – a component you can use for inexact matching
- Fuzzy Grouping – uses algorithms to match or group disparate data to a configurable degree of accuracy
In future posts, I will demonstrate how many of these tasks work (likely through a SQL Server 2008 lens).
Hand Coding with Visual FoxPro (VFP9)
I’ve done a ton of VFP-related data cleansing over the past decade. Of course, VFP doesn’t have any fancy squares that turn green when a cleansing operation succeeds (or red when it fails!), but the language itself will give you everything you need to start cleansing data with little effort. I started trying to count the number of different commands you could use for cleaning in VFP. I stopped when I hit 50. There are likely more than a 100. Commands like SUBSTRING, ALLTRIM, CHRTRAN, LEFT, RIGHT, OCCURS, AT, LOWER, PROPER, SOUNDEX, LIKE, TRANSFORM, etc. Consult with the Visual FoxPro documentation or resources such as The Visual FoxPro Wiki for more details regarding VFP’s language features. I know this is a bit of a cop-out on my part but hey, I could spend days trying to summarize the capabilities.
What I can do is share some data cleansing / VFP implementation experience with you. The Logical Data Map (LDM) is a mapping document (sometimes done in Excel or relational database if you prefer) that tells you all of the source-to-target mappings for the integration. In VFP, you can create a VFP table (or tables) and embed cleansing code snippets into memo fields. This code can execute as part of a metadata-driven cleansing engine. I demonstrate how this can work in my articles for Advisor media. You can see some examples in my FoxForward material as well (see: 2007 Fox Forward Presentation Logical Data Map Database).
From here
Both SSIS and VFP have great data cleansing capabilities; in both systems, data cleansing is easy. In VFP, you can expect a mature language and blazing speeds (FoxPro is excellent when working with string data). SSIS makes it very easy to derive columns and change data types. SSIS can also make use of multiple processors and has some good built-in logging functionality. In the end, I prefer VFP because it is much easier to work with FoxPro’s languange (as oposed to VB and the SSIS expression language). Also, in VFP I find it much easier to manage metadata and the LDM.
In the next article, I’ll discuss the Error Event Management subsystem and how VFP and SSIS can deal with that!
#1 by Santosh at May 7th, 2008
| Quote
Hi Tod,
Your articles are very useful. If you add few examples developed in SSIS 2005 for each subsystem, it would be a great help.
Thanks and Regards
Santosh
#2 by Tod McKenna at May 7th, 2008
| Quote
Hi Santosh,
I’ve posted a reply comment for you here: http://blog.todmeansfox.com/2008/02/08/etl-subsystem-5-error-event-management/
Regards,
Tod
#3 by Chris Johnson at May 13th, 2010
| Quote
great post. i read these every week. Keep up the good work.