Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for January, 2008


Published January 27th, 2008

Finally… Sedna

Some exciting news this weekend for the VFP community: Sedna has been released. Craig Berntson, Kevin Cully, and others have already blogged about it. Sedna contains some very exciting and interesting components:

  • VistaDialogs4COM - a collection of COM-visible classes that wrap the functionality provided by the Microsoft VistaBridgeLibrary which will provide access to the Windows Vista TaskDialog and Common Dialogs
  • Upsizing Wizard - excellent enhancements include bulk insert support, performance improvements, UI improvements, and support for table names with spaces
  • Data Explorer - lots of improvements including better drag and drop support and for showplans
  • NET4COM - a collection of COM classes that wrap a subset of the .NET Framework 2.0
  • MY for VFP - similar to NET4COM, MY is a native implementation which makes available many new functions for VFP developers
  • DDEX for VFP - The Data Designer EXtensions allow Visual Studio to work better with Visual FoxPro data sources

Now that the release is out, I plan to start using and experimenting with the various components (something I had decided to do ‘later’ rather than ’sooner’). It will be fun to see how others are making use of these as well!

Published January 17th, 2008

SSIS Expressions (by Scott Whigham)

In yesterday’s post, I made a comment that SSIS expressions were difficult to get used to. Earlier today, I ran across this video on YouTube. If you’re new to SSIS, or you simply have a problem with expressions, check it out. Here is a blurb:

Expressions are new in SSIS (meaning that they were not in DTS) but they allow you to create extremely flexible packages. Expressions can be used to assign values to variables, help you determine whether to execute a task, and to assign properties. If you ever found yourself writing ActiveX code (VBScript) in DTS to determine which task to execute based on a variable’s value, then you probably can forego the scripting all together in favor of expressions! They are very powerful in SSIS and you’ll want to have a solid understanding of how to write them and how to use them.

I love the comedy around online help. I totally feel for him…

Hope you found this useful!

Published January 16th, 2008

ETL Subsystem 4: Data Cleansing

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data Cleansing is a Data Quality process that is designed to catch and resolve data quality violations. Proper data cleansing will add tremendous value to the data, giving your business users plenty of reasons to trust it. Sometimes referred to as ’scrubbing’, this subsystem is a must for all Data Warehousing projects.

Most quality issues will be discovered during data Profiling, which I discussed here. Other issues will be identified by business users who have had to deal with problem data in the past. Some cleansing operations will be performed because of Master Data initiatives (for example, converting customer codes to universal codes for comparisons).

Cleansing data can consist of tasks like the following:

  1. Parsing city, state, and zip from an address string
  2. Changing the case of a person’s last name to UPPER
  3. Finding the delivery point from a street address and zip code
  4. Stripping out unwanted characters from text
  5. Swapping NULLs for some other value suitable for the DWH
  6. Deriving a value from one or more fields to create a new field
  7. Concatenating two or more fields into one (for example, turning day, month, year fields into a date)

Obviously, the above list is not exhaustive. There are myriad ways to improve the quality of data during integration. The important thing to keep in mind is that this subsystem, more than any other, actually seeks to change the data. As quality is the driving factor, these changes should always be for the better and serve some business purpose. Any change to the data should add value and confidence.

Garbage in, Garbage out

Think of data cleansing as the process of cleaning up all that garbage that got in in the first place. OLTP systems (especially legacy systems that have been kicking around for a few decades) are notorious for doing some strange things with the data. I have encountered cases where fields were re-purposed (”hey we’re not using store_id2 anymore, so now we have some place to put the manager’s phone number!”), algorithms re-written (but historical data not recalculated or stamped with a version), and nulls represented as anything but NULL (99, -99, -1, “N/A”, etc…).

Not only that, but it’s likely that you have run into cases where users were able to free-form type into data entry fields. For example, a list of valid products might include “Widget A” and “Widget B”, but because a user can type what they want, you get “WedgetA” and “WDGT-B”. Of course, from an application architecture point of view, this is a terrible design problem. Anyway, as a data integrator, expect this and worse.

One man’s junk is another man’s treasure

But there is light. Many of these quality problems can be corrected outright through integration, while others can be addressed directly in the source OLTP system. Many home-grown or custom applications compensate for bad data in some way (lookup/mapping tables, loose database constraints, routines that go back and ‘fix’ the data, etc…). Through ETL, you will have to do similar work.

What might seem like worthless data is oftentimes salvageable. In fact, I have been involved in projects where large chunks of data were simply discarded because the business users felt it was unworkable. After some tinkering, it turned out to be far from the truth. Not only was some of the data useful, but we were able to take our ‘dirty data proof’ back to the application developers to get the application fixed (fixing DQ issues at the source is always preferable).

SQL Server 2005 Integration Services (SSIS)

SSIS offers a great set of tasks for cleansing data. If you add the OLEDB Command (for SQL code) and Script Component (for VB code) to the list below, then there really are no limits to what can be achieved. My only criticism is that using expressions in SSIS is quite difficult at first and takes some getting used to (see Integration Services Expression Language). The language and syntax is native to SSIS. On the plus side, the expression builder dialog box is helpful. Expressions are used throughout SSIS, but especially in the Derived Column task.

Typically, data cleansing will involve one or more of the following SSIS tasks:

  • Lookup Transform - performs a lookup using a join to another table, adding columns from the lookup to the data flow
  • Character Map - for string data, this transform applies a variety of string functions to the input column (changing case, for example)
  • Copy/Map - create a copy of a column
  • Data Conversion - convert data types from one to another
  • Derived Column - use this task to create or replace existing data by concatenation, exploding, applying calculations, and doing comparisons
  • Fuzzy Lookup - a component you can use for inexact matching
  • Fuzzy Grouping - uses algorithms to match or group disparate data to a configurable degree of accuracy

In future posts, I will demonstrate how many of these tasks work (likely through a SQL Server 2008 lens).

Hand Coding with Visual FoxPro (VFP9)

I’ve done a ton of VFP-related data cleansing over the past decade. Of course, VFP doesn’t have any fancy squares that turn green when a cleansing operation succeeds (or red when it fails!), but the language itself will give you everything you need to start cleansing data with little effort. I started trying to count the number of different commands you could use for cleaning in VFP. I stopped when I hit 50. There are likely more than a 100. Commands like SUBSTRING, ALLTRIM, CHRTRAN, LEFT, RIGHT, OCCURS, AT, LOWER, PROPER, SOUNDEX, LIKE, TRANSFORM, etc. Consult with the Visual FoxPro documentation or resources such as The Visual FoxPro Wiki for more details regarding VFP’s language features. I know this is a bit of a cop-out on my part but hey, I could spend days trying to summarize the capabilities.

What I can do is share some data cleansing / VFP implementation experience with you. The Logical Data Map (LDM) is a mapping document (sometimes done in Excel or relational database if you prefer) that tells you all of the source-to-target mappings for the integration. In VFP, you can create a VFP table (or tables) and embed cleansing code snippets into memo fields. This code can execute as part of a metadata-driven cleansing engine. I demonstrate how this can work in my articles for Advisor media. You can see some examples in my FoxForward material as well (see: 2007 Fox Forward Presentation Logical Data Map Database).

From here

Both SSIS and VFP have great data cleansing capabilities; in both systems, data cleansing is easy. In VFP, you can expect a mature language and blazing speeds (FoxPro is excellent when working with string data). SSIS makes it very easy to derive columns and change data types. SSIS can also make use of multiple processors and has some good built-in logging functionality. In the end, I prefer VFP because it is much easier to work with FoxPro’s languange (as oposed to VB and the SSIS expression language). Also, in VFP I find it much easier to manage metadata and the LDM.

In the next article, I’ll discuss the Error Event Management subsystem and how VFP and SSIS can deal with that!