Archive for February, 2008

ETL Subsystem 7: Removing Duplicates

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

The process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the most appropriate version for the warehouse.

Generally, the practice involves an input record and a reference table, but could also span several databases and require access to web services and 3rd-party software. The reference table is usually a dimension (such as Product, Customer, Employee, Address, etc.) or a staging table that contains the relevant attributes to perform the lookup. If a staging table is used (a must for large reference sets), then it must be maintained as part of dimension processing. For more information about reference tables and master data, try searching for “Master Data Management” on Google. If I feel up to the challenge, I’ll post more about MDM in a future series of articles.

It is common to have duplicate data in an organization, especially if that organization has multiple systems for managing customers, products, orders, inventory, etc. Consider a small business with in-house software to track customers and a separate database Online. The same customer could be in both systems with different account numbers and a slightly different name (”Martha Jones” vs. “Martha A. Jones”). As part of data warehousing, we need to eliminate these cases as best we can. That is, of course if it is important for your data warehouse to do so.

If an organization does not need to eliminate these duplicates from their data warehouse, then they should proceed no further. However, in order to do any form of historical analysis, you need data that you can follow uninterrupted back in time and across multiple sources. This subsystem helps you do that.

Fuzzy Matching

Fuzzy matching is the process of determining if two strings are approximately equal (ideally returning some form of confidence score). Algorithms such as Levenshtein Distance and SoundEx have been developed to provide this type of functionality.

If the duplication of information is minimal, or if the volume of data is manageable, then it should be a rather straight-forward exercise to write your own fuzzy matching solution using any of the known algorithms. But you should plan on spending some time on this endeavor. Also, as fuzzy matching is not an exact science, plan on some ongoing human intervention.

One thing is certain: The more atomic your data is, the easier your job of de-duping will be. It is considerably easier to look for duplicate address information when the street number, name, unit number, city, state, and zip are broken out into atomic parts!

SSIS and VFP do not have any built-in, native ability to de-dupe data. Organizations with severe duplication problems will likely need to invest in a 3rd-party cleansing solution. Regardless, here is what you could do using SSIS and hand-coded in VFP:

SQL Server 2005 Integration Services (SSIS)

SSIS provides three very good Data Flow tasks for de-duplication:

Lookup Transformation:
Performs a lookup against a reference table. The lookup uses an equi-join on one or more columns (composite join) and is case sensitive. Use this task for exact matching to a reference set.
Fuzzy Lookup
This transformation uses fuzzy matching (pattern-based) against a reference table in a SQL Server 2005 database. You must supply the reference table, which is typically built from your existing data. You can customize the fuzzy lookup by specifying the maximum number of matches to return per input row, token delimiters, and similarity thresholds. Each match includes a similarity score (how similar the input and reference values are) and a confidence score (the likelihood that a match has been found).
Fuzzy Grouping
The Fuzzy Grouping identifies rows of data that are likely to be duplicates. Then it selects a row that best represents the group and marks that row with a similarity score of 1. The other rows in the group are given a score ranging from 0 to 1. The closer to 1, the better the match.

It is a best practice to first use a Lookup Transformation to find exact matches and then divert the unmatched data into a Fuzzy Lookup. The package will run faster (the Fuzzy Lookup is more expensive than the equi-join Lookup) and will be easier to maintain (you will clearly see what records are being diverted as non exact matches).

You should read more about how the Fuzzy Lookup and Fuzzy Groupings work in Books Online. They are both quite interesting. Also, they’re not too easy to get the hang of at first and have quite a few configuration options. Later, I will demonstrate these tasks in a future posting.

If you need a 3rd-Party tool, then integrating it into the Control Flow would be ideal. Otherwise, you will need to stage the dirty data and run the vendor’s product separately on that stage. You can then pick up the de-duped data in SSIS and continue processing. You may be able to use the Execute External Process task (runs a Win32 Executable or batch file) and put the entire operation into an SSIS Control Flow (which is the second-best solution).

Intelligent Search Technology has an integrated deduplication task for SSIS. I have not evaluated this task, but sounds pretty good. Read about it here.

Hand Coding with Visual FoxPro (VFP9)

FoxPro does not have any native or built-in solution for de-duping data. This isn’t odd, as most (if not all) programming and databases languages don’t either (I would love to know of a language that has some form of de-duping functionality). You can, however, build your own by using functions such as the following:

Levenshtein Algorithm
Check out the discussion on the Fox Wiki.
LIKE/ LIKEC
Can use wildcards to see if expression 1 is like expression 2. The LIKEC version is for double-byte characters. This function might be appropriate under certain circumstances for identifying data to de-dupe.
SOUNDEX
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. More about SoundEx.
Craig Boyd’s SoundsLike Function
Check out Craig Boyd’s improvement for the SOUNDEX function on the VFP Wiki.
MetaPhone and Double Metaphone
Here are a couple of gems translated for us by Craig. The code is posted on the FoxPro Wiki
The ‘Oliver’ Function
I, like Tyler Akins at rumkin.com, first heard of this using PHP. You can see Tyler Akins’ implementation using FoxPro code here. Pretty cool!

Note: The code (zip format) for Craig’s work can be found here at Replacement Software. You can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well.

As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just make up a word?), you can handle this easily in FoxPro by splitting compound fields yourself. I wrote a couple posts last year on this very subject: “Parsing firstname / middlename / lastname” and “Parsing City/State/Zip“. You can then use the individual name and address parts to perform the search, increasing your likelihood of finding good matches. Also, it would be a good idea to convert all strings to upper or lower case (I prefer upper) as well.

From Here

Although SSIS and FoxPro do not offer built-in de-duplication algorithms for names, addresses, products, and the like, they both offer enough tools, tasks, and functions so that you can write your own. Plan on spending plenty of time on this endeavor, as de-duplication is not an exact science and will undoubtedly require some human intervention during the process.

Next post, I’ll discuss ETL Best Practice Subsystem 8: Data Conformance. Data conformity across dimensions and fact tables is one of the hallmarks of a fully integrated dimensional data warehouse.

Tags: , , , , , , , ,

8 Comments

Book Review: SQL Server Integration Services Using Visual Studio 2005

I was recently asked to take a look at and review a new book titled “SQL Server Integration Services Using Visual Studio 2005″ released in December of 2007 through PACKT Publishing. As I am always looking for good reference materials, and because I am currently in a position where I will be introducing Integration Services to folks untrained in the fine art of SSIS in the near future, I jumped at the chance.

The book was a fast read (about 4 days, a few hours a day) and contained many of the most common Control Flow and Data Flow Tasks. The author gives context for each task by providing some examples on when they should be used. The author also takes some time orienting the user to BIDS by providing definitions and several pages of menu and other UI feature descriptions.

But with that said:

  • The book was a fast read because more than half of its 296 pages were devoted to screenshots (not a scientific calculation, but about half is how it felt).
  • In many cases, the common tasks demonstrated where done in a vacuum, not allowing a beginner to get an idea of how multiple tasks can work together to provide a more complete solution. I know this is a beginner guide, but surely we could start small and build into something big?
  • The context provided for each task is in the form of an example situation which is nice, but does not give a beginner enough to go on in order to know that a particular task is right for the job they need done.
  • BIDS and VS share the same shell. Why not just say that right away and then call Visual Studio BIDS instead? All the other books and reference materials I have come across use “BIDS” and not “Visual Studio”. I suppose this is a bit of a pet peeve of mine!
  • Many definitions were inadequate or did not really define the subject. A great example is the OLEDB Command! I’m not even sure what the definition refers to. He writes: “Parameterized queries are a powerful means to extract just the information needed to satisfy a given criteria. You would use a parametric query when you try finding patient information, given the patient-ID from a patient database. This transformation requires a data flow task, and an OLEDB data source or a flat file source.” That is word-for-word. Of course, the OLEDB Command merely executes an SQL command for each row in a dataset!
  • The UI orientation was awkward at best. I don’t feel that the author did a good enough job of describing how the menus and toolbars change depending on the user’s view and context.

There are problems on multiple fronts: (a) grammar, (b) style, (c) consistency, (d) completeness, (e) conciseness, and most importantly (f) technical accuracy. See my review on Amazon for some details.

For another example, on page 124, the author writes about preparing a flat file for the Bulk Insert Task:

Normally, this file should be available, as this is the starting point. Since this is just a demo, we will be using a file with 10 rows, of which the 1st row is a column header, a ludicrously small file for this heavy weight tool. This file can be created with a text editor such as notepad, but usually, it is resident in legacy data stores.

What if the file is not available? Is it really ludicrous to import 10 rows into SQL Server using SSIS? Couldn’t modern applications produce delimited files?

Unfortunately for the reader, most paragraphs read this way. I know and understand what the author is trying to say, but for a beginner (which is who this book is designed for) a paragraph structured in this way will undoubtedly stand in the way of his or her learning. I found myself having to read paragraphs like this two or three times to understand them.

What I found odd was an entire chapter dedicated to the Active X Script Task. Even the author admits that the task is only included for backward compatibility. According to BOL:

This feature will be removed in the next version of Microsoft SQL Server. Do not use this feature in new development work, and modify applications that currently use this feature as soon as possible.

So I really, really have to wonder why on earth this would be given any attention at all in a beginner’s book!

Also, what about more information regarding precedent constraints, variables, logging, error handling, parent packages, configuration files, connection managers, data sources, data source views, the expression builder, and the like? I don’t consider any of these items beyond the scope of a beginner’s book, and in fact, if a beginner is not armed with this basic information, their packages are likely not to be very maintainable, scalable, or functional!

My Recommendation?

Although I found this book lacking in many areas, others who have reviewed it think otherwise. Please take a moment and read the reviews at Amazon for more details (mine included).

As data integration projects are very likely to fail or run way over budget and time, it is critical that beginners have a proper understanding of the tool they are about to use. If they are using SSIS as part of a data warehousing project to deliver Business Intelligence (as opposed to writing maintenance plans and MSMQ applications for example), then it is also critical that the beginner is armed with the proper theories and best practices to best position him or her for data integration success. This book does not provide this understanding and reads more like a help file or cookbook.

Closing thoughts

It was difficult for me to write this review. As an author, blogger, and musician though I feel that good constructive feedback/criticism is essential for growth and continuous improvement (sort of like how an ETL developer should approach his or her data integration project each day!). I’ve been the recipient of plenty of negative reviews, and with each, I’ve managed to get better. Lastly, to the great credit of Packt Publishing, whom I contacted before writing this entry here and on Amazon, they told me to go ahead and give my honest assessment.

Other reviews of this book include:

Tags: , , , ,

2 Comments

Some Time Off with Sara

On a personal note: My wife and I had a baby girl this past weekend. Her name is Sara, and she weighed in at 7 pounds, 11 ounces. Now that we’re finally in our new house here in Belgium, it feels good to have the family complete to enjoy it! I am fortunate enough to have 10 paid days off for paternity leave — and I’m taking advantage. I’m reading a few books, catching up on some unpacking, and of course, spending lots of time with Sara.

Sara about a two hours old


I’ll be back on track in a few days with my postings. Cheers!

Tags:

6 Comments

Cooling Down

It is (or should be) common knowledge that you should never send an email, write a blog or forum post, or make a phone call when you’re totally ticked off about something! You are likely to say something you don’t mean or perhaps you’ll be a little too honest.

First cool down, and then respond. Easy enough, but what if you can’t wait to cool down using traditional methods (you know like, take a long hot bath)?

The solution: Simply write your name a few times on a piece of paper using your non-dominant hand. Apparently, it will force the logical side of your brain to start working, giving your emotional side a few seconds to forget why it is so upset (or sad, or excited, etc.). For all the neurosurgeons out there who might want to debate brain lateralization, I’m not the guy for you! But this technique has worked many times for me (and it recently got my sister-in-law out of a funk).

Over the past several days, I’ve also been looking into other ways to train my brain to either help in logical tasks, management tasks, programming, motivation, etc. I stumbled upon a blog entry (from Gary’s Historical Art) that spoke of the book “Drawing on the right side of the brain“. I remember this book from my childhood and was thrilled to see it has a new addition. It contains some additional information on (a) the latest developments in brain research, and (b) information on using drawing skills for problem solving. I plan to get a copy soon.

Tags: ,

No Comments

FoxPro ActiveX (FPOLE.OCX) Security Bug Fix

In case you haven’t heard, Microsoft released several security updates yesterday — 11, to be exact. One of these patches updates a known issue with VFP regarding FPOLE.OCX. This is what was written in the Washington Post:

MS08-010 fixes a publicly disclosed ActiveX bug that affects Visual FoxPro users. Although hackers have already posted code showing how to exploit this vulnerability, the buggy ActiveX control is not included in Internet Explorer 7’s default list of controls, so the flaw should not affect most users.

You can read more information about this patch (and details about the vulnerability) here: ISS:

The Microsoft Visual FoxPro ActiveX control is vulnerable to a stack-based buffer overflow, caused by improper bounds checking by the FoxDoCmd function. By persuading a victim to visit a malicious Web page using Internet Explorer, a remote attacker could overflow a buffer and execute arbitrary code on the system with the privileges of the victim.

More info: http://www.microsoft.com/technet/security/bulletin/ms08-010.mspx

Of course, MS recommends to get patched right away…

Tags: , ,

No Comments