ETL Subsystem 7: Removing Duplicates
This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
The process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the most appropriate version for the warehouse.
Generally, the practice involves an input record and a reference table, but could also span several databases and require access to web services and 3rd-party software. The reference table is usually a dimension (such as Product, Customer, Employee, Address, etc.) or a staging table that contains the relevant attributes to perform the lookup. If a staging table is used (a must for large reference sets), then it must be maintained as part of dimension processing. For more information about reference tables and master data, try searching for “Master Data Management” on Google. If I feel up to the challenge, I’ll post more about MDM in a future series of articles.
It is common to have duplicate data in an organization, especially if that organization has multiple systems for managing customers, products, orders, inventory, etc. Consider a small business with in-house software to track customers and a separate database Online. The same customer could be in both systems with different account numbers and a slightly different name (”Martha Jones” vs. “Martha A. Jones”). As part of data warehousing, we need to eliminate these cases as best we can. That is, of course if it is important for your data warehouse to do so.
If an organization does not need to eliminate these duplicates from their data warehouse, then they should proceed no further. However, in order to do any form of historical analysis, you need data that you can follow uninterrupted back in time and across multiple sources. This subsystem helps you do that.
Fuzzy Matching
Fuzzy matching is the process of determining if two strings are approximately equal (ideally returning some form of confidence score). Algorithms such as Levenshtein Distance and SoundEx have been developed to provide this type of functionality.
If the duplication of information is minimal, or if the volume of data is manageable, then it should be a rather straight-forward exercise to write your own fuzzy matching solution using any of the known algorithms. But you should plan on spending some time on this endeavor. Also, as fuzzy matching is not an exact science, plan on some ongoing human intervention.
One thing is certain: The more atomic your data is, the easier your job of de-duping will be. It is considerably easier to look for duplicate address information when the street number, name, unit number, city, state, and zip are broken out into atomic parts!
SSIS and VFP do not have any built-in, native ability to de-dupe data. Organizations with severe duplication problems will likely need to invest in a 3rd-party cleansing solution. Regardless, here is what you could do using SSIS and hand-coded in VFP:
SQL Server 2005 Integration Services (SSIS)
SSIS provides three very good Data Flow tasks for de-duplication:
- Lookup Transformation:
- Performs a lookup against a reference table. The lookup uses an equi-join on one or more columns (composite join) and is case sensitive. Use this task for exact matching to a reference set.
- Fuzzy Lookup
- This transformation uses fuzzy matching (pattern-based) against a reference table in a SQL Server 2005 database. You must supply the reference table, which is typically built from your existing data. You can customize the fuzzy lookup by specifying the maximum number of matches to return per input row, token delimiters, and similarity thresholds. Each match includes a similarity score (how similar the input and reference values are) and a confidence score (the likelihood that a match has been found).
- Fuzzy Grouping
- The Fuzzy Grouping identifies rows of data that are likely to be duplicates. Then it selects a row that best represents the group and marks that row with a similarity score of 1. The other rows in the group are given a score ranging from 0 to 1. The closer to 1, the better the match.
It is a best practice to first use a Lookup Transformation to find exact matches and then divert the unmatched data into a Fuzzy Lookup. The package will run faster (the Fuzzy Lookup is more expensive than the equi-join Lookup) and will be easier to maintain (you will clearly see what records are being diverted as non exact matches).
You should read more about how the Fuzzy Lookup and Fuzzy Groupings work in Books Online. They are both quite interesting. Also, they’re not too easy to get the hang of at first and have quite a few configuration options. Later, I will demonstrate these tasks in a future posting.
If you need a 3rd-Party tool, then integrating it into the Control Flow would be ideal. Otherwise, you will need to stage the dirty data and run the vendor’s product separately on that stage. You can then pick up the de-duped data in SSIS and continue processing. You may be able to use the Execute External Process task (runs a Win32 Executable or batch file) and put the entire operation into an SSIS Control Flow (which is the second-best solution).
Intelligent Search Technology has an integrated deduplication task for SSIS. I have not evaluated this task, but sounds pretty good. Read about it here.
Hand Coding with Visual FoxPro (VFP9)
FoxPro does not have any native or built-in solution for de-duping data. This isn’t odd, as most (if not all) programming and databases languages don’t either (I would love to know of a language that has some form of de-duping functionality). You can, however, build your own by using functions such as the following:
- Levenshtein Algorithm
- Check out the discussion on the Fox Wiki.
- LIKE/ LIKEC
- Can use wildcards to see if expression 1 is like expression 2. The LIKEC version is for double-byte characters. This function might be appropriate under certain circumstances for identifying data to de-dupe.
- SOUNDEX
- Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. More about SoundEx.
- Craig Boyd’s SoundsLike Function
- Check out Craig Boyd’s improvement for the SOUNDEX function on the VFP Wiki.
- MetaPhone and Double Metaphone
- Here are a couple of gems translated for us by Craig. The code is posted on the FoxPro Wiki
- The ‘Oliver’ Function
- I, like Tyler Akins at rumkin.com, first heard of this using PHP. You can see Tyler Akins’ implementation using FoxPro code here. Pretty cool!
Note: The code (zip format) for Craig’s work can be found here at Replacement Software. You can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well.
As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just make up a word?), you can handle this easily in FoxPro by splitting compound fields yourself. I wrote a couple posts last year on this very subject: “Parsing firstname / middlename / lastname” and “Parsing City/State/Zip“. You can then use the individual name and address parts to perform the search, increasing your likelihood of finding good matches. Also, it would be a good idea to convert all strings to upper or lower case (I prefer upper) as well.
From Here
Although SSIS and FoxPro do not offer built-in de-duplication algorithms for names, addresses, products, and the like, they both offer enough tools, tasks, and functions so that you can write your own. Plan on spending plenty of time on this endeavor, as de-duplication is not an exact science and will undoubtedly require some human intervention during the process.
Next post, I’ll discuss ETL Best Practice Subsystem 8: Data Conformance. Data conformity across dimensions and fact tables is one of the hallmarks of a fully integrated dimensional data warehouse.
I'm a Quant Technical Specialist (Data Warehousing and Business Intelligence), with expertise in business analysis, data modeling, and data integration. I have extensive experience developing vertical and integrated desktop, Internet, and BI applications spanning municipal, clinical, and financial industries.

February 29th, 2008 at 11:46 am
I found your site on technorati and read a few of your other posts. Keep up the good work. I just added your RSS feed to my Google News Reader. Looking forward to reading more from you.
Jason Rakowski
February 29th, 2008 at 2:46 pm
[…] http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/You can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well. As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just … […]
February 29th, 2008 at 3:27 pm
Years ago I worked for Social Services. Our biggest problem was reports of people that needed assistance (ok; not our biggest problem, but it was mine!) and spelling of names was questionable. Even clients (that’s code for welfare folks) sometimes didn’t know how to spell their name (either due to drugs/alcohol abuse or brain damage…) Anyway;
I attended a coference in New York — it’s outcome was a long document; which I filtered out to be a soundex replacement like function specifically for matching names.
For instance, my last name is Assing — but it’s pronounced Ausing the nyiis function calls that a match. That is just one example.
I used to sell the function’s source code; but am planning on making it available for free to anyone that wants it under GPL licensing.
-josh
February 29th, 2008 at 3:28 pm
[…] http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/The process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the … […]
March 1st, 2008 at 2:36 am
Hi Josh,
Please let me know when ready!
Deduplication of names is particularly tricky. At least with addresses, you have a finite number of known ‘Good ‘ values. Lists can be obtained from the USPS for example. Names, on the other hand are too variable.
March 9th, 2008 at 3:15 am
Tod, when I worked in the direct mail industry, we had a program from PostalSoft/FirstLogic/Business Objects that did Match/Consolidate, Merge/Purge, or whatever other term for deduping that you want to come up with.
It had some pretty elaborate settings, and small tweaks could make great differences in the outcome. After using that, I’m not terribly surprised that the functionality doesn’t come natively. 
March 11th, 2008 at 6:50 am
Nice to hear you Garrett. The de-duping process is a rather complex one in general with every solution having some significant downside. Soundex only works for names in the US for example (well, that is how I understand it anyway!).
I would guess that de-duping is one of the areas that takes ETL developers the longest to get right — or close to right.
March 18th, 2008 at 9:03 am
[…] Removing Duplicates […]