Posts Tagged Conformity

ETL Subsystem 7: Removing Duplicates

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

The process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the most appropriate version for the warehouse.

Generally, the practice involves an input record and a reference table, but could also span several databases and require access to web services and 3rd-party software. The reference table is usually a dimension (such as Product, Customer, Employee, Address, etc.) or a staging table that contains the relevant attributes to perform the lookup. If a staging table is used (a must for large reference sets), then it must be maintained as part of dimension processing. For more information about reference tables and master data, try searching for “Master Data Management” on Google. If I feel up to the challenge, I’ll post more about MDM in a future series of articles.

It is common to have duplicate data in an organization, especially if that organization has multiple systems for managing customers, products, orders, inventory, etc. Consider a small business with in-house software to track customers and a separate database Online. The same customer could be in both systems with different account numbers and a slightly different name (“Martha Jones” vs. “Martha A. Jones”). As part of data warehousing, we need to eliminate these cases as best we can. That is, of course if it is important for your data warehouse to do so.

If an organization does not need to eliminate these duplicates from their data warehouse, then they should proceed no further. However, in order to do any form of historical analysis, you need data that you can follow uninterrupted back in time and across multiple sources. This subsystem helps you do that.

Fuzzy Matching

Fuzzy matching is the process of determining if two strings are approximately equal (ideally returning some form of confidence score). Algorithms such as Levenshtein Distance and SoundEx have been developed to provide this type of functionality.

If the duplication of information is minimal, or if the volume of data is manageable, then it should be a rather straight-forward exercise to write your own fuzzy matching solution using any of the known algorithms. But you should plan on spending some time on this endeavor. Also, as fuzzy matching is not an exact science, plan on some ongoing human intervention.

One thing is certain: The more atomic your data is, the easier your job of de-duping will be. It is considerably easier to look for duplicate address information when the street number, name, unit number, city, state, and zip are broken out into atomic parts!

SSIS and VFP do not have any built-in, native ability to de-dupe data. Organizations with severe duplication problems will likely need to invest in a 3rd-party cleansing solution. Regardless, here is what you could do using SSIS and hand-coded in VFP:

SQL Server 2005 Integration Services (SSIS)

SSIS provides three very good Data Flow tasks for de-duplication:

Lookup Transformation:
Performs a lookup against a reference table. The lookup uses an equi-join on one or more columns (composite join) and is case sensitive. Use this task for exact matching to a reference set.
Fuzzy Lookup
This transformation uses fuzzy matching (pattern-based) against a reference table in a SQL Server 2005 database. You must supply the reference table, which is typically built from your existing data. You can customize the fuzzy lookup by specifying the maximum number of matches to return per input row, token delimiters, and similarity thresholds. Each match includes a similarity score (how similar the input and reference values are) and a confidence score (the likelihood that a match has been found).
Fuzzy Grouping
The Fuzzy Grouping identifies rows of data that are likely to be duplicates. Then it selects a row that best represents the group and marks that row with a similarity score of 1. The other rows in the group are given a score ranging from 0 to 1. The closer to 1, the better the match.

It is a best practice to first use a Lookup Transformation to find exact matches and then divert the unmatched data into a Fuzzy Lookup. The package will run faster (the Fuzzy Lookup is more expensive than the equi-join Lookup) and will be easier to maintain (you will clearly see what records are being diverted as non exact matches).

You should read more about how the Fuzzy Lookup and Fuzzy Groupings work in Books Online. They are both quite interesting. Also, they’re not too easy to get the hang of at first and have quite a few configuration options. Later, I will demonstrate these tasks in a future posting.

If you need a 3rd-Party tool, then integrating it into the Control Flow would be ideal. Otherwise, you will need to stage the dirty data and run the vendor’s product separately on that stage. You can then pick up the de-duped data in SSIS and continue processing. You may be able to use the Execute External Process task (runs a Win32 Executable or batch file) and put the entire operation into an SSIS Control Flow (which is the second-best solution).

Intelligent Search Technology has an integrated deduplication task for SSIS. I have not evaluated this task, but sounds pretty good. Read about it here.

Hand Coding with Visual FoxPro (VFP9)

FoxPro does not have any native or built-in solution for de-duping data. This isn’t odd, as most (if not all) programming and databases languages don’t either (I would love to know of a language that has some form of de-duping functionality). You can, however, build your own by using functions such as the following:

Levenshtein Algorithm
Check out the discussion on the Fox Wiki.
LIKE/ LIKEC
Can use wildcards to see if expression 1 is like expression 2. The LIKEC version is for double-byte characters. This function might be appropriate under certain circumstances for identifying data to de-dupe.
SOUNDEX
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. More about SoundEx.
Craig Boyd’s SoundsLike Function
Check out Craig Boyd’s improvement for the SOUNDEX function on the VFP Wiki.
MetaPhone and Double Metaphone
Here are a couple of gems translated for us by Craig. The code is posted on the FoxPro Wiki
The ‘Oliver’ Function
I, like Tyler Akins at rumkin.com, first heard of this using PHP. You can see Tyler Akins’ implementation using FoxPro code here. Pretty cool!

Note: The code (zip format) for Craig’s work can be found here at Replacement Software. You can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well.

As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just make up a word?), you can handle this easily in FoxPro by splitting compound fields yourself. I wrote a couple posts last year on this very subject: “Parsing firstname / middlename / lastname” and “Parsing City/State/Zip“. You can then use the individual name and address parts to perform the search, increasing your likelihood of finding good matches. Also, it would be a good idea to convert all strings to upper or lower case (I prefer upper) as well.

From Here

Although SSIS and FoxPro do not offer built-in de-duplication algorithms for names, addresses, products, and the like, they both offer enough tools, tasks, and functions so that you can write your own. Plan on spending plenty of time on this endeavor, as de-duplication is not an exact science and will undoubtedly require some human intervention during the process.

Next post, I’ll discuss ETL Best Practice Subsystem 8: Data Conformance. Data conformity across dimensions and fact tables is one of the hallmarks of a fully integrated dimensional data warehouse.

Tags: , , , , , , , ,

8 Comments

Compare a String to a List of Formats

Conforming data and building master data are two very important aspects of data warehousing, migrating legacy systems, and federating reports. A common task is to compare a string to a known format and return the matching value. For example, you may need to take parcel identifier information from multiple towns and do some smart parsing to obtain atomic map, blocks, and lots. Town A might store their Parcel ID in the format MMM-BBB/LLL, while another in the format MM/LL.BBB. You could store all known formats in a table, do some magic to convert all parts to a format code, and finally find the match in the formats table.

In VFP, there are many ways to compare the contents of a string to a particular format. TRANSFORM() and LIKE() come to mind. Using TRANSFORM, you can do something like the following:

lcTestVal = "SOMETEST_VAL"
? TRANSFORM(lcTestVal,"XXXXXXXX_XXX") == lcTestVal     && returns .T.
? TRANSFORM(lcTestVal,"XXXXXXX_XX") == lcTestVal    && returns .F.

LIKEwise, you can do some comparisons with LIKE:

lcTestVal = "SOMETEST_VAL"
? LIKE("????????_???",lcTestVal)    && returns .T.
? LIKE("???????_??",lcTestVal)    && returns .F.

But in some cases, it might be necessary to compare a single string against multiple formats, returning the format that the string matches (a common task when conforming data for data warehousing). There are a lot of ways to handle this, but simply using TRANSFORM or LIKE won’t do it alone.

Take the following example, where we have three different account number formats that come from three disparate source systems. You could manage this easily in FoxPro:

lcFormat1 = "NCCCCCN"
lcFormat2 = "CCCNNNNNN"
lcFormat3 = "NNNNNNNCCC"
 
DIMENSION aTestValues [10]
aTestValues[1] = "3UUFGP7"
aTestValues[2] = "MCK000989"
aTestValues[3] = "0090892LLG"
aTestValues[4] = "9ABCD2"
aTestValues[5] = "JKL230945"
aTestValues[6] = "JKLM00989"
aTestValues[7] = "JJ2000989"
aTestValues[8] = "3U99G7"
aTestValues[9] = "3UUFGP"
aTestValues[10] = "7XYZA1"
 
FOR x = 1 TO 10
 
    lcTestValMasked = CHRTRAN(CHRTRAN(UPPER(aTestValues[x]),"ABCDEFGHIJKLMNOPQRSTUVWXYZ","CCCCCCCCCCCCCCCCCCCCCCCCCC"),"1234567890","NNNNNNNNNN")
 
    ? ICASE( lcTestValMasked == lcFormat1 , "String " + aTestValues[x] + " matches format 1",;
            lcTestValMasked == lcFormat2 , "String " + aTestValues[x] + " matches format 2",;
            lcTestValMasked == lcFormat3 , "String " + aTestValues[x] + " matches format 3",;
            "String " + aTestValues[x] + " does not match any format")
 
NEXT

In the above example, there are three acceptable format codes that we’re using to compare against 10 values. Using ICASE, we can easily determine which codes match the format string. The inner CHRTRAN converts all character values to the letter “C”, while the outer CHRTRAN converts numbers to “N”. If we did the numbers first, our “N”s would become “C”s! Of course, you can use “X” and 9, “A” and “#”, or whatever letters you want to use as your mask.

Ideally, the format codes would belong to a table which would contain additional information about the code, such as where it came from and its status (active or inactive). Then, a loop through the table, or a seek on the format column would point you to the corresponding record.

I would love to hear some additional solutions on this. If you have any, please comment!

Tags: , , , , , ,

No Comments

ETL Subsystem 4: Data Cleansing

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data Cleansing is a Data Quality process that is designed to catch and resolve data quality violations. Proper data cleansing will add tremendous value to the data, giving your business users plenty of reasons to trust it. Sometimes referred to as ‘scrubbing’, this subsystem is a must for all Data Warehousing projects.

Most quality issues will be discovered during data Profiling, which I discussed here. Other issues will be identified by business users who have had to deal with problem data in the past. Some cleansing operations will be performed because of Master Data initiatives (for example, converting customer codes to universal codes for comparisons).

Cleansing data can consist of tasks like the following:

  1. Parsing city, state, and zip from an address string
  2. Changing the case of a person’s last name to UPPER
  3. Finding the delivery point from a street address and zip code
  4. Stripping out unwanted characters from text
  5. Swapping NULLs for some other value suitable for the DWH
  6. Deriving a value from one or more fields to create a new field
  7. Concatenating two or more fields into one (for example, turning day, month, year fields into a date)

Obviously, the above list is not exhaustive. There are myriad ways to improve the quality of data during integration. The important thing to keep in mind is that this subsystem, more than any other, actually seeks to change the data. As quality is the driving factor, these changes should always be for the better and serve some business purpose. Any change to the data should add value and confidence.

Garbage in, Garbage out

Think of data cleansing as the process of cleaning up all that garbage that got in in the first place. OLTP systems (especially legacy systems that have been kicking around for a few decades) are notorious for doing some strange things with the data. I have encountered cases where fields were re-purposed (“hey we’re not using store_id2 anymore, so now we have some place to put the manager’s phone number!”), algorithms re-written (but historical data not recalculated or stamped with a version), and nulls represented as anything but NULL (99, -99, -1, “N/A”, etc…).

Not only that, but it’s likely that you have run into cases where users were able to free-form type into data entry fields. For example, a list of valid products might include “Widget A” and “Widget B”, but because a user can type what they want, you get “WedgetA” and “WDGT-B”. Of course, from an application architecture point of view, this is a terrible design problem. Anyway, as a data integrator, expect this and worse.

One man’s junk is another man’s treasure

But there is light. Many of these quality problems can be corrected outright through integration, while others can be addressed directly in the source OLTP system. Many home-grown or custom applications compensate for bad data in some way (lookup/mapping tables, loose database constraints, routines that go back and ‘fix’ the data, etc…). Through ETL, you will have to do similar work.

What might seem like worthless data is oftentimes salvageable. In fact, I have been involved in projects where large chunks of data were simply discarded because the business users felt it was unworkable. After some tinkering, it turned out to be far from the truth. Not only was some of the data useful, but we were able to take our ‘dirty data proof’ back to the application developers to get the application fixed (fixing DQ issues at the source is always preferable).

SQL Server 2005 Integration Services (SSIS)

SSIS offers a great set of tasks for cleansing data. If you add the OLEDB Command (for SQL code) and Script Component (for VB code) to the list below, then there really are no limits to what can be achieved. My only criticism is that using expressions in SSIS is quite difficult at first and takes some getting used to (see Integration Services Expression Language). The language and syntax is native to SSIS. On the plus side, the expression builder dialog box is helpful. Expressions are used throughout SSIS, but especially in the Derived Column task.

Typically, data cleansing will involve one or more of the following SSIS tasks:

  • Lookup Transform – performs a lookup using a join to another table, adding columns from the lookup to the data flow
  • Character Map – for string data, this transform applies a variety of string functions to the input column (changing case, for example)
  • Copy/Map – create a copy of a column
  • Data Conversion – convert data types from one to another
  • Derived Column – use this task to create or replace existing data by concatenation, exploding, applying calculations, and doing comparisons
  • Fuzzy Lookup – a component you can use for inexact matching
  • Fuzzy Grouping – uses algorithms to match or group disparate data to a configurable degree of accuracy

In future posts, I will demonstrate how many of these tasks work (likely through a SQL Server 2008 lens).

Hand Coding with Visual FoxPro (VFP9)

I’ve done a ton of VFP-related data cleansing over the past decade. Of course, VFP doesn’t have any fancy squares that turn green when a cleansing operation succeeds (or red when it fails!), but the language itself will give you everything you need to start cleansing data with little effort. I started trying to count the number of different commands you could use for cleaning in VFP. I stopped when I hit 50. There are likely more than a 100. Commands like SUBSTRING, ALLTRIM, CHRTRAN, LEFT, RIGHT, OCCURS, AT, LOWER, PROPER, SOUNDEX, LIKE, TRANSFORM, etc. Consult with the Visual FoxPro documentation or resources such as The Visual FoxPro Wiki for more details regarding VFP’s language features. I know this is a bit of a cop-out on my part but hey, I could spend days trying to summarize the capabilities.

What I can do is share some data cleansing / VFP implementation experience with you. The Logical Data Map (LDM) is a mapping document (sometimes done in Excel or relational database if you prefer) that tells you all of the source-to-target mappings for the integration. In VFP, you can create a VFP table (or tables) and embed cleansing code snippets into memo fields. This code can execute as part of a metadata-driven cleansing engine. I demonstrate how this can work in my articles for Advisor media. You can see some examples in my FoxForward material as well (see: 2007 Fox Forward Presentation Logical Data Map Database).

From here

Both SSIS and VFP have great data cleansing capabilities; in both systems, data cleansing is easy. In VFP, you can expect a mature language and blazing speeds (FoxPro is excellent when working with string data). SSIS makes it very easy to derive columns and change data types. SSIS can also make use of multiple processors and has some good built-in logging functionality. In the end, I prefer VFP because it is much easier to work with FoxPro’s languange (as oposed to VB and the SSIS expression language). Also, in VFP I find it much easier to manage metadata and the LDM.

In the next article, I’ll discuss the Error Event Management subsystem and how VFP and SSIS can deal with that!

Tags: , , , , , , , ,

6 Comments

Dimensional Modeling: Loading The Slowly Changing Dimension

In my entry “Dimension Tables”, I discussed some of the different types of dimension tables that could exist in a dimensional model. In this entry, I will discuss how they are loaded.

Dimensions change over time: Data entry errors are fixed, new customers arrive, employees get married and change their last names, products are re-categorized, and new store promotions are initiated. There are several challenges associated with these updates — challenges that do not exist in operational database applications. To face these challenges, the concept of the Slowly Changing Dimension has emerged.

There are 3 types of Slowly Changing Dimensions:

Type 1 SCD
When a type 1 attribute changes, it is Ok to overwrite the attribute on the current or all matching rows
Type 2 SCD
When a Type 2 attribute changes, make a copy of the row, mark the copy as ‘expired’ or ‘old’, and place the new Type 2 attribute change into the new row
Type 3 SCD
These represent alternate realities and keep the previous and current value of an attribute in a row

There is also a Type 1.5 SCD which is used to keep very large dimensions from expanding with too much unnecessary history. Consider a case where a Customer dimension is loaded from several sources. Several Type 2 attributes may not be available at the time the row is first loaded. When this data becomes available, it is updated as if it were a Type 1 change, rather than Type 2. If the customer then changes the attribute at a future date, the attribute would be treated as a Type 2. In this scenario, additional information must be made available to the ETL program so that it knows how to treat the attribute(s).

Inferred Members (which are treated somewhat like a Type 1.5 change) exists when a fact table references a dimension row that is not yet loaded. A new row is added to the dimension with a new surrogate key along with whatever additional data is available (at a minimum, the natural key). This almost-empty dimension row would be updated appropriately when the data becomes available. Just like the Type 1.5 change, additional information must be made available to the ETL program so that it knows how to treat the attribute(s) and not create unnecessary history.

To make Type 2 Slowly Changing Dimensions work, the dimension must have some attribute that distinguishes the active or current row from historical or expired rows. This can be done in one of three ways (my preferred method is to use dates):

  1. Dates: Create a from_date and a to_date in the dimension. The active row will have a null ‘to_date’.
  2. Flag: Create a status flag to distinguish active rows from inactive rows. Only one row should be active at a time for a give natural key.
  3. Version: Either a sequence number or some other number that will indicate which row is most recent in the dimension.

To simplify the discussion, when deciding on what SCD Type to tag an attribute, you need to ask yourself (well, ask your business users!): Do I need to keep a history for this attribute? Depending on your answer, consider the following:

Type 1

When an attribute marked as a “Type 1″ attribute is changed, the new value overwrites the existing value in the dimension without keeping a history. There are usually two flavors of this Type: Overwrite all that match the key (SCD subtype 1) or only overwrite the current row (SCD subtype 2). The “current” row is determined by the date, status, or version of the row.

Type 2

Type 2 changes always keep a history by marking the current or active row (determined by the date, status, or version attribute) as inactive and inserting a brand new row into the dimension with matching values from the previous row, plus any Type 2 changes.

For Hand-coders

If you know me, than you know I like to store this sort of information as metadata. For example, in my integration metadata (my Logical Data Map), I have a table that stores all the attributes for each dimension. I have a column called SCD_Type and SCD_SubType. Here is a view of this metadata taken from one of my hand-coded projects:

 Dimensional Modeling: Loading The Slowly Changing Dimension

In pseudo-code, the processing would look like the following:

For each row
  If any attribute marked as 'Type 2' changes Then
    update status of active/current row
    insert into dimension a new row
  End If
  For each attributes marked as a 'Type 1' change Then
    If attribute is subtype 1
        update all rows that match the key
    Else
        update only the current/active row that matches the key
    End if
  End For
End For

As you can see, I handle Type 2 changes first and then process Type 1 changes. This ensures that historical data is kept in tact and that the correct active row is updated if the subtype is marked as 2. Additional logic is needed to process inferred members and Type 1.5 as mentioned earlier.

When no SCD Types Are Assigned

Not all dimensions contain SCD attributes. A Temporal/Time dimension, for example, does not change (January 15th, 2008 is always January 15th, 2008). These and other dimensions are simply appended to, such as a weather or promotion dimension (i.e. Causal Dimensions). I have found that dimensions that operate over time work this way, although this is a generalization and not a rule.

In my metadata, I usually tag these types of non-SCD dimensions as type 0. My ETL code then treats these as a pure insert (or Upserts depending on the nature of the dimension) if all attributes for the dimension are marked as Type 0.

SSIS

The Slowly Changing Dimension Wizard in SSIS makes a lot of this work easy for you. In my opinion, it is one of the better transformations available through SQL Server. For very large dimensions though, I have found that the SCD transformation does not perform as well as using conditional transforms and lookups on filtered data. I will be sure to talk more about this in a future post. I should also mention that hand-coding SCD logic is tricky; my pseudo-code posted above is only a bird’s eye view of the logic involved.

Tags: , , , , , ,

8 Comments

Conforming Temperature in FoxPro (some conversions)

Conforming data is an essential part of data integration. Conforming data takes on many shapes and may include updating data types and lengths, attribute names and conventions, data atomicity, universal conversions, etc. The idea is that, in all circumstances, the attribute means (and therefore acts) the same across all entities in the database.

For example, let’s look at temperature. Business requirements (and perhaps your region of operation) will dictate the format temperature should be stored in. When integrating data, chances are great that you’ll have different temperature formats (Celsius and Fahrenheit) coming from different sources. In order to have a conformed database, you create a rule that states that all temperatures are to be stored in X format. I like Celsius, personally, so I convert all Fahrenheit temperatures to Celsius:

nCelsius = (nFahrenheit  - 32) * (5/9)

If you disagree, then go the other way:

nFahrenheit  = (9/5) * nCelsius + 32

Like Kelvin? (You must be a scientist!)

nKelvin = nCelsius + 273.16

There are virtually hundreds of these types of conforming decisions to be made in a typical large integration project. Build yourself a toolbox of common conversions, and apply them to your data as needed.

Tags: , ,

No Comments

Web Analysis with VFP: Download, Parse, and Stage

My Web Analysis application is beginning to take shape. At the moment, I have created a little road map document (which I’ll share shortly), defined all my base classes (the entire application will be made available in the coming months), and proved the general concept (see below).

To recap, this project is about building a tool to help me make better business decisions on and using the Internet. While there are a ton of canned products on the market, I feel that I can get what I need with VFP (and get it better). In general, I want to be able to download a variety of data from my server, analyze it, and use that data to make decisions (everything from what IPs to ban to how I might redesign a certain page). See my post “Business Intelligence Through Web Analysis” for more details.

The following zip file contains all the code necessary to prove the concept. It (a) establishes a database, (b) downloads the log files from the FTP server, and (c) parses the file, storing it in a stage directory, which I can then use to create a Dimensional Model for analysis purposes. Here’s the programs:

web_analytics.zip

In the zip, you will find:

  • wa_main.prg (the setup program, check the path and run this once)
  • functions.prg (a place to store my functions, this will become a class soon)
  • download.prg (set your server information up and run, it calls load_logfile(), which lives in functions.prg)
  • ftp/ftp.prg (FTP services class written by Robert Abram many moons ago — still works great, requires wininet.dll)

Now that the concept has been proved, and I can visually analyze my log files, I will start to build an application in VFP9 around it. Stay tuned!

Tags: , , , , , ,

No Comments