Posts Tagged MDM

ETL Subsystem 8: Data Conformance

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data conformity is one of the most important and fundamental aspects of data warehousing. Data conformity means that a data element’s label (column header) and content (which includes data types, lengths, and nullability) are consistent across multiple tables and across multiple business processes.

In dimensional modeling, a single dimension shared across business processes is a conformed dimension. Shared dimensions allow for drill-across queries where a user could conduct enterprise-level analysis across several business processes (inventory, sales, procurement, etc.). Conformed dimensions, as a data quality technique, also cut down the likelihood of data inconsistencies in an organization.

Margy Ross of the Kimball Group states that “Using conformed dimensions ensures that the data warehouse is delivering consistently defined attributes for labeling, grouping, filtering and integrating data from multiple business processes.”

Dimensional conformity also ensures reusability. Reusing dimensions across multiple business processes is what makes dimensional modeling so simple and scalable. The Product dimension, for example, would be the exact same physical table used in the Orders, Inventory, and Procurement dimensional models.

Obtaining Conformity

There are some challenges associated with obtaining conformity. Some of these challenges might be political, where business users refuse to agree on common business definitions. Most of them, however, are in the planning, data modeling, and system architecture, where individuals fail to identify or recognize where conformity exists. This is typically a direct result of failing to do adequate data profiling! Not having data stewards available to consult with on specific data issues could also lead to some slow times.

The first step to conforming dimensions is to define the business processes that will be used in the dimensional model. The next step is to conduct a data profile to identify and document the entities and relationships in the source system. Finally, identify the common dimensions across business processes.

The question then becomes: How do we determine which dimensions are conformed across business processes?

The answer is the Bus Matrix. The Kimball Group does an excellent job making the case for the Bus Matrix in their books and articles. In short, it acts as a roadmap for a dimensional modeling project. Check out “The Matrix” and “The Matrix: Revisited” on the Intelligent Enterprise Website. I won’t attempt to describe the Bus Matrix here for the sake of brevity, but may share some thoughts in a future posting (which I will likely title “The Matrix: Reloaded” just to be cheeky!).

The Bus Matrix will tell you exactly what your conformed dimensions are.

Conforming Attributes

By using conformed dimensions, it is possible to conduct drill-across queries. You might, for example, want to do some analysis on retail sales and inventory. Retail Sales and Inventory business processes exist in the data warehouse as two separate dimensional models. The cool thing is that they share dimensions (ex. Product). It is rather simple then to relate the fact tables together using the Product dimension based on some product characteristics such as “size”, “weight”, and “cost”. These drill-across queries can lend a great deal of insight into an organization’s various business entities.

Filtering and searching for data is also simplified if, for example, all names are in UPPER case and stored in atomic parts (last, first, middle, title).

You should strive to get as many attributes in various dimensions to conform. Attributes such as Name, Length, Weight, Size, and Color should mean exactly the same thing across all dimensions. If not, then they must be named differently. Conformity on this level lends to the simplicity of the overall design and makes operator queries simpler to write and understand.

Take for example Customer and Shipper dimensions, both with an attribute called “state”. In the Customer dimension, state is abbreviated to 2 characters, but in the Shipper dimension it is spelled out. You would not be able to conduct a query on all shippers and customers in the same state without the need for some lookup table. This could confuse and likely frustrate the operator. If you must spell out the state’s name, re-name the column to “state_name” (and add a column called ‘state’ with the abbreviated value)!

Allowing business users to relate multiple business processes together using commonly defined metrics (“facts”) also adds tremendous value to the data warehouse database. For example, a metric such as “discount” would not be a percentage in one fact table and an actual dollar amount in another. Instead, convert one to the other during data integration, or create two metrics: “discount_percent” and “discount_amount”. To take this a step further, in order for “discount_amount” to be conformed, it must be stored in some common currency (such as EUR or USD).

Not all attributes and metrics can be conformed, but the more you can, the more flexibility you will have with your drill-across queries. If you cannot conform a particular attribute or fact, then be sure to change its label so that end-users won’t make an assumption that it means something that it’s not!

I find it intuitive to identify attributes and metrics that are conformed across multiple source systems once the conformed dimensions have been identified and a data profile conducted. Where you need to be careful is when two column labels are named the same but mean (sometimes slightly) different things. Product color, for example, might be entered as a color’s name (red, black, blue) in one system, and a hex or RGB value in another. Multiple systems might have different product identifiers, all labeled “Product ID”. “Distance” could be in kilometers in once system, and in miles in another. Fahrenheit to Celsius conversions may be necessary. All of these issues need to be worked out during the data integration process so that “distance”, “color”, and “Product ID” mean exactly the same thing across Dimensions and Facts.

A 3lb red widget from NY is a 3lb red widget from NY is a 3lb red widget from NY!

SQL Server 2005 Integration Services (SSIS)

SSIS has many transformation tasks that you can use to enforce conformity in the dimensional model. In fact, there is not a single data conformity task that I have run across that SSIS does not provide a solution for out of the box (well, with some help from the Script Task, that is). Because many transformation tasks can be used to enforce conformity, I won’t enumerate them here. The challenges of conforming data truly lie in the planning and design process.

Hand Coding with Visual FoxPro (VFP9)

As with SSIS, there is nothing specific about hand-coding regarding conformed dimensions. So much rests on planning, mapping, data profiling, de-duping, and cleansing, that the process of conforming is just a matter of implementing the conversions and mappings necessary to build the dimensions so that they adhere to conformity. Check out my post on Loading Dimensions for some further details.

For some examples on some things you might do in FoxPro code to achieve conformity, take a look at my previous posts “Compare a String to a List of Formats” and “Conforming Temperature in FoxPro (some conversions)“. These are two examples of things you may need to do to enforce conformity.

From Here

In my next ETL Subsystem post, I will discuss Subsystem 9: Slowly Changing Dimensions (SCD). If you want to know more about Conformed Dimensions and the theory behind them, I recommend reading any of Kimball’s data warehousing books or articles that discuss the subject. He lays out the case quite nicely for conformed dimensions. I would like to stress that the success of the dimensional model rests in the ability of the data modelers and integration architects to design the data warehouse using as many conformed dimensions, attributes, and facts as possible.

There are not too many other blog postings (ex: 1 2 3) on the topic of data conformity that truly dive into the topic. I would love to see more, as this is such an important subsystem and best practice.

Tags: , , , , , , , , ,

5 Comments

ETL Subsystem 7: Removing Duplicates

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

The process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the most appropriate version for the warehouse.

Generally, the practice involves an input record and a reference table, but could also span several databases and require access to web services and 3rd-party software. The reference table is usually a dimension (such as Product, Customer, Employee, Address, etc.) or a staging table that contains the relevant attributes to perform the lookup. If a staging table is used (a must for large reference sets), then it must be maintained as part of dimension processing. For more information about reference tables and master data, try searching for “Master Data Management” on Google. If I feel up to the challenge, I’ll post more about MDM in a future series of articles.

It is common to have duplicate data in an organization, especially if that organization has multiple systems for managing customers, products, orders, inventory, etc. Consider a small business with in-house software to track customers and a separate database Online. The same customer could be in both systems with different account numbers and a slightly different name (“Martha Jones” vs. “Martha A. Jones”). As part of data warehousing, we need to eliminate these cases as best we can. That is, of course if it is important for your data warehouse to do so.

If an organization does not need to eliminate these duplicates from their data warehouse, then they should proceed no further. However, in order to do any form of historical analysis, you need data that you can follow uninterrupted back in time and across multiple sources. This subsystem helps you do that.

Fuzzy Matching

Fuzzy matching is the process of determining if two strings are approximately equal (ideally returning some form of confidence score). Algorithms such as Levenshtein Distance and SoundEx have been developed to provide this type of functionality.

If the duplication of information is minimal, or if the volume of data is manageable, then it should be a rather straight-forward exercise to write your own fuzzy matching solution using any of the known algorithms. But you should plan on spending some time on this endeavor. Also, as fuzzy matching is not an exact science, plan on some ongoing human intervention.

One thing is certain: The more atomic your data is, the easier your job of de-duping will be. It is considerably easier to look for duplicate address information when the street number, name, unit number, city, state, and zip are broken out into atomic parts!

SSIS and VFP do not have any built-in, native ability to de-dupe data. Organizations with severe duplication problems will likely need to invest in a 3rd-party cleansing solution. Regardless, here is what you could do using SSIS and hand-coded in VFP:

SQL Server 2005 Integration Services (SSIS)

SSIS provides three very good Data Flow tasks for de-duplication:

Lookup Transformation:
Performs a lookup against a reference table. The lookup uses an equi-join on one or more columns (composite join) and is case sensitive. Use this task for exact matching to a reference set.
Fuzzy Lookup
This transformation uses fuzzy matching (pattern-based) against a reference table in a SQL Server 2005 database. You must supply the reference table, which is typically built from your existing data. You can customize the fuzzy lookup by specifying the maximum number of matches to return per input row, token delimiters, and similarity thresholds. Each match includes a similarity score (how similar the input and reference values are) and a confidence score (the likelihood that a match has been found).
Fuzzy Grouping
The Fuzzy Grouping identifies rows of data that are likely to be duplicates. Then it selects a row that best represents the group and marks that row with a similarity score of 1. The other rows in the group are given a score ranging from 0 to 1. The closer to 1, the better the match.

It is a best practice to first use a Lookup Transformation to find exact matches and then divert the unmatched data into a Fuzzy Lookup. The package will run faster (the Fuzzy Lookup is more expensive than the equi-join Lookup) and will be easier to maintain (you will clearly see what records are being diverted as non exact matches).

You should read more about how the Fuzzy Lookup and Fuzzy Groupings work in Books Online. They are both quite interesting. Also, they’re not too easy to get the hang of at first and have quite a few configuration options. Later, I will demonstrate these tasks in a future posting.

If you need a 3rd-Party tool, then integrating it into the Control Flow would be ideal. Otherwise, you will need to stage the dirty data and run the vendor’s product separately on that stage. You can then pick up the de-duped data in SSIS and continue processing. You may be able to use the Execute External Process task (runs a Win32 Executable or batch file) and put the entire operation into an SSIS Control Flow (which is the second-best solution).

Intelligent Search Technology has an integrated deduplication task for SSIS. I have not evaluated this task, but sounds pretty good. Read about it here.

Hand Coding with Visual FoxPro (VFP9)

FoxPro does not have any native or built-in solution for de-duping data. This isn’t odd, as most (if not all) programming and databases languages don’t either (I would love to know of a language that has some form of de-duping functionality). You can, however, build your own by using functions such as the following:

Levenshtein Algorithm
Check out the discussion on the Fox Wiki.
LIKE/ LIKEC
Can use wildcards to see if expression 1 is like expression 2. The LIKEC version is for double-byte characters. This function might be appropriate under certain circumstances for identifying data to de-dupe.
SOUNDEX
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. More about SoundEx.
Craig Boyd’s SoundsLike Function
Check out Craig Boyd’s improvement for the SOUNDEX function on the VFP Wiki.
MetaPhone and Double Metaphone
Here are a couple of gems translated for us by Craig. The code is posted on the FoxPro Wiki
The ‘Oliver’ Function
I, like Tyler Akins at rumkin.com, first heard of this using PHP. You can see Tyler Akins’ implementation using FoxPro code here. Pretty cool!

Note: The code (zip format) for Craig’s work can be found here at Replacement Software. You can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well.

As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just make up a word?), you can handle this easily in FoxPro by splitting compound fields yourself. I wrote a couple posts last year on this very subject: “Parsing firstname / middlename / lastname” and “Parsing City/State/Zip“. You can then use the individual name and address parts to perform the search, increasing your likelihood of finding good matches. Also, it would be a good idea to convert all strings to upper or lower case (I prefer upper) as well.

From Here

Although SSIS and FoxPro do not offer built-in de-duplication algorithms for names, addresses, products, and the like, they both offer enough tools, tasks, and functions so that you can write your own. Plan on spending plenty of time on this endeavor, as de-duplication is not an exact science and will undoubtedly require some human intervention during the process.

Next post, I’ll discuss ETL Best Practice Subsystem 8: Data Conformance. Data conformity across dimensions and fact tables is one of the hallmarks of a fully integrated dimensional data warehouse.

Tags: , , , , , , , ,

8 Comments

ETL Subsystem 3: Data Extraction

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

One of the major pieces in the Data Integration puzzle is data extraction. Simply put, data extraction is the process of taking data from some source system to be used in some other process. Usually, in an ETL environment, extracted data is staged for further processing (staging is important for lineage and recovery purposes.).

Extracting comes in a few different flavors:

Push
The source system ‘pushes’ the data into the ETL system (through an export, direct write into a staging area, etc.)
Pull
The ETL process ‘pulls’ the data from the source system (usually using SQL to read the data)
Dribble
For real-time/right-time systems, this mechanism makes data available to the ETL system quite frequently throughout the day, possibly through database triggers or message queues (see MS Message Queuing for more details)

Depending on the source system, the extraction may be different. Also, depending on the purpose of the process (complete historical load verses incremental loads) the connection may be different. It is imperative that the decisions on how to extract data from source systems are well documented in the ETL technical documentation. This document will aid integration specialists, together with the data mappings, when developing the extract system and methodology.

Furthermore, you should strive to extract only relevant data. Extra, non-essential data will take up additional disk space and memory during processing. This is especially true in SSIS where memory is quite important.

In the perfect world

Ideally, the source system will push data to the ETL system on some pre-determined, business-relevant interval (hourly, daily, weekly, etc.). The source system will be responsible for determining the changed data; however, you should never fully trust source systems and always build a CDC system (see my previous post ETL Subsystem 2: Changed Data Capture for more information) to pick up any changed data missed by the source system.

The ideal extraction strategy for Pull systems relies on trigger-level insert and last update dates on each row in each table in the source. Pulling the data by date range — perhaps one subset for inserts and another for updates — is quite efficient and as long as the database is supplying the dates through triggers, you can use this method with confidence. If the source system uses application logic to populate these dates: steer clear. You’re asking for trouble. There is always at least one rogue update process that will fail to manually insert the correct date!

Realistically though

Things are usually less than ideal. Your CDC strategy will help determine how you acquire extract data and what to do with it when you do. Some additional ways to access extracted data including using vendor-supplied APIs, web services, direct data access, reading file-based reports, tapping into the transaction logs, or using some external 3rd Party tool. Be concerned with difficult-to-access legacy systems, especially if the experts are beginning to ask for retirement!

It is also possible to use database replication. A Publisher can distribute transactional data to any Subscribers such as an OLTP or ODS server and your ETL system. This scenario is best in a controlled environment where the IT department can manage the replication across systems.

Data Extracted. Now What?

I prefer to stage extracted data for later use (even if ‘later’ is only a few seconds away). You acquire the data from the source, and drop it off on disk in some native format (DBF files for FoxPro, RAW files for SSIS). This allows you to return to this step if problems were encountered downstream, if downstream processes need to wait for some other condition before continuing, for data lineage purposes, and for recovery in case of failure. You will not want to have to go back to the source to re-capture this data. It isn’t efficient and a lot can happen to the data while the ETL process fumbles around.

SQL Server 2005 Integration Services (SSIS)

Data extraction in Integration Services is rather easy. There are many different ways to connect to source systems. Dot NET Providers, for example, give you access to many types of systems including Informix Dynamic Server, SAP, and Sybase. Connection types included:

  • DataReader Source: Allows you to read data from a .NET Provider (see the ADO.NET team blog for more details) using the sqlcommand property. Tip: The DataReader usually performs better than OLE DB even for SQL Server. Also, be aware that you can sort data for further processing by specifying the IsSorted property = True on the Input and Output Properties tab. Remember to set the SortKeyPosition for each Output Column involved in the sort as well.
  • OLE DB Source: Use OLE DB to retrieve data from a variety of sources. You can use this adapter to request a result from a table, view, stored procedure, or SQL query. I like to develop my queries in SSMS or the source system’s editor (MySQL Query Browser or VFP IDE for example) if I have it. This helps to get any syntax quirks out of the way. SSIS is a bit lacking when it comes to writing queries using the provided edit box.
  • Excel Source: Simplifies access to MS Excel data. I prefer to stay away from using Excel as a data source for incremental loads because it is too easy for the user to break the process by changing the format of the file.
  • Flat File Source: A great adapter for reading flat files (fixed, delimited, ragged right). You will actually set up most of the information in the flat file connection and not in the source adapter. I use the flat file adapter more than any other particularly because it is easy to get data from source systems in a flat file. XML is slowly taking over, however.
  • XML Adapter: Used to read XML data and help convert it to a relational form. Use this adapter to talk to XML, either via file or direct data from a variable, shred the collections and hierarchies, and produce multiple outputs. I even use this adapter to shred XML for other purposes (non-data integration purposes) because it is so easy. Tip: Accept XML with XSD schema definitions only; this will save you a ton of time and effort profiling the XML for determining destination details.
  • Raw File Source: Internal SSIS format that can be used to stage data or exchange data among packages. This is a very fast mechanism to read from because the format of the raw file is the same format that SSIS uses in memory; SSIS simply loads the raw data and you’re off to the races.

I also make heavy use of the FTP Control Flow Task and have also used the Web Service Control Flow Task to retrieve data. Both of these tasks are relatively easy to set up and use.

Also, for real-time systems (using the dribble extraction approach), you will likely be looking into the Message Queue Task and WMI Event Watcher Task. I have not had the privilege of working in a dribble environment (yet), so I don’t have much more to offer at this time.

Hand Coding with Visual FoxPro (VFP9)

Of course, much of the above is easily accomplished in Visual Foxpro. You won’t have a colorful drag-and-drop editor to work with but developing your own classes to accomplish what SSIS can do is rather trivial (technically, you can then use the Toolbox Panel and get some drag and drop if you want). With the available adapters (ADO, ODBC, XML, Native), and the ability to read and append from files directly, FoxPro is well positioned as a data extraction tool. Not only that, but you can use the native DBF format as a staging format, making downstream operations that much faster (don’t stage to flat files when you can stage into native tables in a staging database).

I outlined a rather comprehensive data extraction and staging strategy in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy“. I also talked at length about certain techniques in my Fox Forward and AFUG presentations earlier this year (2007 Fox Forward Presentation Examples).

From here

Once data is extracted and sitting in a staging area (or in memory if you don’t use a physical stage), you can move on to cleaning it. The 4th subsystem is just that: The Data Cleansing Subsystem.

Tags: , , , , , , ,

3 Comments