Posts Tagged Quality

SSIS Debugging: Find in Files, Naming Conventions, and a Problem Solved!

I recently inherited several complex SSIS packages that are now being promoted from the development environment (ALPHA) to testing (BETA). There was one lingering issue though that took me a few hours to figure out.

We’ve implemented custom logging and auditing in the packages. In the Alpha environment, we set up Agent jobs to run groups of related tasks. For some odd reason, a log entry was being inserted into our audit table that was from a previous job step and not from the one we expected.

To give a simplified example: There are 3 SSIS packages being called by 3 job steps. Each loads a CSV file into a staging area for dimension processing. Step 1 reports that file A was processed. Step 2 reports on file B, and step 3 on file C. The step 3 job was reporting that file A was processed, but that it was the step 3 job that did it.

This wasn’t noticed until recently for several reasons, but primarily because we generally don’t rely on this particular log entry to determine if our files have been staged. So no one looks at it. In fact, this log entry merely serves as a bracket in the log table for other log entries posted in the same job. Nevertheless, I wanted to fix it.

So, where was the problem? All the logging is done using an Execute SQL Task. Every task checked out OK. The correct variables were being used in all cases. I was about ready to blame SSIS or corruption in the package when I stumbled upon a valuable clue: The name of the exact SQL Task Causing the problem!

I found the pesky task by doing a “Find in Files” search (BIDS -> Edit -> Find and Replace -> Find in Files) on the exact message being inserted into my audit table. You can use regular expressions and wildcards in this dialog so I was sure to find at least a clue. It turned out though that my exact search found the phrase; it was in a task named “Execute SQL Task”.

Red Flag!

I meticulously went through each package last month and renamed each and every task using the naming conventions posted on Jamie Thomson’s blog. This is a great best-practice and I certainly advise all SSIS developers to use this (or similar) convention. Anyway, the developers before me did a fairly good job of renaming tasks so they had meaningful names, but it was nearly impossible to look into sysdtslog90 or our custom audit tables and identify what kind of task the record was referring to. So I felt that while these packages were still in development, it was worth the few days effort to make these changes. All I did was add the suggested prefix before each task and renamed a few that I thought needed better descriptive names (sometimes shortening them in the process).

I know for sure that I didn’t miss any! Right?

Well, yes. I did. And so did the developers before me. When the package was being developed, the developers used cut & paste to move similar tasks from an existing package into this one. Then, a series of Sequence containers and Loop containers were added and tasks were moved around. A Sequence container was sized over the troublesome “Execute SQL Task” – to be forever (almost) lost in the Control Flow maze. I first saw it in the Package Explorer view and then went back to the Control Flow to see where it had gone. The task ran an SQL statement that updated the audit table.

Advice

Take care when using cut & paste to move items from one package to another. Not only could you run into some simple problems like the one detailed here, you could also forget to update a variable or an expression – producing odd or destructive results on your database. For us, this was just a simple audit log entry with zero business impact. Are there other cases buried in the packages? Need to check right now…

Also, use the naming convention best practice as discussed by Jamie. There are a lot of great reasons to do so (ease in development, debugging, readability).

I have never been in the habit of using the “Find in Files” function. But it sure saved me here. If you haven’t been using it in your packages, then give it a try!

Tags: , , , ,

No Comments

ETL Subsystem 7: Removing Duplicates

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

The process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the most appropriate version for the warehouse.

Generally, the practice involves an input record and a reference table, but could also span several databases and require access to web services and 3rd-party software. The reference table is usually a dimension (such as Product, Customer, Employee, Address, etc.) or a staging table that contains the relevant attributes to perform the lookup. If a staging table is used (a must for large reference sets), then it must be maintained as part of dimension processing. For more information about reference tables and master data, try searching for “Master Data Management” on Google. If I feel up to the challenge, I’ll post more about MDM in a future series of articles.

It is common to have duplicate data in an organization, especially if that organization has multiple systems for managing customers, products, orders, inventory, etc. Consider a small business with in-house software to track customers and a separate database Online. The same customer could be in both systems with different account numbers and a slightly different name (“Martha Jones” vs. “Martha A. Jones”). As part of data warehousing, we need to eliminate these cases as best we can. That is, of course if it is important for your data warehouse to do so.

If an organization does not need to eliminate these duplicates from their data warehouse, then they should proceed no further. However, in order to do any form of historical analysis, you need data that you can follow uninterrupted back in time and across multiple sources. This subsystem helps you do that.

Fuzzy Matching

Fuzzy matching is the process of determining if two strings are approximately equal (ideally returning some form of confidence score). Algorithms such as Levenshtein Distance and SoundEx have been developed to provide this type of functionality.

If the duplication of information is minimal, or if the volume of data is manageable, then it should be a rather straight-forward exercise to write your own fuzzy matching solution using any of the known algorithms. But you should plan on spending some time on this endeavor. Also, as fuzzy matching is not an exact science, plan on some ongoing human intervention.

One thing is certain: The more atomic your data is, the easier your job of de-duping will be. It is considerably easier to look for duplicate address information when the street number, name, unit number, city, state, and zip are broken out into atomic parts!

SSIS and VFP do not have any built-in, native ability to de-dupe data. Organizations with severe duplication problems will likely need to invest in a 3rd-party cleansing solution. Regardless, here is what you could do using SSIS and hand-coded in VFP:

SQL Server 2005 Integration Services (SSIS)

SSIS provides three very good Data Flow tasks for de-duplication:

Lookup Transformation:
Performs a lookup against a reference table. The lookup uses an equi-join on one or more columns (composite join) and is case sensitive. Use this task for exact matching to a reference set.
Fuzzy Lookup
This transformation uses fuzzy matching (pattern-based) against a reference table in a SQL Server 2005 database. You must supply the reference table, which is typically built from your existing data. You can customize the fuzzy lookup by specifying the maximum number of matches to return per input row, token delimiters, and similarity thresholds. Each match includes a similarity score (how similar the input and reference values are) and a confidence score (the likelihood that a match has been found).
Fuzzy Grouping
The Fuzzy Grouping identifies rows of data that are likely to be duplicates. Then it selects a row that best represents the group and marks that row with a similarity score of 1. The other rows in the group are given a score ranging from 0 to 1. The closer to 1, the better the match.

It is a best practice to first use a Lookup Transformation to find exact matches and then divert the unmatched data into a Fuzzy Lookup. The package will run faster (the Fuzzy Lookup is more expensive than the equi-join Lookup) and will be easier to maintain (you will clearly see what records are being diverted as non exact matches).

You should read more about how the Fuzzy Lookup and Fuzzy Groupings work in Books Online. They are both quite interesting. Also, they’re not too easy to get the hang of at first and have quite a few configuration options. Later, I will demonstrate these tasks in a future posting.

If you need a 3rd-Party tool, then integrating it into the Control Flow would be ideal. Otherwise, you will need to stage the dirty data and run the vendor’s product separately on that stage. You can then pick up the de-duped data in SSIS and continue processing. You may be able to use the Execute External Process task (runs a Win32 Executable or batch file) and put the entire operation into an SSIS Control Flow (which is the second-best solution).

Intelligent Search Technology has an integrated deduplication task for SSIS. I have not evaluated this task, but sounds pretty good. Read about it here.

Hand Coding with Visual FoxPro (VFP9)

FoxPro does not have any native or built-in solution for de-duping data. This isn’t odd, as most (if not all) programming and databases languages don’t either (I would love to know of a language that has some form of de-duping functionality). You can, however, build your own by using functions such as the following:

Levenshtein Algorithm
Check out the discussion on the Fox Wiki.
LIKE/ LIKEC
Can use wildcards to see if expression 1 is like expression 2. The LIKEC version is for double-byte characters. This function might be appropriate under certain circumstances for identifying data to de-dupe.
SOUNDEX
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. More about SoundEx.
Craig Boyd’s SoundsLike Function
Check out Craig Boyd’s improvement for the SOUNDEX function on the VFP Wiki.
MetaPhone and Double Metaphone
Here are a couple of gems translated for us by Craig. The code is posted on the FoxPro Wiki
The ‘Oliver’ Function
I, like Tyler Akins at rumkin.com, first heard of this using PHP. You can see Tyler Akins’ implementation using FoxPro code here. Pretty cool!

Note: The code (zip format) for Craig’s work can be found here at Replacement Software. You can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well.

As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just make up a word?), you can handle this easily in FoxPro by splitting compound fields yourself. I wrote a couple posts last year on this very subject: “Parsing firstname / middlename / lastname” and “Parsing City/State/Zip“. You can then use the individual name and address parts to perform the search, increasing your likelihood of finding good matches. Also, it would be a good idea to convert all strings to upper or lower case (I prefer upper) as well.

From Here

Although SSIS and FoxPro do not offer built-in de-duplication algorithms for names, addresses, products, and the like, they both offer enough tools, tasks, and functions so that you can write your own. Plan on spending plenty of time on this endeavor, as de-duplication is not an exact science and will undoubtedly require some human intervention during the process.

Next post, I’ll discuss ETL Best Practice Subsystem 8: Data Conformance. Data conformity across dimensions and fact tables is one of the hallmarks of a fully integrated dimensional data warehouse.

Tags: , , , , , , , ,

8 Comments

ETL Subsystem 4: Data Cleansing

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data Cleansing is a Data Quality process that is designed to catch and resolve data quality violations. Proper data cleansing will add tremendous value to the data, giving your business users plenty of reasons to trust it. Sometimes referred to as ‘scrubbing’, this subsystem is a must for all Data Warehousing projects.

Most quality issues will be discovered during data Profiling, which I discussed here. Other issues will be identified by business users who have had to deal with problem data in the past. Some cleansing operations will be performed because of Master Data initiatives (for example, converting customer codes to universal codes for comparisons).

Cleansing data can consist of tasks like the following:

  1. Parsing city, state, and zip from an address string
  2. Changing the case of a person’s last name to UPPER
  3. Finding the delivery point from a street address and zip code
  4. Stripping out unwanted characters from text
  5. Swapping NULLs for some other value suitable for the DWH
  6. Deriving a value from one or more fields to create a new field
  7. Concatenating two or more fields into one (for example, turning day, month, year fields into a date)

Obviously, the above list is not exhaustive. There are myriad ways to improve the quality of data during integration. The important thing to keep in mind is that this subsystem, more than any other, actually seeks to change the data. As quality is the driving factor, these changes should always be for the better and serve some business purpose. Any change to the data should add value and confidence.

Garbage in, Garbage out

Think of data cleansing as the process of cleaning up all that garbage that got in in the first place. OLTP systems (especially legacy systems that have been kicking around for a few decades) are notorious for doing some strange things with the data. I have encountered cases where fields were re-purposed (“hey we’re not using store_id2 anymore, so now we have some place to put the manager’s phone number!”), algorithms re-written (but historical data not recalculated or stamped with a version), and nulls represented as anything but NULL (99, -99, -1, “N/A”, etc…).

Not only that, but it’s likely that you have run into cases where users were able to free-form type into data entry fields. For example, a list of valid products might include “Widget A” and “Widget B”, but because a user can type what they want, you get “WedgetA” and “WDGT-B”. Of course, from an application architecture point of view, this is a terrible design problem. Anyway, as a data integrator, expect this and worse.

One man’s junk is another man’s treasure

But there is light. Many of these quality problems can be corrected outright through integration, while others can be addressed directly in the source OLTP system. Many home-grown or custom applications compensate for bad data in some way (lookup/mapping tables, loose database constraints, routines that go back and ‘fix’ the data, etc…). Through ETL, you will have to do similar work.

What might seem like worthless data is oftentimes salvageable. In fact, I have been involved in projects where large chunks of data were simply discarded because the business users felt it was unworkable. After some tinkering, it turned out to be far from the truth. Not only was some of the data useful, but we were able to take our ‘dirty data proof’ back to the application developers to get the application fixed (fixing DQ issues at the source is always preferable).

SQL Server 2005 Integration Services (SSIS)

SSIS offers a great set of tasks for cleansing data. If you add the OLEDB Command (for SQL code) and Script Component (for VB code) to the list below, then there really are no limits to what can be achieved. My only criticism is that using expressions in SSIS is quite difficult at first and takes some getting used to (see Integration Services Expression Language). The language and syntax is native to SSIS. On the plus side, the expression builder dialog box is helpful. Expressions are used throughout SSIS, but especially in the Derived Column task.

Typically, data cleansing will involve one or more of the following SSIS tasks:

  • Lookup Transform – performs a lookup using a join to another table, adding columns from the lookup to the data flow
  • Character Map – for string data, this transform applies a variety of string functions to the input column (changing case, for example)
  • Copy/Map – create a copy of a column
  • Data Conversion – convert data types from one to another
  • Derived Column – use this task to create or replace existing data by concatenation, exploding, applying calculations, and doing comparisons
  • Fuzzy Lookup – a component you can use for inexact matching
  • Fuzzy Grouping – uses algorithms to match or group disparate data to a configurable degree of accuracy

In future posts, I will demonstrate how many of these tasks work (likely through a SQL Server 2008 lens).

Hand Coding with Visual FoxPro (VFP9)

I’ve done a ton of VFP-related data cleansing over the past decade. Of course, VFP doesn’t have any fancy squares that turn green when a cleansing operation succeeds (or red when it fails!), but the language itself will give you everything you need to start cleansing data with little effort. I started trying to count the number of different commands you could use for cleaning in VFP. I stopped when I hit 50. There are likely more than a 100. Commands like SUBSTRING, ALLTRIM, CHRTRAN, LEFT, RIGHT, OCCURS, AT, LOWER, PROPER, SOUNDEX, LIKE, TRANSFORM, etc. Consult with the Visual FoxPro documentation or resources such as The Visual FoxPro Wiki for more details regarding VFP’s language features. I know this is a bit of a cop-out on my part but hey, I could spend days trying to summarize the capabilities.

What I can do is share some data cleansing / VFP implementation experience with you. The Logical Data Map (LDM) is a mapping document (sometimes done in Excel or relational database if you prefer) that tells you all of the source-to-target mappings for the integration. In VFP, you can create a VFP table (or tables) and embed cleansing code snippets into memo fields. This code can execute as part of a metadata-driven cleansing engine. I demonstrate how this can work in my articles for Advisor media. You can see some examples in my FoxForward material as well (see: 2007 Fox Forward Presentation Logical Data Map Database).

From here

Both SSIS and VFP have great data cleansing capabilities; in both systems, data cleansing is easy. In VFP, you can expect a mature language and blazing speeds (FoxPro is excellent when working with string data). SSIS makes it very easy to derive columns and change data types. SSIS can also make use of multiple processors and has some good built-in logging functionality. In the end, I prefer VFP because it is much easier to work with FoxPro’s languange (as oposed to VB and the SSIS expression language). Also, in VFP I find it much easier to manage metadata and the LDM.

In the next article, I’ll discuss the Error Event Management subsystem and how VFP and SSIS can deal with that!

Tags: , , , , , , , ,

6 Comments

The Role of Good Database Design and Normalization for Data Integration Projects

Not only is a database designer charged with the task of identifying entities, attributes, and relationships (the fun stuff) but also to ensure data integrity and data quality (often a challenge). This challenge is not limited to building indexes, rules, triggers, and database procedures. Database normalization is crucial in the design of relational databases. A good design is a normalized one, which seeks to minimize duplicate data and isolates entity rows so that they contain the exact data needed to satisfy the entity definition. The entity definition is based on business definitions: “customer demographics file”, “daily orders”, “receipt line items”, “store location”, etc. The database architect takes these business definitions and translates them to entities using ER or similar diagramming. And thus begins the normalization process.

I don’t want to get too academic here, or dive into the theory behind database normalization. That discussion can be saved for another day (but you can read more here or ). But basically, normalization is based on several cumulative rules or ‘Forms’. In the list below, Forms (1NF, 2NF, and 3NF) are most widely implemented. BCNF and 4NF are used but generally are the first to be relaxed when purposefully denormalizing the database (more on this below). I don’t recall ever seeing 5NF, DKNF, or 6NF in any production environment — instead these Forms seem to be relegated to specialized object-based systems, academic, and theoretic projects.

First Normal Form (1NF)
Entity does not have duplicate columns and each row can be identified by a primary key
Second Normal Form (2NF)
The usage of foreign keys to build relationships to remove repeating subsets of data
Third Normal Form (3NF)
Removes all attributes from a table not dependent upon the primary key
Boyce-Codd Normal Form (BCNF)
The only determinants are superkeys, or a superset of candidate key
Fourth Normal Form (4NF)
The removal of all multi-valued dependencies
Fifth Normal Form (5NF)
Projection-based normalization approach where every nontrivial join dependency is implied by a candidate key
Domain/Key Normal Form (DKNF)
Domain constraints and key constraints only
Sixth Normal Form (6NF)
If interested, read about the temporal/time dimension and 6NF in the book Temporal Data & the Relational Model.

Database normalization is an art form and takes practice, good sense, and a thorough understanding of the operational process to be modeled. The reasons why we walk through this exercise before each design is simple: Denormalized data can lead to update, insert, and delete anomalies, which must be avoided for data integrity and quality. Each of the above Normal Forms address one or more type or degree of anomaly:

Update Anomaly
When data to update is stored in many places, and not all updates take place
Insert Anomaly
When data to insert cannot be inserted because additional information (which should be stored as its own entity) is unavailable
Delete Anomaly
When deleting a row causes other information to be deleted

Many of these problems caused by denormalization are caused by bad planning, faulty design, laziness, or database scope overloading.

  • The Planning Factor
  • Normalization was not a consideration when designing the database. There are many reasons for this, but from my experience it is usually one of two things: The designer is a novice and is building his or her first database, or the design is based faithfully off of a spreadsheet (affectionately referred to as Spreadsheet Syndrome). This type of problem I have seen countless times in Access-based database systems, but other user-friendly RAD RDMS systems (like Visual FoxPro) suffer from this as well.

  • Faulty Design
  • Sometimes good intentions just aren’t enough! Poor or uninformed choices were made early on in the design process either because normalization concepts were misunderstood or business rules were not clearly defined. Database refactoring is a big deal and once you get started it is very difficult to backstep. I should note that faulty design is not limited to inadequate normalization but also to entity definitions, metadata considerations (is there any metadata at all?), and naming conventions (account_no, acct_nbr, account, acctnumb, etc…).

  • Laziness
  • Laziness and selective ignorance go hand-in-hand. After all, why create 10 fully normalized entities when you can have 1! The challenge of maintaining a database with dozens upon dozens of tables can be daunting. Especially when considering the need for careful index planning and the challenge of translating many entities into a good GUI design. So the urge is to combine order information with line items, or billing information with the customer demographic data to ease the workload.

  • Overloading the Design
  • I have seen database designs where the architect was clearly trying to satisfy multiple purposes; namely, daily transactions and ad-hoc reports. When building ER diagrams, a designer (who is often the developer as well) may take a step back and wonder what it would take to produce certain reports across multiple entities in their fully normalized model. Not only is writing a report of that nature tough, but performance will certainly be an issue. These are valid concerns. But the solution is not selective denormalization. The transaction database should be specific to transactions. If the same database is needed for reporting and ad-hoc queries, consider creating views or build a separate dimensional model with real-time/right-time updates using triggers on insert, update, and delete (my preferred method and a breeze in VFP).

Denormalization

There are times when denormilization is in order for a relational database. But the rule is this: Normalize the entire design first (usually up to 4NF where possible), approve the model with business users, and build a working prototype. Then, and only then, look for ways to improve system optimization and performance through denormilization. back off of any change that requires any sort of debate. Remember that data integrity and quality are far more important in an operational system than a tiny performance improvement or a couple of saved minutes in programming time.

Of course, in a dimensional model, denormalization is encouraged!

Data Integration

The reason for this post has more to do with data integration than you might have gathered. When conducting a data profile to learn about the source data, it helps tremendously if this data has been normalized properly. Certain assumptions can then be made and the integration process can steam along. A normalized database helps the integration engineers determine which data is reliable and represents the single version of the truth for the data warehouse. If data anomalies are possible because of a bad or lazy design, it could cost your data warehousing project days, weeks, or months to account for.

Therefore, normalized operational systems are not only essential for the data integrity and quality of those systems, but also for Business Intelligence initiatives that rely on them. This is true even if no anomalies exist in the source database (because of heavy application rules and code to ensure integrity on a denormalized structure); the integration team will not be able to make the assumption that anomalies will never exist. Data value is so important for the dimensional model and Business Intelligence that precautions like these must be taken.

Tags: , ,

5 Comments

Find Unmatched Keys from Two Data Sets

Here’s a real fast way to find the keys (natural or surrogate) that are missing from Set A or Set B. This query is similar to finding the symmetric difference between two tables, except that it only compares keys:

SELECT DISTINCT EVL(tableA.pkey, tableB.pkey) AS not_in ;
  FROM tableA ;
    FULL OUTER JOIN tableB ON tableA.pkey  = tableB.pkey ;
  WHERE tableA.pkey IS NULL OR tableB.pkey IS NULL

I used EVL in order to show a unique set of values that are missing from either TableA or TableB. To see exactly which keys are missing from each table, simply leave the EVL out and display both columns.

I tested this on two tables (A & B) with more than a million records using an integer surrogate key. The query took 8.6 seconds. A similar approach works in TSQL. Just use COALESCE instead of EVL and you’re off to the races!

Tags: , , ,

2 Comments