I recently inherited eight SSIS packages that all do essentially the same thing: They extract data from disparate sources, cleanse and conform them, and finally load them into an analytical data warehouse (dimensional model). Welcome to Data Integration 101!

The project calls for another two dozen of these extracts. Unfortunately, the team before me operated with little functional documentation, a limited knowledge of data integration, and did not focus on using tried and true best practices. There was no data profiling, no logical data mapping, no quality controls, no auditing, and very little in the way of logging. Essentially, things are a bit of a mess.

My primary focus over the next 6 months is to clean things up. I’ve already introduced better documentation practices, naming conventions, and this crazy concept of the “package template”. There is no way on earth I am developing 20 similar packages without using a template!

I’ve also started work on a Data Quality Manager Application (for now, it will consist of a table-driven set of SQL statements that can be run via stored procedure on a scheduler) and a Logical Data Map (LDM). The LDM will provide much needed documentation and give the other SSIS developers a little more direction.

I plan on posting much more about the LDM, Data Quality Manager, and package templates in a later post. For now, I want to gloat a little about a custom transformation task component I wrote.

Update: I almost never re-publish a post, but felt I needed to in this case. Darren Green of http://www.sqlis.com, which is an excellent SSIS resource by the way, commented about my use of the term “Task” to refer to Data Flow “components”. He’s right, so I made the change! Sorry for any confusion. In fact, while reading his comment, I had an “Ah-Ha” moment: The way to tell the difference between the two Scripts in SSIS is to call them by their correct names: Script Task (of the Control Flow) and Script Component (of the Data Flow). Duh!

The Problem

When the data comes into SSIS, work needs to be done to relate the rows from the source to those of the target. The mapping is a little different depending on the source file, but in the end, we need to match any number of significant characteristics (aka the file’s candidate key) to a single unique ID Number in our data warehouse database.

For example, one file might contain Social Security number, another might have last name and date of birth, a third might have the last name and the last 4 digits of a credit card number, yet another would have an account number from a billing system. Attached to all these identifiers are some important metrics that need to be added to a fact table in the data warehouse.

To do this, the developers before me used three Data Flow tasks, and over 100 Transformations (no joke). The second (and meatiest) Data Flow task took the extracted data (now in a staging table) and then embarked on a long drive through the various possible match combinations. First, there would be a Conditional Split that checked to see if the row had a valid account number, if not, then the output would hit another Conditional Split that would check on last name and so on. If there was a valid account number, a Lookup component would check the account dimension in the data warehouse for a match. If exactly one match was found (it is possible to have repeating account numbers), it would add the ID to the pipeline and drop the results into another staging table as a “match”. If not found — or there was a duplicate — another Lookup component would be used on some other identifiable column. Repeat. Repeat. Repeat!

The other two data flows did various cleanup work.

To add complexity to this mess, the attributes in each source are different. Sometimes account number has an extra check digit at the end. Other times, dashes are inserted in different positions.

The original data flow

A picture is worth a thousand words. Here is one of the data flows. The quality is bad on purpose: This is business logic, and as messy as it is, I don’t want to give away any company secrets! But, here is a high view so you can taste the pasta.

The developers did an admirable job of wading through all this and managed to implement a pure SSIS solution that works. They did this all without a data profile, LDM, or any real functional documentation. However, maintaining this, or repeating this logic in 20 more packages, makes me want to extract my eyes from my head, transform them into soup, and load them into my mouth.

A Better Approach

So I spent an hour looking at all this and thought to myself how easy and simple this would be using Visual FoxPro. A few SELECT statements, perhaps a CASE here, an IF there, and a REPLACE command for good measure.

Unfortunately, I can’t use VFP for this project. We’re locked into SSIS and there isn’t a single VFP license in my reach. The next best option, of course, is to design a custom component using C#.

In my next post, I’ll explain the nuts and bolts in detail. But to summarize, my Custom Task component (called “ID Lookup”) uses about 100 lines of code in the ProcessInput method, and another 100 lines in PreExecute (give or take). It essentially acts as a Lookup and Conditional Split with some extra logic built in that is specific to our situation. There are properties on the component that allow the developer to specify what input columns should be used in the matching process.

The new data flow using the custom transformation task

The new Data Flow looks a little more manageable.

ID Lookup achieves the exact same results as the three data flows and 100 transformations did. As an added bonus, it performs over 600% better! And I’m sure with a bit more tweaking, I can squeeze out additional performance.

Aesop Time

I suppose that being a senior-level software developer in my previous life helped out a bit. I’m not certain that the team before me even thought about writing some code to solve the problem. I also have the added benefit of seeing eight different packages in a completed state. It is much easier to identify patterns when you have something tangible to look at.

So the moral is, don’t be afraid of writing custom tasks or Data Flow components. This was not my first, and it won’t be my last. Now, when we start to write the new packages, we’ll drop this transformation into the Data Flow, right smack between the appropriate staging tables.

As I said, in my next post, I’ll take a look at some of the code I used to create the component. Keep in mind that I cannot show you our business-logic, but I will show the framework.