An SSIS Custom Transformation Success Story
I recently inherited eight SSIS packages that all do essentially the same thing: They extract data from disparate sources, cleanse and conform them, and finally load them into an analytical data warehouse (dimensional model). Welcome to Data Integration 101!
The project calls for another two dozen of these extracts. Unfortunately, the team before me operated with little functional documentation, a limited knowledge of data integration, and did not focus on using tried and true best practices. There was no data profiling, no logical data mapping, no quality controls, no auditing, and very little in the way of logging. Essentially, things are a bit of a mess.
My primary focus over the next 6 months is to clean things up. I’ve already introduced better documentation practices, naming conventions, and this crazy concept of the “package template”. There is no way on earth I am developing 20 similar packages without using a template!
I’ve also started work on a Data Quality Manager Application (for now, it will consist of a table-driven set of SQL statements that can be run via stored procedure on a scheduler) and a Logical Data Map (LDM). The LDM will provide much needed documentation and give the other SSIS developers a little more direction.
I plan on posting much more about the LDM, Data Quality Manager, and package templates in a later post. For now, I want to gloat a little about a custom transformation task component I wrote.
The Problem
When the data comes into SSIS, work needs to be done to relate the rows from the source to those of the target. The mapping is a little different depending on the source file, but in the end, we need to match any number of significant characteristics (aka the file’s candidate key) to a single unique ID Number in our data warehouse database.
For example, one file might contain Social Security number, another might have last name and date of birth, a third might have the last name and the last 4 digits of a credit card number, yet another would have an account number from a billing system. Attached to all these identifiers are some important metrics that need to be added to a fact table in the data warehouse.
To do this, the developers before me used three Data Flow tasks, and over 100 Transformations (no joke). The second (and meatiest) Data Flow task took the extracted data (now in a staging table) and then embarked on a long drive through the various possible match combinations. First, there would be a Conditional Split that checked to see if the row had a valid account number, if not, then the output would hit another Conditional Split that would check on last name and so on. If there was a valid account number, a Lookup component would check the account dimension in the data warehouse for a match. If exactly one match was found (it is possible to have repeating account numbers), it would add the ID to the pipeline and drop the results into another staging table as a “match”. If not found — or there was a duplicate — another Lookup component would be used on some other identifiable column. Repeat. Repeat. Repeat!
The other two data flows did various cleanup work.
To add complexity to this mess, the attributes in each source are different. Sometimes account number has an extra check digit at the end. Other times, dashes are inserted in different positions.
A picture is worth a thousand words. Here is one of the data flows. The quality is bad on purpose: This is business logic, and as messy as it is, I don’t want to give away any company secrets! But, here is a high view so you can taste the pasta.
The developers did an admirable job of wading through all this and managed to implement a pure SSIS solution that works. They did this all without a data profile, LDM, or any real functional documentation. However, maintaining this, or repeating this logic in 20 more packages, makes me want to extract my eyes from my head, transform them into soup, and load them into my mouth.
A Better Approach
So I spent an hour looking at all this and thought to myself how easy and simple this would be using Visual FoxPro. A few SELECT statements, perhaps a CASE here, an IF there, and a REPLACE command for good measure.
Unfortunately, I can’t use VFP for this project. We’re locked into SSIS and there isn’t a single VFP license in my reach. The next best option, of course, is to design a custom component using C#.
In my next post, I’ll explain the nuts and bolts in detail. But to summarize, my Custom Task component (called “ID Lookup”) uses about 100 lines of code in the ProcessInput method, and another 100 lines in PreExecute (give or take). It essentially acts as a Lookup and Conditional Split with some extra logic built in that is specific to our situation. There are properties on the component that allow the developer to specify what input columns should be used in the matching process.
The new Data Flow looks a little more manageable.
ID Lookup achieves the exact same results as the three data flows and 100 transformations did. As an added bonus, it performs over 600% better! And I’m sure with a bit more tweaking, I can squeeze out additional performance.
Aesop Time
I suppose that being a senior-level software developer in my previous life helped out a bit. I’m not certain that the team before me even thought about writing some code to solve the problem. I also have the added benefit of seeing eight different packages in a completed state. It is much easier to identify patterns when you have something tangible to look at.
So the moral is, don’t be afraid of writing custom tasks or Data Flow components. This was not my first, and it won’t be my last. Now, when we start to write the new packages, we’ll drop this transformation into the Data Flow, right smack between the appropriate staging tables.
As I said, in my next post, I’ll take a look at some of the code I used to create the component. Keep in mind that I cannot show you our business-logic, but I will show the framework.
I'm a Quant Technical Specialist (Data Warehousing and Business Intelligence), with expertise in business analysis, data modeling, and data integration. I have extensive experience developing vertical and integrated desktop, Internet, and BI applications spanning municipal, clinical, and financial industries.

April 8th, 2008 at 11:34 am
Looks like you’ve done a good job for yourself, and discovered a much undervalued SSIS capability, one of my favourites in fact. I do wish however you would not call it a Task.
Tasks are elements found in Control Flow, but you have written a Data Flow “component”, specifically a transformation.
Writing custom tasks is even easier than components so write those too!
April 8th, 2008 at 1:14 pm
LOL
Ok, ok. So I wrote a custom transformation “component”! Now that I’ve re-read my post, calling it a “task” now gets under my skin a bit. I may just have to edit it….
April 16th, 2008 at 3:24 am
Nice to see you improved the processing.
Sad to hear that things are “a bit of a mess”.
Quiet curious on how you wrote the custom transformation, with all the lookups needed to find the various keys based on the different combinations.
Aim was to clearly show the business logic/flow in the package.
OK it looks messy, but at least the test manager was able to scan through the matching process and understand which matching types were needed.
When you put all of this in a custom transformation you make it more complex for a non programmer to understand what is happening in the ETL flow.
I think you have to consider what is easiest, scan through 2 X 100 lines of code or the flow picture…. (as you know a pictures says more than a thousand words).
But you have a point in doing it like that, if you improved performance by 600%.
Anyway, Tod it was nice working with you on the project….
see you.
April 16th, 2008 at 8:11 am
Excellent point Yves! The custom component sits in a black box, which makes it next-to-impossible for a test manager to pick apart. The control flows and data flows gave him and others on the team an opportunity to review each step along the way.
But just to be clear: The mess refers to a lack functional documentation, little to no experience of the team (basically, everyone but you) on data warehousing/data integration, and therefore no focus on implementing best-practices in the initial design. Sadly, this is indicative of many projects. On top of that, we’ve been stuck in the mud with the release (still).
The consequence of missing or limited data profiling (we had some pretty good talks about the data profiling issue in particular), no functional requirements, and a more-or-less ad hoc approach to managing the development processes is that oftentimes what works isn’t very optimized, easy to debug, or easy to repeat (which is the task I have in front of me).
I had the benefit of stepping in after all this, and saw several patterns in the data flow that I’ve been able to shake out and turn into some optimized C#. The code is rather simple — although I won’t post the actual implementation, I’ll be posting most of the foundation. Actually, it would have been great for you to still be here for this. I could have used your knowledge and expertise.
So my decision to go ahead with this was based on the following: ease of future maintenance, ease of future development, and performance improvements. I’ve achieved that at the expense of making it more difficult for business users to poke in and see. To compensate, the documentation for the new component will be kept up to date and easy to follow.
April 17th, 2008 at 10:26 am
[…] As promised, I’ll discuss in this post some of the details of my custom transform component I wrote in C# for SSIS. Unfortunately, I cannot publish the task because it contains business logic and company secrets! I will, however, talk about how I went about it and detail some of its finer and not-so-finer points. […]