Archive for April, 2008

An SSIS Custom Transformation Success Story

I recently inherited eight SSIS packages that all do essentially the same thing: They extract data from disparate sources, cleanse and conform them, and finally load them into an analytical data warehouse (dimensional model). Welcome to Data Integration 101!

The project calls for another two dozen of these extracts. Unfortunately, the team before me operated with little functional documentation, a limited knowledge of data integration, and did not focus on using tried and true best practices. There was no data profiling, no logical data mapping, no quality controls, no auditing, and very little in the way of logging. Essentially, things are a bit of a mess.

My primary focus over the next 6 months is to clean things up. I’ve already introduced better documentation practices, naming conventions, and this crazy concept of the “package template”. There is no way on earth I am developing 20 similar packages without using a template!

I’ve also started work on a Data Quality Manager Application (for now, it will consist of a table-driven set of SQL statements that can be run via stored procedure on a scheduler) and a Logical Data Map (LDM). The LDM will provide much needed documentation and give the other SSIS developers a little more direction.

I plan on posting much more about the LDM, Data Quality Manager, and package templates in a later post. For now, I want to gloat a little about a custom transformation task component I wrote.

Update: I almost never re-publish a post, but felt I needed to in this case. Darren Green of http://www.sqlis.com, which is an excellent SSIS resource by the way, commented about my use of the term “Task” to refer to Data Flow “components”. He’s right, so I made the change! Sorry for any confusion. In fact, while reading his comment, I had an “Ah-Ha” moment: The way to tell the difference between the two Scripts in SSIS is to call them by their correct names: Script Task (of the Control Flow) and Script Component (of the Data Flow). Duh!

The Problem

When the data comes into SSIS, work needs to be done to relate the rows from the source to those of the target. The mapping is a little different depending on the source file, but in the end, we need to match any number of significant characteristics (aka the file’s candidate key) to a single unique ID Number in our data warehouse database.

For example, one file might contain Social Security number, another might have last name and date of birth, a third might have the last name and the last 4 digits of a credit card number, yet another would have an account number from a billing system. Attached to all these identifiers are some important metrics that need to be added to a fact table in the data warehouse.

To do this, the developers before me used three Data Flow tasks, and over 100 Transformations (no joke). The second (and meatiest) Data Flow task took the extracted data (now in a staging table) and then embarked on a long drive through the various possible match combinations. First, there would be a Conditional Split that checked to see if the row had a valid account number, if not, then the output would hit another Conditional Split that would check on last name and so on. If there was a valid account number, a Lookup component would check the account dimension in the data warehouse for a match. If exactly one match was found (it is possible to have repeating account numbers), it would add the ID to the pipeline and drop the results into another staging table as a “match”. If not found — or there was a duplicate — another Lookup component would be used on some other identifiable column. Repeat. Repeat. Repeat!

The other two data flows did various cleanup work.

To add complexity to this mess, the attributes in each source are different. Sometimes account number has an extra check digit at the end. Other times, dashes are inserted in different positions.

 An SSIS Custom Transformation Success Story

A picture is worth a thousand words. Here is one of the data flows. The quality is bad on purpose: This is business logic, and as messy as it is, I don’t want to give away any company secrets! But, here is a high view so you can taste the pasta.

The developers did an admirable job of wading through all this and managed to implement a pure SSIS solution that works. They did this all without a data profile, LDM, or any real functional documentation. However, maintaining this, or repeating this logic in 20 more packages, makes me want to extract my eyes from my head, transform them into soup, and load them into my mouth.

A Better Approach

So I spent an hour looking at all this and thought to myself how easy and simple this would be using Visual FoxPro. A few SELECT statements, perhaps a CASE here, an IF there, and a REPLACE command for good measure.

Unfortunately, I can’t use VFP for this project. We’re locked into SSIS and there isn’t a single VFP license in my reach. The next best option, of course, is to design a custom component using C#.

In my next post, I’ll explain the nuts and bolts in detail. But to summarize, my Custom Task component (called “ID Lookup”) uses about 100 lines of code in the ProcessInput method, and another 100 lines in PreExecute (give or take). It essentially acts as a Lookup and Conditional Split with some extra logic built in that is specific to our situation. There are properties on the component that allow the developer to specify what input columns should be used in the matching process.

 An SSIS Custom Transformation Success Story

The new Data Flow looks a little more manageable.

ID Lookup achieves the exact same results as the three data flows and 100 transformations did. As an added bonus, it performs over 600% better! And I’m sure with a bit more tweaking, I can squeeze out additional performance.

Aesop Time

I suppose that being a senior-level software developer in my previous life helped out a bit. I’m not certain that the team before me even thought about writing some code to solve the problem. I also have the added benefit of seeing eight different packages in a completed state. It is much easier to identify patterns when you have something tangible to look at.

So the moral is, don’t be afraid of writing custom tasks or Data Flow components. This was not my first, and it won’t be my last. Now, when we start to write the new packages, we’ll drop this transformation into the Data Flow, right smack between the appropriate staging tables.

As I said, in my next post, I’ll take a look at some of the code I used to create the component. Keep in mind that I cannot show you our business-logic, but I will show the framework.

Tags: , , , , ,

6 Comments

Formula 409: Private Companies Must Comply with SOX

I’ve been doing a lot of research on Sarbanes-Oxley (SOX) compliance lately in part because I am now working in the financial industry and in part because I am preparing an article on the topic for Advisor Media.

SOX compliance is both complex and vague. There is no official compliance checklist, only various guidelines and advice from agencies, accountants, and vendors. Businesses are left to implement control frameworks, introduce new segregation of powers, add auditing and logging to existing systems, and rely on the advice and expertise of consultants and vendors who promise to deliver various solutions.

And if there is a misstep, the CEO could go to jail.

Section 409

One area I don’t hear a lot of discussion about from the IT world is the implications of Section 409. Not to say that there is no discussion, but that the vast majority of IT articles on SOX compliance focus on Sections 302 and 404. The reality is that Section 409 doesn’t easily translate to any specific IT implementation or control structure.

But it certainly has significant implications for a public company’s IT/R&D department. Here is the text of the Sarbanes-Oxley Act, Section 409:

Section 13 of the Securities Exchange Act of 1934 (15 U.S.C. 78m), as amended by this Act, is amended by adding at the end the following:

“(l) REAL TIME ISSUER DISCLOSURES. – Each issuer reporting under section 13(a) or 15(d) shall disclose to the public on a rapid and current basis such additional information concerning material changes in the financial condition or operations of the issuer, in plain English, which may include trend and qualitative information and graphic presentations, as the Commission determines, by rule, is necessary or useful for the protection of investors and in the public interest.”.

Basically, a public company must disclose material change events that would impact their financial condition or operations. And Big Brother wants pictures!

As an investor, this is great news; for the sake of innovation though, not so much.

Material changes

What is a material change? No clue. Well, I do have some clue, but there is no official definition of a material change in relation to Section 409 compliance. The only requirement seems to be that it is any change that impacts a company’s finances or operations. I suppose outsourcing a project to IBM, laying off a few dozen employees, or significantly cutting supplier costs all apply. Any change in an organization that could change profitability is a candidate. This includes a failed research and development project.

Yes, a failed R&D project.

Innovation takes a hit

The prospect of reporting failure likely makes CEOs a bit weak in the knees. Competitors will sniff the SOX box to find out what their rivals are doing — or not doing, for that matter. This in turn will force public companies to think twice about taking R&D risks. If you like innovation and continuous improvement, this doesn’t bode well.

As a result (directly or indirectly), we’ve seen a flurry of big-time acquisitions. Instead of developing new technologies in-house, companies are more inclined than ever to acquire them from smaller companies. To restate: the prospect of a failed innovative R&D project is forcing large companies to purchase private companies with proven ideas and technologies.

One of many examples

Take Microsoft’s acquisition of Stratature, an MDM vendor, last year. Stratature was recognized as the fastest growing private company in the Southeast in both 2005 and 2006. Microsoft bought them in 2007. Certainly Microsoft could have developed their own MDM solution. Right?

It is my feeling that the purchase had to do in part with Section 409. Microsoft could have started R&D on their own MDM solution. But MDM is complex and evolving. There is no one clear solution. If Microsoft embarked on this path, there would have been a chance they would have failed. Stratature was already a big success. The price was high, but worth it.

Opportunities for the rest of us

It is clear that Section 409 presents an interesting opportunity to small, private companies. If you invent an idea and grow and market it, it is more likely today than ever before that a larger company would seek to acquire you. Larger companies don’t want to take the risk of exposing themselves (and their failed project initiatives) under the “material event” clause of SOX. Besides, larger companies buy up smaller companies anyway: it is good business and often fits their strategic interests. Section 409 merely gives them an additional reason to do so.

Therefore, SOX compliance for all

Now you have a great product, and you have some interest from a larger public company looking to acquire you. But you have no internal control structures in place, no financial audit trail, and your IT department has broad access to all of your data. Because of this, the purchasing company will need to do a lot of work getting your business in shape for public life.

Not only that, but partnering with a public company may force you into compliance as well.

Lastly, your valuation will be higher if you comply with SOX (check out the Aberdeen Group’s “SOX Compliance and Automation: A Benchmark Report”, which can be downloaded from the Compliance Library at ultimate Software). Private companies who comply with SOX — especially sections 302 and 404 — operate better, are trusted, and are more attractive to potential buyers.

Unless you have no plans of being acquired or partnering with a public company, then it seems foolish not to start the process of meeting the requirements of SOX: Especially if you are an innovative company doing one or more progressive research projects.

Tags: , , , ,

3 Comments