Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for April, 2008


Published April 15th, 2008

ETL Subsystem 10: Surrogate Key Generator

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

When integrating data into a Dimensional Model, you need a mechanism to assign new primary keys to each dimension. These primary keys will be used in your Fact table as foreign keys. You cannot use natural keys because they are likely to repeat — this is especially true if you are maintaining history using SCD Type 2 (more on this in a bit). Subsystem 10 addresses this important need, by specifying the need to generate surrogate keys for all dimensions.

Surrogates

Integer surrogate keys should always be used as primary keys for a dimension. In addition, each dimension will have at least one candidate key comprised of the natural key and row status (active or inactive dates) if the dimension is a TYPE 2 SCD. Here is what I wrote in a previous post:

Every dimension contains a single-part surrogate primary key field. This primary key is used to relate to one or more fact tables in the schema. Natural keys (like an employee id) are kept in the dimension, but are not used in joins. This technique ensures a unique value for each record, simplifies joins, and keeps indexes small.

So the concept is simple: Every time a new row is inserted into any dimension, it gets a new primary surrogate key. We use integers because they’re compact, perfect for sequencing, and under most relational databases we can control their ranges. How you generating this primary key largely depends on your performance needs and RDBMS capabilities (see the SSIS and VFP subsections below).

Special Cases

You can treat Date dimensions a little differently. Instead of making the first row in your Date Dimension start at 1, use the ISO 8601 date standard YYYYMMDD format. This makes sense for a few reasons. First, it is easier to look at a row in a Fact table and have the date available without a join to the Date dimension or one of its role players. Second, a Date Dimension is treated as a Type 1 SCD, meaning, you would never have the same date repeated with one row being active and the other inactive as in TYPE 2 SCDs (hence throwing off the ISO date format). Third, because Fact tables generally contain several dates, using the integer date key (which is an int in the ISO date format) instead of fetching the formatted date from the Date Dimension may have some performance benefits because you can easily format the date later on in the presentation layer. Lastly, you can usually predict with certainty the total number of rows that the Date Dimension will have, and then pre-populate all the dates before you even do your first integration. This removes the Date Dimension from the surrogate key processing all together.

As I mentioned previously, one row in each dimension should have a primary key of zero, which can be used as a symbolic “null”. Remember that Fact tables reference dimensions through foreign keys. You cannot have a null foreign key, so the only viable alternative is to link to a dummy row in the dimension. In dimensional modeling, the acceptable solution is to create a row in each dimension with a primary key of zero. For example, in a Customer Dimension, the zero key might have a customer name of “Unknown” and a DOB of 0 (which is a foreign key that relates to the Date Dimension’s null row). This is a very cool concept because it becomes incredibly easy to produce a query that could give you all the order transactions that did not involve a known customer. Maybe not the best of examples, but try writing a query to give you results of everything that didn’t happen, or wasn’t there, using a relational model!

You can also use -1, -2, and so on to represent other special conditions. For example, a Promotion Dimension may have a zero-key row that represents “Unknown” and a -1 key row that represents “No Promotion”. Perhaps a -2 key could symbolize a “Canceled Promotion”. And so on…

The following sections give some ideas on how to generate surrogate keys in both SSIS and VFP. I’ll discuss using these keys in greater depth when I discuss Subsystem 14: Surrogate Key Management.

SQL Server 2005 Integration Services (SSIS)

I prefer to let the database generate new, unique surrogate keys in all cases. This is usually the safest approach and doesn’t require any real development around the key-generation process.

For SQL Server, select the best int for the job and make it an IDENTITY. Select among the 1 byte Tinyint (0 to 255), 2 byte Smallint (-32,768 to 32,768), 4 byte Int (-2,147,483,648 to 2,147,483,647), and the 8 byte Bigint (-2^63 to 2^63-1) depending on the expected maximum size of the dimension.

It may not be practical or feasible to make the primary key attribute in your dimensions IDENTITY Ints. Performance may also be an issue when assigning and returning these new keys. There are really two alternatives, my favorite first:

Alternative 1: Use a staging table. Using a lookup/reference table in the staging area is an option that always works. The concept is simple: Create lookup tables that correspond to each dimension in the model. Include the IDENTITY surrogate key column, the natural key(s), and active/inactive dates (for Type 2 dimensions). When processing dimensions, insert new rows in this lookup table just before adding them to the dimension. Use the surrogate key generated from the lookup table to insert into the dimension (you can either return the generated key using SCOPE_IDENTITY() or simply do a lookup/join on the natural key and status). When you need the key for the Fact table, you can do another lookup/join on the staging table. This approach takes pressure off of the dimensional model and creates opportunities to increase performance by optimizing the lookup/reference table. I’ll talk more about the details on this approach when I discuss Subsystem 14.

Alternative 2: Generate your own key. Another way to achieve the same result is to figure out the last key in the dimension and increment it manually. This number will then be used to insert into the dimension and again into the Fact. The problem with this approach is that if you do not process Facts immediately after each dimension, one at a time, then you will need to store and lookup the key in a staging table or in memory. Also, this approach assumes that no other process or thread is inserting data into the dimension at the same time. This, of course, would kill the sequential system and violate the primary key constraint on the table.

To explore Alternative 2 further, have a look at this post from Marco Russo of sqljunkies.com. He explains how to use an Execute SQL Task in SSIS that executes the following:

SELECT COALESCE( MAX( ID_Dimension ), 0 ) AS LastID 
    FROM Dimension

The claim is that his method is the best, most scalable solution. I don’t fully agree, as my experiences dictate otherwise. But I believe that each approach has merit and depends a lot on the environment.

Hand Coding with Visual FoxPro (VFP9)

For VFP, use an int data type with AutoIncrement enabled. As with a SQL Server implementation, I prefer using this approach to generating keys manually. Even still, both alternatives mentioned above are also available in FoxPro:

Alternative 1: Use a staging table. Here, as with SQL Server, use a staging table with AutoIncrement enabled on the primary key. As a consequence, you will not enabled AutoIncrement on the dimension. To return the newly added key from the lookup/reference table, you simply use the GETAUTOINCVALUE() function and return it.

Alternative 2: Generate your own key. You can also determine the last key used in the dimension and manually increment it. Personally, this approach offers very little to a VFP developer. Alternative 1 — or simply using the Dimension for generating the surrogate — is by far the best approach.

I like VFP for these types of tasks. Whatever you decide, you don’t really need to worry about returning the generated key. You typically insert and update all your dimensional data first, and then return to the dimension later to lookup keys just before inserting the facts. Nothing beats an old-fashioned SEEK in this scenario, which can be light years faster than building a select with a where clause depending on how crafty you are with Rushmore and Index building.

Compare this:

SELECT PrimaryKey ;
    FROM Dimension ;
    WHERE NaturalKey = 'SomeValue' ;
        AND StatusFlag = True

to this:

SEEK('SomeCandidateKey','Dimension','CandidateIndex')

If you are using Alternative 1, then instead of SEEK, you could build a JOIN between the lookup table and the data being processed for the fact. This set-based approach can be pretty speedy as well.

If you insist on exploring Alternative 2, the translation from Marco’s TSQL code to VFP is as follows:

SELECT NVL(MAX(ID_Dimension),0) AS LastID ;
    FROM Dimension

In my experience, I simply use VFP9’s Autoincrementing Integer field in each Dimension. I don’t bother with a staging lookup table because with VFP, I can just set the order to the natural key and do a SEEK on the dimension. It’s a fast, simple, and consistent approach.

VFP Really Shines

In this series of articles, I have attempted to highlight some of the differences between implementing the 34 Subsystem best practices in both a hand-coded environment (VFP) and through a tool specifically designed for the job (SSIS). In each case, both VFP and SSIS have really been about equal (this is a non-scientific study, of course!). SSIS has advantages that are hard to duplicate in VFP (some of which I will be highlighting later on in this series), but in this case, VFP really shines. While both databases can handle generating keys for us in similar ways, VFP’s SEEK command allows us to use it!

I don’t want to dwell on this any more. I’ll discuss more details on how to use the surrogate keys when I talk about Subsystem 14: Surrogate Key Management.

Published April 9th, 2008

An SSIS Custom Transformation Success Story

I recently inherited eight SSIS packages that all do essentially the same thing: They extract data from disparate sources, cleanse and conform them, and finally load them into an analytical data warehouse (dimensional model). Welcome to Data Integration 101!

The project calls for another two dozen of these extracts. Unfortunately, the team before me operated with little functional documentation, a limited knowledge of data integration, and did not focus on using tried and true best practices. There was no data profiling, no logical data mapping, no quality controls, no auditing, and very little in the way of logging. Essentially, things are a bit of a mess.

My primary focus over the next 6 months is to clean things up. I’ve already introduced better documentation practices, naming conventions, and this crazy concept of the “package template”. There is no way on earth I am developing 20 similar packages without using a template!

I’ve also started work on a Data Quality Manager Application (for now, it will consist of a table-driven set of SQL statements that can be run via stored procedure on a scheduler) and a Logical Data Map (LDM). The LDM will provide much needed documentation and give the other SSIS developers a little more direction.

I plan on posting much more about the LDM, Data Quality Manager, and package templates in a later post. For now, I want to gloat a little about a custom transformation task component I wrote.

Update: I almost never re-publish a post, but felt I needed to in this case. Darren Green of http://www.sqlis.com, which is an excellent SSIS resource by the way, commented about my use of the term “Task” to refer to Data Flow “components”. He’s right, so I made the change! Sorry for any confusion. In fact, while reading his comment, I had an “Ah-Ha” moment: The way to tell the difference between the two Scripts in SSIS is to call them by their correct names: Script Task (of the Control Flow) and Script Component (of the Data Flow). Duh!

The Problem

When the data comes into SSIS, work needs to be done to relate the rows from the source to those of the target. The mapping is a little different depending on the source file, but in the end, we need to match any number of significant characteristics (aka the file’s candidate key) to a single unique ID Number in our data warehouse database.

For example, one file might contain Social Security number, another might have last name and date of birth, a third might have the last name and the last 4 digits of a credit card number, yet another would have an account number from a billing system. Attached to all these identifiers are some important metrics that need to be added to a fact table in the data warehouse.

To do this, the developers before me used three Data Flow tasks, and over 100 Transformations (no joke). The second (and meatiest) Data Flow task took the extracted data (now in a staging table) and then embarked on a long drive through the various possible match combinations. First, there would be a Conditional Split that checked to see if the row had a valid account number, if not, then the output would hit another Conditional Split that would check on last name and so on. If there was a valid account number, a Lookup component would check the account dimension in the data warehouse for a match. If exactly one match was found (it is possible to have repeating account numbers), it would add the ID to the pipeline and drop the results into another staging table as a “match”. If not found — or there was a duplicate — another Lookup component would be used on some other identifiable column. Repeat. Repeat. Repeat!

The other two data flows did various cleanup work.

To add complexity to this mess, the attributes in each source are different. Sometimes account number has an extra check digit at the end. Other times, dashes are inserted in different positions.

The original data flow

A picture is worth a thousand words. Here is one of the data flows. The quality is bad on purpose: This is business logic, and as messy as it is, I don’t want to give away any company secrets! But, here is a high view so you can taste the pasta.

The developers did an admirable job of wading through all this and managed to implement a pure SSIS solution that works. They did this all without a data profile, LDM, or any real functional documentation. However, maintaining this, or repeating this logic in 20 more packages, makes me want to extract my eyes from my head, transform them into soup, and load them into my mouth.

A Better Approach

So I spent an hour looking at all this and thought to myself how easy and simple this would be using Visual FoxPro. A few SELECT statements, perhaps a CASE here, an IF there, and a REPLACE command for good measure.

Unfortunately, I can’t use VFP for this project. We’re locked into SSIS and there isn’t a single VFP license in my reach. The next best option, of course, is to design a custom component using C#.

In my next post, I’ll explain the nuts and bolts in detail. But to summarize, my Custom Task component (called “ID Lookup”) uses about 100 lines of code in the ProcessInput method, and another 100 lines in PreExecute (give or take). It essentially acts as a Lookup and Conditional Split with some extra logic built in that is specific to our situation. There are properties on the component that allow the developer to specify what input columns should be used in the matching process.

The new data flow using the custom transformation task

The new Data Flow looks a little more manageable.

ID Lookup achieves the exact same results as the three data flows and 100 transformations did. As an added bonus, it performs over 600% better! And I’m sure with a bit more tweaking, I can squeeze out additional performance.

Aesop Time

I suppose that being a senior-level software developer in my previous life helped out a bit. I’m not certain that the team before me even thought about writing some code to solve the problem. I also have the added benefit of seeing eight different packages in a completed state. It is much easier to identify patterns when you have something tangible to look at.

So the moral is, don’t be afraid of writing custom tasks or Data Flow components. This was not my first, and it won’t be my last. Now, when we start to write the new packages, we’ll drop this transformation into the Data Flow, right smack between the appropriate staging tables.

As I said, in my next post, I’ll take a look at some of the code I used to create the component. Keep in mind that I cannot show you our business-logic, but I will show the framework.

Published April 4th, 2008

Formula 409: Private Companies Must Comply with SOX

I’ve been doing a lot of research on Sarbanes-Oxley (SOX) compliance lately in part because I am now working in the financial industry and in part because I am preparing an article on the topic for Advisor Media.

SOX compliance is both complex and vague. There is no official compliance checklist, only various guidelines and advice from agencies, accountants, and vendors. Businesses are left to implement control frameworks, introduce new segregation of powers, add auditing and logging to existing systems, and rely on the advice and expertise of consultants and vendors who promise to deliver various solutions.

And if there is a misstep, the CEO could go to jail.

Section 409

One area I don’t hear a lot of discussion about from the IT world is the implications of Section 409. Not to say that there is no discussion, but that the vast majority of IT articles on SOX compliance focus on Sections 302 and 404. The reality is that Section 409 doesn’t easily translate to any specific IT implementation or control structure.

But it certainly has significant implications for a public company’s IT/R&D department. Here is the text of the Sarbanes-Oxley Act, Section 409:

Section 13 of the Securities Exchange Act of 1934 (15 U.S.C. 78m), as amended by this Act, is amended by adding at the end the following:

“(l) REAL TIME ISSUER DISCLOSURES. - Each issuer reporting under section 13(a) or 15(d) shall disclose to the public on a rapid and current basis such additional information concerning material changes in the financial condition or operations of the issuer, in plain English, which may include trend and qualitative information and graphic presentations, as the Commission determines, by rule, is necessary or useful for the protection of investors and in the public interest.”.

Basically, a public company must disclose material change events that would impact their financial condition or operations. And Big Brother wants pictures!

As an investor, this is great news; for the sake of innovation though, not so much.

Material changes

What is a material change? No clue. Well, I do have some clue, but there is no official definition of a material change in relation to Section 409 compliance. The only requirement seems to be that it is any change that impacts a company’s finances or operations. I suppose outsourcing a project to IBM, laying off a few dozen employees, or significantly cutting supplier costs all apply. Any change in an organization that could change profitability is a candidate. This includes a failed research and development project.

Yes, a failed R&D project.

Innovation takes a hit

The prospect of reporting failure likely makes CEOs a bit weak in the knees. Competitors will sniff the SOX box to find out what their rivals are doing — or not doing, for that matter. This in turn will force public companies to think twice about taking R&D risks. If you like innovation and continuous improvement, this doesn’t bode well.

As a result (directly or indirectly), we’ve seen a flurry of big-time acquisitions. Instead of developing new technologies in-house, companies are more inclined than ever to acquire them from smaller companies. To restate: the prospect of a failed innovative R&D project is forcing large companies to purchase private companies with proven ideas and technologies.

One of many examples

Take Microsoft’s acquisition of Stratature, an MDM vendor, last year. Stratature was recognized as the fastest growing private company in the Southeast in both 2005 and 2006. Microsoft bought them in 2007. Certainly Microsoft could have developed their own MDM solution. Right?

It is my feeling that the purchase had to do in part with Section 409. Microsoft could have started R&D on their own MDM solution. But MDM is complex and evolving. There is no one clear solution. If Microsoft embarked on this path, there would have been a chance they would have failed. Stratature was already a big success. The price was high, but worth it.

Opportunities for the rest of us

It is clear that Section 409 presents an interesting opportunity to small, private companies. If you invent an idea and grow and market it, it is more likely today than ever before that a larger company would seek to acquire you. Larger companies don’t want to take the risk of exposing themselves (and their failed project initiatives) under the “material event” clause of SOX. Besides, larger companies buy up smaller companies anyway: it is good business and often fits their strategic interests. Section 409 merely gives them an additional reason to do so.

Therefore, SOX compliance for all

Now you have a great product, and you have some interest from a larger public company looking to acquire you. But you have no internal control structures in place, no financial audit trail, and your IT department has broad access to all of your data. Because of this, the purchasing company will need to do a lot of work getting your business in shape for public life.

Not only that, but partnering with a public company may force you into compliance as well.

Lastly, your valuation will be higher if you comply with SOX (check out the Aberdeen Group’s “SOX Compliance and Automation: A Benchmark Report”, which can be downloaded from the Compliance Library at ultimate Software). Private companies who comply with SOX — especially sections 302 and 404 — operate better, are trusted, and are more attractive to potential buyers.

Unless you have no plans of being acquired or partnering with a public company, then it seems foolish not to start the process of meeting the requirements of SOX: Especially if you are an innovative company doing one or more progressive research projects.