This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

When I sat down at my desk and began putting together my notes on this subsystem a couple of weeks ago, I was reminded how important version control can really be. In fact, as I was looking through a few articles and books on the matter, I realized that I don’t do a great job of version control management. And unfortunately for data-warehoused businesses around the planet, I’m not alone.

What I currently do

I use version control software for all packages, database scripts, and configurations. Although not the greatest of the bunch, MS Visual SourceSafe does the job (we’ll be switching to Serena Dimensions soon, though). From reading various articles and from a few first-hand experiences I can tell you that many teams don’t even do this much. Ouch.

I tag each package, job, and source with a version number and relate these versions to the database and BI application versions. I will also label the source code in VSS to match. For example: version 4.0 of the database aligns with version 3.1 of the data integration systems which supports version 1.7 of the BI Portal. I currently coordinate this using a spreadsheet in Excel, with high hopes to get this into the database in the form of metadata that can be queried and used for planning and deployment purposes.

I should mention that versioning is particularly important in compliance-laden environments. If it is important to show an auditor how data got from point A to point B, then you’ll need to dig out the exact versions of the software that loaded the data at the time. This is why including version numbers in your logging and auditing routines is very important. I wrote earlier this year on this topic in my article “Auditing Your Warehouse for Sarbanes-Oxley Compliance“. Take a look if you want more information and to see how I implemented this using SSIS.

At this time, I could segue into impact analysis; I’ll leave that for another post.

In addition to source control software and versioning, I use an LDM (Logical Data Mapping) that is stored as “business” metadata in the database. This metadata explains all source-to-target mappings and is sensitive to start and end dates. For example, if a vendor adds a column to one of our input files, I add the new column to the LDM and set the begin/end dates appropriately so that the historical mappings are left in tact. In this scenario, I can load an old file or a new file with little effort. When time permits, I will build logic into my SSIS packages so that, based on the date of the file being processed, the appropriate child package will be loaded.

What I don’t do yet

So far so good, but this is where I fall short. I have source control. I use major and minor versions and connect them to all other elements in the data warehouse environment. I even have a time-sensitive, table-driven LDM system. But I am lacking in the “control” area which has me worried. I have a lot of to-do items and half implementations.

What I need is a (fully automated, metadata-driven) system to coordinate different releases on different schedules and allow the packages to work (or not work) under certain database versions and environments. It would be nice to have some indicator in the ETL system that tell me if a certain version of a certain input file will load with a certain version of the database. To complicate matters more, much of what I do is based on certain mathematical models that sometimes change (yep, they have versions too). The question becomes, what version of what package works with models X, Y, Z for what database. Normally (99% of the time), everything is simple: the latest version of the package works for the latest version of each model on the current version of the data warehouse. That other 1% of the time is what makes my job interesting!

Assuming that you already use source control…

SQL Server 2005 Integration Services (SSIS)

Some elements of version control are simple using SSIS. For starters, you are able to set the VersionMajor and VersionMinor properties to your ETL version. When you make a major change to the package — so that it will no longer work with historical data, for example — then you can also update the package ID (see property “ID” under the Identification group in the Properties window). As a general rule, you should update this ID if the package works fundamentally different than it did before.

When I get around to implementing the system for selecting the correct version of the package that matches the input file version, I’ll be sure to write about it. For now, it’s merely a figment of my imagination and needs some fleshing out.

What I have done, though, is create a custom Task which examines the input source file against the LDM. If a column name or type is out of place, the file is rejected. This prevents loading a file that might have an extra column suddenly or if columns were swapped for some reason. We deal with input data that is (mostly) trustworthy, but sometimes dirty. As a result, this extra checking has become necessary. If anyone is interested, I would be happy to share the code and logic of this custom Task in a future post. Just let me know!

The above is not a scenario that exists in every environment. For me, detecting the file version is very important. For you, it might not be a challenging issue.

I have also linked the package’s ID to the LDM and my logging system so that each package — as represented by its GUID — can relate to one or more versions of my mappings. This allows me to modify packages, but keeps the modifications tied to a particular mapping.

Hand Coding with Visual FoxPro (VFP9)

Good developers will appropriately version their software. This should be part of How to Write Software 101, but is surprising to me how many times I’ve seen poor or non-existent versioning schemes. Hand-coding ETL is also a software development project and deserves the same versioning-attention you would give to any other good software.

In addition to versioning and source control, you should also consider versioning your metadata as I discussed above. I have written about and given presentations on this topic in the past with respect to the Logical Data Mappings. Check the following for more details:

The LDM is not the only metadata to add to your version control, but this should give you a real good start!

From Here

So you have a good version control system in place. Now you need to migrate your current versions from your development environment into production. That’s what Subsystem 26 is all about. And lucky enough, that’s next on my list!