Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for November, 2008


Published November 20th, 2008

ETL Subsystem 26: Version Migration

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

migrationDepending on the lifecycle in place at your business, you will likely need to move your data warehousing project from the development environment into a production environment. I am currently using an alpha / beta / gamma system (with mixed emotions), where alpha is our development environment, beta is our integration, and gamma is production. I have also worked with systems that combined alpha and beta and had a special user acceptance environment instead.

Whatever you do, you’ll need to consider migrating from one environment to another.

Some things to consider:

  • Environments: How similar are your environments (both hardware and software)? Are there major differences between the development servers and the production servers? Ideally, all environments should resemble each other (integration, UAT, and production environments should really be identical). You don’t want to develop on a 64-bit Windows Server 2008 and deploy to a 32-bit Windows Server 2003 machine!
  • Security: Have you considered the different security and access requirements that exist between development and production environments? Will you, as the data warehousing team, have access to manage server jobs, file transfers, and file shares?
  • Version Control: Have you set up your version control in such a way that if an installation fails, you can revert back to a previous version? This is tricky, so be careful to set aside enough time for planning. You also need to keep in mind potential compliance issues, as it may be necessary to “pull out” an older version to see how data might have been manipulated differently than the current installation.
  • Resources: Who are the people in charge of the migration? Are they available when you need them?
  • Documentation: Does your organization require installation documents, verification plans, quality assurance sign-offs, user acceptance sign-offs, and back-out plans?
  • Protocol: What type of red tape must you cut through to get your data warehouse online? Who needs to sign-off on the deployment? What forms must be filled out? Is there an official change request procedure? ITIL and other systems can be good in theory but difficult to work with in practice. Take the time to learn what you need to do to get your system online.
  • Packaging: How will you package your software and database scripts? Will you use an installation program or will you have someone manually copy files? Will scripts be run individually? Will you use a deployment tool such as MSBuild?
  • Verification: Have you thought about how you can test the system in a way that flexes all new and updated features, while providing adequate regression to be sure you didn’t break something that was once working? Have you done a proper impact analysis on all new or changed components?
  • Data: Will you be migrating data changes as well? How will you get these data changes from one environment to the next? Has everything been scripted and tested thoroughly?

 

SQL Server 2005 Integration Services (SSIS)

SSIS makes deployment of packages from one environment to another relatively easy. You merely create a single package and supply different configurations for each environment. I am currently using XML configurations, but you might do better with database configurations, environment variables, or using a table in SQL Server. The basic premise is simple: Create variables in your package for things like file locations and connection strings. Using the Package Configurations dialog (SSIS -> Package Configurations…), enable configurations, click Add and then follow the Wizard. Check the SQL Server Development Center for details.

Remember that you can always set a configurable item (e.g. variables, connections, tasks) from the command line or in the SQL Agent Job. For the Agent Jobs, open the Job Step properties and from the General page, select “Set values”. When entering your values to configure, use the following syntax:

\Package\container.property

Example:

\Package.Variable[ServerName].value

For more insight on setting up jobs you can see my post “SSIS Packages and setting up Jobs“.

SSIS comes with a Deployment Utility that will bundle all the files necessary for deployment of the package into a single folder. Copy this folder to the destination server and run the Package Installation Wizard to install your packages. While this is a good idea, and does remove the possibility of human error during the installation process, I do not use the Deployment Utility. Call me old-school, but I prefer to do this bundling and installing myself through the dtexec command line utility — I like to see what is going on and it gives me an opportunity to catch surprises before they become trouble.

Hand Coding with Visual FoxPro (VFP9)

You can also use FoxPro to mimic the same kind of functionality you get in SSIS using package configurations. This is no different than using an ini file, a “setup” table, registry entries, etc. I would warn against using free tables if your deployment team does not have access to FoxPro. You would be better served using XML or some other plain text formatted file.

You can also automate much of your builds by utilizing VFPBuild. Find the related CodePlex page here.

From Here

As a developer, you are responsible for ensuring that the right version of your software makes it through the migration path with little or no hassle. It can be challenging enough dealing with hardware and software compatibility issues. Do your best to know the routine, package appropriately, and plan ahead!

Next post, I’ll talk about Work Flow Monitoring…

 

Published November 10th, 2008

ETL Subsystem 25: Version Control

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

When I sat down at my desk and began putting together my notes on this subsystem a couple of weeks ago, I was reminded how important version control can really be. In fact, as I was looking through a few articles and books on the matter, I realized that I don’t do a great job of version control management. And unfortunately for data-warehoused businesses around the planet, I’m not alone.

What I currently do

I use version control software for all packages, database scripts, and configurations. Although not the greatest of the bunch, MS Visual SourceSafe does the job (we’ll be switching to Serena Dimensions soon, though). From reading various articles and from a few first-hand experiences I can tell you that many teams don’t even do this much. Ouch.

I tag each package, job, and source with a version number and relate these versions to the database and BI application versions. I will also label the source code in VSS to match. For example: version 4.0 of the database aligns with version 3.1 of the data integration systems which supports version 1.7 of the BI Portal. I currently coordinate this using a spreadsheet in Excel, with high hopes to get this into the database in the form of metadata that can be queried and used for planning and deployment purposes.

I should mention that versioning is particularly important in compliance-laden environments. If it is important to show an auditor how data got from point A to point B, then you’ll need to dig out the exact versions of the software that loaded the data at the time. This is why including version numbers in your logging and auditing routines is very important. I wrote earlier this year on this topic in my article “Auditing Your Warehouse for Sarbanes-Oxley Compliance“. Take a look if you want more information and to see how I implemented this using SSIS.

At this time, I could segue into impact analysis; I’ll leave that for another post.

In addition to source control software and versioning, I use an LDM (Logical Data Mapping) that is stored as “business” metadata in the database. This metadata explains all source-to-target mappings and is sensitive to start and end dates. For example, if a vendor adds a column to one of our input files, I add the new column to the LDM and set the begin/end dates appropriately so that the historical mappings are left in tact. In this scenario, I can load an old file or a new file with little effort. When time permits, I will build logic into my SSIS packages so that, based on the date of the file being processed, the appropriate child package will be loaded.

What I don’t do yet

So far so good, but this is where I fall short. I have source control. I use major and minor versions and connect them to all other elements in the data warehouse environment. I even have a time-sensitive, table-driven LDM system. But I am lacking in the “control” area which has me worried. I have a lot of to-do items and half implementations.

What I need is a (fully automated, metadata-driven) system to coordinate different releases on different schedules and allow the packages to work (or not work) under certain database versions and environments. It would be nice to have some indicator in the ETL system that tell me if a certain version of a certain input file will load with a certain version of the database. To complicate matters more, much of what I do is based on certain mathematical models that sometimes change (yep, they have versions too). The question becomes, what version of what package works with models X, Y, Z for what database. Normally (99% of the time), everything is simple: the latest version of the package works for the latest version of each model on the current version of the data warehouse. That other 1% of the time is what makes my job interesting!

Assuming that you already use source control…

SQL Server 2005 Integration Services (SSIS)

Some elements of version control are simple using SSIS. For starters, you are able to set the VersionMajor and VersionMinor properties to your ETL version. When you make a major change to the package — so that it will no longer work with historical data, for example — then you can also update the package ID (see property “ID” under the Identification group in the Properties window). As a general rule, you should update this ID if the package works fundamentally different than it did before.

When I get around to implementing the system for selecting the correct version of the package that matches the input file version, I’ll be sure to write about it. For now, it’s merely a figment of my imagination and needs some fleshing out.

What I have done, though, is create a custom Task which examines the input source file against the LDM. If a column name or type is out of place, the file is rejected. This prevents loading a file that might have an extra column suddenly or if columns were swapped for some reason. We deal with input data that is (mostly) trustworthy, but sometimes dirty. As a result, this extra checking has become necessary. If anyone is interested, I would be happy to share the code and logic of this custom Task in a future post. Just let me know!

The above is not a scenario that exists in every environment. For me, detecting the file version is very important. For you, it might not be a challenging issue.

I have also linked the package’s ID to the LDM and my logging system so that each package — as represented by its GUID — can relate to one or more versions of my mappings. This allows me to modify packages, but keeps the modifications tied to a particular mapping.

Hand Coding with Visual FoxPro (VFP9)

Good developers will appropriately version their software. This should be part of How to Write Software 101, but is surprising to me how many times I’ve seen poor or non-existent versioning schemes. Hand-coding ETL is also a software development project and deserves the same versioning-attention you would give to any other good software.

In addition to versioning and source control, you should also consider versioning your metadata as I discussed above. I have written about and given presentations on this topic in the past with respect to the Logical Data Mappings. Check the following for more details:

The LDM is not the only metadata to add to your version control, but this should give you a real good start!

From Here

So you have a good version control system in place. Now you need to migrate your current versions from your development environment into production. That’s what Subsystem 26 is all about. And lucky enough, that’s next on my list!

Published November 3rd, 2008

ETL Subsystem 24: Recovery and Restart System

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

As eluded to in my previous post (Backup System), bad things happen to good people. There are a ton of reasons to justify having a good backup procedure. A lot can go wrong with the hardware and software that make up your system. From network failures to bad service pack installs, a backup system is a must. But the best backup system in the world only gets you so far if you don’t have a system to recover or restart a failed process.

This is an ETL / Data Integration topic, so I’ll leave out discussion on restore capabilities found in backup tools. Of course, there may be times when restoring entire files, intermediate results, and staged data is necessary, but this is a job for your IT department. I’ll focus instead on how the ETL system should handle exceptions and failures.

At its most basic level, a recovery and restart system must allow you to stop, start, and retry your ETL processes.

man-pulling-hair-out-2.jpgSome might argue (and I have been in the room with these debaters) that it is simpler and more manageable to simply restart the jobs from the beginning and not bother with the added complexities of recovery and restart. While this might be true for simple environments or small jobs, it is certainly not true for larger, more critical projects. Restarting, for example, saves you from having to re-acquire data from your sources. This is a huge benefit, as some sources are only available to you during certain load windows. In some cases — like in my environment — some source files are overwritten each day! Other times, your source data might come from an FTP site. Downloading the same (large) data over and over takes time.

Keep in mind that your recovery and restart systems must be smart enough to clean up partial or damaged work. You’ll need the ability to back out of processes that partially finished, or even to rollback committed changes in order to ensure continued quality and integrity of your data warehouse and staging area. This is not an easy task and requires a lot of planning. Make your life easier by planning well ahead of time and try to link your recovery and restart procedures directly to the times when you write data to disk (”staging”).

SQL Server 2005 Integration Services (SSIS)

To start, use transactions in your packages. Transactions are an often underused capability of SSIS. They should be a part of your package designs so that blocks of tasks can be handled together as logical units. Transactions can be defined on the Task and container level, giving you the flexibility to nest and work with transactions in parallel. To get this working, you will need to set TransactionOption on the parent container to “Required”. This will initiate a new transaction if its parent hasn’t already done so. Child Tasks can participate by also setting TransactionOption to “Required” or “Supported”. A common technique you could employ: Use a Sequence Container to group related Tasks together, setting the Sequence Container’s TransactionOption to “Required” and all of its contained Tasks TransactionOption to “Supported”.

But transactions only get you part of the way there, and are not suited for every situation.

SSIS comes with a powerful and also underused feature called “Checkpoints”. Checkpoints allow you to restart failed packages from the point of failure, instead of rerunning the whole package again. The process is rather simple. SSIS will record information about its current execution to a special log file (this log file is overwritten each time the package is started from the beginning). When the package fails, and is subsequently rerun, the package will read the checkpoint log and attempt to pick up where it left off.

What’s particularly handy about SSIS’s implementation is that you can control checkpointing on Tasks and containers using the FailPackageOnFailure property. Set this property to True for each Task you want to participate. Furthermore, checkpointing is smart enough to recognize and handle transactions.

Hand Coding with Visual FoxPro (VFP9)

VFP supports transactions for both database tables and free tables (the latter works in VFP9 if you use the MAKETRANSACTABLE() function). This is nice, but doesn’t allow you — without some creative coding — to wrap transactions in the same way SSIS does with its containers and tasks.

You also don’t get built-in functionality to resume processes in Visual FoxPro after a failure. You might be able to use the RETRY command in combination with an error handler to fix problems as they are occurring, but for the sake of this subsystem, you need a “retry” mechanism that can span multiple sessions (i.e. restarting a process after the executable has already exited). The consequence is that you will need to design your ETL system in such a way that it can pick up where it left off, or go back to a previous step. You will need to:

  • Create “markers” in the workflow of your code (the most natural place for any marker is right after data has been staged to disk, but you could also create intermediate markers as well)
  • Log successful passes through each marker so the system will know where to pick up where it left off
  • Make each marker smart enough to clean up any partial work left over by a previous failed run

The easiest way to implement the third point above is to use transactions in combination with cursors for all temporary work. If a job fails (the lights go out) then your cursors will be lost, but you should be able to restart from the last good marker. Since the cursors didn’t do anything permanent, there is nothing to clean up. This creates a natural and easy-to-manage recovery and restart system.

The bottom line: You’ll need to be creative when implementing a recovery and restart system using VFP (or any other hand coded solution).

From Here

Next, I’ll talk about version control (Subsystem 25) and version migration (Subsystem 26). Both of these subsystems deal with deployment and have tripped up many teams (I’ve been a part of such teams and have hopefully learned from the mistakes).