This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

As eluded to in my previous post (Backup System), bad things happen to good people. There are a ton of reasons to justify having a good backup procedure. A lot can go wrong with the hardware and software that make up your system. From network failures to bad service pack installs, a backup system is a must. But the best backup system in the world only gets you so far if you don’t have a system to recover or restart a failed process.

This is an ETL / Data Integration topic, so I’ll leave out discussion on restore capabilities found in backup tools. Of course, there may be times when restoring entire files, intermediate results, and staged data is necessary, but this is a job for your IT department. I’ll focus instead on how the ETL system should handle exceptions and failures.

At its most basic level, a recovery and restart system must allow you to stop, start, and retry your ETL processes.

man-pulling-hair-out-2.jpgSome might argue (and I have been in the room with these debaters) that it is simpler and more manageable to simply restart the jobs from the beginning and not bother with the added complexities of recovery and restart. While this might be true for simple environments or small jobs, it is certainly not true for larger, more critical projects. Restarting, for example, saves you from having to re-acquire data from your sources. This is a huge benefit, as some sources are only available to you during certain load windows. In some cases — like in my environment — some source files are overwritten each day! Other times, your source data might come from an FTP site. Downloading the same (large) data over and over takes time.

Keep in mind that your recovery and restart systems must be smart enough to clean up partial or damaged work. You’ll need the ability to back out of processes that partially finished, or even to rollback committed changes in order to ensure continued quality and integrity of your data warehouse and staging area. This is not an easy task and requires a lot of planning. Make your life easier by planning well ahead of time and try to link your recovery and restart procedures directly to the times when you write data to disk (”staging”).

SQL Server 2005 Integration Services (SSIS)

To start, use transactions in your packages. Transactions are an often underused capability of SSIS. They should be a part of your package designs so that blocks of tasks can be handled together as logical units. Transactions can be defined on the Task and container level, giving you the flexibility to nest and work with transactions in parallel. To get this working, you will need to set TransactionOption on the parent container to “Required”. This will initiate a new transaction if its parent hasn’t already done so. Child Tasks can participate by also setting TransactionOption to “Required” or “Supported”. A common technique you could employ: Use a Sequence Container to group related Tasks together, setting the Sequence Container’s TransactionOption to “Required” and all of its contained Tasks TransactionOption to “Supported”.

But transactions only get you part of the way there, and are not suited for every situation.

SSIS comes with a powerful and also underused feature called “Checkpoints”. Checkpoints allow you to restart failed packages from the point of failure, instead of rerunning the whole package again. The process is rather simple. SSIS will record information about its current execution to a special log file (this log file is overwritten each time the package is started from the beginning). When the package fails, and is subsequently rerun, the package will read the checkpoint log and attempt to pick up where it left off.

What’s particularly handy about SSIS’s implementation is that you can control checkpointing on Tasks and containers using the FailPackageOnFailure property. Set this property to True for each Task you want to participate. Furthermore, checkpointing is smart enough to recognize and handle transactions.

Hand Coding with Visual FoxPro (VFP9)

VFP supports transactions for both database tables and free tables (the latter works in VFP9 if you use the MAKETRANSACTABLE() function). This is nice, but doesn’t allow you — without some creative coding — to wrap transactions in the same way SSIS does with its containers and tasks.

You also don’t get built-in functionality to resume processes in Visual FoxPro after a failure. You might be able to use the RETRY command in combination with an error handler to fix problems as they are occurring, but for the sake of this subsystem, you need a “retry” mechanism that can span multiple sessions (i.e. restarting a process after the executable has already exited). The consequence is that you will need to design your ETL system in such a way that it can pick up where it left off, or go back to a previous step. You will need to:

  • Create “markers” in the workflow of your code (the most natural place for any marker is right after data has been staged to disk, but you could also create intermediate markers as well)
  • Log successful passes through each marker so the system will know where to pick up where it left off
  • Make each marker smart enough to clean up any partial work left over by a previous failed run

The easiest way to implement the third point above is to use transactions in combination with cursors for all temporary work. If a job fails (the lights go out) then your cursors will be lost, but you should be able to restart from the last good marker. Since the cursors didn’t do anything permanent, there is nothing to clean up. This creates a natural and easy-to-manage recovery and restart system.

The bottom line: You’ll need to be creative when implementing a recovery and restart system using VFP (or any other hand coded solution).

From Here

Next, I’ll talk about version control (Subsystem 25) and version migration (Subsystem 26). Both of these subsystems deal with deployment and have tripped up many teams (I’ve been a part of such teams and have hopefully learned from the mistakes).