Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for October, 2008


Published October 28th, 2008

ETL Subsystem 23: Backup System

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

You’ve just spent the better part of the year designing a perfect ETL system - from data profiling to data propagation. You’re in production, and all jobs are running on schedule in what seems to be an eerie harmony. You lean back in your chair. Victory, you think.

Then comes the call: Your database server crashed. Now what?

You must have a reliable backup system in place — and not just overnight tape backups conducted by your IT department. Those are a good first (baby) step, but they are usually not good enough, especially to help you recover from failures in mid ETL stride. Even with a fully redundant environment, you have got to be sure that you can re-run jobs from your staging area no matter where they failed. This is an ETL responsibility not to be passed off to IT. This means that an accessible backup of all staging data is essential. Remember, you should stage data at each major step in the ETL process (after load, after cleansing, conforming, and just before delivery). Doing so allows you to return to these staging points in order to recover from failures when they occur during ETL processing. I’ll discuss this more in my next post “ETL Subsystem 24: Recovery and Restart System”.

If you use a staging area, then you’re already part of the way there. The only thing you need to manage at this point is to plan and implement your backup strategy. Backing up staged data as it is being staged can be tricky. You can write to two different locations, on different machines or on different drives of the same machine. Consider this the poor-man’s approach to staging area backups. I’ve done this often; as it is easy and inexpensive (disk space is cheap and getting cheaper). More sophisticated mechanisms exist (for example, check out IBM Tivoli). Your policies, existing services, compliance concerns, and your ability to recreate lost data will all play a factor in determining what level of backups you need to make.

Storage of these backups, including the question on how long to keep the archives, is something you’ll need to consider as well. For critical data that cannot be reproduced, you will want to store it for a very long time in a safe location. For easily reproduced data, you might want to keep it long enough to appease the auditors. For the staging area, it might be safe to keep it only until the next run.

From Here

I won’t frame this subsystem using SSIS or Visual FoxPro, as I’ve done with the other subsystems to date. Backing up your data and files — unless you use the poor-man’s approach — is not a function that you can do using the ETL tooling. And doing a manual copy of staging data might be as simple as moving files from one directory to another, or using database replication to replicate changes from your main database to a linked, backup database on another machine.

You have a lot of decisions to make when it comes to your backup system. Next post, I’ll discuss recovery and restart, ETL Subsystem 24.

Published October 20th, 2008

ETL Subsystem 22: Scheduler

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

ScheduleThis subsystem addresses one of the more basic management-related functions of ETL: Scheduling. You must provide a way to reliably run your packages and scripts. Data moves into and out of the Data Warehouse continuously; scheduling provides a structured and predictable way of managing the process.

I use the term “predictable” because it is often necessary to provide business users and other applications with data at specific intervals throughout the day and week. In my current project, for example, we provide a reporting data mart to a specific analytical application every morning at 9am. This reporting mart must have data up until the close of the previous day (and is not refreshed throughout the current day). If the timing wasn’t right (if, for example, some of yesterday’s data came into the database at 9:30), then the reporting database would not be ready for the business users. All of these timing issues are solved with a scheduler (and of course some agreements with the data providers to ensure data is available to process by the time the scheduled task runs!).

Scheduling is also used to organize and arrange processes so that dependencies and relationships among disparate data sources are processed logically. If data A must be loaded before data B and C, then data A will be scheduled first (worth mentioning again: a prerequisite for this is that the data is available from the source at the right time as well!). If one load fails, a graceful exit from the process must be made with adequate notifications and alerts. Restarting at failure points (without having to rerun the entire process) is also a highly desireable feature of any scheduler.

I’ll get into these topics more in a few of my next subsystem posts.

SQL Server 2005 Integration Services (SSIS)

Scheduling SSIS packages is easy. The best option you have is also the most intuitive: The SQL Server Agent.

SQL Agent Job SchedulerThe SQL Server Agent is a Microsoft Windows service that executes scheduled tasks. These tasks are referred to as Jobs. A single job can consist of multiple steps and be run on multiple schedules. Schedules can be configured to respond to a specific event, at intervals, specific times, or on demand. In a single job, for example, you can run a few SSIS packages, some database stored procedures, and a few command line utilities. These can all be scheduled at various times throughout the day (perhaps each hour, or once in the morning and again at lunch), and you can get an email telling you when it finished.

The scheduling aspect of SQL Agent is quite powerful. Not only can you create a seemingly endless combination of schedules, you also have full control over how to handle each step if it fails or succeeds. All of these steps can be linked together in such a way that if any one job step fails, processing can halt or move on to another step (of your choice). If you think about it, you have built-in If/Else/EndIf and CASE logic available right in the scheduler. I find that developers who come from a programming background can make quick sense of this put together some rather elaborate and logical flows, with dependancies, prerequisites, and alternative paths built in.

Knowing this ahead of time can save you some effort in your ETL packages. So here’s a tip: Use the scheduler to manage package dependencies where you can, rather than trying to build these complexities into your Control Flows. Of course, every package should start out by examining the environment, to be sure that it has everything it needs to start processing. This might be the existence of a CSV file in some directory, a flag in a staging table that indicates a previous process has finished, or any other check-able condition. If you can get into the habit of using the scheduler for some dependency management, and you are familiar with using parent/child packages, then you really have full control over your ETL processes.

One drawback to using the scheduler for managing dependencies is that your packages are now dependent on the scheduler! If, for example, you need to debug a package, you may be forced to run other packages first, or you may need to manually create the environment ahead of time (set flags, move files, run other scripts, etc.). This is in contrast to having a single super master package call dozens of other master packages and SQL Tasks in a more visual way. The scheduler in this environment would simply schedule the super master package and that’s all.

Hand Coding with Visual FoxPro (VFP9)

Finding solutions for management-related and process-related development problems in VFP is easy if you utilize the wealth of information and tools available from the FoxPro community. The community has always been one of FoxPro’s greatest assets. And this subsystem, more than any other discussed so far, highlights this.

Scheduling is not something you can do out-of-the-box with VFP. One very good option is to simply use the Windows Task Scheduler to run your compiled executables or application files. If you’ve been following Doug Hennig’s blog, then you might have seen his excellent Task Scheduler class. He writes:

This provides classes that can be used to create tasks in the Windows Task Scheduler. There are separate classes for Windows Vista, which uses Task Scheduler 2.0 and earlier Windows versions, which use Task Scheduler 1.0. Updated 2007/04/26.

TaskScheduler.zip is available for download on Stonefield’s Whitepaper and SourceCode page.

If you’re using SQL Server for your data warehouse database, then you could (and rightfully should) use the SQL Agent to run your VFP ETL applications from the command line (see my previous section on SSIS). In fact, in one system I worked with, I used a combination of SSIS packages with VFP compiled applications to handle a fairly complex VFP/SQL Server data integration environment. I used VFP much like SQL Server would rely on stored procedures and the CLR to conduct some complex row and set-based analytics. The SQL Agent — as with any scheduler — is a great way to bind different technologies together.

An option you should not explore is using the VFP Timer control in a running VFP application to schedule tasks. There are better options out there. If you disagree, feel free to start a discussion in the comments section of this post. I’ve attempted to use the Timer a few times in the past for these sorts of things, only to scrap it and go another way. One area worth exploring, however, is using a Timer in an application designed for the SysTray. Microsoft provided a SysTray class as part of their Solutions Samples when VFP8 was released. But even still, the scheduling capabilities of both SQL Agent and the Window’s Task Scheduler are very good and built in to both Windows and SQL Server environments.

From Here

I’ll continue discussing the management-related ETL Subsystems in my next series of posts. The next one is the Backup System, ETL Subsystem 23. Feel free to pass along any questions in the meantime!

Published October 5th, 2008

Busy Stretch

Man I’ve been busy! Not only has my posting been limited in the past 3 weeks, but almost all of my extra-curricular activities have been put on hold as well (it feels like months ago I last played my guitar). There are a few things going on currently. Topmost on the list is the rollout of the first real significant upgrade to our DW/BI Project here at ING. When I mull over the “feature” list, it amazes me how our team of a few developers and analysts (all quant’s), and 2 contractors (an SSIS developer and a QA Manager) got it done. What’s more, it looks like we’ll be right on schedule with the release with a few days to spare. Of course, I hope now I haven’t jinxed us!

I joined this project in mid-stride back in January when I was hired as an external consultant. I was coming over fresh from the United States, to a new and exciting land, Belgium (where the waffles are hot and the chocolate is worth dying for) and a new industry, financials. After a year and a half in healthcare at Cardinal Health, I felt I was ready to enter the financial world — particularly investment management. And so I joined the quant team in the quantitative research department at ING Investment Management.

I was hired as a BI/DW consultant (functioning primarily as an Integration Architect) and was immediately charged with supporting the in-place data integration (ETL) processes for the then un-released DW/BI system. I began my work by documenting (some functional, bust mostly technical) the entire system. In a few weeks I had uncovered many issues and offered a long list of improvements to the team. Many items on my list were directly related to many of the ETL Subsystems that I’ve been discussing. Taking these to heart, we began the long process of refactoring, renaming, and redeveloping some of the processes that were in place. The focus was on future maintainability and performance. I’ve chronicled some of these in a few of my recent posts.

In addition to this, I slowly took over the project management duties and have now been managing the development activities for the entire team. By mid-summer, and due in part to several converging circumstances, I was hired as a full time member of the quant team. Now, my stake in this project’s success is even greater!

I expect to be just as busy over the next two weeks as the final touches are added to this release of the project. I’ve also started to compile a list of “lessons learned”, most of which will be relevant-enough to share with you all. I think that some of these might surprise you!

This week I’ll continue my posts on the ETL Subsystems. I’ll also likely start working on another article for Advisor (topic yet unknown). Lastly, planning for the next release of our DW/BI system is already underway, with development starting as early as November!

PS: If you ever find yourself feeling a bit overwhelmed or too engaged in your work, I highly recommend re-reading Walden (ahh, come on — everyone’s read Walden, right?!?). It has always managed to level me, reminding me how important it is to just sit back and relax once in a while.