Posts Tagged Data Integration

ETL Subsystem 28: Sorting

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Before you can aggregate, summarize, look up, or merge data, you will need it ordered in some way. Sorting, which changes the physical order of the underlying data, is seldom necessary. Instead, indexing, which changes the logical order of the underlying data is often all you need. The Kimball Group seems to use Sort interchangeably with Order, so I won’t deviate in this post. This may be because some databases vendors do so as well.

If you’re loading data from a mainframe, then the utility programs DFSORT or SORT will be of some use. Other utilities exist depending on the vendor, but these are the main ones. The idea is to sort the raw physical data before it hits the ETL pipeline. Doing so will save processing time. Otherwise, you can simply load raw data into your staging area and then use standard SQL to order your data.

Sometimes sorting is implicit, so be careful. In SQL Server, sorting occurs when you use ORDER BY, GROUP BY, SELECT DISTINCT, and UNION. You should, for example, avoid the SELECT DISTINCT clause whenever possible, as the additional sorting can hinder performance. Many programmers and (ehem) trained DBAs automatically add the distinct clause to their queries. Be certain that the distinct clause is necessary; otherwise performance will take a hit.

Beware that sometimes too much sorting can be a problem. Here is a general rule: Only sort when you need to! Sorting is an expensive operation and I’ve seen it overused and abused.

Clusters and Heaps

A clustered index determines the physical order of data in a table. That is, data is inserted into the table in order. Pages are linked for faster sequential access. This is in contrast to a heap table, in which data is added to the end of the table and the data pages are not linked.

Clustered indexes can make a big impact in the staging area. Consider a surrogate key lookup dimension that contains 4 columns: surrogate key, business key, valid_from, and valid_to. A clustered index on business key, valid_from, and valid_to will return the surrogate key upon lookup faster than if your lookup dimension was a heap. This is mainly because “queries that return a range of values using operators such as BETWEEN, >, >=, <, and <=" are well suited for the clustered index structure.

This works well during surrogate key lookups where we typically want the surrogate key of some business key on a particular date. Of course INSERTs and UPDATEs take longer on clustered tables because the physical structure of the table needs to be maintained. So be sure to balance speed with need. In my experience: Maintaining a clustered lookup dimension is almost always well worth the effort, especially when the original dimension is big and wide.

Best Practices

Here are some best practices that might help you keep your sorting operations under control:

  • Create one tempdb data file for each physical processor on the machine (more)
  • Put the tempdb files on their own drive (more)
  • Make sure your data types are appropriate for the column (storing numbers in a varchar field is not very efficient)
  • Keep the columns you sort on as narrow as possible (here is where good data profiling will help)
  • Integer sorting is much faster than character sorting, so consider using integers instead of character data whenever possible (more)

 

SQL Server 2005 Integration Services (SSIS)

As I have stated, sorting is often overused. The Sort component that ships with SSIS is good for small data sets that fit entirely into memory. Performance will degrade as soon as there is overflow. One way around the problem:

  1. In the Source Adapter, set the data access mode to “The results of an SQL statement”
  2. Use and ORDER BY clause in the SQL Statement
  3. Open the “Advanced Editor” of the Source component
  4. On the Input and Output Properties tab, click the Source Output node and set the property IsSorted to True.
  5. Under Output Columns, set the SortKey for each column involved in the index (start at 1, use a negative number to denote a descending index order)

 

Hand Coding with Visual FoxPro (VFP9)

Physically sorting a table in FoxPro is not necessary. The physical order of the table is irrelevant because of how VFP uses its indexes. You can test this easily by creating a table with a few million rows containing a randomly generated integer value and another column with some other data (perhaps system time or something). Using the SORT command (or SELECT INTO with an ORDER BY) create a second table in the order of the integer column. Without creating an index, attempt to retrieve a row from the table based on some integer value. Now, create an index on the column in both tables. You should notice no performance differences.

So while you’ll need to understand indexes and rushmore in VFP, you won’t need to bother with physical sorts.

From Here

In the next post, I’ll discuss lineage and dependency analysis. If you work in the financial sector (like me) then this topic will be of critical interest to you. If you don’t, I suspect you still care about lineage and dependencies!

Tags: , , , ,

ETL Subsystem 26: Version Migration

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

migrating birds ETL Subsystem 26: Version MigrationDepending on the lifecycle in place at your business, you will likely need to move your data warehousing project from the development environment into a production environment. I am currently using an alpha / beta / gamma system (with mixed emotions), where alpha is our development environment, beta is our integration, and gamma is production. I have also worked with systems that combined alpha and beta and had a special user acceptance environment instead.

Whatever you do, you’ll need to consider migrating from one environment to another.

Some things to consider:

  • Environments: How similar are your environments (both hardware and software)? Are there major differences between the development servers and the production servers? Ideally, all environments should resemble each other (integration, UAT, and production environments should really be identical). You don’t want to develop on a 64-bit Windows Server 2008 and deploy to a 32-bit Windows Server 2003 machine!
  • Security: Have you considered the different security and access requirements that exist between development and production environments? Will you, as the data warehousing team, have access to manage server jobs, file transfers, and file shares?
  • Version Control: Have you set up your version control in such a way that if an installation fails, you can revert back to a previous version? This is tricky, so be careful to set aside enough time for planning. You also need to keep in mind potential compliance issues, as it may be necessary to “pull out” an older version to see how data might have been manipulated differently than the current installation.
  • Resources: Who are the people in charge of the migration? Are they available when you need them?
  • Documentation: Does your organization require installation documents, verification plans, quality assurance sign-offs, user acceptance sign-offs, and back-out plans?
  • Protocol: What type of red tape must you cut through to get your data warehouse online? Who needs to sign-off on the deployment? What forms must be filled out? Is there an official change request procedure? ITIL and other systems can be good in theory but difficult to work with in practice. Take the time to learn what you need to do to get your system online.
  • Packaging: How will you package your software and database scripts? Will you use an installation program or will you have someone manually copy files? Will scripts be run individually? Will you use a deployment tool such as MSBuild?
  • Verification: Have you thought about how you can test the system in a way that flexes all new and updated features, while providing adequate regression to be sure you didn’t break something that was once working? Have you done a proper impact analysis on all new or changed components?
  • Data: Will you be migrating data changes as well? How will you get these data changes from one environment to the next? Has everything been scripted and tested thoroughly?

 

SQL Server 2005 Integration Services (SSIS)

SSIS makes deployment of packages from one environment to another relatively easy. You merely create a single package and supply different configurations for each environment. I am currently using XML configurations, but you might do better with database configurations, environment variables, or using a table in SQL Server. The basic premise is simple: Create variables in your package for things like file locations and connection strings. Using the Package Configurations dialog (SSIS -> Package Configurations…), enable configurations, click Add and then follow the Wizard. Check the SQL Server Development Center for details.

Remember that you can always set a configurable item (e.g. variables, connections, tasks) from the command line or in the SQL Agent Job. For the Agent Jobs, open the Job Step properties and from the General page, select “Set values”. When entering your values to configure, use the following syntax:

\Package\container.property

Example:

\Package.Variable[ServerName].value

For more insight on setting up jobs you can see my post “SSIS Packages and setting up Jobs“.

SSIS comes with a Deployment Utility that will bundle all the files necessary for deployment of the package into a single folder. Copy this folder to the destination server and run the Package Installation Wizard to install your packages. While this is a good idea, and does remove the possibility of human error during the installation process, I do not use the Deployment Utility. Call me old-school, but I prefer to do this bundling and installing myself through the dtexec command line utility — I like to see what is going on and it gives me an opportunity to catch surprises before they become trouble.

Hand Coding with Visual FoxPro (VFP9)

You can also use FoxPro to mimic the same kind of functionality you get in SSIS using package configurations. This is no different than using an ini file, a “setup” table, registry entries, etc. I would warn against using free tables if your deployment team does not have access to FoxPro. You would be better served using XML or some other plain text formatted file.

You can also automate much of your builds by utilizing VFPBuild. Find the related CodePlex page here.

From Here

As a developer, you are responsible for ensuring that the right version of your software makes it through the migration path with little or no hassle. It can be challenging enough dealing with hardware and software compatibility issues. Do your best to know the routine, package appropriately, and plan ahead!

Next post, I’ll talk about Work Flow Monitoring…

 

Tags: , , , ,

ETL Subsystem 24: Recovery and Restart System

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

As eluded to in my previous post (Backup System), bad things happen to good people. There are a ton of reasons to justify having a good backup procedure. A lot can go wrong with the hardware and software that make up your system. From network failures to bad service pack installs, a backup system is a must. But the best backup system in the world only gets you so far if you don’t have a system to recover or restart a failed process.

This is an ETL / Data Integration topic, so I’ll leave out discussion on restore capabilities found in backup tools. Of course, there may be times when restoring entire files, intermediate results, and staged data is necessary, but this is a job for your IT department. I’ll focus instead on how the ETL system should handle exceptions and failures.

At its most basic level, a recovery and restart system must allow you to stop, start, and retry your ETL processes.

man pulling hair out 2.thumbnail ETL Subsystem 24: Recovery and Restart SystemSome might argue (and I have been in the room with these debaters) that it is simpler and more manageable to simply restart the jobs from the beginning and not bother with the added complexities of recovery and restart. While this might be true for simple environments or small jobs, it is certainly not true for larger, more critical projects. Restarting, for example, saves you from having to re-acquire data from your sources. This is a huge benefit, as some sources are only available to you during certain load windows. In some cases — like in my environment — some source files are overwritten each day! Other times, your source data might come from an FTP site. Downloading the same (large) data over and over takes time.

Keep in mind that your recovery and restart systems must be smart enough to clean up partial or damaged work. You’ll need the ability to back out of processes that partially finished, or even to rollback committed changes in order to ensure continued quality and integrity of your data warehouse and staging area. This is not an easy task and requires a lot of planning. Make your life easier by planning well ahead of time and try to link your recovery and restart procedures directly to the times when you write data to disk (“staging”).

SQL Server 2005 Integration Services (SSIS)

To start, use transactions in your packages. Transactions are an often underused capability of SSIS. They should be a part of your package designs so that blocks of tasks can be handled together as logical units. Transactions can be defined on the Task and container level, giving you the flexibility to nest and work with transactions in parallel. To get this working, you will need to set TransactionOption on the parent container to “Required”. This will initiate a new transaction if its parent hasn’t already done so. Child Tasks can participate by also setting TransactionOption to “Required” or “Supported”. A common technique you could employ: Use a Sequence Container to group related Tasks together, setting the Sequence Container’s TransactionOption to “Required” and all of its contained Tasks TransactionOption to “Supported”.

But transactions only get you part of the way there, and are not suited for every situation.

SSIS comes with a powerful and also underused feature called “Checkpoints”. Checkpoints allow you to restart failed packages from the point of failure, instead of rerunning the whole package again. The process is rather simple. SSIS will record information about its current execution to a special log file (this log file is overwritten each time the package is started from the beginning). When the package fails, and is subsequently rerun, the package will read the checkpoint log and attempt to pick up where it left off.

What’s particularly handy about SSIS’s implementation is that you can control checkpointing on Tasks and containers using the FailPackageOnFailure property. Set this property to True for each Task you want to participate. Furthermore, checkpointing is smart enough to recognize and handle transactions.

Hand Coding with Visual FoxPro (VFP9)

VFP supports transactions for both database tables and free tables (the latter works in VFP9 if you use the MAKETRANSACTABLE() function). This is nice, but doesn’t allow you — without some creative coding — to wrap transactions in the same way SSIS does with its containers and tasks.

You also don’t get built-in functionality to resume processes in Visual FoxPro after a failure. You might be able to use the RETRY command in combination with an error handler to fix problems as they are occurring, but for the sake of this subsystem, you need a “retry” mechanism that can span multiple sessions (i.e. restarting a process after the executable has already exited). The consequence is that you will need to design your ETL system in such a way that it can pick up where it left off, or go back to a previous step. You will need to:

  • Create “markers” in the workflow of your code (the most natural place for any marker is right after data has been staged to disk, but you could also create intermediate markers as well)
  • Log successful passes through each marker so the system will know where to pick up where it left off
  • Make each marker smart enough to clean up any partial work left over by a previous failed run

The easiest way to implement the third point above is to use transactions in combination with cursors for all temporary work. If a job fails (the lights go out) then your cursors will be lost, but you should be able to restart from the last good marker. Since the cursors didn’t do anything permanent, there is nothing to clean up. This creates a natural and easy-to-manage recovery and restart system.

The bottom line: You’ll need to be creative when implementing a recovery and restart system using VFP (or any other hand coded solution).

From Here

Next, I’ll talk about version control (Subsystem 25) and version migration (Subsystem 26). Both of these subsystems deal with deployment and have tripped up many teams (I’ve been a part of such teams and have hopefully learned from the mistakes).

Tags: , , , , , ,

ETL Subsystem 23: Backup System

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

You’ve just spent the better part of the year designing a perfect ETL system – from data profiling to data propagation. You’re in production, and all jobs are running on schedule in what seems to be an eerie harmony. You lean back in your chair. Victory, you think.

Then comes the call: Your database server crashed. Now what?

You must have a reliable backup system in place — and not just overnight tape backups conducted by your IT department. Those are a good first (baby) step, but they are usually not good enough, especially to help you recover from failures in mid ETL stride. Even with a fully redundant environment, you have got to be sure that you can re-run jobs from your staging area no matter where they failed. This is an ETL responsibility not to be passed off to IT. This means that an accessible backup of all staging data is essential. Remember, you should stage data at each major step in the ETL process (after load, after cleansing, conforming, and just before delivery). Doing so allows you to return to these staging points in order to recover from failures when they occur during ETL processing. I’ll discuss this more in my next post “ETL Subsystem 24: Recovery and Restart System”.

If you use a staging area, then you’re already part of the way there. The only thing you need to manage at this point is to plan and implement your backup strategy. Backing up staged data as it is being staged can be tricky. You can write to two different locations, on different machines or on different drives of the same machine. Consider this the poor-man’s approach to staging area backups. I’ve done this often; as it is easy and inexpensive (disk space is cheap and getting cheaper). More sophisticated mechanisms exist (for example, check out IBM Tivoli). Your policies, existing services, compliance concerns, and your ability to recreate lost data will all play a factor in determining what level of backups you need to make.

Storage of these backups, including the question on how long to keep the archives, is something you’ll need to consider as well. For critical data that cannot be reproduced, you will want to store it for a very long time in a safe location. For easily reproduced data, you might want to keep it long enough to appease the auditors. For the staging area, it might be safe to keep it only until the next run.

From Here

I won’t frame this subsystem using SSIS or Visual FoxPro, as I’ve done with the other subsystems to date. Backing up your data and files — unless you use the poor-man’s approach — is not a function that you can do using the ETL tooling. And doing a manual copy of staging data might be as simple as moving files from one directory to another, or using database replication to replicate changes from your main database to a linked, backup database on another machine.

You have a lot of decisions to make when it comes to your backup system. Next post, I’ll discuss recovery and restart, ETL Subsystem 24.

Tags: , , , , ,

ETL Subsystem 22: Scheduler

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

time.thumbnail ETL Subsystem 22: SchedulerThis subsystem addresses one of the more basic management-related functions of ETL: Scheduling. You must provide a way to reliably run your packages and scripts. Data moves into and out of the Data Warehouse continuously; scheduling provides a structured and predictable way of managing the process.

I use the term “predictable” because it is often necessary to provide business users and other applications with data at specific intervals throughout the day and week. In my current project, for example, we provide a reporting data mart to a specific analytical application every morning at 9am. This reporting mart must have data up until the close of the previous day (and is not refreshed throughout the current day). If the timing wasn’t right (if, for example, some of yesterday’s data came into the database at 9:30), then the reporting database would not be ready for the business users. All of these timing issues are solved with a scheduler (and of course some agreements with the data providers to ensure data is available to process by the time the scheduled task runs!).

Scheduling is also used to organize and arrange processes so that dependencies and relationships among disparate data sources are processed logically. If data A must be loaded before data B and C, then data A will be scheduled first (worth mentioning again: a prerequisite for this is that the data is available from the source at the right time as well!). If one load fails, a graceful exit from the process must be made with adequate notifications and alerts. Restarting at failure points (without having to rerun the entire process) is also a highly desireable feature of any scheduler.

I’ll get into these topics more in a few of my next subsystem posts.

SQL Server 2005 Integration Services (SSIS)

Scheduling SSIS packages is easy. The best option you have is also the most intuitive: The SQL Server Agent.

sqlagentscheduler.thumbnail ETL Subsystem 22: SchedulerThe SQL Server Agent is a Microsoft Windows service that executes scheduled tasks. These tasks are referred to as Jobs. A single job can consist of multiple steps and be run on multiple schedules. Schedules can be configured to respond to a specific event, at intervals, specific times, or on demand. In a single job, for example, you can run a few SSIS packages, some database stored procedures, and a few command line utilities. These can all be scheduled at various times throughout the day (perhaps each hour, or once in the morning and again at lunch), and you can get an email telling you when it finished.

The scheduling aspect of SQL Agent is quite powerful. Not only can you create a seemingly endless combination of schedules, you also have full control over how to handle each step if it fails or succeeds. All of these steps can be linked together in such a way that if any one job step fails, processing can halt or move on to another step (of your choice). If you think about it, you have built-in If/Else/EndIf and CASE logic available right in the scheduler. I find that developers who come from a programming background can make quick sense of this put together some rather elaborate and logical flows, with dependancies, prerequisites, and alternative paths built in.

Knowing this ahead of time can save you some effort in your ETL packages. So here’s a tip: Use the scheduler to manage package dependencies where you can, rather than trying to build these complexities into your Control Flows. Of course, every package should start out by examining the environment, to be sure that it has everything it needs to start processing. This might be the existence of a CSV file in some directory, a flag in a staging table that indicates a previous process has finished, or any other check-able condition. If you can get into the habit of using the scheduler for some dependency management, and you are familiar with using parent/child packages, then you really have full control over your ETL processes.

One drawback to using the scheduler for managing dependencies is that your packages are now dependent on the scheduler! If, for example, you need to debug a package, you may be forced to run other packages first, or you may need to manually create the environment ahead of time (set flags, move files, run other scripts, etc.). This is in contrast to having a single super master package call dozens of other master packages and SQL Tasks in a more visual way. The scheduler in this environment would simply schedule the super master package and that’s all.

Hand Coding with Visual FoxPro (VFP9)

Finding solutions for management-related and process-related development problems in VFP is easy if you utilize the wealth of information and tools available from the FoxPro community. The community has always been one of FoxPro’s greatest assets. And this subsystem, more than any other discussed so far, highlights this.

Scheduling is not something you can do out-of-the-box with VFP. One very good option is to simply use the Windows Task Scheduler to run your compiled executables or application files. If you’ve been following Doug Hennig’s blog, then you might have seen his excellent Task Scheduler class. He writes:

This provides classes that can be used to create tasks in the Windows Task Scheduler. There are separate classes for Windows Vista, which uses Task Scheduler 2.0 and earlier Windows versions, which use Task Scheduler 1.0. Updated 2007/04/26.

TaskScheduler.zip is available for download on Stonefield’s Whitepaper and SourceCode page.

If you’re using SQL Server for your data warehouse database, then you could (and rightfully should) use the SQL Agent to run your VFP ETL applications from the command line (see my previous section on SSIS). In fact, in one system I worked with, I used a combination of SSIS packages with VFP compiled applications to handle a fairly complex VFP/SQL Server data integration environment. I used VFP much like SQL Server would rely on stored procedures and the CLR to conduct some complex row and set-based analytics. The SQL Agent — as with any scheduler — is a great way to bind different technologies together.

An option you should not explore is using the VFP Timer control in a running VFP application to schedule tasks. There are better options out there. If you disagree, feel free to start a discussion in the comments section of this post. I’ve attempted to use the Timer a few times in the past for these sorts of things, only to scrap it and go another way. One area worth exploring, however, is using a Timer in an application designed for the SysTray. Microsoft provided a SysTray class as part of their Solutions Samples when VFP8 was released. But even still, the scheduling capabilities of both SQL Agent and the Window’s Task Scheduler are very good and built in to both Windows and SQL Server environments.

From Here

I’ll continue discussing the management-related ETL Subsystems in my next series of posts. The next one is the Backup System, ETL Subsystem 23. Feel free to pass along any questions in the meantime!

Tags: , , , , ,