This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

CPU Load Monitoring (image from www.isymon.com)You should keep a careful eye on the performance of your ETL system. It isn’t enough to simply schedule jobs and watch them run and run and run. You have got to put some effort into improving your processes so that the entire data warehousing operation works to its optimal level.

In order to discover areas of improvement, you will need to monitor as many processes as you can.

That said, I’ll skip a discussion on network, server, and operating system level performance monitoring (which includes monitoring disk access, memory usage, processor time, etc.). I am not an expert in this area and I simply do not know all the best practices. If you have fast, up-to-date servers, and you do a good job of developing ETL processes that mind their business and play nice, then there is likely not much you can uncover at this level anyway. You’ll need an expert. However, if you have some strange problems (like a load process that seems to take longer and longer over time; a package locks up mid-run; if you can’t seem to get anything else to run on the server while you load data), then exploring an application such as Performance Monitor in Windows is a good place to start.

Here are some good resources (which I had bookmarked):

 

What the ETL System Should Do

When you run a job, package, or script, you should log starting and ending times. In addition to logging these timestamps, include other details such as machine name, logged in user, process name, package versioning information, number of rows processed, number of errors, and any other useful bits of information that can be used to discover areas in which you can improve your ETL system. All of this data should be stored in your database as process metadata (more about process metadata here).

Combining this process metadata with other metrics on the server (CPU usage, disk space, memory allocation, etc.) should give you enough information to identify problem areas. The Kimball Group suggests that all this data be stored as process metadata. While I agree in principle, getting all this “extra” data into the database seems like overkill. I prefer to write to logs when I need to, and then examine the logs. This might be what the Kimball Group is actually suggesting, but it isn’t clear to me. Ideally, this logging can be switched on and off so that it does not interfere with production execution.

So what can you do with all this process metadata? You can use it to improve performance! From the Data Warehouse LifeCycle Toolkit, the following list (a bit paraphrased) will give you some things to look for:

  • Poorly indexed queries
  • SQL syntax causing wrong optimizer choice
  • Insufficient RAM
  • Sorting
  • Slow transforms
  • Excessive I/O
  • Unnecessary writes followed by reads
  • Bad aggregate management
  • CDC applied too late
  • Oportunities to run processes in parallel
  • Unnecessary transaction logging
  • Network traffic and file transfer issues

 

Planning for Performance

Now that you can see there are benefits to storing process metadata as part of an overall workflow monitoring capability in the data warehouse, you should also see that doing so is an ongoing effort. Workflow monitoring is not something you do once or twice, and then get busy on something else. I do a lot of planning along side my developing and I am very careful to allot adequate resources to this subsystem. Many managers don’t. If you are managing a group, or being managed by someone, and workflow monitoring is not on the plan, then do the necessary to get it there!

If you would like to know more about how I estimate time for this task, send me a message and I’ll be happy to discuss it with you.

SSIS and VFP

The tool you use makes little difference in this subsystem. All you need is access to the various variables (e.g. machine name, package name) and an ability to write them to the database. ETL Subsystem 6: Auditing is tied very closely to this subsystem. For details on how to implement an auditing subsystem (and hence have the ability to monitor your ETL) check out that post.

From Here

Remember that your job is not finished when your packages run without error. As well, it can be argued that poor performance should be treated like any other bug. Every effort should be made to improve the overall workflow and processing in order to squeeze every last second out of the system. You cannot do this unless you set aside ongoing resources for the life of the data warehouse system.

In my next post, I’ll talk a little about sorting, ETL Subsystem 28. We’re almost at the end of the line.

 

Tags: , , , , ,