Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for February, 2008


Published February 8th, 2008

ETL Subsystem 5: Error Event Management

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

The Error Event Management Subsystem is designed to report and respond to all integration error events and exceptions. The event logs are then made available to the integration team so that improvements in the design can be made. As with the last subsystem I discussed (data cleansing), this one is also driven by Data Quality improvement. To be clear, error event management is not about catching programming errors (although it will do this too), but rather to catch data quality and ETL design problems.

A well-designed error event management system will have logic to handle a variety of events, expected and unexpected. These events can be anything from lost database connections to catching field-level data truncation. A more advanced system could provide you with real-time monitoring of all error and warning events that occur in the ETL pipeline.

Report and Respond

There are two key responsibilities of this subsystem: report and respond.

Error reporting can be as simple as logging error events to a table or flat file. It can also be quite elaborate — perhaps emailing error events to a system administrator as they happen, or by providing a web interface that operators could monitor as part of their daily tasks. When you have dozens or hundreds of ETL jobs running on various schedules, reporting errors become more and more important.

Responding to error conditions is a bit trickier than simply reporting that they occurred. In a programming environment, we can utilize structured error handling (TRY..CATCH..FINALLY), transaction processing, buffers, and object-level error handlers to trap and possibly fix errors. For an ETL pipeline, you’ll have to think much in the same way. How you implement this is largely dependent on the technology you use.

SQL Server 2005 Integration Services (SSIS)

SSIS has built in event handling that will help you plan for and trap exceptions. What’s more, you can use all the same control flow elements in the “Event Handlers” view as you can in the “Control Flow” view — including Data Flows. In theory, you could branch your ETL packages in many different directions depending on the unfavorable event (such as an error) that another task raises.

The default behavior of a Control Flow task is to fail the entire package after MaximumErrorCount is reached (defaulted to 1). This is hardly the type of “error event management” I’ve been talking about in this article! Perhaps in the earliest stages of development, failing the entire package when an error occurs is a good idea. But normally, you will need to build in error handing.

For Control Flow items, you can rely on elaborate event handling coupled with Precedence Constraints. Precedence Constraints allow you to attach one control flow item to another, defining the workflow between the two tasks. You can set up a constraint that prevents workflow downstream when the task fails. To do this, simply double-click the green constraint arrow that connects the two objects and adjust a couple of settings. Simple!

For Data Flow Tasks, you have more options. As you may know, it is the Data Flow that does most of the heavy lifting in SSIS. Specific to this subsystem, you can use Conditional Splits to divert output in different directions. You can use the lookup tasks to search for known values. Row Counts can be used to ensure that the same number of rows that passed through task A also passed through task Z. The Script Component can do just about everything else. But beyond all of these, you can configure error outputs for most Data Flow tasks.

Error Outputs allow you to divert rows through a different pipeline when an error occurs. You can specify general error handling or truncation error handling. The diversion can be to a log file, a table, or back into the pipeline (once fixed). For example, if data truncation occurs on a “last name” field, you can redirect this condition to another transformation that will attempt to remedy the truncation problem (perhaps by adding ‘…’ to the string) or, you could log the truncation to an XML file for later examination. You can use these logs to improve the ETL design, avoiding errors altogether. If you fix the problem, you can merge the data back into the pipeline for downstream processing.

Because reporting and responding to events is so important during data integration, I promise to dedicate a future article (or series of articles) to the different approaches you can use (with plenty of examples). For now, know that SSIS has a lot of good options that satisfy the requirements of ETL subsystem 5.

Note, in addition to handling errors and failing the package when they occur, you can also purposefully ignore them, although I rarely find the justification for this!

Hand Coding with Visual FoxPro (VFP9)

There are many things you can do in VFP to handle error events and exceptions in your hand-coded ETL solution.

  • Transactions and Buffers - Honestly, these can save your life. Always wrap critical inserts and updates in a transaction and use buffering when needed.
  • ON ERROR - a global ‘catch all’ error handler is a must.
  • Class and Object Error events - Building an error event strategy that starts at the object level and propagates up the hierarchy is a great practice. Individual classes will handle any expected errors it can first; unhandled conditions are passed to its container.
  • Structured Error Handling - Through TRY/CATCH/FINALLY you can plan for many different conditions and wrap all assumptions so that you can handle cases when they’re wrong.
  • Defensive Coding - In my articles “Boundary Testing Parameters“, “Using ASSERTS with VARTYPE to Test Parameters“, and “Handling Errors in Visual FoxPro“, I talk about the importance of testing parameter values, coding in a way to eliminate assumptions, and other thoughts on the subject of error handling in Visual FoxPro.

VFP has excellent error-handling capabilities and under most circumstances, you can recover gracefully from any exception. This is especially true for structured error handling. Everything SSIS can do, VFP can do too. You may need to get creative (for example, VFP has no such events called “OnWarning” or “OnInformation”), but generally speaking VFP satisfies the requirements of this subsystem nicely.

At this time I would like to point you to a great resource written by Randy Pearson on the subject of structured error handling in VFP. I’ve had this link bookmarked for quite some time. The web page is entitled “Visual FoxPro Structured Error Handling Matrix” and will provide you with just about everything you need to create a structured error handing strategy for your hand-coded VFP ETL project.

In the end, both SSIS and VFP can be used to create a rich and extensive error event management system. Of course, in both systems, you need to design, plan, and code it. I have stepped into many projects that have had inadequate error event management. It turns out that this best practice is routinely overlooked. For data quality’s sake, spend some time designing a good error event management system for your ETL solution!

Published February 5th, 2008

Compare a String to a List of Formats

Conforming data and building master data are two very important aspects of data warehousing, migrating legacy systems, and federating reports. A common task is to compare a string to a known format and return the matching value. For example, you may need to take parcel identifier information from multiple towns and do some smart parsing to obtain atomic map, blocks, and lots. Town A might store their Parcel ID in the format MMM-BBB/LLL, while another in the format MM/LL.BBB. You could store all known formats in a table, do some magic to convert all parts to a format code, and finally find the match in the formats table.

In VFP, there are many ways to compare the contents of a string to a particular format. TRANSFORM() and LIKE() come to mind. Using TRANSFORM, you can do something like the following:

lcTestVal = "SOMETEST_VAL"
? TRANSFORM(lcTestVal,"XXXXXXXX_XXX") == lcTestVal     && returns .T.
? TRANSFORM(lcTestVal,"XXXXXXX_XX") == lcTestVal    && returns .F.

LIKEwise, you can do some comparisons with LIKE:

lcTestVal = "SOMETEST_VAL"
? LIKE("????????_???",lcTestVal)    && returns .T.
? LIKE("???????_??",lcTestVal)    && returns .F.

But in some cases, it might be necessary to compare a single string against multiple formats, returning the format that the string matches (a common task when conforming data for data warehousing). There are a lot of ways to handle this, but simply using TRANSFORM or LIKE won’t do it alone.

Take the following example, where we have three different account number formats that come from three disparate source systems. You could manage this easily in FoxPro:

lcFormat1 = "NCCCCCN"
lcFormat2 = "CCCNNNNNN"
lcFormat3 = "NNNNNNNCCC"
 
DIMENSION aTestValues [10]
aTestValues[1] = "3UUFGP7"
aTestValues[2] = "MCK000989"
aTestValues[3] = "0090892LLG"
aTestValues[4] = "9ABCD2"
aTestValues[5] = "JKL230945"
aTestValues[6] = "JKLM00989"
aTestValues[7] = "JJ2000989"
aTestValues[8] = "3U99G7"
aTestValues[9] = "3UUFGP"
aTestValues[10] = "7XYZA1"
 
FOR x = 1 TO 10
 
    lcTestValMasked = CHRTRAN(CHRTRAN(UPPER(aTestValues[x]),"ABCDEFGHIJKLMNOPQRSTUVWXYZ","CCCCCCCCCCCCCCCCCCCCCCCCCC"),"1234567890","NNNNNNNNNN")
 
    ? ICASE( lcTestValMasked == lcFormat1 , "String " + aTestValues[x] + " matches format 1",;
            lcTestValMasked == lcFormat2 , "String " + aTestValues[x] + " matches format 2",;
            lcTestValMasked == lcFormat3 , "String " + aTestValues[x] + " matches format 3",;
            "String " + aTestValues[x] + " does not match any format")
 
NEXT

In the above example, there are three acceptable format codes that we’re using to compare against 10 values. Using ICASE, we can easily determine which codes match the format string. The inner CHRTRAN converts all character values to the letter “C”, while the outer CHRTRAN converts numbers to “N”. If we did the numbers first, our “N”s would become “C”s! Of course, you can use “X” and 9, “A” and “#”, or whatever letters you want to use as your mask.

Ideally, the format codes would belong to a table which would contain additional information about the code, such as where it came from and its status (active or inactive). Then, a loop through the table, or a seek on the format column would point you to the corresponding record.

I would love to hear some additional solutions on this. If you have any, please comment!