ETL Subsystem 6: Auditing
This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Auditing is the process that tracks the activities of users by recording selected types of events in the security log of a computer (cite). In the context of data warehousing and data integration, auditing is the process of attaching metadata to each fact (and in some scenarios, some dimensions) in a dimensional model. This metadata is then available to various applications that use the model to examine data quality and reliability. The audit record then provides some context for the data and can be instrumental in turning that data into information.
In a dimensional model, the audit table is actually another dimension and adheres to the normal rules for dimensions. This includes the use of surrogate keys, SCD types, and ETL procedures. For more information about these subjects, see my following entries:
Why Bother?
Data warehousing is truly an ongoing struggle to provide high value data to business users so that they can make the best decisions possible for their function. The audit dimension helps us provide high quality, accurate, and complete data. In addition, it also helps us with following the lineage of data, which in this context means the ability to trace origin and ownership of a piece of information. Having adequate data lineage is crucial for complying with the government reporting requirements of legislation such as HIPAA and Sarbanes-Oxley, and for answering the tough questions auditors might have for you some day.
With a properly populated audit dimension, you can actually answer questions like: “How did that data get in there?”; “Where did that information come from?”; and “How did you arrive at that number?”.
As data warehouses continue to become the “system of record” in various organizations, it is critical that every fact is traceable from source to presentation.
Making it Work
Although I will reserve complete details on implementing an audit dimension for another posting (I already have a problem of being too long-winded in my entries!), here are the basics:
- At the very least, the audit dimension should contain the following attributes:
- primary surrogate key
- date loaded into the warehouse
- version of the data warehouse when added
- name, location, and version of the source system that the fact came from
- information about the process that loaded the data (ex: SSIS execution ID, computer name, user, etc.)
- information about missing data
- information about how the fact was altered in transit
- optionally (but very helpful), a quality identifier or score for the fact
- optionally (and equally helpful), statistical information about the fact (deviations from the mean, outlier indications, etc.)
- Generate audit information as part of fact table generation, which normally occurs after dimensions have been processed. This would include gathering all of the necessary quality indicators, statistics, and lineage information.
- Create the audit record just before inserting into the fact table. This makes available the audit key, which you will need to add into the fact record.
- To query audit information later on, simply include attributes from the audit dimension in your query that cuts across the fact table.
It is reasonable to create an audit outrigger which would contain even more details about the quality and lineage of the fact. The approach I outline here uses a one-to-one relationship between the fact and the audit dimension; the outrigger would allow one-to-many relationships from a single fact to multiple audit conditions. I’ll reserve this design for a future posting as well.
SQL Server 2005 Integration Services (SSIS)
SSIS can handle auditing quite easily. In fact, there is a special Audit Transformation designed to add important lineage, package, and environment information to the workflow. The Derived Column transformation also exposes a significant amount of system and package information. The Aggregate and Row Count transforms will also prove quite helpful as they can be used to provide statistical information about sets of data.
If you need to include auditing information from dimensions into your fact audit, you might need to work with intermediate staging tables. When processing dimensions (like Customer, Product, Employee, etc.), you can build a staging table (preferably, a RAW formatted file) with information about the dimension (statistics, date loaded, etc.). Then tap into this staging table later on when processing individual facts.
You can check out the webcast “MSDN Architecture Webcast: Using SQL Server 2005 Integration Services to Populate a Kimball Method Data Warehouse” at microsoft.com for more information about adding auditing using SSIS.
Hand Coding with Visual FoxPro (VFP9)
In VFP, handling audit information is also easy, but requires some additional planning and coding to be comprehensive. In SSIS, many environmental variables (package name, user information, etc.) are available automatically. Also, in SSIS, it is trivial to add these variables to the data being moved from source to target. I’m not trying to suggest that SSIS gives you a comprehensive auditing system that requires little planning, but I am saying that it is comparably easier to add ad hoc auditing information using what SSIS provides out of the box.
In VFP, you will have to make this audit information available yourself. You can obtain this information directly from metadata you create (your process name, ETL version, etc) or from any number of VFP commands that can provide insight into the quality and lineage of the data you are processing. These commands can include those such as: CALCULATE; RECCOUNT; SYS(0) (network machine information); SYS(16) (executing program file); DATETIME; and many others.
If you have taken my advice and created a table-driven ETL system (see my Fox Forward presentation materials), then you’ll find that processing audit information is quite easy. Take a look at the information provided in those files and get back to me with any questions.
From here
In the end, both SSIS and FoxPro can handle the requirements of this subsystem. Auditing in this case is not really limited by the technology, but by the design, constraints, and requirements of the project.
Auditing, logging and other instrumentation are all important for compliance and data lineage. These topics will be discussed in detail in future posts on subsystems 29 “Lineage and Dependency” and 33 “Compliance Manager”. At the rate I’m going, these should be ready some time this spring!
In the next article, I’ll discuss removing duplicates and how VFP and SSIS can deal with the various challenges associated.
Data Warehousing and Business Intelligence consultant, with expertise in business analysis, data modeling, and data integration. Extensive experience developing vertical and integrated desktop and Internet applications spanning municipal, clinical, and financial industries.
February 29th, 2008 at 12:25 pm
[…] Auditing […]
June 13th, 2008 at 10:04 pm
[…] approach I outline here uses a one-to-one relationship between the fact and the audit dimensionhttp://blog.todmeansfox.com/2008/02/14/etl-subsystem-6-auditing/CBITWork with development teams around the globe to improve database design and performance. Manage […]