Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘VFP’


Published April 15th, 2008

ETL Subsystem 10: Surrogate Key Generator

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

When integrating data into a Dimensional Model, you need a mechanism to assign new primary keys to each dimension. These primary keys will be used in your Fact table as foreign keys. You cannot use natural keys because they are likely to repeat — this is especially true if you are maintaining history using SCD Type 2 (more on this in a bit). Subsystem 10 addresses this important need, by specifying the need to generate surrogate keys for all dimensions.

Surrogates

Integer surrogate keys should always be used as primary keys for a dimension. In addition, each dimension will have at least one candidate key comprised of the natural key and row status (active or inactive dates) if the dimension is a TYPE 2 SCD. Here is what I wrote in a previous post:

Every dimension contains a single-part surrogate primary key field. This primary key is used to relate to one or more fact tables in the schema. Natural keys (like an employee id) are kept in the dimension, but are not used in joins. This technique ensures a unique value for each record, simplifies joins, and keeps indexes small.

So the concept is simple: Every time a new row is inserted into any dimension, it gets a new primary surrogate key. We use integers because they’re compact, perfect for sequencing, and under most relational databases we can control their ranges. How you generating this primary key largely depends on your performance needs and RDBMS capabilities (see the SSIS and VFP subsections below).

Special Cases

You can treat Date dimensions a little differently. Instead of making the first row in your Date Dimension start at 1, use the ISO 8601 date standard YYYYMMDD format. This makes sense for a few reasons. First, it is easier to look at a row in a Fact table and have the date available without a join to the Date dimension or one of its role players. Second, a Date Dimension is treated as a Type 1 SCD, meaning, you would never have the same date repeated with one row being active and the other inactive as in TYPE 2 SCDs (hence throwing off the ISO date format). Third, because Fact tables generally contain several dates, using the integer date key (which is an int in the ISO date format) instead of fetching the formatted date from the Date Dimension may have some performance benefits because you can easily format the date later on in the presentation layer. Lastly, you can usually predict with certainty the total number of rows that the Date Dimension will have, and then pre-populate all the dates before you even do your first integration. This removes the Date Dimension from the surrogate key processing all together.

As I mentioned previously, one row in each dimension should have a primary key of zero, which can be used as a symbolic “null”. Remember that Fact tables reference dimensions through foreign keys. You cannot have a null foreign key, so the only viable alternative is to link to a dummy row in the dimension. In dimensional modeling, the acceptable solution is to create a row in each dimension with a primary key of zero. For example, in a Customer Dimension, the zero key might have a customer name of “Unknown” and a DOB of 0 (which is a foreign key that relates to the Date Dimension’s null row). This is a very cool concept because it becomes incredibly easy to produce a query that could give you all the order transactions that did not involve a known customer. Maybe not the best of examples, but try writing a query to give you results of everything that didn’t happen, or wasn’t there, using a relational model!

You can also use -1, -2, and so on to represent other special conditions. For example, a Promotion Dimension may have a zero-key row that represents “Unknown” and a -1 key row that represents “No Promotion”. Perhaps a -2 key could symbolize a “Canceled Promotion”. And so on…

The following sections give some ideas on how to generate surrogate keys in both SSIS and VFP. I’ll discuss using these keys in greater depth when I discuss Subsystem 14: Surrogate Key Management.

SQL Server 2005 Integration Services (SSIS)

I prefer to let the database generate new, unique surrogate keys in all cases. This is usually the safest approach and doesn’t require any real development around the key-generation process.

For SQL Server, select the best int for the job and make it an IDENTITY. Select among the 1 byte Tinyint (0 to 255), 2 byte Smallint (-32,768 to 32,768), 4 byte Int (-2,147,483,648 to 2,147,483,647), and the 8 byte Bigint (-2^63 to 2^63-1) depending on the expected maximum size of the dimension.

It may not be practical or feasible to make the primary key attribute in your dimensions IDENTITY Ints. Performance may also be an issue when assigning and returning these new keys. There are really two alternatives, my favorite first:

Alternative 1: Use a staging table. Using a lookup/reference table in the staging area is an option that always works. The concept is simple: Create lookup tables that correspond to each dimension in the model. Include the IDENTITY surrogate key column, the natural key(s), and active/inactive dates (for Type 2 dimensions). When processing dimensions, insert new rows in this lookup table just before adding them to the dimension. Use the surrogate key generated from the lookup table to insert into the dimension (you can either return the generated key using SCOPE_IDENTITY() or simply do a lookup/join on the natural key and status). When you need the key for the Fact table, you can do another lookup/join on the staging table. This approach takes pressure off of the dimensional model and creates opportunities to increase performance by optimizing the lookup/reference table. I’ll talk more about the details on this approach when I discuss Subsystem 14.

Alternative 2: Generate your own key. Another way to achieve the same result is to figure out the last key in the dimension and increment it manually. This number will then be used to insert into the dimension and again into the Fact. The problem with this approach is that if you do not process Facts immediately after each dimension, one at a time, then you will need to store and lookup the key in a staging table or in memory. Also, this approach assumes that no other process or thread is inserting data into the dimension at the same time. This, of course, would kill the sequential system and violate the primary key constraint on the table.

To explore Alternative 2 further, have a look at this post from Marco Russo of sqljunkies.com. He explains how to use an Execute SQL Task in SSIS that executes the following:

SELECT COALESCE( MAX( ID_Dimension ), 0 ) AS LastID 
    FROM Dimension

The claim is that his method is the best, most scalable solution. I don’t fully agree, as my experiences dictate otherwise. But I believe that each approach has merit and depends a lot on the environment.

Hand Coding with Visual FoxPro (VFP9)

For VFP, use an int data type with AutoIncrement enabled. As with a SQL Server implementation, I prefer using this approach to generating keys manually. Even still, both alternatives mentioned above are also available in FoxPro:

Alternative 1: Use a staging table. Here, as with SQL Server, use a staging table with AutoIncrement enabled on the primary key. As a consequence, you will not enabled AutoIncrement on the dimension. To return the newly added key from the lookup/reference table, you simply use the GETAUTOINCVALUE() function and return it.

Alternative 2: Generate your own key. You can also determine the last key used in the dimension and manually increment it. Personally, this approach offers very little to a VFP developer. Alternative 1 — or simply using the Dimension for generating the surrogate — is by far the best approach.

I like VFP for these types of tasks. Whatever you decide, you don’t really need to worry about returning the generated key. You typically insert and update all your dimensional data first, and then return to the dimension later to lookup keys just before inserting the facts. Nothing beats an old-fashioned SEEK in this scenario, which can be light years faster than building a select with a where clause depending on how crafty you are with Rushmore and Index building.

Compare this:

SELECT PrimaryKey ;
    FROM Dimension ;
    WHERE NaturalKey = 'SomeValue' ;
        AND StatusFlag = True

to this:

SEEK('SomeCandidateKey','Dimension','CandidateIndex')

If you are using Alternative 1, then instead of SEEK, you could build a JOIN between the lookup table and the data being processed for the fact. This set-based approach can be pretty speedy as well.

If you insist on exploring Alternative 2, the translation from Marco’s TSQL code to VFP is as follows:

SELECT NVL(MAX(ID_Dimension),0) AS LastID ;
    FROM Dimension

In my experience, I simply use VFP9’s Autoincrementing Integer field in each Dimension. I don’t bother with a staging lookup table because with VFP, I can just set the order to the natural key and do a SEEK on the dimension. It’s a fast, simple, and consistent approach.

VFP Really Shines

In this series of articles, I have attempted to highlight some of the differences between implementing the 34 Subsystem best practices in both a hand-coded environment (VFP) and through a tool specifically designed for the job (SSIS). In each case, both VFP and SSIS have really been about equal (this is a non-scientific study, of course!). SSIS has advantages that are hard to duplicate in VFP (some of which I will be highlighting later on in this series), but in this case, VFP really shines. While both databases can handle generating keys for us in similar ways, VFP’s SEEK command allows us to use it!

I don’t want to dwell on this any more. I’ll discuss more details on how to use the surrogate keys when I talk about Subsystem 14: Surrogate Key Management.

Published March 29th, 2008

FoxPro a Big Hit in China

Did you know that “a survey showed that Microsoft’s FoxPro database application accounted for 65% of database installations in
China, yet Microsoft had not sold a single legal copy in the country.” (source)

This blows me a way. I know pirating is bad in China but this is just unbelievable. Imagine if 10% of those installs were legal. Microsoft would have additional pressure to carry on the language (and would have plenty of additional revenue as well).

Anyway, that article is a great read for anyone interested. Now I can blame Bill AND China for VFP’s fate.

Published March 25th, 2008

ETL Subsystem 9: Slowly Changing Dimensions (SCD)

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

To this point, I’ve discussed “Data Extraction” and “Cleansing and Conforming” ETL Subsystems. Now, I’ll shift attention to data delivery (the “L” in ETL). As I have already discussed Slowly Changing Dimensions in a previous post, I’ll only highlight some of the points I think are most important here. So please, if you haven’t already, read my previous post “Dimensional Modeling: Loading the Slowly Changing Dimension” before you continue.

Some important takeaways:

  • Dimensions change over time. We can identify how to process these changes by using SCD Types 1, 2, and 3 techniques.
  • Dimensions (especially large ones) sometimes need a hybrid approach. This approach is sometimes referred to as a Type 1.5 SCD, which allow a dimension to be added with incomplete data. Updates are processed later as if the dimension was a Type 1 (empty and incomplete attributes are overwritten instead of retaining history) until the entire row can be labelled as “complete”. To indicate this, I have found including a “complete” flag in each dimension is a clear way to mark Type 1.5 dimension rows as complete, or in need of additional data.
  • Like the Type 1.5, sometimes a dimension row must be added with no details. These are called Inferred Members. In this scenario, just a natural key and the system-generated surrogate key are added. The dimension is then updated much like the Type 1.5 SCD until its data is complete.
  • For Type 1.5 and Type 2 SCDs to work, the row needs some sort of status indicator, the best being active “to” and “from” dates.

 

SQL Server 2005 Integration Services (SSIS)

I already mentioned in my previous post how SSIS’s Slowly Changing Dimension Task can be used to meet the requirements of this subsystem. Again, my only word of advice is to be careful about the amount of data you plan on passing through the generated SCD Tasks. I have seen performance degrade substantially on large datasets. It is best, in these scenarios, to introduce additional logic before the SCD logic to filter out inserts, updates, and deletes into different trees.

Also, it is a good idea to fully understand the different SCD types (inferred dimensions as well) before running through the wizard. Not only will the entire process be clearer, but you will better understand how to optimize the generated data flow when all is created.

Hand Coding with Visual FoxPro (VFP9)

I’ve had great success using FoxPro for hand-coded dimensional modeling projects. My first stab at loading Slowly Changing Dimensions was in 1998, using VFP6. Using a database trigger, I wrote code to update a denormalized data model that was used for reporting and ad-hoc queries. At the time, I had no idea what a SCD was, but somehow, instinctively I implemented it pretty close to the tried-and-true Kimball methods. That system is still in use today (the app is in VFP9 now) and I have redesigned it to match the Kimbal SCD methodology much closer than I originally conceived it.

If you are involved in a similar VFP project, or are at all interested in learning more about how VFP can be used in this way, please do contact me. I’m pretty passionate about VFP and Dimensional Modeling, so I’d love to hear from you!

Published March 17th, 2008

ETL Subsystem 8: Data Conformance

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data conformity is one of the most important and fundamental aspects of data warehousing. Data conformity means that a data element’s label (column header) and content (which includes data types, lengths, and nullability) are consistent across multiple tables and across multiple business processes.

In dimensional modeling, a single dimension shared across business processes is a conformed dimension. Shared dimensions allow for drill-across queries where a user could conduct enterprise-level analysis across several business processes (inventory, sales, procurement, etc.). Conformed dimensions, as a data quality technique, also cut down the likelihood of data inconsistencies in an organization.

Margy Ross of the Kimball Group states that “Using conformed dimensions ensures that the data warehouse is delivering consistently defined attributes for labeling, grouping, filtering and integrating data from multiple business processes.”

Dimensional conformity also ensures reusability. Reusing dimensions across multiple business processes is what makes dimensional modeling so simple and scalable. The Product dimension, for example, would be the exact same physical table used in the Orders, Inventory, and Procurement dimensional models.

Obtaining Conformity

There are some challenges associated with obtaining conformity. Some of these challenges might be political, where business users refuse to agree on common business definitions. Most of them, however, are in the planning, data modeling, and system architecture, where individuals fail to identify or recognize where conformity exists. This is typically a direct result of failing to do adequate data profiling! Not having data stewards available to consult with on specific data issues could also lead to some slow times.

The first step to conforming dimensions is to define the business processes that will be used in the dimensional model. The next step is to conduct a data profile to identify and document the entities and relationships in the source system. Finally, identify the common dimensions across business processes.

The question then becomes: How do we determine which dimensions are conformed across business processes?

The answer is the Bus Matrix. The Kimball Group does an excellent job making the case for the Bus Matrix in their books and articles. In short, it acts as a roadmap for a dimensional modeling project. Check out “The Matrix” and “The Matrix: Revisited” on the Intelligent Enterprise Website. I won’t attempt to describe the Bus Matrix here for the sake of brevity, but may share some thoughts in a future posting (which I will likely title “The Matrix: Reloaded” just to be cheeky!).

The Bus Matrix will tell you exactly what your conformed dimensions are.

Conforming Attributes

By using conformed dimensions, it is possible to conduct drill-across queries. You might, for example, want to do some analysis on retail sales and inventory. Retail Sales and Inventory business processes exist in the data warehouse as two separate dimensional models. The cool thing is that they share dimensions (ex. Product). It is rather simple then to relate the fact tables together using the Product dimension based on some product characteristics such as “size”, “weight”, and “cost”. These drill-across queries can lend a great deal of insight into an organization’s various business entities.

Filtering and searching for data is also simplified if, for example, all names are in UPPER case and stored in atomic parts (last, first, middle, title).

You should strive to get as many attributes in various dimensions to conform. Attributes such as Name, Length, Weight, Size, and Color should mean exactly the same thing across all dimensions. If not, then they must be named differently. Conformity on this level lends to the simplicity of the overall design and makes operator queries simpler to write and understand.

Take for example Customer and Shipper dimensions, both with an attribute called “state”. In the Customer dimension, state is abbreviated to 2 characters, but in the Shipper dimension it is spelled out. You would not be able to conduct a query on all shippers and customers in the same state without the need for some lookup table. This could confuse and likely frustrate the operator. If you must spell out the state’s name, re-name the column to “state_name” (and add a column called ’state’ with the abbreviated value)!

Allowing business users to relate multiple business processes together using commonly defined metrics (”facts”) also adds tremendous value to the data warehouse database. For example, a metric such as “discount” would not be a percentage in one fact table and an actual dollar amount in another. Instead, convert one to the other during data integration, or create two metrics: “discount_percent” and “discount_amount”. To take this a step further, in order for “discount_amount” to be conformed, it must be stored in some common currency (such as EUR or USD).

Not all attributes and metrics can be conformed, but the more you can, the more flexibility you will have with your drill-across queries. If you cannot conform a particular attribute or fact, then be sure to change its label so that end-users won’t make an assumption that it means something that it’s not!

I find it intuitive to identify attributes and metrics that are conformed across multiple source systems once the conformed dimensions have been identified and a data profile conducted. Where you need to be careful is when two column labels are named the same but mean (sometimes slightly) different things. Product color, for example, might be entered as a color’s name (red, black, blue) in one system, and a hex or RGB value in another. Multiple systems might have different product identifiers, all labeled “Product ID”. “Distance” could be in kilometers in once system, and in miles in another. Fahrenheit to Celsius conversions may be necessary. All of these issues need to be worked out during the data integration process so that “distance”, “color”, and “Product ID” mean exactly the same thing across Dimensions and Facts.

A 3lb red widget from NY is a 3lb red widget from NY is a 3lb red widget from NY!

SQL Server 2005 Integration Services (SSIS)

SSIS has many transformation tasks that you can use to enforce conformity in the dimensional model. In fact, there is not a single data conformity task that I have run across that SSIS does not provide a solution for out of the box (well, with some help from the Script Task, that is). Because many transformation tasks can be used to enforce conformity, I won’t enumerate them here. The challenges of conforming data truly lie in the planning and design process.

Hand Coding with Visual FoxPro (VFP9)

As with SSIS, there is nothing specific about hand-coding regarding conformed dimensions. So much rests on planning, mapping, data profiling, de-duping, and cleansing, that the process of conforming is just a matter of implementing the conversions and mappings necessary to build the dimensions so that they adhere to conformity. Check out my post on Loading Dimensions for some further details.

For some examples on some things you might do in FoxPro code to achieve conformity, take a look at my previous posts “Compare a String to a List of Formats” and “Conforming Temperature in FoxPro (some conversions)“. These are two examples of things you may need to do to enforce conformity.

From Here

In my next ETL Subsystem post, I will discuss Subsystem 9: Slowly Changing Dimensions (SCD). If you want to know more about Conformed Dimensions and the theory behind them, I recommend reading any of Kimball’s data warehousing books or articles that discuss the subject. He lays out the case quite nicely for conformed dimensions. I would like to stress that the success of the dimensional model rests in the ability of the data modelers and integration architects to design the data warehouse using as many conformed dimensions, attributes, and facts as possible.

There are not too many other blog postings (ex: 1 2 3) on the topic of data conformity that truly dive into the topic. I would love to see more, as this is such an important subsystem and best practice.

Published February 29th, 2008

ETL Subsystem 7: Removing Duplicates

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

The process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the most appropriate version for the warehouse.

Generally, the practice involves an input record and a reference table, but could also span several databases and require access to web services and 3rd-party software. The reference table is usually a dimension (such as Product, Customer, Employee, Address, etc.) or a staging table that contains the relevant attributes to perform the lookup. If a staging table is used (a must for large reference sets), then it must be maintained as part of dimension processing. For more information about reference tables and master data, try searching for “Master Data Management” on Google. If I feel up to the challenge, I’ll post more about MDM in a future series of articles.

It is common to have duplicate data in an organization, especially if that organization has multiple systems for managing customers, products, orders, inventory, etc. Consider a small business with in-house software to track customers and a separate database Online. The same customer could be in both systems with different account numbers and a slightly different name (”Martha Jones” vs. “Martha A. Jones”). As part of data warehousing, we need to eliminate these cases as best we can. That is, of course if it is important for your data warehouse to do so.

If an organization does not need to eliminate these duplicates from their data warehouse, then they should proceed no further. However, in order to do any form of historical analysis, you need data that you can follow uninterrupted back in time and across multiple sources. This subsystem helps you do that.

Fuzzy Matching

Fuzzy matching is the process of determining if two strings are approximately equal (ideally returning some form of confidence score). Algorithms such as Levenshtein Distance and SoundEx have been developed to provide this type of functionality.

If the duplication of information is minimal, or if the volume of data is manageable, then it should be a rather straight-forward exercise to write your own fuzzy matching solution using any of the known algorithms. But you should plan on spending some time on this endeavor. Also, as fuzzy matching is not an exact science, plan on some ongoing human intervention.

One thing is certain: The more atomic your data is, the easier your job of de-duping will be. It is considerably easier to look for duplicate address information when the street number, name, unit number, city, state, and zip are broken out into atomic parts!

SSIS and VFP do not have any built-in, native ability to de-dupe data. Organizations with severe duplication problems will likely need to invest in a 3rd-party cleansing solution. Regardless, here is what you could do using SSIS and hand-coded in VFP:

SQL Server 2005 Integration Services (SSIS)

SSIS provides three very good Data Flow tasks for de-duplication:

Lookup Transformation:
Performs a lookup against a reference table. The lookup uses an equi-join on one or more columns (composite join) and is case sensitive. Use this task for exact matching to a reference set.
Fuzzy Lookup
This transformation uses fuzzy matching (pattern-based) against a reference table in a SQL Server 2005 database. You must supply the reference table, which is typically built from your existing data. You can customize the fuzzy lookup by specifying the maximum number of matches to return per input row, token delimiters, and similarity thresholds. Each match includes a similarity score (how similar the input and reference values are) and a confidence score (the likelihood that a match has been found).
Fuzzy Grouping
The Fuzzy Grouping identifies rows of data that are likely to be duplicates. Then it selects a row that best represents the group and marks that row with a similarity score of 1. The other rows in the group are given a score ranging from 0 to 1. The closer to 1, the better the match.

It is a best practice to first use a Lookup Transformation to find exact matches and then divert the unmatched data into a Fuzzy Lookup. The package will run faster (the Fuzzy Lookup is more expensive than the equi-join Lookup) and will be easier to maintain (you will clearly see what records are being diverted as non exact matches).

You should read more about how the Fuzzy Lookup and Fuzzy Groupings work in Books Online. They are both quite interesting. Also, they’re not too easy to get the hang of at first and have quite a few configuration options. Later, I will demonstrate these tasks in a future posting.

If you need a 3rd-Party tool, then integrating it into the Control Flow would be ideal. Otherwise, you will need to stage the dirty data and run the vendor’s product separately on that stage. You can then pick up the de-duped data in SSIS and continue processing. You may be able to use the Execute External Process task (runs a Win32 Executable or batch file) and put the entire operation into an SSIS Control Flow (which is the second-best solution).

Intelligent Search Technology has an integrated deduplication task for SSIS. I have not evaluated this task, but sounds pretty good. Read about it here.

Hand Coding with Visual FoxPro (VFP9)

FoxPro does not have any native or built-in solution for de-duping data. This isn’t odd, as most (if not all) programming and databases languages don’t either (I would love to know of a language that has some form of de-duping functionality). You can, however, build your own by using functions such as the following:

Levenshtein Algorithm
Check out the discussion on the Fox Wiki.
LIKE/ LIKEC
Can use wildcards to see if expression 1 is like expression 2. The LIKEC version is for double-byte characters. This function might be appropriate under certain circumstances for identifying data to de-dupe.
SOUNDEX
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. More about SoundEx.
Craig Boyd’s SoundsLike Function
Check out Craig Boyd’s improvement for the SOUNDEX function on the VFP Wiki.
MetaPhone and Double Metaphone
Here are a couple of gems translated for us by Craig. The code is posted on the FoxPro Wiki
The ‘Oliver’ Function
I, like Tyler Akins at rumkin.com, first heard of this using PHP. You can see Tyler Akins’ implementation using FoxPro code here. Pretty cool!

Note: The code (zip format) for Craig’s work can be found here at Replacement Software. You can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well.

As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just make up a word?), you can handle this easily in FoxPro by splitting compound fields yourself. I wrote a couple posts last year on this very subject: “Parsing firstname / middlename / lastname” and “Parsing City/State/Zip“. You can then use the individual name and address parts to perform the search, increasing your likelihood of finding good matches. Also, it would be a good idea to convert all strings to upper or lower case (I prefer upper) as well.

From Here

Although SSIS and FoxPro do not offer built-in de-duplication algorithms for names, addresses, products, and the like, they both offer enough tools, tasks, and functions so that you can write your own. Plan on spending plenty of time on this endeavor, as de-duplication is not an exact science and will undoubtedly require some human intervention during the process.

Next post, I’ll discuss ETL Best Practice Subsystem 8: Data Conformance. Data conformity across dimensions and fact tables is one of the hallmarks of a fully integrated dimensional data warehouse.

Published February 14th, 2008

ETL Subsystem 6: Auditing

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Auditing is the process that tracks the activities of users by recording selected types of events in the security log of a computer (cite). In the context of data warehousing and data integration, auditing is the process of attaching metadata to each fact (and in some scenarios, some dimensions) in a dimensional model. This metadata is then available to various applications that use the model to examine data quality and reliability. The audit record then provides some context for the data and can be instrumental in turning that data into information.

In a dimensional model, the audit table is actually another dimension and adheres to the normal rules for dimensions. This includes the use of surrogate keys, SCD types, and ETL procedures. For more information about these subjects, see my following entries:

 

Why Bother?

Data warehousing is truly an ongoing struggle to provide high value data to business users so that they can make the best decisions possible for their function. The audit dimension helps us provide high quality, accurate, and complete data. In addition, it also helps us with following the lineage of data, which in this context means the ability to trace origin and ownership of a piece of information. Having adequate data lineage is crucial for complying with the government reporting requirements of legislation such as HIPAA and Sarbanes-Oxley, and for answering the tough questions auditors might have for you some day.

With a properly populated audit dimension, you can actually answer questions like: “How did that data get in there?”; “Where did that information come from?”; and “How did you arrive at that number?”.

As data warehouses continue to become the “system of record” in various organizations, it is critical that every fact is traceable from source to presentation.

Making it Work

Although I will reserve complete details on implementing an audit dimension for another posting (I already have a problem of being too long-winded in my entries!), here are the basics:

  1. At the very least, the audit dimension should contain the following attributes:
    • primary surrogate key
    • date loaded into the warehouse
    • version of the data warehouse when added
    • name, location, and version of the source system that the fact came from
    • information about the process that loaded the data (ex: SSIS execution ID, computer name, user, etc.)
    • information about missing data
    • information about how the fact was altered in transit
    • optionally (but very helpful), a quality identifier or score for the fact
    • optionally (and equally helpful), statistical information about the fact (deviations from the mean, outlier indications, etc.)
  2. Generate audit information as part of fact table generation, which normally occurs after dimensions have been processed. This would include gathering all of the necessary quality indicators, statistics, and lineage information.
  3. Create the audit record just before inserting into the fact table. This makes available the audit key, which you will need to add into the fact record.
  4. To query audit information later on, simply include attributes from the audit dimension in your query that cuts across the fact table.

It is reasonable to create an audit outrigger which would contain even more details about the quality and lineage of the fact. The approach I outline here uses a one-to-one relationship between the fact and the audit dimension; the outrigger would allow one-to-many relationships from a single fact to multiple audit conditions. I’ll reserve this design for a future posting as well.

SQL Server 2005 Integration Services (SSIS)

SSIS can handle auditing quite easily. In fact, there is a special Audit Transformation designed to add important lineage, package, and environment information to the workflow. The Derived Column transformation also exposes a significant amount of system and package information. The Aggregate and Row Count transforms will also prove quite helpful as they can be used to provide statistical information about sets of data.

If you need to include auditing information from dimensions into your fact audit, you might need to work with intermediate staging tables. When processing dimensions (like Customer, Product, Employee, etc.), you can build a staging table (preferably, a RAW formatted file) with information about the dimension (statistics, date loaded, etc.). Then tap into this staging table later on when processing individual facts.

You can check out the webcast “MSDN Architecture Webcast: Using SQL Server 2005 Integration Services to Populate a Kimball Method Data Warehouse” at microsoft.com for more information about adding auditing using SSIS.

Hand Coding with Visual FoxPro (VFP9)

In VFP, handling audit information is also easy, but requires some additional planning and coding to be comprehensive. In SSIS, many environmental variables (package name, user information, etc.) are available automatically. Also, in SSIS, it is trivial to add these variables to the data being moved from source to target. I’m not trying to suggest that SSIS gives you a comprehensive auditing system that requires little planning, but I am saying that it is comparably easier to add ad hoc auditing information using what SSIS provides out of the box.

In VFP, you will have to make this audit information available yourself. You can obtain this information directly from metadata you create (your process name, ETL version, etc) or from any number of VFP commands that can provide insight into the quality and lineage of the data you are processing. These commands can include those such as: CALCULATE; RECCOUNT; SYS(0) (network machine information); SYS(16) (executing program file); DATETIME; and many others.

If you have taken my advice and created a table-driven ETL system (see my Fox Forward presentation materials), then you’ll find that processing audit information is quite easy. Take a look at the information provided in those files and get back to me with any questions.

From here

In the end, both SSIS and FoxPro can handle the requirements of this subsystem. Auditing in this case is not really limited by the technology, but by the design, constraints, and requirements of the project.

Auditing, logging and other instrumentation are all important for compliance and data lineage. These topics will be discussed in detail in future posts on subsystems 29 “Lineage and Dependency” and 33 “Compliance Manager”. At the rate I’m going, these should be ready some time this spring!

In the next article, I’ll discuss removing duplicates and how VFP and SSIS can deal with the various challenges associated.