Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Data Integration’


Published July 2nd, 2008

ETL Subsystem 15: Bridge Table Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Before I can discuss this subsystem, I need to discuss bridge tables. This is a topic I haven’t covered much until now. So to help me illustrate how bridge tables are used in a dimensional model, I’ll present a case for a real-estate property transaction.

Sometimes the defined grain of a fact has multiple dimension rows that characterize it. These scenarios are referred to as many-to-many or multi-valued dimensions. A patient encounter at a hospital might have multiple procedures. Multiple bank accounts might belong to a single person. In the case that follows, I’ll show how multi-valued dimensions help support property sales (i.e. a real estate transaction).

I would like to make one important point before I begin: If you can thwart the need for bridge tables by redefining the grain of your fact table to be more atomic, then you should do so. Bridge tables add complexity to the model, ETL process, and the end-user application layer. With that said, multi-values bridge tables in many environments are unavoidable and absolutely necessary.

Case Study: A Real Estate Transaction

In the case of a property transaction (aka a property “sale”), you are potentially dealing with several many-to-many relationships. First of all, you may not be buying alone (one-to-many buyers). You may also not be buying your house from a single person (one-to-many sellers). Your house may be on two parcels of land or no land at all if you are buying an apartment or condo (none-to-many parcels). A barn or in-law apartment may be included or perhaps you are just buying land to build your dream home later (none-to-many buildings). These and other complexities make the case for bridge tables to represent the many-to-many relationships inherent in a complex transaction such as this.

When you buy land or a house (in the US, at least!) you must “record” the sale with the clerk of the municipality in which the property exists. This recording becomes part of the chain of ownership and tax record for the property. This recording can accommodate multiple grantors (the people or organizations selling the property), grantees (those who are buying the property), and properties (multiple parcels, land lines, lots, buildings, outbuildings, units, etc). This recording is given a book and page and filed away in a database. A hard copy is typically stored in a physical book in the clerks’ vault as well. It’s pretty fun (for some people who are into history) to go to your clerk’s office and peruse the vault. Anyway…

What follows is a simplified ER model from an OLTP system I designed several years ago. This model represents the complexities of the above case in a transaction environment. It has been simplified to only show the more important business entities:

Property Sales Schema - OLTP

Note that in the above OLTP system, table Real represents land and building values from tax billing software. These figures have been aggregated into a single row (i.e. total land value and total building value, number of buildings, total acres, etc.). For the above model, knowing the exact building or land line is not important. For the Data Warehouse, though, we want and need this atomic detail for some detailed analysis. The Deed table represents the actual recording (book, page, date, time, etc.). The bridge table ND_link contains grantor and grantee information. NR_link represents ownership history between the names and entities in Namefile and the aggregate property information in table Real.

The business process dimensional model could therefore be modeled as follows:

Business Process Dimensional Model for a Property Sale

As you can see, there are bridge tables between the SaleFact and DimBuilding, SaleFact and DimLand, and SaleFact and DimPerson dimensions. I typically name these tables by including the word “Group” in its name (LandGroup, BuildingGroup, etc.) but you could use the word “Bridge” or any other term that is consistent throughout your models. These bridges define the many-to-many relationships that are normal for a Real Estate transaction. Because a property sale is a natural business event, and cannot be logically defined to be any more atomic, multi-value bridge tables are a must.

In the above schema, a particularly astute and observant modeler will notice the many-to-many relationship between the keys in the fact and bridge. FoxPro, which is where I put together these examples, will not allow a many-to-many join between two keys. So, I Photoshopped the graphic in order to show you the logical representation! In FoxPro, and in many other database systems, you will need another table in between the Group and the Fact with a single attribute primary key. This would represent the physical implementation of the above design. You would therefore need two physical tables to represent the many-to-many relationship. I try my best to avoid this. In FoxPro, for example, I would simply not create the relation at all! Or in some cases, the bridge group would be unique for each event, making the foreign key in the fact table always unique. Being unique, it can be defined as a candidate key and then happily joined to the bridge table (more on this in the section below on “special considerations”).

Bridge Table Characteristics

Bridge tables are not just janitors or pianos (well, they do have all they keys!). Most bridge tables also contain attributes. For the above case, I include two attributes in each bridge: weight and primary indicator.

Weights are used to allocate metrics in the fact. Weights always add up to 1 across the group. If you need to allocate the sale price across all buildings involved in the sale, you would use the weight. Weights are calculated as part of ETL process and are an important part of the mechanisms of Subsystem 15. Sometimes, it is perfectly OK to divide 1 by the number of group records to get the weight (applying any fraction to the first row perhaps). But more often than not, there is a more complex algorithm to determining the weight. For land, you might base the weight on lot size or price code. For buildings, you may base the weight on the replacement cost of the building, or you might use square footage. All of these are defined by the business and applied during ETL.

I use primary indicators because some towns only care about the primary grantor, grantee, building, or parcel for analysis. One row in the bridge table will be identified as ‘primary’. During ETL, the primary row is identified by using a sequence number (number 1 = primary) or by determining the highest valued property. Sometimes, its arbitrary, in that you just mark the first one encountered as primary. These decisions are part of your business requirements and must be carefully implemented during data transformation.

Other Considerations

Guess what? Your bridge table is really a type of fact table. In the above schema, LandGroup is a sort of fact with some semi-additive metrics that represent the relationship between a parcel of land and a sale.

In my example, a sale is a static event. The buildings, land, and people involved will change over time, but the groupings used at the time of the sale will remain the same. For this reason, there is no need to make the group tables Type2 dimensions. If we ever need to change a value in one of these bridge tables, then it is likely to correct a data problem and not to represent a changing dimension. Additionally, in this scenario, it would be a safe assumption — especially because there are weighting factors and primary indicators involved in each bridge — that each grouping is unique to the fact. For modeling purposes, you could avoid the many-to-many relationship issue discussed in the previous paragraphs by making the bridge keys in the fact table candidate keys. There would be a unique set of group keys for these bridge tables. But be cautioned: The volume of property sales is small in any given year for any given municipality. You’ll never see a million transactions in a year. This makes the candidate key approach attractive. If you are dealing with patient diagnosis across multiple hospitals then your bridge tables will become too large. In these cases, you’ll need to consider re-using groups. The consequence of reusing groups is that the key in the fact table can no longer be a candidate.

To be clear: Bridge tables can also be a Type 2 SCD. If this is needed, and tracking bridge table changes over time is important, then you will need to combine bridge building with your SCD logic. You will need activity dates (from and to), and a current row indicator.

As I said at the top: If you can thwart the need for bridge tables by redefining the grain of your fact table to be more atomic, then you should do so. Bridge tables are not for the faint of heart!

SQL Server 2005 Integration Services (SSIS)

Bridge table building is largely a modeling and business rules problem. There is nothing in SSIS that will prohibit you from implementing even the most complex bridge table design. Most difficulties will come when managing the keys and implementing weighting factors. For the keys, you can use the same techniques described in my posts about Subsystem 10 and Subsystem 14. For weightings, I’ve implemented weighting algorithms using both stored procedures (Execute SQL Task) and VB (Script Component) when appropriate.

Hand Coding with Visual FoxPro (VFP9)

The above case is taken from an actual implementation I created in Visual FoxPro. I must say, it was rather simple once I ironed out the nuances and had my weighting algorithms in place. Developing the bridge table builder subsystem was a simple matter of inserting and linking the fact with the multi-valued entity and applying the weight and primary indicator logic. The resulting analytics that can be derived now far outshines that which could be done with the original OLTP model. Not to mention, the dimensional model has atomic detail (not like the aggregate Real table shown in the first schema) which allows for more slicing and dicing. But I digress!

From Here

Next post, I’ll discuss Late Arriving Data (Subsystem 16). That’s a big topic, so I’ll distill it as much as possible to keep it brief!

Published June 24th, 2008

ETL Subsystem 14: Surrogate Key Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Surrogate key management is an important part of designing an ETL system using the Kimball Methodology. Slowly Changing Dimensions require the use of surrogates because natural keys, like a Customer ID, Product ID, and so on, will be repeated in dimensions whenever Type 2 changes are tracked. Surrogates are also desirable because they are single-part and fast for joins. I’ve already discussed the need for surrogates in several posts. Here, I’ll talk about some techniques for replacing the natural keys that come in from the source systems with the new surrogate keys that exist in the data warehouse.

The idea is simple in theory: For each dimension row, you generate a surrogate key. This surrogate key replaces the row’s candidate key and is used to link to one or more Facts.

For example, the candidate for a Customer dimension may be CustomerID + ActiveFromDate. The ActiveFromDate is used to sequence the row if Type 2 change tracking is used. It is entirely possible to have the same CustomerID listed dozens of times in the dimension. The differentiator is the ActiveFromDate. This combination is guaranteed to be unique in the Customer dimension (although you would not normally enforce this using an index, you would certainly enforce this through ETL). This is a simple example. When modeling Parts or Real Estate, you may have several attributes combining to form a candidate key. This also becomes a bit more complex when Customer data is refreshed throughout the day — meaning you cannot rely on the date alone, you also need to consider time.

In dimensional modeling, you will normally process dimensions before facts. Because of this, you will have all surrogates defined for each row you need. In a subsequent post, I’ll talk more about situations where you load a fact before a dimension, or situations when late arriving data must be handled. For now, assume that all dimensions have rows that correspond to a fact. When you process the fact, you simply need to look up the surrogate key assigned in the dimension and use that key as the foreign key to that dimension in the fact.

Dimensional modeling essentially requires surrogate keys, and does so by taking a performance hit during ETL. It takes time to look up surrogate keys. You will spend a lot of time yourself making this process as fast and efficient as possible. I do have a few suggestions, though:

Load by the Day

Process facts one day at a time (all of yesterday’s data plus any changed data or late arriving data from previous days). Doing so allows you to join your staging fact row by the natural key in the dimension for that day. For example, in SQL, you could acquire all dimensions that are valid for a particular day by doing the following:

SELECT DISTINCT NaturalKey 
    FROM Dimension 
    WHERE DataDate BETWEEN ActiveFromDate AND ActiveToDate

This returns a subset of the dimension for the day in which the fact occurred (DataDate). If the candidate key involves more than the natural key, you will need to adjust the above SQL to be more robust. For example, you may have the name and birthdate of a patient at a hospital with an insurance number and insurance company code that uniquely identifies the row! That’s six attributes (if you include the ActiveFromDate) that uniquely identifies the row when combined.

With the above, you can incorporate the query into a JOIN on the staging fact data (by selecting all facts that occurred on DataDate), or you could use it as the source for a lookup operation. By narrowing the focus to be a single day (and you load by the day), you eliminate the need to worry about SCDs.

You can further enhance the above by maintaining special staging tables for each of your dimensions. These staging tables contain all candidate attributes and the surrogate key for each dimension. These special staging tables are maintained while you process your dimensions. Instead of selecting against the actual, production dimension, you can use the staging table. Although there is a little extra overhead in maintaining the staging tables, you get it all back — and then some — when you do your key lookups.

If you need to load by time of day (essential for real-time or intra-day ETL), you can still use the above method, except that you will need to join your fact data on time of day as well.

You can get around using BETWEEN if you are processing the most recent data. Instead, you could filter results by the dimension’s “current” flag. The current flag is set while processing the dimension. Only one row, for particular natural key, can be marked as current. A predicate on a current flag would be faster than checking between dates. A common trick is to split the rows into two streams: One with current data (anything that happened yesterday), and one with historical data (anything that happened the day before yesterday and back). Use the current flag for the current data, and use the BETWEEN for historical data.

I find that this technique works well for small dimensions with a low number of facts (a few thousand transactions a day). With a good index plan, this might be all you need to do.

Store the Surrogate Right Away

Another technique worth mentioning is particularly useful when dealing with very large dimensions and lots of facts. You should also consider this technique if you are manually generating your surrogates (see my post “ETL Subsystem 10: Surrogate Key Generator” for specifics). Basically, when you generate your surrogate key for a dimension row, store it with the fact data at the same time.

This technique requires writing the surrogate to the dimension and the fact staging table. Using an UPDATE statement is a simple approach, but a more exotic and better performing method would be to INSERT the new keys into another staging table, and joining to the staging fact table later on.

Either way, there is a performance hit with storing they key right away, so be sure to thoroughly review your environment and needs. I have used this technique before with great success, but it took me a long time to determine it was the best approach. In fact, my original design worked great until the dimension grew to a few million rows and the transaction load doubled. Doing a lookup using BETWEEN proved to be too slow compared with saving the key immediately.

SQL Server 2005 Integration Services (SSIS)

Jamie Thomson wrote an article on how to use SSIS to update fact tables (actually, most of it is a case study written by “Mag”). I won’t repeat what was in that post. Have a look. Jamie’s technique is basically what I have come to adopt myself. You’ll notice the liberal usage of Lookups instead of Joins (joins, at least for me seemed like a more logical starting point). Lookups perform better for a few reasons. The Merge Join operation requires a Sort — either by using the Sort Component or by ORDERing the data through the Source Component. This sorting can be expensive. Lookups, in addition to not requiring that the input be sorted, can use caching and limit access to the database. For these reasons, I use Lookups now almost exclusively.

When using the Lookup approach you must be aware of two possible conditions: The Lookup could return more than one match if you don’t properly define your candidate key. Also, if no match is found, common for late arriving data, you need to either divert the errors and handle them separately, or convert the NULLs to your default “unknown” value for each dimension. For me, I typically use negative numbers to define various types of null (-1 for “not known”, -2 for “not applicable”, etc…).

If you process by the day, use staging tables as dimension lookups, and rely on Lookups instead of Sorts and Merge Joins, then fact processing should go quite well.

Hand Coding with Visual FoxPro (VFP9)

As I discussed in my post “ETL Subsystem 10: Surrogate Key Generator “, VFP’s SEEKing capability far outshines set-based approaches to this problem. Check out my comments in that posting for some ideas on how to use VFP to get the surrogate keys for your fact data.

Not to get lost though in the wonders of SEEK, VFP is perfectly suited for set-based approaches as well. So all of the methods discussed in the article (Joins and lookups) can be implemented easily in FoxPro. As I wrote in that post, VFP really shines for this type of task.

From Here

Next post, hopefully later this week, I’ll discuss bridge tables and how to best build them. I have an interesting case study to walk through. I have refrained from using cases in my ETL subsystem posts, but for bridge tables, I am making an exception!

Published June 16th, 2008

ETL Subsystem 13: Fact Table Builders

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Fact tables are the heart and soul of the business process dimensional model. I discussed fact tables already in a previous post (”Fact Tables“), so I won’t repeat myself here. Recently, I also provided a more formal definition of a fact table as the: “central table of a Star Schema with numeric performance measurements, identified by a composite key, each of whose elements is a foreign key drawn from a dimension table”.

Three grains define fact tables: Transaction-level, Accumulating Snapshot, and the Periodic Snapshot grain. Refer to my post referenced above (”Fact Tables”) for more information on granularity.

Subsystem 13 is responsible for the construction of all types of fact grains.

Transaction grain fact tables are always the largest and should contain data at the finest grain possible. This would include a line on an invoice, a patient diagnosis for a series of emergency room events, or even a Josh Beckett fastball. Data is almost always inserted (via bulk load) into transaction-grain facts. Deleting rows is dangerous (more on this in a second), and updates should only occur in order to correct an error or update some late-arriving key. In some cases, I have even gone back into a transaction grain fact table in order to update a calculated field.

Instead of deleting data in a transaction-grain fact table, “negate” the row instead. Otherwise, you run the risk that some of your historical reports and analysis will be inconsistent. To negate a row, simply add a new row to the fact, with identical foreign keys to the dimensions as the row you are negating, but update the metric so that when both rows are summed (or counted, etc…) they result in zero activity. For some metrics, this is tricky, but if you are using mostly additive and semi-additive metrics in your facts, you should be OK with this approach.

For the periodic snapshot grain, there are some differences in the loading process to consider. First, as these facts deal with aggregate data (based on daily, weekly, monthly, quarterly, etc…) time spans, it is appropriate to only load them when a period is complete. Secondly, extremely careful consideration for the grain of the periodic snapshot must be honored because it is too easy to apply an aggregate calculation inappropriately against the stated grain. Third, and as a compliment to my first point, it may be necessary to aggregate some running totals. If for example you are building a monthly periodic snapshot for a checking account at the bank, you would normally also provide data from the end of the last period to date.

Accumulating snapshots represent some process as it evolves over time. Therefore, rows in the fact are constantly updated to represent this evolution. For example, a patient may enter the ER, check in at triage, get assigned a room, see the nurse, see the doctor, etc. A building inspector may visit a new construction before work is done, when the foundation is set, when the framing is up, when the electrician as finished, etc. Loading this type of snapshot is much different from the other two, in that it tracks some defined process over time in a single row. There are many updates, and additional logic is needed to handle the loads.

SQL Server 2005 Integration Services (SSIS)

I find working with the various types of fact tables in SSIS to be rather simple. The key for success is usually in how the Control Flow is sequenced (facts generally come last), when dimensions are loaded, and where you do your dimension key lookups. In essence, all you need is a source containing your metrics (line items on an invoice pulled during the extract phase), a series of lookups (or joins, whichever provides the best performance in your situation), and some way to load the fact (bulk insert if possible).

Some advice: Do your UPDATEs in an Execute SQL Task. Do not try to use the OLD DB Command. Especially — and I mean especially - if you are doing a large amount of updates and your fact table is also large. To get this to work, offload the data you need for your updates into a separate staging table. Use the Execute SQL Task and write an appropriate UPDATE statement which joins your fact to your new staging table. The same approach holds true for any rows you might be deleting.

Hand Coding with Visual FoxPro (VFP9)

I find working with fact tables in FoxPro to be just as easy as in SSIS. There is no clear advantage of one over the other. One major benefit with VFP is that it is an excellent OOP language with very fast cursors (a VFP Cursor is much easier to work with than an SSIS Raw file). This gives you many more options on how you look up and populate foreign keys for the fact at load time.

If you are using VFP tables for your RDBMS, you will need to be careful for the 2 gig size limitation for your fact tables. Your Fact Table Builder may have to handle partitions manually. This is a major disadvantage with using the VFP database. Especially because 2 gigs is not a lot of space for daily transactions!

From here

In the next article, I’ll discuss ETL Subsystem 14: Surrogate Key Manager. Subsystem 14 helps to maintain the integrity among facts and dimensions for each business process, and is a complement to your Fact Table Builders.

Published May 26th, 2008

ETL Subsystem 12: Special Dimensions Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Each dimensional model that you create will have special design characteristics and requirements. This typically means that you may need to rely on specialized dimension tables. Not all dimensions will be “Product” or “Customer”. In fact, you will likely have more specialized dimensions in your models than actual normal, business dimensions.

There are several categories of special dimensions. They are:

  • Date and Time Dimensions
  • Bridges to support many-to-many relationships
  • Special Indicators and Flags Dimensions
  • Study or Research Group Dimensions
  • Mini-Dimensions
  • “Current” Dimensions
  • Lookup and other Static Dimensions
  • Administration Dimensions (such as special logs, monitors, and audits)

Much of the above has already been discussed in my posts to date (date dimensions, bridges, and lookup dimensions for example). So not to beat a dead horse, I’ll spend the next few paragraphs explaining those I haven’t really talked much about yet.

Special Indicators and Flags Dimensions
These dimensions, also referred to under some circumstances as “junk” dimensions, exist to store extra information about the fact when it occurred. I typically use these special dimensions to hold flags (for example, if a tax payer has a history of delinquency, if a certain diagnosis and age creates greater risk for death, etc..). I usually store these flags as a tiny int with three settings: -1 null, 0 off, and 1 on. For researchers, this table acts as a way to screen large amounts of data and focus efforts on a subset of events with certain characteristics.
Study or Research Group Dimensions
Whether I’m working with Qualitative Analysts or Quantitative Analysts, they always seem to need specialized research databases (and they all seem to like using SAS too!). In the past, I got in the habit of creating copies of the required business processes dimensional models for their use, filtered and sampled appropriately. These databases would be installed in an isolated environment. This isolation allowed them to develop models, conduct what-if analysis, and to manipulate the data to test theories. This gave the researches their own data to play with and everyone was happy. Another solution — a better solution — which I have only recently begun to implement with more regularity is creating a special study group and/or research dimension which defines the research being conducted (”Infection Research”, “buying patterns of customers in Alaska in January”, etc..). A bridge table between the subject (Customer? Product? Stock? Web Page?) links the study with the sample used. All queries for that study simply use the new special dimension as an additional filter. And everyone’s happy!
Mini-Dimensions
Mini-dimensions are necessary when a subset of attributes (typically related) need Type 2 change behavior tracking, the changes happen frequently, and the dimension is rather large (such as customer, webpage hits, or stock index characteristics). A specialized mini dimension is created with the required attributes, stored at a grain that represents a combination of attributes. This mini dimension does not contain a link back to the parent dimension, nor does it contain the natural or primary key of the dimension. Each row is instead given a unique key that is referenced alongside the parent dimension in the fact table. For a large Customer dimension, you may create a mini-dimension with income bands, year-to-date behaviors, job status, and other related financial and purchase information. This “profile” might be shared by hundreds or thousands of customers and change frequently. Therefore, the data you store would not be identifiable on its own to any single customer, but to all customers who share the characteristics.
“Current” Dimensions
There are times when large Type 2 SCD dimensions become too unwieldy for querying and other analysis; or, what you really want is a Type 1 dimension to do most of your work, but need to maintain Type 2 changes over time to accommodate some special need (like compliance). The solution is to copy out the “current” rows of a dimension (using a current flag or date range) into a second, special Type 1 dimension. I’ve had success using this technique using three methods: (a) create a role-playing view on the dimension using only the current rows (one challenge with this method is that you need to store a special key in the parent dimension so that you can use the view to relate to a fact table row); (b) by creating a new Type 1 dimension through ETL and placing a new foreign key to my table in the fact; and (c) by using the Type 2 dimension as a bridge table, linking to the special dimension through its natural key (which can be a view or a physical table created through ETL).

SQL Server 2005 Integration Services (SSIS)

Other than building the dimensions, there is nothing special from an SSIS point of view. You will generally handle the creation of the special dimensions in separate packages (dates and lookups for example) or as part of normal dimensional processing (which would be the case for mini-dimensions and current dimensions for example).

Hand Coding with Visual FoxPro (VFP9)

Same as with SSIS. Nothing special. If you can process a dimension, you can process a special dimension. Most of the onus falls on business requirements, data modeling, and overall data warehouse design. Non of which are inhibited by Visual FoxPro.

From Here

This subsystem is deigned to be a place-holder for handling special dimensions. I have found that a great deal of value found in a DW/BI solution actually stem from these. All those special flags, scoring results, quality assessments, and additional features that we’ve painstakingly added through ETL are now available.

In my next post in this series, I’ll discuss lucky subsystem 13: Fact Table Builders.

Published May 20th, 2008

ETL Subsystem 11: Hierarchy Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

This subsystem exists to ensure that hierarchies are appropriately translated and represented in the dimensional model. There are two types of hierarchies that you’ll need to contend with: fixed and ragged.

Managing hierarchies can be a complex process especially when you have hierarchies that are extremely ragged (for example, manufacturers’ parts or an organizational chart). You’ll also run into complexities when a dimension entity (like a single customer or a single product) exists in simultaneous hierarchies. I’ll talk more about these later in this post. In contrast, the fixed variety is easier to work with.

Hierarchies as Attributes of a Dimension

Generally speaking, you can think of hierarchies as many-to-one relationships. In the dimensional modeling world, these relationships are represented in a single table. This would include stores to regions to state, children to parents to grandparents, and greens to vegetables to produce. ETL Subsystem 11 seeks to maintain the integrity of these relationships.

To simplify even further, instead of thinking in “tree” structures, think in “lines” from the child up the hierarchy to the parent. This will help you build the dimensions to accommodate the relationships. In the simplest form: A single store is in a single region in a single state. This is actually a very interesting topic from a modeling perspective — one in which I’ll get to in more detail in a future posting.

state_region_storeDimensions are denormalized structures, which means that you will have many repeated elements. This is normal and desired. For example, a store region will be repeated for each store and state will be repeated for each region all in the same dimension (look left). This is normal and desirable in a dimension. The trick to getting this to work correctly is that the hierarchy must be represented as a single value with the dimension row’s primary (surrogate) key.

Data modelers who are used to 3rd Normal Form might look at the above and cringe. But remember: normalized models are for preventing data anomalies in a transaction environment. In a data warehousing environment, the rules are different. First, there are no opportunities for data anomalies due to data integration controls. Secondly, normalizing the data warehouse makes absolutely no sense from a usability perspective: it only complicates and slows down reports and queries.

The Ragged Kind

So fixed hierarchies are easiest to work with: got it. It is not so easy to work with variable and ragged hierarchies because of their variable depth. The classic example is an organization chart, where any employee can be at the top or at the bottom of the hierarchy. Knowing how deep the organization runs from any point is a challenge that usually requires self joins and bridge tables to represent the relationships.

I have always solved these types of hierarchy “problems” using “helper tables” in a dimensional model. Ralph Kimball wrote a great article a decade ago on this subject. Check it out for more details.

Helper tables look like bridge tables that sit between the fact and the dimension. They facilitate the representation complex hierarchal information. This design complicates user queries though, so be sure that helper tables are absolutely needed. It might be, for example, you only really need the manager’s name and not the entire chain of command with each employee. You don’t need a helper table for that (see the following code)!

SELECT
   c.FirstName AS empFirstName,
   c.LastName AS empLastName,
   e.title AS empTitle,
   COALESCE(m.mgrFirstName,'N/A'),
   COALESCE(m.mgrLastName,'N/A'),
   COALESCE(m.mgrTitle,'I am the boss')
FROM
    HumanResources.Employee e 
JOIN Person.contact c ON c.ContactId = e.EmployeeID 
LEFT JOIN (
   SELECT
      e2.EmployeeID,
      c2.FirstName AS mgrFirstName,
      c2.LastName AS mgrLastName,
      e2.title AS mgrTitle 
   FROM
      HumanResources.Employee e2 
   JOIN Person.contact c2 ON c2.ContactId = e2.EmployeeID  ) 
  AS m ON m.EmployeeID = e.ManagerID

I wrote a while ago on ragged hierarchies from a programming perspective. Take a look at that post for more details.

Date Hierarchies

For another example, let’s look at a common hierarchy we can all relate to: year, quarter, month, and week. You can see how this hierarchy is modeled by looking at any date dimension. As you’ll see, each day contains information about how it is grouped on the calendar. The information is repeated for each day until one of the groups change. When designing reports that allow your users to drill down into the data, it is a common approach to start at the highest group (sales by year) and then look a bit deeper as necessary (sales by quarter or sales by month, for example).

Also, in the date dimension, weeks don’t line up too well within a year, quarter, or month. This is a classic example of how one group can fit entirely or partly into another. You see this often when a sales region crosses multiple states, or when an employee serves multiple roles within the company. These are all situations that must be accounted for by your hierarchy manager. For the date dimension, one technique I’ve adopted is to include the week of the year number, week of the quarter, and week of the month in the date dimension to give users the ability to drill into weekly data much easier (for example, you might want to measure holiday sales in the US from the 3rd week of November to the 4th week of December).

Snowflakes and Hierarchies

SnowflakeSnowflakes are usually a sign that a hierarchy has been normalized. This is bad. Don’t fall into the trap! In order to keep the dimensional model as simple as possible, you should avoid snowflaking; however, snowflake designs are perfectly legal under certain circumstances. Carefully examine the reasons though. If you are snowflaking to accommodate a hierarchy, hold your horses. Hierarchies are a natural part of a dimension. In fact, most things in life are categorized and need to be grouped in some way. Remember that one of the primary purposes of delivering a denormalized dimension is to remove almost all complexity from the user’s perspective. This usually means hierarchies.

SQL Server 2005 Integration Services (SSIS)

Denormalizing hierarchies using SSIS is not difficult. The hardest part is usually in writing the SQL that correctly fetches the right data. In a Data Flow, you can use a series of Lookup and Merge components. In some cases, especially for the more complex ragged hierarchies, your best bet is to use SQL statement in an Execute SQL Task.

Although I won’t get into details in this post, CTEs (Common Table Expressions) are excellent for working with recursion and hierarchies.

The real joy of hierarchy management using the SQL Server Business Intelligence suite comes at the end when you want to start using your hierarchies to allow users to drill down into the data. Even if you have complex ragged hierarchies structured with bridge tables in your model, Analysis (SSAS) and Reporting (SSRS) Services can really make sense of it all. But I’m digressing; this series of posts is about ETL and not presentation tools!

Hand Coding with Visual FoxPro (VFP9)

It’s easy to write recursive functions in Visual FoxPro. Recursion is one of the “secretes” to flattening out hierarchal structures. For more details on writing recursive functions in FoxPro, check out my post “Ragged Hierarchy Alert” or even better, visit the FoxWiki page on the subject.

From Here

In the end, as I continue to compare SSIS with VFP on data integration, I find that hierarchy management is about equal between the two. FoxPro seems to perform much better but I have no benchmarking (yet) to prove it. One advantage with SSIS is being able to utilize CTEs.

This was a long one, but a heavy subject. In my next post, I’ll discuss the Special Dimensions Manager, ETL Subsystem 12. Not as heavy and hopefully not as long!

Published April 25th, 2008

ID Lookup Custom Transform (Part 3 of 3)

In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In Part 2, I walked through AcquireConnections and PreExecute. In this post, I’ll discuss ProcessInput. Please read the introduction , part 1, and part 2 if you haven’t already!

ProcessInput

ProcessInput accepts each row from the incoming buffer. This buffer is available via a parameter apply named “buffer” of type PipelineBuffer. The buffer contains each column defined in the column collection (IDTSInputColumnCollection90) of the input.

For ID Lookup, each row from the buffer is examined, and based on properties set by the user, appropriate action is taken.

The first step is to see if I can find an ID in the master lookup table based on key fields in the incoming data file. This functionality is controlled by a series of properties on the component. Normally, this search would be conducted on one or more input columns (lname, fname, and dob for example). The developer, after having done a data profile on the source file, will know exactly what columns will be used in the lookup. The code to build the SQL statement is defined like so:

public override void ProcessInput(int inputID, PipelineBuffer buffer)
{
 
  bool nullOutput = false;      // if the input contains a NULL, we want to reject the row     
 
  if (!buffer.EndOfRowset)
  {
    while (buffer.NextRow())
    {
      string strSelectWhere = "";   // to store the where clause for the search
 
      // if the columnIndex is in the input string. Recall that I figured this out in PreExecute
      if (indexMatchValue > -1) 
      {
        // build a where string for the search
        if (buffer.IsNull(indexMatchValue))
          nullOutput = true;      // fail a search on NULL as an UNMATCHED row
        else
          strSelectWhere += " lname='" + buffer.GetString(indexMatchValue);
      }

The block of code starting with line ” if (indexMatchValue > -1)” is repeated several times, with some variation, for many additional properties set by the user. The purpose is to build up a where string (stored in strSelectWhere) for the search to be made on the data table (dtMstr). This data table was filled in the PreExecute method.

Before going any further, I want to see if there are any nulls in any of my search columns. If so, I want to redirect to one of my outputs (specifically, IDLookupNoMatchOut), and move on to the next row:

      if (nullOutput == true)
      {
         buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
         continue;
      }

Once the strSelectWhere statement is built, I try to find a match in the lookup table:

      DataRow[] aDR;
      aDR = dtMstr.Select(strSelectWhere);

I had my doubts about the speed of this approach. I really thought that the Select method, repeated for each row in the buffer, would kill my grand idea. But I was pleasantly surprised to see this perform well beyond my expectations. Someday as time permits, I’ll try benchmarking it.

Based on the results from the Select, the logic flows along the following set of actions:

  • (1) If exactly one match is found in the lookup table, use buffer.DirectRow(int outputID) to output the row to the “IDLookupOut” Output.
  • (2) If more than one match is found, the candidate key in the source file is a duplicate in the master data table. Use buffer.DirectRow(int outputID) to output the row to the “IDLookupDuplicateOut” Output for later handling.
  • If no matches are found, do another Select on a historical lookup table in a different database. The code for this is not shown to keep this posting simple.
    • (3) If exactly one match is found in the historical lookup table, direct the output to both the “IDLookupOut” Output and to IDAddToMasterData Output. The latter will allow me to populate the current master data with the ID found so that we can maintain referential integrity in the database.
    • (4) If more than one match is found, direct the output to the “IDLookupDuplicateOut” Output for later handling.
  • (5) If no matches are found in the local or historical data, direct the row to the “IDLookupNoMatchOut” Output.

For components that conditionally direct inputs to multiple outputs (like a Conditional Split or my ID Lookup component), you need to use buffer.DirectRow to tell the component what output to use.

Here’s the code:

      if (aDR.Length == 1)
      {
        // (1) One record found. Action: Direct row to IDLookupOut
        buffer.SetString(indexIDValue, aDR[0]["id"].ToString());   // the column to put the ID
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupOut"].ID);
      }
      else if (aDR.Length > 1)
      {
        // (2) More than one entry found. Action: Direct row to IDLookupDuplicateOut
        buffer.SetString(indexIDValueDupe, aDR[0]["id"].ToString()); // store dupe ids 
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupDuplicateOut"].ID);
      }
      else if (aDR.Length == 0)
      {   
        if (Allow3rdPartyLookup== true)
        {
          // (3) Search the historical database, if single match found Action: 
          //       Direct row to IDLookupOut
          //       and Direct Row to IDAddToMasterData 
          // (4) If duplicates found in 3rd party database, Action: 
          //       Direct row to IDLookupDuplicateOut
          // (5.a) Nothing found. Action: 
          //       Direct row to IDLookupNoMatchOut
        }
        else
        {
          // (5.b) Nothing found. Action: Direct row to IDLookupNoMatchOut
          buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
        }
      }
    }
  }
}

Notice my use of buffer.SetString() to plug in the ID once I find it. I also add additional information to the pipeline (which is not shown in the code examples) to identify how the match was derived.

Conclusion

And there you have it! I ended up leaving some of the code out. As I said, this is business logic and proprietary. I did want to show, though, how I went about solving a fairly complex problem using a custom component. Again, I did have the benefit of examining a fully functional set of packages that had implemented all the logic in a Data Flow. This helped immensely, as the logic was plain as day.

So in the end, the new component makes creating new packages easier by only having to drop a single component on the Data Flow. There are a handful of properties to set (around 6 depending on the incoming data) and four Outputs to account for. This component also performs much, much better. My estimates put it at around 600% over the Data Flow approach.

Lastly, while writing these entries, I forced myself to examine the code in a different way (i.e. for public consumption and scrutiny). This made me think harder about the logic and how I used C#. The exercise was quite valuable and I ended up learning a lot in the process. With that said, please feel free to identify any mistakes I might have made, or more importantly, any places where you think improvements can be made. Thanks for reading!