Posts Tagged SCD

ETL Subsystem 16: Late Arriving Data Handler

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Sometimes, not all dimension data is available when a transaction (i.e. business event, such as an order) comes into the Warehouse. This is a fact of data warehousing life. Under normal circumstances, you process dimensions first so that you have all your keys ready for inserting into the fact. When dimensional data is unavailable, missing, or incomplete you end up with a bit of a problem. NULL foreign keys in the fact table are not allowed.

Missed the trainOther times, your fact comes in late. In this scenario, you have all your dimensions, but now you need to insert a fact into the fact table and link it to existing dimension rows that were active at the time of the fact. Late arriving facts are easier to work with than late arriving dimensions, in my opinion. I already discussed some techniques for handling this in my post “ETL Subsystem 14: Surrogate Key Manager“.

I find late arriving dimensions (also called “early arriving facts”) to be a more difficult problem. There are a few of good solutions though, all of which are handled during ETL:

  • Hold onto the fact until all dimensions arrive
  • Create a dimension called ‘unknown’ or ‘not available’ with a primary key of -1 and use that
  • Insert a dummy row in the dimension and populate it with whatever you can

None of these options are terribly appealing across all scenarios; however, each has a place under certain circumstances.

Hold Onto the Fact

Holding onto the fact in your staging area may be a great option if, for example, you expect the late arriving data to be coming in soon. Perhaps you process data from multiple source systems. System A contains the transactions, while systems B, C, and D combine to give you your dimensions. Because of your extraction windows, you may not be able to get data from system D until later in the morning, meaning that the business events from system A are missing some context. Holding onto these events and waiting for the data from system D is the appropriate action!

Even if the situation is not predictable, holding onto facts might be the best option. If you are loading daily stocks, for example, and for some reason a particular company is not available (a new company perhaps), but the stock is showing up in your extracts, you should hold onto that stock in your staging area until the company is available. You don’t want that stock in the warehouse until the full details of the company comes in (which could be available in a separate extract, or may need to be manually entered).

As you can see, deciding to work this way has implications on the ETL processes and staging area. Planning for these circumstances up front is a must.

The ‘Unknown’ Dimension

Creating an empty row in the dimension called ‘Unknown’ is also an option that is good under some circumstances. If you are processing orders and no information about a promotion comes in, then it would be safe to link the fact to a special row in the dimension that denotes that no promotional information was available. Depending on your dimension and business requirements, you could have many different levels of ‘unknown’.

Here are a few suggestions:

  • -1, ‘none’
  • -2, ‘unknown’
  • -3, ‘not applicable’

If you use key -1 in your fact row for the foreign key to the promotion table, you allow your users to write queries and reports for all orders that were made when there was no promotion (promotion.name = ‘none’). The same is true for all facts that have an unknown promotion status, meaning there could have been a promotion, but the data was incomplete. Lastly, there are cases — if you do online orders for example — where promotions simply don’t apply.

You can expand these keys heading backwards with lots of detailed descriptions telling you why a dimension might not exist for an event. Don’t go overboard though! Having a few options is a good idea for data quality and analysis purposes. But too many may overload users with too many choices all too similar to be useful.

This approach works well when the likelihood of acquiring the context later is slim-to-none. If you use this method, it gets very complex if you later do in fact find that there was some context (i.e. a promotion). Your only option then is to revisit your facts and look for, and update, the fact rows corresponding to the late arriving data. On the bright side, if you’ve already tagged these rows with a foreign key of -2, for example, you may be able to focus only on those facts — saving you some process time.

Inferring a Dimension

In the company stock example I gave above, you have another option. If the company details of a stock is not known — and all you have is a SEDOL or Ticker — then simply insert a new row into the dimension with all of the information you know to be true about the dimension. This only works if you have enough details about the dimension to construct the natural key. Without this, you would never (although, I’ve learned to never say never) be able to go back and update this dimension row with complete attributes.

To make this work, and if you are using Type 2 SCDs, you will need another attribute in your dimension. Call it “inferred” or similar, make it a bit or char(1), and mark it as true (1, or “Y”).

This has a major implication on your dimensional processing. Now, before you process any Type 2 changes on any dimension that contains an attribute called “inferred”, you need to check if the row you are ready to work on is in fact inferred. If so, treat the entire row as Type 1. Because it is inferred, we don’t have the details anyway. If we went along and processed the row as a Type 2, then your fact will continue to point to an essentially empty row.

SQL Server 2005 Integration Services (SSIS)

The SCD Component handles the inferred dimension approach quite nicely. In fact, “Inferred Dimension” is a term introduced by SSIS (I believe). I talked briefly about the SCD Component and inferred dimensions in my post “ETL Subsystem 9: Slowly Changing Dimensions (SCD)” so I won’t repeat myself here. The bottom line is that SSIS has a very elegant solution to the late arriving dimension problem. If you have them — and almost every data warehouse has late arriving dimensions — then using the SCD Component and the inferred dimension option is must. Also check out this external link for more inspiration: “Handling Late-Arriving Facts“.

Hand Coding with Visual FoxPro (VFP9)

While SSIS has built-in capabilities to handle the inferred rows, VFP does not. This is not surprising of course because VFP is a language and not an ETL tool! With that said, writing SQL to accommodate the “inferred” flag in a dimension row is not a big deal. The processing logic for a SCD Type 2 dimension then becomes (in pseudo-code):

 
For each row
  If any attribute marked as 'Type 2' changes AND inferred = "N" Then
    update status of active/current row
    insert into dimension a new row
  End If
  For each attributes marked as a 'Type 1' change OR inferred = "Y" Then
    If attribute is subtype 1
        update all rows that match the key
    Else
        update only the current/active row that matches the key
    End if
  End For
End For

I discussed the above pseudo-code in more detail in my post “Loading the Slowly Changing Dimension“. Check that out for more insight into some of the logic. I should mention again that this is a very basic structure and not intended to be comprehensive or de jure.

From Here

Enough about late arriving data! Late arriving data, changed data capture, and surrogate key lookups might be the most challenging aspects of ETL — especially when it comes to performance and efficiency. It will take you some time to get these right. You may — like me — be forever tweaking these areas to squeeze every last second out of the process.

Next, I’ll move on to discuss Dimension Management. A much lighter topic!

Tags: , , , , , , , ,

1 Comment

ETL Subsystem 15: Bridge Table Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Before I can discuss this subsystem, I need to discuss bridge tables. This is a topic I haven’t covered much until now. So to help me illustrate how bridge tables are used in a dimensional model, I’ll present a case for a real-estate property transaction.

Sometimes the defined grain of a fact has multiple dimension rows that characterize it. These scenarios are referred to as many-to-many or multi-valued dimensions. A patient encounter at a hospital might have multiple procedures. Multiple bank accounts might belong to a single person. In the case that follows, I’ll show how multi-valued dimensions help support property sales (i.e. a real estate transaction).

I would like to make one important point before I begin: If you can thwart the need for bridge tables by redefining the grain of your fact table to be more atomic, then you should do so. Bridge tables add complexity to the model, ETL process, and the end-user application layer. With that said, multi-values bridge tables in many environments are unavoidable and absolutely necessary.

Case Study: A Real Estate Transaction

In the case of a property transaction (aka a property “sale”), you are potentially dealing with several many-to-many relationships. First of all, you may not be buying alone (one-to-many buyers). You may also not be buying your house from a single person (one-to-many sellers). Your house may be on two parcels of land or no land at all if you are buying an apartment or condo (none-to-many parcels). A barn or in-law apartment may be included or perhaps you are just buying land to build your dream home later (none-to-many buildings). These and other complexities make the case for bridge tables to represent the many-to-many relationships inherent in a complex transaction such as this.

When you buy land or a house (in the US, at least!) you must “record” the sale with the clerk of the municipality in which the property exists. This recording becomes part of the chain of ownership and tax record for the property. This recording can accommodate multiple grantors (the people or organizations selling the property), grantees (those who are buying the property), and properties (multiple parcels, land lines, lots, buildings, outbuildings, units, etc). This recording is given a book and page and filed away in a database. A hard copy is typically stored in a physical book in the clerks’ vault as well. It’s pretty fun (for some people who are into history) to go to your clerk’s office and peruse the vault. Anyway…

What follows is a simplified ER model from an OLTP system I designed several years ago. This model represents the complexities of the above case in a transaction environment. It has been simplified to only show the more important business entities:

Property Sales Schema - OLTP

Note that in the above OLTP system, table Real represents land and building values from tax billing software. These figures have been aggregated into a single row (i.e. total land value and total building value, number of buildings, total acres, etc.). For the above model, knowing the exact building or land line is not important. For the Data Warehouse, though, we want and need this atomic detail for some detailed analysis. The Deed table represents the actual recording (book, page, date, time, etc.). The bridge table ND_link contains grantor and grantee information. NR_link represents ownership history between the names and entities in Namefile and the aggregate property information in table Real.

The business process dimensional model could therefore be modeled as follows:

Business Process Dimensional Model for a Property Sale

As you can see, there are bridge tables between the SaleFact and DimBuilding, SaleFact and DimLand, and SaleFact and DimPerson dimensions. I typically name these tables by including the word “Group” in its name (LandGroup, BuildingGroup, etc.) but you could use the word “Bridge” or any other term that is consistent throughout your models. These bridges define the many-to-many relationships that are normal for a Real Estate transaction. Because a property sale is a natural business event, and cannot be logically defined to be any more atomic, multi-value bridge tables are a must.

In the above schema, a particularly astute and observant modeler will notice the many-to-many relationship between the keys in the fact and bridge. FoxPro, which is where I put together these examples, will not allow a many-to-many join between two keys. So, I Photoshopped the graphic in order to show you the logical representation! In FoxPro, and in many other database systems, you will need another table in between the Group and the Fact with a single attribute primary key. This would represent the physical implementation of the above design. You would therefore need two physical tables to represent the many-to-many relationship. I try my best to avoid this. In FoxPro, for example, I would simply not create the relation at all! Or in some cases, the bridge group would be unique for each event, making the foreign key in the fact table always unique. Being unique, it can be defined as a candidate key and then happily joined to the bridge table (more on this in the section below on “special considerations”).

Bridge Table Characteristics

Bridge tables are not just janitors or pianos (well, they do have all they keys!). Most bridge tables also contain attributes. For the above case, I include two attributes in each bridge: weight and primary indicator.

Weights are used to allocate metrics in the fact. Weights always add up to 1 across the group. If you need to allocate the sale price across all buildings involved in the sale, you would use the weight. Weights are calculated as part of ETL process and are an important part of the mechanisms of Subsystem 15. Sometimes, it is perfectly OK to divide 1 by the number of group records to get the weight (applying any fraction to the first row perhaps). But more often than not, there is a more complex algorithm to determining the weight. For land, you might base the weight on lot size or price code. For buildings, you may base the weight on the replacement cost of the building, or you might use square footage. All of these are defined by the business and applied during ETL.

I use primary indicators because some towns only care about the primary grantor, grantee, building, or parcel for analysis. One row in the bridge table will be identified as ‘primary’. During ETL, the primary row is identified by using a sequence number (number 1 = primary) or by determining the highest valued property. Sometimes, its arbitrary, in that you just mark the first one encountered as primary. These decisions are part of your business requirements and must be carefully implemented during data transformation.

Other Considerations

Guess what? Your bridge table is really a type of fact table. In the above schema, LandGroup is a sort of fact with some semi-additive metrics that represent the relationship between a parcel of land and a sale.

In my example, a sale is a static event. The buildings, land, and people involved will change over time, but the groupings used at the time of the sale will remain the same. For this reason, there is no need to make the group tables Type2 dimensions. If we ever need to change a value in one of these bridge tables, then it is likely to correct a data problem and not to represent a changing dimension. Additionally, in this scenario, it would be a safe assumption — especially because there are weighting factors and primary indicators involved in each bridge — that each grouping is unique to the fact. For modeling purposes, you could avoid the many-to-many relationship issue discussed in the previous paragraphs by making the bridge keys in the fact table candidate keys. There would be a unique set of group keys for these bridge tables. But be cautioned: The volume of property sales is small in any given year for any given municipality. You’ll never see a million transactions in a year. This makes the candidate key approach attractive. If you are dealing with patient diagnosis across multiple hospitals then your bridge tables will become too large. In these cases, you’ll need to consider re-using groups. The consequence of reusing groups is that the key in the fact table can no longer be a candidate.

To be clear: Bridge tables can also be a Type 2 SCD. If this is needed, and tracking bridge table changes over time is important, then you will need to combine bridge building with your SCD logic. You will need activity dates (from and to), and a current row indicator.

As I said at the top: If you can thwart the need for bridge tables by redefining the grain of your fact table to be more atomic, then you should do so. Bridge tables are not for the faint of heart!

SQL Server 2005 Integration Services (SSIS)

Bridge table building is largely a modeling and business rules problem. There is nothing in SSIS that will prohibit you from implementing even the most complex bridge table design. Most difficulties will come when managing the keys and implementing weighting factors. For the keys, you can use the same techniques described in my posts about Subsystem 10 and Subsystem 14. For weightings, I’ve implemented weighting algorithms using both stored procedures (Execute SQL Task) and VB (Script Component) when appropriate.

Hand Coding with Visual FoxPro (VFP9)

The above case is taken from an actual implementation I created in Visual FoxPro. I must say, it was rather simple once I ironed out the nuances and had my weighting algorithms in place. Developing the bridge table builder subsystem was a simple matter of inserting and linking the fact with the multi-valued entity and applying the weight and primary indicator logic. The resulting analytics that can be derived now far outshines that which could be done with the original OLTP model. Not to mention, the dimensional model has atomic detail (not like the aggregate Real table shown in the first schema) which allows for more slicing and dicing. But I digress!

From Here

Next post, I’ll discuss Late Arriving Data (Subsystem 16). That’s a big topic, so I’ll distill it as much as possible to keep it brief!

Tags: , , , , , , , ,

10 Comments

ETL Subsystem 14: Surrogate Key Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Surrogate key management is an important part of designing an ETL system using the Kimball Methodology. Slowly Changing Dimensions require the use of surrogates because natural keys, like a Customer ID, Product ID, and so on, will be repeated in dimensions whenever Type 2 changes are tracked. Surrogates are also desirable because they are single-part and fast for joins. I’ve already discussed the need for surrogates in several posts. Here, I’ll talk about some techniques for replacing the natural keys that come in from the source systems with the new surrogate keys that exist in the data warehouse.

The idea is simple in theory: For each dimension row, you generate a surrogate key. This surrogate key replaces the row’s candidate key and is used to link to one or more Facts.

For example, the candidate for a Customer dimension may be CustomerID + ActiveFromDate. The ActiveFromDate is used to sequence the row if Type 2 change tracking is used. It is entirely possible to have the same CustomerID listed dozens of times in the dimension. The differentiator is the ActiveFromDate. This combination is guaranteed to be unique in the Customer dimension (although you would not normally enforce this using an index, you would certainly enforce this through ETL). This is a simple example. When modeling Parts or Real Estate, you may have several attributes combining to form a candidate key. This also becomes a bit more complex when Customer data is refreshed throughout the day — meaning you cannot rely on the date alone, you also need to consider time.

In dimensional modeling, you will normally process dimensions before facts. Because of this, you will have all surrogates defined for each row you need. In a subsequent post, I’ll talk more about situations where you load a fact before a dimension, or situations when late arriving data must be handled. For now, assume that all dimensions have rows that correspond to a fact. When you process the fact, you simply need to look up the surrogate key assigned in the dimension and use that key as the foreign key to that dimension in the fact.

Dimensional modeling essentially requires surrogate keys, and does so by taking a performance hit during ETL. It takes time to look up surrogate keys. You will spend a lot of time yourself making this process as fast and efficient as possible. I do have a few suggestions, though:

Load by the Day

Process facts one day at a time (all of yesterday’s data plus any changed data or late arriving data from previous days). Doing so allows you to join your staging fact row by the natural key in the dimension for that day. For example, in SQL, you could acquire all dimensions that are valid for a particular day by doing the following:

SELECT DISTINCT NaturalKey 
    FROM Dimension 
    WHERE DataDate BETWEEN ActiveFromDate AND ActiveToDate

This returns a subset of the dimension for the day in which the fact occurred (DataDate). If the candidate key involves more than the natural key, you will need to adjust the above SQL to be more robust. For example, you may have the name and birthdate of a patient at a hospital with an insurance number and insurance company code that uniquely identifies the row! That’s six attributes (if you include the ActiveFromDate) that uniquely identifies the row when combined.

With the above, you can incorporate the query into a JOIN on the staging fact data (by selecting all facts that occurred on DataDate), or you could use it as the source for a lookup operation. By narrowing the focus to be a single day (and you load by the day), you eliminate the need to worry about SCDs.

You can further enhance the above by maintaining special staging tables for each of your dimensions. These staging tables contain all candidate attributes and the surrogate key for each dimension. These special staging tables are maintained while you process your dimensions. Instead of selecting against the actual, production dimension, you can use the staging table. Although there is a little extra overhead in maintaining the staging tables, you get it all back — and then some — when you do your key lookups.

If you need to load by time of day (essential for real-time or intra-day ETL), you can still use the above method, except that you will need to join your fact data on time of day as well.

You can get around using BETWEEN if you are processing the most recent data. Instead, you could filter results by the dimension’s “current” flag. The current flag is set while processing the dimension. Only one row, for particular natural key, can be marked as current. A predicate on a current flag would be faster than checking between dates. A common trick is to split the rows into two streams: One with current data (anything that happened yesterday), and one with historical data (anything that happened the day before yesterday and back). Use the current flag for the current data, and use the BETWEEN for historical data.

I find that this technique works well for small dimensions with a low number of facts (a few thousand transactions a day). With a good index plan, this might be all you need to do.

Store the Surrogate Right Away

Another technique worth mentioning is particularly useful when dealing with very large dimensions and lots of facts. You should also consider this technique if you are manually generating your surrogates (see my post “ETL Subsystem 10: Surrogate Key Generator” for specifics). Basically, when you generate your surrogate key for a dimension row, store it with the fact data at the same time.

This technique requires writing the surrogate to the dimension and the fact staging table. Using an UPDATE statement is a simple approach, but a more exotic and better performing method would be to INSERT the new keys into another staging table, and joining to the staging fact table later on.

Either way, there is a performance hit with storing they key right away, so be sure to thoroughly review your environment and needs. I have used this technique before with great success, but it took me a long time to determine it was the best approach. In fact, my original design worked great until the dimension grew to a few million rows and the transaction load doubled. Doing a lookup using BETWEEN proved to be too slow compared with saving the key immediately.

SQL Server 2005 Integration Services (SSIS)

Jamie Thomson wrote an article on how to use SSIS to update fact tables (actually, most of it is a case study written by “Mag”). I won’t repeat what was in that post. Have a look. Jamie’s technique is basically what I have come to adopt myself. You’ll notice the liberal usage of Lookups instead of Joins (joins, at least for me seemed like a more logical starting point). Lookups perform better for a few reasons. The Merge Join operation requires a Sort — either by using the Sort Component or by ORDERing the data through the Source Component. This sorting can be expensive. Lookups, in addition to not requiring that the input be sorted, can use caching and limit access to the database. For these reasons, I use Lookups now almost exclusively.

When using the Lookup approach you must be aware of two possible conditions: The Lookup could return more than one match if you don’t properly define your candidate key. Also, if no match is found, common for late arriving data, you need to either divert the errors and handle them separately, or convert the NULLs to your default “unknown” value for each dimension. For me, I typically use negative numbers to define various types of null (-1 for “not known”, -2 for “not applicable”, etc…).

If you process by the day, use staging tables as dimension lookups, and rely on Lookups instead of Sorts and Merge Joins, then fact processing should go quite well.

Hand Coding with Visual FoxPro (VFP9)

As I discussed in my post “ETL Subsystem 10: Surrogate Key Generator “, VFP’s SEEKing capability far outshines set-based approaches to this problem. Check out my comments in that posting for some ideas on how to use VFP to get the surrogate keys for your fact data.

Not to get lost though in the wonders of SEEK, VFP is perfectly suited for set-based approaches as well. So all of the methods discussed in the article (Joins and lookups) can be implemented easily in FoxPro. As I wrote in that post, VFP really shines for this type of task.

From Here

Next post, hopefully later this week, I’ll discuss bridge tables and how to best build them. I have an interesting case study to walk through. I have refrained from using cases in my ETL subsystem posts, but for bridge tables, I am making an exception!

Tags: , , , , , , ,

2 Comments

A Data Warehouser’s Vocabulary (Part 2)

This post is part 2 (read part 1) of a series of posts containing a glossary of terms and concepts that I feel has some relevance to the data warehousing and business intelligence world. Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! When finished, these terms will be compiled and made a static page on TmF.

Aggregation
The process of redefining data into a summarization based on some rules or criteria. Aggregation may also encompass de-normalization for data access and retrieval.
Analytical Processing
Producing analysis for management decisions, usually involving trend analysis, drill-down analysis, demographic analysis, profiling, and so on.
Attribute
Any detail that serves to qualify, identify, classify, quantify, or express the state of an entity.
Data Mining
The process of analyzing large amounts of data in search of previously undiscovered business patterns.
Dimension
A denormalized table in a dimensional model with a single part primary key and descriptive attribute columns.
Event
A signal that some activity (usually a business transaction) has occurred.
Fact
Central table of a Star Schema which numeric performance measurements identified by a composite key, each of whose elements is a foreign key drawn from a dimension table.
Heuristic Analysis
Heuristic Analysis is a method to help to solve a problem, commonly informal. It is particularly used for a method that often rapidly leads to a solution that is usually reasonably close to the best possible answer. Heuristics are “rules of thumb”, educated guesses, intuitive judgments or simply common sense.
Online Analytical Processing (OLAP, also MOLAP)
On-line retrieval and analysis of data to reveal business trends and statistics not directly visible in the data directly retrieved from a data warehouse. Also known as multidimensional analysis.
Outrigger
A secondary dimension table attached to a dimension table. An outrigger is not used to normalize a dimension.
Relational OLAP (ROLAP)
“Relational” OLAP, in which the OLAP processes use a relational, normalized model for its source.
Slowly Changing Dimension (SCD)
The tendency for dimension attributes to change gradually or occasionally over time. The techniques for handling these changes include Type 1 (overwrite), Type 2 (keep history), and Type 3 (alternate realities).
Snowflake
A normalized dimension where a flat, single dimension table is deconstructed into a tree structure with potentially many nesting levels. Snowflaking a dimension generally compromises user understandability and browsing performance.
Snowflaking
The (undesirable) act of normalizing a dimensional model.
Star Schema
A generic representation of a dimensional model in a relational database in which a fact table with a composite key is joined to a number of single level dimension tables, each with a single primary key.

Tags: , , ,

1 Comment

ETL Subsystem 12: Special Dimensions Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Each dimensional model that you create will have special design characteristics and requirements. This typically means that you may need to rely on specialized dimension tables. Not all dimensions will be “Product” or “Customer”. In fact, you will likely have more specialized dimensions in your models than actual normal, business dimensions.

There are several categories of special dimensions. They are:

  • Date and Time Dimensions
  • Bridges to support many-to-many relationships
  • Special Indicators and Flags Dimensions
  • Study or Research Group Dimensions
  • Mini-Dimensions
  • “Current” Dimensions
  • Lookup and other Static Dimensions
  • Administration Dimensions (such as special logs, monitors, and audits)

Much of the above has already been discussed in my posts to date (date dimensions, bridges, and lookup dimensions for example). So not to beat a dead horse, I’ll spend the next few paragraphs explaining those I haven’t really talked much about yet.

Special Indicators and Flags Dimensions
These dimensions, also referred to under some circumstances as “junk” dimensions, exist to store extra information about the fact when it occurred. I typically use these special dimensions to hold flags (for example, if a tax payer has a history of delinquency, if a certain diagnosis and age creates greater risk for death, etc..). I usually store these flags as a tiny int with three settings: -1 null, 0 off, and 1 on. For researchers, this table acts as a way to screen large amounts of data and focus efforts on a subset of events with certain characteristics.
Study or Research Group Dimensions
Whether I’m working with Qualitative Analysts or Quantitative Analysts, they always seem to need specialized research databases (and they all seem to like using SAS too!). In the past, I got in the habit of creating copies of the required business processes dimensional models for their use, filtered and sampled appropriately. These databases would be installed in an isolated environment. This isolation allowed them to develop models, conduct what-if analysis, and to manipulate the data to test theories. This gave the researches their own data to play with and everyone was happy. Another solution — a better solution — which I have only recently begun to implement with more regularity is creating a special study group and/or research dimension which defines the research being conducted (”Infection Research”, “buying patterns of customers in Alaska in January”, etc..). A bridge table between the subject (Customer? Product? Stock? Web Page?) links the study with the sample used. All queries for that study simply use the new special dimension as an additional filter. And everyone’s happy!
Mini-Dimensions
Mini-dimensions are necessary when a subset of attributes (typically related) need Type 2 change behavior tracking, the changes happen frequently, and the dimension is rather large (such as customer, webpage hits, or stock index characteristics). A specialized mini dimension is created with the required attributes, stored at a grain that represents a combination of attributes. This mini dimension does not contain a link back to the parent dimension, nor does it contain the natural or primary key of the dimension. Each row is instead given a unique key that is referenced alongside the parent dimension in the fact table. For a large Customer dimension, you may create a mini-dimension with income bands, year-to-date behaviors, job status, and other related financial and purchase information. This “profile” might be shared by hundreds or thousands of customers and change frequently. Therefore, the data you store would not be identifiable on its own to any single customer, but to all customers who share the characteristics.
“Current” Dimensions
There are times when large Type 2 SCD dimensions become too unwieldy for querying and other analysis; or, what you really want is a Type 1 dimension to do most of your work, but need to maintain Type 2 changes over time to accommodate some special need (like compliance). The solution is to copy out the “current” rows of a dimension (using a current flag or date range) into a second, special Type 1 dimension. I’ve had success using this technique using three methods: (a) create a role-playing view on the dimension using only the current rows (one challenge with this method is that you need to store a special key in the parent dimension so that you can use the view to relate to a fact table row); (b) by creating a new Type 1 dimension through ETL and placing a new foreign key to my table in the fact; and (c) by using the Type 2 dimension as a bridge table, linking to the special dimension through its natural key (which can be a view or a physical table created through ETL).

SQL Server 2005 Integration Services (SSIS)

Other than building the dimensions, there is nothing special from an SSIS point of view. You will generally handle the creation of the special dimensions in separate packages (dates and lookups for example) or as part of normal dimensional processing (which would be the case for mini-dimensions and current dimensions for example).

Hand Coding with Visual FoxPro (VFP9)

Same as with SSIS. Nothing special. If you can process a dimension, you can process a special dimension. Most of the onus falls on business requirements, data modeling, and overall data warehouse design. Non of which are inhibited by Visual FoxPro.

From Here

This subsystem is deigned to be a place-holder for handling special dimensions. I have found that a great deal of value found in a DW/BI solution actually stem from these. All those special flags, scoring results, quality assessments, and additional features that we’ve painstakingly added through ETL are now available.

In my next post in this series, I’ll discuss lucky subsystem 13: Fact Table Builders.

Tags: , , , , , ,

1 Comment

Live from Kimball University: Day 2 (SCDs, the Mini, Modeling Process)

I am thoroughly exhausted! Margy followed up day one with another day packed with information.

We went through the various types of Slowly Changing Dimensions (SCD) and even talked about Type 6 SCDs, which I’ll get to in a bit. We then rolled right into mini dimensions and other related concepts. After that, we talked about the modeling process, different types of fact tables, and then finally how to deal with a mature data warehouse.

Slowly Changing Dimensions

Although I’ve posted a few entries (1 2) on the SCD Types, I’ll go over them once again here, but hopefully in a slightly new light. We talked specifically about:

Type 1: Overwrite the Attribute
In short, when an attribute’s value changes, we overwrite the old value with the new value. This technique does not preserve history. While it might be easy to implement, its use must be fully justified! In most every case, we will want to keep some sort of history in our dimensions.
Type 2: Preserve History / Add a New Record
The most common type, this technique preserves the old value by inserting a copy of the current row with the updated attribute values into the Dimension. The old row is marked as archived (typically by using active and inactive dates and a status indicator).
Type 3: Alternate Reality / Add a New Attribute
New attributes are added to the dimension to represent the old and new value. In some cases this is a handy technique because you can preserve the old value along side the new value. This technique is best used when there is a massive, dimension-wide change that affects many rows. For example, if a voter’s representative district changes during redistricting, we might want to see what election results would have been if the old districts were used in the analysis.
Type 6: Advanced Hybrid
This is a hybrid approach that combines the other 3 types (1 + 2 + 3 = 6). According to Margy: “We’re creating new rows to capture change (Type 2), adding attributes to reflect an alternative view of the world (Type 3), which are overwritten for all earlier dimension rows for a given product (Type 1).” Read more about this here and be sure to check out this article by Margy and Ralph for more information. I hope to have some examples posted in the coming weeks as well.

We didn’t talk about inferred dimensions (although Margy did mention the late arriving kind). I suppose the idea of an inferred dimension (or Type 1.5) might be discussed with Warren in the coming days.

The Mini

After all that SCD talk, we talked about dealing with very large dimensions. Especially those that have Type 2 attributes. Margy introduced the concept of the Mini-Dimension, and stated that for extremely large dimensions, using Type 2 may not be a good idea for attributes that might change often. For example, suppose you wanted to store an airline passenger’s ‘age’ in your Passenger dimension. If the passenger flies 7 times over an 8 year period, you could have as many as 7 rows reflecting that change.

Instead, create a mini-dimension with attributes that change frequently and link them to the Fact table using a separate foreign key. It is hard to explain this in any detail without a picture. I promise once I get back from Amsterdam I will prepare a better example for this (with pictures!).

The Modeling Process

I was happy to see a segment dedicated to the overall modeling process. When it first came up, I felt it was out of place (perhaps it could have been discussed before we dove into sketching our case study models?), but in the end, it seemed to fit the flow.

The modeling process includes the following steps: Preparation, a high level design session, detailed model development, Dimensional model review and validation, and final documentation to be handed to the ETL team. It is important to note that this process is iterative.

Next we went over the various participants that need to be involved: the data modeler, business analyst, power users, BI application developer, data steward, source experts, and the ETL team.

During this process, we’ll also start to conduct our first rounds of data profiling. I find this part of the process to be most crucial because it is at this level where you can start to determine the feasibility of implementing a particular business process.

Types of Fact Tables

After a brief case study, Margy took a little more time to go over the different types of Fact table grains. These include the transaction, periodic snapshot, and accumulating snapshot. For more information on the different types, please read my post “Fact Tables“.

Some additional notes on Accumulating Snapshots: They generally represent some sequential, short lived, and predictable process that needs to be expressed by various milestones. These milestones are most often dates. These fact tables often span multiple business processes as well. For example, an order is placed, it is released, shipped, and delivered. These are separate milestones that have some serious significance to the business for analysis.

Lastly…

We finished the day discussing mature BI projects and how to handle existing warehouses that might not be in great shape. Margy went over several important common disorders (ranging from lack of business sponsor to infrastructure issues), some symptoms, and finally she proposed a treatment plan. Every slide in this section was packed with good, solid advice. When I can get my notes together on this, I plan to write about some of my own experiences with these disorders.

Tomorrow Warren will take over. I trust he’ll be as good as Margy!

Tags: , , , , ,

No Comments