In my entry “Dimension Tables”, I discussed some of the different types of dimension tables that could exist in a dimensional model. In this entry, I will discuss how they are loaded.
Dimensions change over time: Data entry errors are fixed, new customers arrive, employees get married and change their last names, products are re-categorized, and new store promotions are initiated. There are several challenges associated with these updates — challenges that do not exist in operational database applications. To face these challenges, the concept of the Slowly Changing Dimension has emerged.
There are 3 types of Slowly Changing Dimensions:
- Type 1 SCD
- When a type 1 attribute changes, it is Ok to overwrite the attribute on the current or all matching rows
- Type 2 SCD
- When a Type 2 attribute changes, make a copy of the row, mark the copy as ‘expired’ or ‘old’, and place the new Type 2 attribute change into the new row
- Type 3 SCD
- These represent alternate realities and keep the previous and current value of an attribute in a row
There is also a Type 1.5 SCD which is used to keep very large dimensions from expanding with too much unnecessary history. Consider a case where a Customer dimension is loaded from several sources. Several Type 2 attributes may not be available at the time the row is first loaded. When this data becomes available, it is updated as if it were a Type 1 change, rather than Type 2. If the customer then changes the attribute at a future date, the attribute would be treated as a Type 2. In this scenario, additional information must be made available to the ETL program so that it knows how to treat the attribute(s).
Inferred Members (which are treated somewhat like a Type 1.5 change) exists when a fact table references a dimension row that is not yet loaded. A new row is added to the dimension with a new surrogate key along with whatever additional data is available (at a minimum, the natural key). This almost-empty dimension row would be updated appropriately when the data becomes available. Just like the Type 1.5 change, additional information must be made available to the ETL program so that it knows how to treat the attribute(s) and not create unnecessary history.
To make Type 2 Slowly Changing Dimensions work, the dimension must have some attribute that distinguishes the active or current row from historical or expired rows. This can be done in one of three ways (my preferred method is to use dates):
- Dates: Create a from_date and a to_date in the dimension. The active row will have a null ‘to_date’.
- Flag: Create a status flag to distinguish active rows from inactive rows. Only one row should be active at a time for a give natural key.
- Version: Either a sequence number or some other number that will indicate which row is most recent in the dimension.
To simplify the discussion, when deciding on what SCD Type to tag an attribute, you need to ask yourself (well, ask your business users!): Do I need to keep a history for this attribute? Depending on your answer, consider the following:
Type 1
When an attribute marked as a “Type 1″ attribute is changed, the new value overwrites the existing value in the dimension without keeping a history. There are usually two flavors of this Type: Overwrite all that match the key (SCD subtype 1) or only overwrite the current row (SCD subtype 2). The “current” row is determined by the date, status, or version of the row.
Type 2
Type 2 changes always keep a history by marking the current or active row (determined by the date, status, or version attribute) as inactive and inserting a brand new row into the dimension with matching values from the previous row, plus any Type 2 changes.
For Hand-coders
If you know me, than you know I like to store this sort of information as metadata. For example, in my integration metadata (my Logical Data Map), I have a table that stores all the attributes for each dimension. I have a column called SCD_Type and SCD_SubType. Here is a view of this metadata taken from one of my hand-coded projects:
In pseudo-code, the processing would look like the following:
For each row
If any attribute marked as 'Type 2' changes Then
update status of active/current row
insert into dimension a new row
End If
For each attributes marked as a 'Type 1' change Then
If attribute is subtype 1
update all rows that match the key
Else
update only the current/active row that matches the key
End if
End For
End For
As you can see, I handle Type 2 changes first and then process Type 1 changes. This ensures that historical data is kept in tact and that the correct active row is updated if the subtype is marked as 2. Additional logic is needed to process inferred members and Type 1.5 as mentioned earlier.
When no SCD Types Are Assigned
Not all dimensions contain SCD attributes. A Temporal/Time dimension, for example, does not change (January 15th, 2008 is always January 15th, 2008). These and other dimensions are simply appended to, such as a weather or promotion dimension (i.e. Causal Dimensions). I have found that dimensions that operate over time work this way, although this is a generalization and not a rule.
In my metadata, I usually tag these types of non-SCD dimensions as type 0. My ETL code then treats these as a pure insert (or Upserts depending on the nature of the dimension) if all attributes for the dimension are marked as Type 0.
SSIS
The Slowly Changing Dimension Wizard in SSIS makes a lot of this work easy for you. In my opinion, it is one of the better transformations available through SQL Server. For very large dimensions though, I have found that the SCD transformation does not perform as well as using conditional transforms and lookups on filtered data. I will be sure to talk more about this in a future post. I should also mention that hand-coding SCD logic is tricky; my pseudo-code posted above is only a bird’s eye view of the logic involved.
#1 by Chris Woodard at September 2nd, 2008
| Quote
I am working on a data governance project with a team who bring a range of experience to the table. We are currently developing some standard logical data models, one of which is for representing Time.
The goal is to represent time consistently when captured in source OLTP databases so that warehouse type, generally dimensional, BI environments can reliably aggregate and compare (at least as regards dates) data from disparate sources within the enterprise.
I am coming to the conclusion that we may as well model Time logically in a relational way that 3NF oriented team members are comfortable with, but create the physical model for the Time, actually Calendar Day, dimension table as it will be used in the warehousing environments. So the physical model would be very denormalized, more or less one table with a row for each day, including robust attributes such as Month Number or Week In Year Number.
What do you think of the idea of using a replica of the actual Calendar Day dimension table as a reference table for OLTP databases, with business data tables representing dates using the Calendar Day Surrogate Key as a foreign key?
Thanks, Chris.
#2 by Tod McKenna at September 3rd, 2008
| Quote
Hi Chris,
I once had a similar problem to solve. I had just developed my first dimensional model and loved the way the Date Dimension worked. When I started programming again in our OLTP environment I noticed that to get the same functionality on reports, screens, and in queries, my users and developers were using date and time functions. All we stored in the database was a datetime stamp or just simply a date value as in 12/15/2007.
To make matters more complex, we had some mySQL and SQL Server Databases together with Visual FoxPro — all of which had subtle differences in the way they handle dates.
I solved the problem by introducing a Date Dimension (an exact replica of the one in the data warehouse) into the OLTP systems. So in the end, 4 OLTP systems and the data warehouse all shared the same Date Dimension. For distribution reasons, a copy of the date dimension was shipped with the product, while the master copy was maintained in the ETL environment.
The only other change I ended up making was to use the ISO date as an integer in my OLTP systems instead of using the database’s date or datetime data type. This allowed for a much more consistent way of accessing the date dimension in the different environments. It was a lot of work to make this change, though… a bit painful at times! But it was well worth it. Now, all dates in all systems are stored as an integer like 20071215.
So, to answer your question, I think its a great idea!
#3 by Todd McDermid at January 19th, 2009
| Quote
Hi Tod,
Stumbled across this post while searching for custom tasks. (I found your other post, but alas, no code - I’m compiling a directory of open-source tasks and components).
I wanted to bring to your attention an open-source alternative to the MS SCD transformation that performs well (for me) for (what I think are) larger data sets. Of course, I’m biased, ’cause I wrote it.
Head over to http://www.codeplex.com/ssisctc for that and other open (and semi-open) source tasks and components for SSIS.
#4 by Tod McKenna at January 31st, 2009
| Quote
Hi Todd, I’ve actually got a need at the moment for such a component. Thanks for the heads-up! Over the next week or so I’ll give it a try and let you know how I make out with it.