In my entry “Dimension Tables”, I discussed some of the different types of dimension tables that could exist in a dimensional model. In this entry, I will discuss how they are loaded.

Dimensions change over time: Data entry errors are fixed, new customers arrive, employees get married and change their last names, products are re-categorized, and new store promotions are initiated. There are several challenges associated with these updates — challenges that do not exist in operational database applications. To face these challenges, the concept of the Slowly Changing Dimension has emerged.

There are 3 types of Slowly Changing Dimensions:

Type 1 SCD
When a type 1 attribute changes, it is Ok to overwrite the attribute on the current or all matching rows
Type 2 SCD
When a Type 2 attribute changes, make a copy of the row, mark the copy as ‘expired’ or ‘old’, and place the new Type 2 attribute change into the new row
Type 3 SCD
These represent alternate realities and keep the previous and current value of an attribute in a row

There is also a Type 1.5 SCD which is used to keep very large dimensions from expanding with too much unnecessary history. Consider a case where a Customer dimension is loaded from several sources. Several Type 2 attributes may not be available at the time the row is first loaded. When this data becomes available, it is updated as if it were a Type 1 change, rather than Type 2. If the customer then changes the attribute at a future date, the attribute would be treated as a Type 2. In this scenario, additional information must be made available to the ETL program so that it knows how to treat the attribute(s).

Inferred Members (which are treated somewhat like a Type 1.5 change) exists when a fact table references a dimension row that is not yet loaded. A new row is added to the dimension with a new surrogate key along with whatever additional data is available (at a minimum, the natural key). This almost-empty dimension row would be updated appropriately when the data becomes available. Just like the Type 1.5 change, additional information must be made available to the ETL program so that it knows how to treat the attribute(s) and not create unnecessary history.

To make Type 2 Slowly Changing Dimensions work, the dimension must have some attribute that distinguishes the active or current row from historical or expired rows. This can be done in one of three ways (my preferred method is to use dates):

  1. Dates: Create a from_date and a to_date in the dimension. The active row will have a null ‘to_date’.
  2. Flag: Create a status flag to distinguish active rows from inactive rows. Only one row should be active at a time for a give natural key.
  3. Version: Either a sequence number or some other number that will indicate which row is most recent in the dimension.

To simplify the discussion, when deciding on what SCD Type to tag an attribute, you need to ask yourself (well, ask your business users!): Do I need to keep a history for this attribute? Depending on your answer, consider the following:

Type 1

When an attribute marked as a “Type 1″ attribute is changed, the new value overwrites the existing value in the dimension without keeping a history. There are usually two flavors of this Type: Overwrite all that match the key (SCD subtype 1) or only overwrite the current row (SCD subtype 2). The “current” row is determined by the date, status, or version of the row.

Type 2

Type 2 changes always keep a history by marking the current or active row (determined by the date, status, or version attribute) as inactive and inserting a brand new row into the dimension with matching values from the previous row, plus any Type 2 changes.

For Hand-coders

If you know me, than you know I like to store this sort of information as metadata. For example, in my integration metadata (my Logical Data Map), I have a table that stores all the attributes for each dimension. I have a column called SCD_Type and SCD_SubType. Here is a view of this metadata taken from one of my hand-coded projects:

Slowly Changing Dimension metadata view

In pseudo-code, the processing would look like the following:

For each row
  If any attribute marked as 'Type 2' changes Then
    update status of active/current row
    insert into dimension a new row
  End If
  For each attributes marked as a 'Type 1' change Then
    If attribute is subtype 1
        update all rows that match the key
    Else
        update only the current/active row that matches the key
    End if
  End For
End For

As you can see, I handle Type 2 changes first and then process Type 1 changes. This ensures that historical data is kept in tact and that the correct active row is updated if the subtype is marked as 2. Additional logic is needed to process inferred members and Type 1.5 as mentioned earlier.

When no SCD Types Are Assigned

Not all dimensions contain SCD attributes. A Temporal/Time dimension, for example, does not change (January 15th, 2008 is always January 15th, 2008). These and other dimensions are simply appended to, such as a weather or promotion dimension (i.e. Causal Dimensions). I have found that dimensions that operate over time work this way, although this is a generalization and not a rule.

In my metadata, I usually tag these types of non-SCD dimensions as type 0. My ETL code then treats these as a pure insert (or Upserts depending on the nature of the dimension) if all attributes for the dimension are marked as Type 0.

SSIS

The Slowly Changing Dimension Wizard in SSIS makes a lot of this work easy for you. In my opinion, it is one of the better transformations available through SQL Server. For very large dimensions though, I have found that the SCD transformation does not perform as well as using conditional transforms and lookups on filtered data. I will be sure to talk more about this in a future post. I should also mention that hand-coding SCD logic is tricky; my pseudo-code posted above is only a bird’s eye view of the logic involved.