This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Sometimes, not all dimension data is available when a transaction (i.e. business event, such as an order) comes into the Warehouse. This is a fact of data warehousing life. Under normal circumstances, you process dimensions first so that you have all your keys ready for inserting into the fact. When dimensional data is unavailable, missing, or incomplete you end up with a bit of a problem. NULL foreign keys in the fact table are not allowed.
Other times, your fact comes in late. In this scenario, you have all your dimensions, but now you need to insert a fact into the fact table and link it to existing dimension rows that were active at the time of the fact. Late arriving facts are easier to work with than late arriving dimensions, in my opinion. I already discussed some techniques for handling this in my post “ETL Subsystem 14: Surrogate Key Manager“.
I find late arriving dimensions (also called “early arriving facts”) to be a more difficult problem. There are a few of good solutions though, all of which are handled during ETL:
- Hold onto the fact until all dimensions arrive
- Create a dimension called ‘unknown’ or ‘not available’ with a primary key of -1 and use that
- Insert a dummy row in the dimension and populate it with whatever you can
None of these options are terribly appealing across all scenarios; however, each has a place under certain circumstances.
Hold Onto the Fact
Holding onto the fact in your staging area may be a great option if, for example, you expect the late arriving data to be coming in soon. Perhaps you process data from multiple source systems. System A contains the transactions, while systems B, C, and D combine to give you your dimensions. Because of your extraction windows, you may not be able to get data from system D until later in the morning, meaning that the business events from system A are missing some context. Holding onto these events and waiting for the data from system D is the appropriate action!
Even if the situation is not predictable, holding onto facts might be the best option. If you are loading daily stocks, for example, and for some reason a particular company is not available (a new company perhaps), but the stock is showing up in your extracts, you should hold onto that stock in your staging area until the company is available. You don’t want that stock in the warehouse until the full details of the company comes in (which could be available in a separate extract, or may need to be manually entered).
As you can see, deciding to work this way has implications on the ETL processes and staging area. Planning for these circumstances up front is a must.
The ‘Unknown’ Dimension
Creating an empty row in the dimension called ‘Unknown’ is also an option that is good under some circumstances. If you are processing orders and no information about a promotion comes in, then it would be safe to link the fact to a special row in the dimension that denotes that no promotional information was available. Depending on your dimension and business requirements, you could have many different levels of ‘unknown’.
Here are a few suggestions:
- -1, ‘none’
- -2, ‘unknown’
- -3, ‘not applicable’
If you use key -1 in your fact row for the foreign key to the promotion table, you allow your users to write queries and reports for all orders that were made when there was no promotion (promotion.name = ‘none’). The same is true for all facts that have an unknown promotion status, meaning there could have been a promotion, but the data was incomplete. Lastly, there are cases — if you do online orders for example — where promotions simply don’t apply.
You can expand these keys heading backwards with lots of detailed descriptions telling you why a dimension might not exist for an event. Don’t go overboard though! Having a few options is a good idea for data quality and analysis purposes. But too many may overload users with too many choices all too similar to be useful.
This approach works well when the likelihood of acquiring the context later is slim-to-none. If you use this method, it gets very complex if you later do in fact find that there was some context (i.e. a promotion). Your only option then is to revisit your facts and look for, and update, the fact rows corresponding to the late arriving data. On the bright side, if you’ve already tagged these rows with a foreign key of -2, for example, you may be able to focus only on those facts — saving you some process time.
Inferring a Dimension
In the company stock example I gave above, you have another option. If the company details of a stock is not known — and all you have is a SEDOL or Ticker — then simply insert a new row into the dimension with all of the information you know to be true about the dimension. This only works if you have enough details about the dimension to construct the natural key. Without this, you would never (although, I’ve learned to never say never) be able to go back and update this dimension row with complete attributes.
To make this work, and if you are using Type 2 SCDs, you will need another attribute in your dimension. Call it “inferred” or similar, make it a bit or char(1), and mark it as true (1, or “Y”).
This has a major implication on your dimensional processing. Now, before you process any Type 2 changes on any dimension that contains an attribute called “inferred”, you need to check if the row you are ready to work on is in fact inferred. If so, treat the entire row as Type 1. Because it is inferred, we don’t have the details anyway. If we went along and processed the row as a Type 2, then your fact will continue to point to an essentially empty row.
SQL Server 2005 Integration Services (SSIS)
The SCD Component handles the inferred dimension approach quite nicely. In fact, “Inferred Dimension” is a term introduced by SSIS (I believe). I talked briefly about the SCD Component and inferred dimensions in my post “ETL Subsystem 9: Slowly Changing Dimensions (SCD)” so I won’t repeat myself here. The bottom line is that SSIS has a very elegant solution to the late arriving dimension problem. If you have them — and almost every data warehouse has late arriving dimensions — then using the SCD Component and the inferred dimension option is must. Also check out this external link for more inspiration: “Handling Late-Arriving Facts“.
Hand Coding with Visual FoxPro (VFP9)
While SSIS has built-in capabilities to handle the inferred rows, VFP does not. This is not surprising of course because VFP is a language and not an ETL tool! With that said, writing SQL to accommodate the “inferred” flag in a dimension row is not a big deal. The processing logic for a SCD Type 2 dimension then becomes (in pseudo-code):
For each row
If any attribute marked as 'Type 2' changes AND inferred = "N" Then
update status of active/current row
insert into dimension a new row
End If
For each attributes marked as a 'Type 1' change OR inferred = "Y" Then
If attribute is subtype 1
update all rows that match the key
Else
update only the current/active row that matches the key
End if
End For
End ForI discussed the above pseudo-code in more detail in my post “Loading the Slowly Changing Dimension“. Check that out for more insight into some of the logic. I should mention again that this is a very basic structure and not intended to be comprehensive or de jure.
From Here
Enough about late arriving data! Late arriving data, changed data capture, and surrogate key lookups might be the most challenging aspects of ETL — especially when it comes to performance and efficiency. It will take you some time to get these right. You may — like me — be forever tweaking these areas to squeeze every last second out of the process.
Next, I’ll move on to discuss Dimension Management. A much lighter topic!
