ETL Subsystem 10: Surrogate Key Generator
This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
When integrating data into a Dimensional Model, you need a mechanism to assign new primary keys to each dimension. These primary keys will be used in your Fact table as foreign keys. You cannot use natural keys because they are likely to repeat — this is especially true if you are maintaining history using SCD Type 2 (more on this in a bit). Subsystem 10 addresses this important need, by specifying the need to generate surrogate keys for all dimensions.
Surrogates
Integer surrogate keys should always be used as primary keys for a dimension. In addition, each dimension will have at least one candidate key comprised of the natural key and row status (active or inactive dates) if the dimension is a TYPE 2 SCD. Here is what I wrote in a previous post:
Every dimension contains a single-part surrogate primary key field. This primary key is used to relate to one or more fact tables in the schema. Natural keys (like an employee id) are kept in the dimension, but are not used in joins. This technique ensures a unique value for each record, simplifies joins, and keeps indexes small.
So the concept is simple: Every time a new row is inserted into any dimension, it gets a new primary surrogate key. We use integers because they’re compact, perfect for sequencing, and under most relational databases we can control their ranges. How you generating this primary key largely depends on your performance needs and RDBMS capabilities (see the SSIS and VFP subsections below).
Special Cases
You can treat Date dimensions a little differently. Instead of making the first row in your Date Dimension start at 1, use the ISO 8601 date standard YYYYMMDD format. This makes sense for a few reasons. First, it is easier to look at a row in a Fact table and have the date available without a join to the Date dimension or one of its role players. Second, a Date Dimension is treated as a Type 1 SCD, meaning, you would never have the same date repeated with one row being active and the other inactive as in TYPE 2 SCDs (hence throwing off the ISO date format). Third, because Fact tables generally contain several dates, using the integer date key (which is an int in the ISO date format) instead of fetching the formatted date from the Date Dimension may have some performance benefits because you can easily format the date later on in the presentation layer. Lastly, you can usually predict with certainty the total number of rows that the Date Dimension will have, and then pre-populate all the dates before you even do your first integration. This removes the Date Dimension from the surrogate key processing all together.
As I mentioned previously, one row in each dimension should have a primary key of zero, which can be used as a symbolic “null”. Remember that Fact tables reference dimensions through foreign keys. You cannot have a null foreign key, so the only viable alternative is to link to a dummy row in the dimension. In dimensional modeling, the acceptable solution is to create a row in each dimension with a primary key of zero. For example, in a Customer Dimension, the zero key might have a customer name of “Unknown” and a DOB of 0 (which is a foreign key that relates to the Date Dimension’s null row). This is a very cool concept because it becomes incredibly easy to produce a query that could give you all the order transactions that did not involve a known customer. Maybe not the best of examples, but try writing a query to give you results of everything that didn’t happen, or wasn’t there, using a relational model!
You can also use -1, -2, and so on to represent other special conditions. For example, a Promotion Dimension may have a zero-key row that represents “Unknown” and a -1 key row that represents “No Promotion”. Perhaps a -2 key could symbolize a “Canceled Promotion”. And so on…
The following sections give some ideas on how to generate surrogate keys in both SSIS and VFP. I’ll discuss using these keys in greater depth when I discuss Subsystem 14: Surrogate Key Management.
SQL Server 2005 Integration Services (SSIS)
I prefer to let the database generate new, unique surrogate keys in all cases. This is usually the safest approach and doesn’t require any real development around the key-generation process.
For SQL Server, select the best int for the job and make it an IDENTITY. Select among the 1 byte Tinyint (0 to 255), 2 byte Smallint (-32,768 to 32,768), 4 byte Int (-2,147,483,648 to 2,147,483,647), and the 8 byte Bigint (-2^63 to 2^63-1) depending on the expected maximum size of the dimension.
It may not be practical or feasible to make the primary key attribute in your dimensions IDENTITY Ints. Performance may also be an issue when assigning and returning these new keys. There are really two alternatives, my favorite first:
Alternative 1: Use a staging table. Using a lookup/reference table in the staging area is an option that always works. The concept is simple: Create lookup tables that correspond to each dimension in the model. Include the IDENTITY surrogate key column, the natural key(s), and active/inactive dates (for Type 2 dimensions). When processing dimensions, insert new rows in this lookup table just before adding them to the dimension. Use the surrogate key generated from the lookup table to insert into the dimension (you can either return the generated key using SCOPE_IDENTITY() or simply do a lookup/join on the natural key and status). When you need the key for the Fact table, you can do another lookup/join on the staging table. This approach takes pressure off of the dimensional model and creates opportunities to increase performance by optimizing the lookup/reference table. I’ll talk more about the details on this approach when I discuss Subsystem 14.
Alternative 2: Generate your own key. Another way to achieve the same result is to figure out the last key in the dimension and increment it manually. This number will then be used to insert into the dimension and again into the Fact. The problem with this approach is that if you do not process Facts immediately after each dimension, one at a time, then you will need to store and lookup the key in a staging table or in memory. Also, this approach assumes that no other process or thread is inserting data into the dimension at the same time. This, of course, would kill the sequential system and violate the primary key constraint on the table.
To explore Alternative 2 further, have a look at this post from Marco Russo of sqljunkies.com. He explains how to use an Execute SQL Task in SSIS that executes the following:
SELECT COALESCE( MAX( ID_Dimension ), 0 ) AS LastID FROM Dimension
The claim is that his method is the best, most scalable solution. I don’t fully agree, as my experiences dictate otherwise. But I believe that each approach has merit and depends a lot on the environment.
Hand Coding with Visual FoxPro (VFP9)
For VFP, use an int data type with AutoIncrement enabled. As with a SQL Server implementation, I prefer using this approach to generating keys manually. Even still, both alternatives mentioned above are also available in FoxPro:
Alternative 1: Use a staging table. Here, as with SQL Server, use a staging table with AutoIncrement enabled on the primary key. As a consequence, you will not enabled AutoIncrement on the dimension. To return the newly added key from the lookup/reference table, you simply use the GETAUTOINCVALUE() function and return it.
Alternative 2: Generate your own key. You can also determine the last key used in the dimension and manually increment it. Personally, this approach offers very little to a VFP developer. Alternative 1 — or simply using the Dimension for generating the surrogate — is by far the best approach.
I like VFP for these types of tasks. Whatever you decide, you don’t really need to worry about returning the generated key. You typically insert and update all your dimensional data first, and then return to the dimension later to lookup keys just before inserting the facts. Nothing beats an old-fashioned SEEK in this scenario, which can be light years faster than building a select with a where clause depending on how crafty you are with Rushmore and Index building.
Compare this:
SELECT PrimaryKey ; FROM Dimension ; WHERE NaturalKey = 'SomeValue' ; AND StatusFlag = True
to this:
SEEK('SomeCandidateKey','Dimension','CandidateIndex')
If you are using Alternative 1, then instead of SEEK, you could build a JOIN between the lookup table and the data being processed for the fact. This set-based approach can be pretty speedy as well.
If you insist on exploring Alternative 2, the translation from Marco’s TSQL code to VFP is as follows:
SELECT NVL(MAX(ID_Dimension),0) AS LastID ; FROM Dimension
In my experience, I simply use VFP9’s Autoincrementing Integer field in each Dimension. I don’t bother with a staging lookup table because with VFP, I can just set the order to the natural key and do a SEEK on the dimension. It’s a fast, simple, and consistent approach.
VFP Really Shines
In this series of articles, I have attempted to highlight some of the differences between implementing the 34 Subsystem best practices in both a hand-coded environment (VFP) and through a tool specifically designed for the job (SSIS). In each case, both VFP and SSIS have really been about equal (this is a non-scientific study, of course!). SSIS has advantages that are hard to duplicate in VFP (some of which I will be highlighting later on in this series), but in this case, VFP really shines. While both databases can handle generating keys for us in similar ways, VFP’s SEEK command allows us to use it!
I don’t want to dwell on this any more. I’ll discuss more details on how to use the surrogate keys when I talk about Subsystem 14: Surrogate Key Management.
I'm a Quant Technical Specialist (Data Warehousing and Business Intelligence), with expertise in business analysis, data modeling, and data integration. I have extensive experience developing vertical and integrated desktop, Internet, and BI applications spanning municipal, clinical, and financial industries.

June 16th, 2008 at 4:49 am
[…] Surrogate Key Generator […]