Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘VFP’


Published June 24th, 2008

ETL Subsystem 14: Surrogate Key Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Surrogate key management is an important part of designing an ETL system using the Kimball Methodology. Slowly Changing Dimensions require the use of surrogates because natural keys, like a Customer ID, Product ID, and so on, will be repeated in dimensions whenever Type 2 changes are tracked. Surrogates are also desirable because they are single-part and fast for joins. I’ve already discussed the need for surrogates in several posts. Here, I’ll talk about some techniques for replacing the natural keys that come in from the source systems with the new surrogate keys that exist in the data warehouse.

The idea is simple in theory: For each dimension row, you generate a surrogate key. This surrogate key replaces the row’s candidate key and is used to link to one or more Facts.

For example, the candidate for a Customer dimension may be CustomerID + ActiveFromDate. The ActiveFromDate is used to sequence the row if Type 2 change tracking is used. It is entirely possible to have the same CustomerID listed dozens of times in the dimension. The differentiator is the ActiveFromDate. This combination is guaranteed to be unique in the Customer dimension (although you would not normally enforce this using an index, you would certainly enforce this through ETL). This is a simple example. When modeling Parts or Real Estate, you may have several attributes combining to form a candidate key. This also becomes a bit more complex when Customer data is refreshed throughout the day — meaning you cannot rely on the date alone, you also need to consider time.

In dimensional modeling, you will normally process dimensions before facts. Because of this, you will have all surrogates defined for each row you need. In a subsequent post, I’ll talk more about situations where you load a fact before a dimension, or situations when late arriving data must be handled. For now, assume that all dimensions have rows that correspond to a fact. When you process the fact, you simply need to look up the surrogate key assigned in the dimension and use that key as the foreign key to that dimension in the fact.

Dimensional modeling essentially requires surrogate keys, and does so by taking a performance hit during ETL. It takes time to look up surrogate keys. You will spend a lot of time yourself making this process as fast and efficient as possible. I do have a few suggestions, though:

Load by the Day

Process facts one day at a time (all of yesterday’s data plus any changed data or late arriving data from previous days). Doing so allows you to join your staging fact row by the natural key in the dimension for that day. For example, in SQL, you could acquire all dimensions that are valid for a particular day by doing the following:

SELECT DISTINCT NaturalKey 
    FROM Dimension 
    WHERE DataDate BETWEEN ActiveFromDate AND ActiveToDate

This returns a subset of the dimension for the day in which the fact occurred (DataDate). If the candidate key involves more than the natural key, you will need to adjust the above SQL to be more robust. For example, you may have the name and birthdate of a patient at a hospital with an insurance number and insurance company code that uniquely identifies the row! That’s six attributes (if you include the ActiveFromDate) that uniquely identifies the row when combined.

With the above, you can incorporate the query into a JOIN on the staging fact data (by selecting all facts that occurred on DataDate), or you could use it as the source for a lookup operation. By narrowing the focus to be a single day (and you load by the day), you eliminate the need to worry about SCDs.

You can further enhance the above by maintaining special staging tables for each of your dimensions. These staging tables contain all candidate attributes and the surrogate key for each dimension. These special staging tables are maintained while you process your dimensions. Instead of selecting against the actual, production dimension, you can use the staging table. Although there is a little extra overhead in maintaining the staging tables, you get it all back — and then some — when you do your key lookups.

If you need to load by time of day (essential for real-time or intra-day ETL), you can still use the above method, except that you will need to join your fact data on time of day as well.

You can get around using BETWEEN if you are processing the most recent data. Instead, you could filter results by the dimension’s “current” flag. The current flag is set while processing the dimension. Only one row, for particular natural key, can be marked as current. A predicate on a current flag would be faster than checking between dates. A common trick is to split the rows into two streams: One with current data (anything that happened yesterday), and one with historical data (anything that happened the day before yesterday and back). Use the current flag for the current data, and use the BETWEEN for historical data.

I find that this technique works well for small dimensions with a low number of facts (a few thousand transactions a day). With a good index plan, this might be all you need to do.

Store the Surrogate Right Away

Another technique worth mentioning is particularly useful when dealing with very large dimensions and lots of facts. You should also consider this technique if you are manually generating your surrogates (see my post “ETL Subsystem 10: Surrogate Key Generator” for specifics). Basically, when you generate your surrogate key for a dimension row, store it with the fact data at the same time.

This technique requires writing the surrogate to the dimension and the fact staging table. Using an UPDATE statement is a simple approach, but a more exotic and better performing method would be to INSERT the new keys into another staging table, and joining to the staging fact table later on.

Either way, there is a performance hit with storing they key right away, so be sure to thoroughly review your environment and needs. I have used this technique before with great success, but it took me a long time to determine it was the best approach. In fact, my original design worked great until the dimension grew to a few million rows and the transaction load doubled. Doing a lookup using BETWEEN proved to be too slow compared with saving the key immediately.

SQL Server 2005 Integration Services (SSIS)

Jamie Thomson wrote an article on how to use SSIS to update fact tables (actually, most of it is a case study written by “Mag”). I won’t repeat what was in that post. Have a look. Jamie’s technique is basically what I have come to adopt myself. You’ll notice the liberal usage of Lookups instead of Joins (joins, at least for me seemed like a more logical starting point). Lookups perform better for a few reasons. The Merge Join operation requires a Sort — either by using the Sort Component or by ORDERing the data through the Source Component. This sorting can be expensive. Lookups, in addition to not requiring that the input be sorted, can use caching and limit access to the database. For these reasons, I use Lookups now almost exclusively.

When using the Lookup approach you must be aware of two possible conditions: The Lookup could return more than one match if you don’t properly define your candidate key. Also, if no match is found, common for late arriving data, you need to either divert the errors and handle them separately, or convert the NULLs to your default “unknown” value for each dimension. For me, I typically use negative numbers to define various types of null (-1 for “not known”, -2 for “not applicable”, etc…).

If you process by the day, use staging tables as dimension lookups, and rely on Lookups instead of Sorts and Merge Joins, then fact processing should go quite well.

Hand Coding with Visual FoxPro (VFP9)

As I discussed in my post “ETL Subsystem 10: Surrogate Key Generator “, VFP’s SEEKing capability far outshines set-based approaches to this problem. Check out my comments in that posting for some ideas on how to use VFP to get the surrogate keys for your fact data.

Not to get lost though in the wonders of SEEK, VFP is perfectly suited for set-based approaches as well. So all of the methods discussed in the article (Joins and lookups) can be implemented easily in FoxPro. As I wrote in that post, VFP really shines for this type of task.

From Here

Next post, hopefully later this week, I’ll discuss bridge tables and how to best build them. I have an interesting case study to walk through. I have refrained from using cases in my ETL subsystem posts, but for bridge tables, I am making an exception!

Published June 16th, 2008

ETL Subsystem 13: Fact Table Builders

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Fact tables are the heart and soul of the business process dimensional model. I discussed fact tables already in a previous post (”Fact Tables“), so I won’t repeat myself here. Recently, I also provided a more formal definition of a fact table as the: “central table of a Star Schema with numeric performance measurements, identified by a composite key, each of whose elements is a foreign key drawn from a dimension table”.

Three grains define fact tables: Transaction-level, Accumulating Snapshot, and the Periodic Snapshot grain. Refer to my post referenced above (”Fact Tables”) for more information on granularity.

Subsystem 13 is responsible for the construction of all types of fact grains.

Transaction grain fact tables are always the largest and should contain data at the finest grain possible. This would include a line on an invoice, a patient diagnosis for a series of emergency room events, or even a Josh Beckett fastball. Data is almost always inserted (via bulk load) into transaction-grain facts. Deleting rows is dangerous (more on this in a second), and updates should only occur in order to correct an error or update some late-arriving key. In some cases, I have even gone back into a transaction grain fact table in order to update a calculated field.

Instead of deleting data in a transaction-grain fact table, “negate” the row instead. Otherwise, you run the risk that some of your historical reports and analysis will be inconsistent. To negate a row, simply add a new row to the fact, with identical foreign keys to the dimensions as the row you are negating, but update the metric so that when both rows are summed (or counted, etc…) they result in zero activity. For some metrics, this is tricky, but if you are using mostly additive and semi-additive metrics in your facts, you should be OK with this approach.

For the periodic snapshot grain, there are some differences in the loading process to consider. First, as these facts deal with aggregate data (based on daily, weekly, monthly, quarterly, etc…) time spans, it is appropriate to only load them when a period is complete. Secondly, extremely careful consideration for the grain of the periodic snapshot must be honored because it is too easy to apply an aggregate calculation inappropriately against the stated grain. Third, and as a compliment to my first point, it may be necessary to aggregate some running totals. If for example you are building a monthly periodic snapshot for a checking account at the bank, you would normally also provide data from the end of the last period to date.

Accumulating snapshots represent some process as it evolves over time. Therefore, rows in the fact are constantly updated to represent this evolution. For example, a patient may enter the ER, check in at triage, get assigned a room, see the nurse, see the doctor, etc. A building inspector may visit a new construction before work is done, when the foundation is set, when the framing is up, when the electrician as finished, etc. Loading this type of snapshot is much different from the other two, in that it tracks some defined process over time in a single row. There are many updates, and additional logic is needed to handle the loads.

SQL Server 2005 Integration Services (SSIS)

I find working with the various types of fact tables in SSIS to be rather simple. The key for success is usually in how the Control Flow is sequenced (facts generally come last), when dimensions are loaded, and where you do your dimension key lookups. In essence, all you need is a source containing your metrics (line items on an invoice pulled during the extract phase), a series of lookups (or joins, whichever provides the best performance in your situation), and some way to load the fact (bulk insert if possible).

Some advice: Do your UPDATEs in an Execute SQL Task. Do not try to use the OLD DB Command. Especially — and I mean especially - if you are doing a large amount of updates and your fact table is also large. To get this to work, offload the data you need for your updates into a separate staging table. Use the Execute SQL Task and write an appropriate UPDATE statement which joins your fact to your new staging table. The same approach holds true for any rows you might be deleting.

Hand Coding with Visual FoxPro (VFP9)

I find working with fact tables in FoxPro to be just as easy as in SSIS. There is no clear advantage of one over the other. One major benefit with VFP is that it is an excellent OOP language with very fast cursors (a VFP Cursor is much easier to work with than an SSIS Raw file). This gives you many more options on how you look up and populate foreign keys for the fact at load time.

If you are using VFP tables for your RDBMS, you will need to be careful for the 2 gig size limitation for your fact tables. Your Fact Table Builder may have to handle partitions manually. This is a major disadvantage with using the VFP database. Especially because 2 gigs is not a lot of space for daily transactions!

From here

In the next article, I’ll discuss ETL Subsystem 14: Surrogate Key Manager. Subsystem 14 helps to maintain the integrity among facts and dimensions for each business process, and is a complement to your Fact Table Builders.

Published May 26th, 2008

ETL Subsystem 12: Special Dimensions Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Each dimensional model that you create will have special design characteristics and requirements. This typically means that you may need to rely on specialized dimension tables. Not all dimensions will be “Product” or “Customer”. In fact, you will likely have more specialized dimensions in your models than actual normal, business dimensions.

There are several categories of special dimensions. They are:

  • Date and Time Dimensions
  • Bridges to support many-to-many relationships
  • Special Indicators and Flags Dimensions
  • Study or Research Group Dimensions
  • Mini-Dimensions
  • “Current” Dimensions
  • Lookup and other Static Dimensions
  • Administration Dimensions (such as special logs, monitors, and audits)

Much of the above has already been discussed in my posts to date (date dimensions, bridges, and lookup dimensions for example). So not to beat a dead horse, I’ll spend the next few paragraphs explaining those I haven’t really talked much about yet.

Special Indicators and Flags Dimensions
These dimensions, also referred to under some circumstances as “junk” dimensions, exist to store extra information about the fact when it occurred. I typically use these special dimensions to hold flags (for example, if a tax payer has a history of delinquency, if a certain diagnosis and age creates greater risk for death, etc..). I usually store these flags as a tiny int with three settings: -1 null, 0 off, and 1 on. For researchers, this table acts as a way to screen large amounts of data and focus efforts on a subset of events with certain characteristics.
Study or Research Group Dimensions
Whether I’m working with Qualitative Analysts or Quantitative Analysts, they always seem to need specialized research databases (and they all seem to like using SAS too!). In the past, I got in the habit of creating copies of the required business processes dimensional models for their use, filtered and sampled appropriately. These databases would be installed in an isolated environment. This isolation allowed them to develop models, conduct what-if analysis, and to manipulate the data to test theories. This gave the researches their own data to play with and everyone was happy. Another solution — a better solution — which I have only recently begun to implement with more regularity is creating a special study group and/or research dimension which defines the research being conducted (”Infection Research”, “buying patterns of customers in Alaska in January”, etc..). A bridge table between the subject (Customer? Product? Stock? Web Page?) links the study with the sample used. All queries for that study simply use the new special dimension as an additional filter. And everyone’s happy!
Mini-Dimensions
Mini-dimensions are necessary when a subset of attributes (typically related) need Type 2 change behavior tracking, the changes happen frequently, and the dimension is rather large (such as customer, webpage hits, or stock index characteristics). A specialized mini dimension is created with the required attributes, stored at a grain that represents a combination of attributes. This mini dimension does not contain a link back to the parent dimension, nor does it contain the natural or primary key of the dimension. Each row is instead given a unique key that is referenced alongside the parent dimension in the fact table. For a large Customer dimension, you may create a mini-dimension with income bands, year-to-date behaviors, job status, and other related financial and purchase information. This “profile” might be shared by hundreds or thousands of customers and change frequently. Therefore, the data you store would not be identifiable on its own to any single customer, but to all customers who share the characteristics.
“Current” Dimensions
There are times when large Type 2 SCD dimensions become too unwieldy for querying and other analysis; or, what you really want is a Type 1 dimension to do most of your work, but need to maintain Type 2 changes over time to accommodate some special need (like compliance). The solution is to copy out the “current” rows of a dimension (using a current flag or date range) into a second, special Type 1 dimension. I’ve had success using this technique using three methods: (a) create a role-playing view on the dimension using only the current rows (one challenge with this method is that you need to store a special key in the parent dimension so that you can use the view to relate to a fact table row); (b) by creating a new Type 1 dimension through ETL and placing a new foreign key to my table in the fact; and (c) by using the Type 2 dimension as a bridge table, linking to the special dimension through its natural key (which can be a view or a physical table created through ETL).

SQL Server 2005 Integration Services (SSIS)

Other than building the dimensions, there is nothing special from an SSIS point of view. You will generally handle the creation of the special dimensions in separate packages (dates and lookups for example) or as part of normal dimensional processing (which would be the case for mini-dimensions and current dimensions for example).

Hand Coding with Visual FoxPro (VFP9)

Same as with SSIS. Nothing special. If you can process a dimension, you can process a special dimension. Most of the onus falls on business requirements, data modeling, and overall data warehouse design. Non of which are inhibited by Visual FoxPro.

From Here

This subsystem is deigned to be a place-holder for handling special dimensions. I have found that a great deal of value found in a DW/BI solution actually stem from these. All those special flags, scoring results, quality assessments, and additional features that we’ve painstakingly added through ETL are now available.

In my next post in this series, I’ll discuss lucky subsystem 13: Fact Table Builders.

Published May 20th, 2008

ETL Subsystem 11: Hierarchy Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

This subsystem exists to ensure that hierarchies are appropriately translated and represented in the dimensional model. There are two types of hierarchies that you’ll need to contend with: fixed and ragged.

Managing hierarchies can be a complex process especially when you have hierarchies that are extremely ragged (for example, manufacturers’ parts or an organizational chart). You’ll also run into complexities when a dimension entity (like a single customer or a single product) exists in simultaneous hierarchies. I’ll talk more about these later in this post. In contrast, the fixed variety is easier to work with.

Hierarchies as Attributes of a Dimension

Generally speaking, you can think of hierarchies as many-to-one relationships. In the dimensional modeling world, these relationships are represented in a single table. This would include stores to regions to state, children to parents to grandparents, and greens to vegetables to produce. ETL Subsystem 11 seeks to maintain the integrity of these relationships.

To simplify even further, instead of thinking in “tree” structures, think in “lines” from the child up the hierarchy to the parent. This will help you build the dimensions to accommodate the relationships. In the simplest form: A single store is in a single region in a single state. This is actually a very interesting topic from a modeling perspective — one in which I’ll get to in more detail in a future posting.

state_region_storeDimensions are denormalized structures, which means that you will have many repeated elements. This is normal and desired. For example, a store region will be repeated for each store and state will be repeated for each region all in the same dimension (look left). This is normal and desirable in a dimension. The trick to getting this to work correctly is that the hierarchy must be represented as a single value with the dimension row’s primary (surrogate) key.

Data modelers who are used to 3rd Normal Form might look at the above and cringe. But remember: normalized models are for preventing data anomalies in a transaction environment. In a data warehousing environment, the rules are different. First, there are no opportunities for data anomalies due to data integration controls. Secondly, normalizing the data warehouse makes absolutely no sense from a usability perspective: it only complicates and slows down reports and queries.

The Ragged Kind

So fixed hierarchies are easiest to work with: got it. It is not so easy to work with variable and ragged hierarchies because of their variable depth. The classic example is an organization chart, where any employee can be at the top or at the bottom of the hierarchy. Knowing how deep the organization runs from any point is a challenge that usually requires self joins and bridge tables to represent the relationships.

I have always solved these types of hierarchy “problems” using “helper tables” in a dimensional model. Ralph Kimball wrote a great article a decade ago on this subject. Check it out for more details.

Helper tables look like bridge tables that sit between the fact and the dimension. They facilitate the representation complex hierarchal information. This design complicates user queries though, so be sure that helper tables are absolutely needed. It might be, for example, you only really need the manager’s name and not the entire chain of command with each employee. You don’t need a helper table for that (see the following code)!

SELECT
   c.FirstName AS empFirstName,
   c.LastName AS empLastName,
   e.title AS empTitle,
   COALESCE(m.mgrFirstName,'N/A'),
   COALESCE(m.mgrLastName,'N/A'),
   COALESCE(m.mgrTitle,'I am the boss')
FROM
    HumanResources.Employee e 
JOIN Person.contact c ON c.ContactId = e.EmployeeID 
LEFT JOIN (
   SELECT
      e2.EmployeeID,
      c2.FirstName AS mgrFirstName,
      c2.LastName AS mgrLastName,
      e2.title AS mgrTitle 
   FROM
      HumanResources.Employee e2 
   JOIN Person.contact c2 ON c2.ContactId = e2.EmployeeID  ) 
  AS m ON m.EmployeeID = e.ManagerID

I wrote a while ago on ragged hierarchies from a programming perspective. Take a look at that post for more details.

Date Hierarchies

For another example, let’s look at a common hierarchy we can all relate to: year, quarter, month, and week. You can see how this hierarchy is modeled by looking at any date dimension. As you’ll see, each day contains information about how it is grouped on the calendar. The information is repeated for each day until one of the groups change. When designing reports that allow your users to drill down into the data, it is a common approach to start at the highest group (sales by year) and then look a bit deeper as necessary (sales by quarter or sales by month, for example).

Also, in the date dimension, weeks don’t line up too well within a year, quarter, or month. This is a classic example of how one group can fit entirely or partly into another. You see this often when a sales region crosses multiple states, or when an employee serves multiple roles within the company. These are all situations that must be accounted for by your hierarchy manager. For the date dimension, one technique I’ve adopted is to include the week of the year number, week of the quarter, and week of the month in the date dimension to give users the ability to drill into weekly data much easier (for example, you might want to measure holiday sales in the US from the 3rd week of November to the 4th week of December).

Snowflakes and Hierarchies

SnowflakeSnowflakes are usually a sign that a hierarchy has been normalized. This is bad. Don’t fall into the trap! In order to keep the dimensional model as simple as possible, you should avoid snowflaking; however, snowflake designs are perfectly legal under certain circumstances. Carefully examine the reasons though. If you are snowflaking to accommodate a hierarchy, hold your horses. Hierarchies are a natural part of a dimension. In fact, most things in life are categorized and need to be grouped in some way. Remember that one of the primary purposes of delivering a denormalized dimension is to remove almost all complexity from the user’s perspective. This usually means hierarchies.

SQL Server 2005 Integration Services (SSIS)

Denormalizing hierarchies using SSIS is not difficult. The hardest part is usually in writing the SQL that correctly fetches the right data. In a Data Flow, you can use a series of Lookup and Merge components. In some cases, especially for the more complex ragged hierarchies, your best bet is to use SQL statement in an Execute SQL Task.

Although I won’t get into details in this post, CTEs (Common Table Expressions) are excellent for working with recursion and hierarchies.

The real joy of hierarchy management using the SQL Server Business Intelligence suite comes at the end when you want to start using your hierarchies to allow users to drill down into the data. Even if you have complex ragged hierarchies structured with bridge tables in your model, Analysis (SSAS) and Reporting (SSRS) Services can really make sense of it all. But I’m digressing; this series of posts is about ETL and not presentation tools!

Hand Coding with Visual FoxPro (VFP9)

It’s easy to write recursive functions in Visual FoxPro. Recursion is one of the “secretes” to flattening out hierarchal structures. For more details on writing recursive functions in FoxPro, check out my post “Ragged Hierarchy Alert” or even better, visit the FoxWiki page on the subject.

From Here

In the end, as I continue to compare SSIS with VFP on data integration, I find that hierarchy management is about equal between the two. FoxPro seems to perform much better but I have no benchmarking (yet) to prove it. One advantage with SSIS is being able to utilize CTEs.

This was a long one, but a heavy subject. In my next post, I’ll discuss the Special Dimensions Manager, ETL Subsystem 12. Not as heavy and hopefully not as long!

Published April 30th, 2008

Missing Him Already…

I’m saddened by the news that Ken Murphy has passed away. I have learned so much from Ken (although we’ve never met) and I already miss him. My condolences to his family…

Here’s how he felt about the “end” of FoxPro:

The decision to cease further development of VFP is one that I believe Microsoft, and especially, the SQL server division will come to regret. I develop database apps for charities, and it has been my experience that most of these charities eventually move to a SQL server back end. I would suggest that the same is true of many small businesses. VFP is a fantastic language for developing entry level database apps for small to medium businesses. The power and speed of VFP allows people like me to develop tools for these smaller organizations that allow them to grow. Evenutally, they outgrow the VFP back end and typically move to a SQL solution. If SQL were a major league baseball team, VFP would be their farm team. I wonder how many major league teams would succeed if they were to get rid of all their farm teams?

This will be how I remember him.

Published April 15th, 2008

ETL Subsystem 10: Surrogate Key Generator

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

When integrating data into a Dimensional Model, you need a mechanism to assign new primary keys to each dimension. These primary keys will be used in your Fact table as foreign keys. You cannot use natural keys because they are likely to repeat — this is especially true if you are maintaining history using SCD Type 2 (more on this in a bit). Subsystem 10 addresses this important need, by specifying the need to generate surrogate keys for all dimensions.

Surrogates

Integer surrogate keys should always be used as primary keys for a dimension. In addition, each dimension will have at least one candidate key comprised of the natural key and row status (active or inactive dates) if the dimension is a TYPE 2 SCD. Here is what I wrote in a previous post:

Every dimension contains a single-part surrogate primary key field. This primary key is used to relate to one or more fact tables in the schema. Natural keys (like an employee id) are kept in the dimension, but are not used in joins. This technique ensures a unique value for each record, simplifies joins, and keeps indexes small.

So the concept is simple: Every time a new row is inserted into any dimension, it gets a new primary surrogate key. We use integers because they’re compact, perfect for sequencing, and under most relational databases we can control their ranges. How you generating this primary key largely depends on your performance needs and RDBMS capabilities (see the SSIS and VFP subsections below).

Special Cases

You can treat Date dimensions a little differently. Instead of making the first row in your Date Dimension start at 1, use the ISO 8601 date standard YYYYMMDD format. This makes sense for a few reasons. First, it is easier to look at a row in a Fact table and have the date available without a join to the Date dimension or one of its role players. Second, a Date Dimension is treated as a Type 1 SCD, meaning, you would never have the same date repeated with one row being active and the other inactive as in TYPE 2 SCDs (hence throwing off the ISO date format). Third, because Fact tables generally contain several dates, using the integer date key (which is an int in the ISO date format) instead of fetching the formatted date from the Date Dimension may have some performance benefits because you can easily format the date later on in the presentation layer. Lastly, you can usually predict with certainty the total number of rows that the Date Dimension will have, and then pre-populate all the dates before you even do your first integration. This removes the Date Dimension from the surrogate key processing all together.

As I mentioned previously, one row in each dimension should have a primary key of zero, which can be used as a symbolic “null”. Remember that Fact tables reference dimensions through foreign keys. You cannot have a null foreign key, so the only viable alternative is to link to a dummy row in the dimension. In dimensional modeling, the acceptable solution is to create a row in each dimension with a primary key of zero. For example, in a Customer Dimension, the zero key might have a customer name of “Unknown” and a DOB of 0 (which is a foreign key that relates to the Date Dimension’s null row). This is a very cool concept because it becomes incredibly easy to produce a query that could give you all the order transactions that did not involve a known customer. Maybe not the best of examples, but try writing a query to give you results of everything that didn’t happen, or wasn’t there, using a relational model!

You can also use -1, -2, and so on to represent other special conditions. For example, a Promotion Dimension may have a zero-key row that represents “Unknown” and a -1 key row that represents “No Promotion”. Perhaps a -2 key could symbolize a “Canceled Promotion”. And so on…

The following sections give some ideas on how to generate surrogate keys in both SSIS and VFP. I’ll discuss using these keys in greater depth when I discuss Subsystem 14: Surrogate Key Management.

SQL Server 2005 Integration Services (SSIS)

I prefer to let the database generate new, unique surrogate keys in all cases. This is usually the safest approach and doesn’t require any real development around the key-generation process.

For SQL Server, select the best int for the job and make it an IDENTITY. Select among the 1 byte Tinyint (0 to 255), 2 byte Smallint (-32,768 to 32,768), 4 byte Int (-2,147,483,648 to 2,147,483,647), and the 8 byte Bigint (-2^63 to 2^63-1) depending on the expected maximum size of the dimension.

It may not be practical or feasible to make the primary key attribute in your dimensions IDENTITY Ints. Performance may also be an issue when assigning and returning these new keys. There are really two alternatives, my favorite first:

Alternative 1: Use a staging table. Using a lookup/reference table in the staging area is an option that always works. The concept is simple: Create lookup tables that correspond to each dimension in the model. Include the IDENTITY surrogate key column, the natural key(s), and active/inactive dates (for Type 2 dimensions). When processing dimensions, insert new rows in this lookup table just before adding them to the dimension. Use the surrogate key generated from the lookup table to insert into the dimension (you can either return the generated key using SCOPE_IDENTITY() or simply do a lookup/join on the natural key and status). When you need the key for the Fact table, you can do another lookup/join on the staging table. This approach takes pressure off of the dimensional model and creates opportunities to increase performance by optimizing the lookup/reference table. I’ll talk more about the details on this approach when I discuss Subsystem 14.

Alternative 2: Generate your own key. Another way to achieve the same result is to figure out the last key in the dimension and increment it manually. This number will then be used to insert into the dimension and again into the Fact. The problem with this approach is that if you do not process Facts immediately after each dimension, one at a time, then you will need to store and lookup the key in a staging table or in memory. Also, this approach assumes that no other process or thread is inserting data into the dimension at the same time. This, of course, would kill the sequential system and violate the primary key constraint on the table.

To explore Alternative 2 further, have a look at this post from Marco Russo of sqljunkies.com. He explains how to use an Execute SQL Task in SSIS that executes the following:

SELECT COALESCE( MAX( ID_Dimension ), 0 ) AS LastID 
    FROM Dimension

The claim is that his method is the best, most scalable solution. I don’t fully agree, as my experiences dictate otherwise. But I believe that each approach has merit and depends a lot on the environment.

Hand Coding with Visual FoxPro (VFP9)

For VFP, use an int data type with AutoIncrement enabled. As with a SQL Server implementation, I prefer using this approach to generating keys manually. Even still, both alternatives mentioned above are also available in FoxPro:

Alternative 1: Use a staging table. Here, as with SQL Server, use a staging table with AutoIncrement enabled on the primary key. As a consequence, you will not enabled AutoIncrement on the dimension. To return the newly added key from the lookup/reference table, you simply use the GETAUTOINCVALUE() function and return it.

Alternative 2: Generate your own key. You can also determine the last key used in the dimension and manually increment it. Personally, this approach offers very little to a VFP developer. Alternative 1 — or simply using the Dimension for generating the surrogate — is by far the best approach.

I like VFP for these types of tasks. Whatever you decide, you don’t really need to worry about returning the generated key. You typically insert and update all your dimensional data first, and then return to the dimension later to lookup keys just before inserting the facts. Nothing beats an old-fashioned SEEK in this scenario, which can be light years faster than building a select with a where clause depending on how crafty you are with Rushmore and Index building.

Compare this:

SELECT PrimaryKey ;
    FROM Dimension ;
    WHERE NaturalKey = 'SomeValue' ;
        AND StatusFlag = True

to this:

SEEK('SomeCandidateKey','Dimension','CandidateIndex')

If you are using Alternative 1, then instead of SEEK, you could build a JOIN between the lookup table and the data being processed for the fact. This set-based approach can be pretty speedy as well.

If you insist on exploring Alternative 2, the translation from Marco’s TSQL code to VFP is as follows:

SELECT NVL(MAX(ID_Dimension),0) AS LastID ;
    FROM Dimension

In my experience, I simply use VFP9’s Autoincrementing Integer field in each Dimension. I don’t bother with a staging lookup table because with VFP, I can just set the order to the natural key and do a SEEK on the dimension. It’s a fast, simple, and consistent approach.

VFP Really Shines

In this series of articles, I have attempted to highlight some of the differences between implementing the 34 Subsystem best practices in both a hand-coded environment (VFP) and through a tool specifically designed for the job (SSIS). In each case, both VFP and SSIS have really been about equal (this is a non-scientific study, of course!). SSIS has advantages that are hard to duplicate in VFP (some of which I will be highlighting later on in this series), but in this case, VFP really shines. While both databases can handle generating keys for us in similar ways, VFP’s SEEK command allows us to use it!

I don’t want to dwell on this any more. I’ll discuss more details on how to use the surrogate keys when I talk about Subsystem 14: Surrogate Key Management.