Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘SQL Server 2005’


Published July 10th, 2008

ETL Subsystem 16: Late Arriving Data Handler

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Sometimes, not all dimension data is available when a transaction (i.e. business event, such as an order) comes into the Warehouse. This is a fact of data warehousing life. Under normal circumstances, you process dimensions first so that you have all your keys ready for inserting into the fact. When dimensional data is unavailable, missing, or incomplete you end up with a bit of a problem. NULL foreign keys in the fact table are not allowed.

Missed the trainOther times, your fact comes in late. In this scenario, you have all your dimensions, but now you need to insert a fact into the fact table and link it to existing dimension rows that were active at the time of the fact. Late arriving facts are easier to work with than late arriving dimensions, in my opinion. I already discussed some techniques for handling this in my post “ETL Subsystem 14: Surrogate Key Manager“.

I find late arriving dimensions (also called “early arriving facts”) to be a more difficult problem. There are a few of good solutions though, all of which are handled during ETL:

  • Hold onto the fact until all dimensions arrive
  • Create a dimension called ‘unknown’ or ‘not available’ with a primary key of -1 and use that
  • Insert a dummy row in the dimension and populate it with whatever you can

None of these options are terribly appealing across all scenarios; however, each has a place under certain circumstances.

Hold Onto the Fact

Holding onto the fact in your staging area may be a great option if, for example, you expect the late arriving data to be coming in soon. Perhaps you process data from multiple source systems. System A contains the transactions, while systems B, C, and D combine to give you your dimensions. Because of your extraction windows, you may not be able to get data from system D until later in the morning, meaning that the business events from system A are missing some context. Holding onto these events and waiting for the data from system D is the appropriate action!

Even if the situation is not predictable, holding onto facts might be the best option. If you are loading daily stocks, for example, and for some reason a particular company is not available (a new company perhaps), but the stock is showing up in your extracts, you should hold onto that stock in your staging area until the company is available. You don’t want that stock in the warehouse until the full details of the company comes in (which could be available in a separate extract, or may need to be manually entered).

As you can see, deciding to work this way has implications on the ETL processes and staging area. Planning for these circumstances up front is a must.

The ‘Unknown’ Dimension

Creating an empty row in the dimension called ‘Unknown’ is also an option that is good under some circumstances. If you are processing orders and no information about a promotion comes in, then it would be safe to link the fact to a special row in the dimension that denotes that no promotional information was available. Depending on your dimension and business requirements, you could have many different levels of ‘unknown’.

Here are a few suggestions:

  • -1, ‘none’
  • -2, ‘unknown’
  • -3, ‘not applicable’

If you use key -1 in your fact row for the foreign key to the promotion table, you allow your users to write queries and reports for all orders that were made when there was no promotion (promotion.name = ‘none’). The same is true for all facts that have an unknown promotion status, meaning there could have been a promotion, but the data was incomplete. Lastly, there are cases — if you do online orders for example — where promotions simply don’t apply.

You can expand these keys heading backwards with lots of detailed descriptions telling you why a dimension might not exist for an event. Don’t go overboard though! Having a few options is a good idea for data quality and analysis purposes. But too many may overload users with too many choices all too similar to be useful.

This approach works well when the likelihood of acquiring the context later is slim-to-none. If you use this method, it gets very complex if you later do in fact find that there was some context (i.e. a promotion). Your only option then is to revisit your facts and look for, and update, the fact rows corresponding to the late arriving data. On the bright side, if you’ve already tagged these rows with a foreign key of -2, for example, you may be able to focus only on those facts — saving you some process time.

Inferring a Dimension

In the company stock example I gave above, you have another option. If the company details of a stock is not known — and all you have is a SEDOL or Ticker — then simply insert a new row into the dimension with all of the information you know to be true about the dimension. This only works if you have enough details about the dimension to construct the natural key. Without this, you would never (although, I’ve learned to never say never) be able to go back and update this dimension row with complete attributes.

To make this work, and if you are using Type 2 SCDs, you will need another attribute in your dimension. Call it “inferred” or similar, make it a bit or char(1), and mark it as true (1, or “Y”).

This has a major implication on your dimensional processing. Now, before you process any Type 2 changes on any dimension that contains an attribute called “inferred”, you need to check if the row you are ready to work on is in fact inferred. If so, treat the entire row as Type 1. Because it is inferred, we don’t have the details anyway. If we went along and processed the row as a Type 2, then your fact will continue to point to an essentially empty row.

SQL Server 2005 Integration Services (SSIS)

The SCD Component handles the inferred dimension approach quite nicely. In fact, “Inferred Dimension” is a term introduced by SSIS (I believe). I talked briefly about the SCD Component and inferred dimensions in my post “ETL Subsystem 9: Slowly Changing Dimensions (SCD)” so I won’t repeat myself here. The bottom line is that SSIS has a very elegant solution to the late arriving dimension problem. If you have them — and almost every data warehouse has late arriving dimensions — then using the SCD Component and the inferred dimension option is must. Also check out this external link for more inspiration: “Handling Late-Arriving Facts“.

Hand Coding with Visual FoxPro (VFP9)

While SSIS has built-in capabilities to handle the inferred rows, VFP does not. This is not surprising of course because VFP is a language and not an ETL tool! With that said, writing SQL to accommodate the “inferred” flag in a dimension row is not a big deal. The processing logic for a SCD Type 2 dimension then becomes (in pseudo-code):

 
For each row
  If any attribute marked as 'Type 2' changes AND inferred = "N" Then
    update status of active/current row
    insert into dimension a new row
  End If
  For each attributes marked as a 'Type 1' change OR inferred = "Y" Then
    If attribute is subtype 1
        update all rows that match the key
    Else
        update only the current/active row that matches the key
    End if
  End For
End For

I discussed the above pseudo-code in more detail in my post “Loading the Slowly Changing Dimension“. Check that out for more insight into some of the logic. I should mention again that this is a very basic structure and not intended to be comprehensive or de jure.

From Here

Enough about late arriving data! Late arriving data, changed data capture, and surrogate key lookups might be the most challenging aspects of ETL — especially when it comes to performance and efficiency. It will take you some time to get these right. You may — like me — be forever tweaking these areas to squeeze every last second out of the process.

Next, I’ll move on to discuss Dimension Management. A much lighter topic!

Published July 2nd, 2008

ETL Subsystem 15: Bridge Table Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Before I can discuss this subsystem, I need to discuss bridge tables. This is a topic I haven’t covered much until now. So to help me illustrate how bridge tables are used in a dimensional model, I’ll present a case for a real-estate property transaction.

Sometimes the defined grain of a fact has multiple dimension rows that characterize it. These scenarios are referred to as many-to-many or multi-valued dimensions. A patient encounter at a hospital might have multiple procedures. Multiple bank accounts might belong to a single person. In the case that follows, I’ll show how multi-valued dimensions help support property sales (i.e. a real estate transaction).

I would like to make one important point before I begin: If you can thwart the need for bridge tables by redefining the grain of your fact table to be more atomic, then you should do so. Bridge tables add complexity to the model, ETL process, and the end-user application layer. With that said, multi-values bridge tables in many environments are unavoidable and absolutely necessary.

Case Study: A Real Estate Transaction

In the case of a property transaction (aka a property “sale”), you are potentially dealing with several many-to-many relationships. First of all, you may not be buying alone (one-to-many buyers). You may also not be buying your house from a single person (one-to-many sellers). Your house may be on two parcels of land or no land at all if you are buying an apartment or condo (none-to-many parcels). A barn or in-law apartment may be included or perhaps you are just buying land to build your dream home later (none-to-many buildings). These and other complexities make the case for bridge tables to represent the many-to-many relationships inherent in a complex transaction such as this.

When you buy land or a house (in the US, at least!) you must “record” the sale with the clerk of the municipality in which the property exists. This recording becomes part of the chain of ownership and tax record for the property. This recording can accommodate multiple grantors (the people or organizations selling the property), grantees (those who are buying the property), and properties (multiple parcels, land lines, lots, buildings, outbuildings, units, etc). This recording is given a book and page and filed away in a database. A hard copy is typically stored in a physical book in the clerks’ vault as well. It’s pretty fun (for some people who are into history) to go to your clerk’s office and peruse the vault. Anyway…

What follows is a simplified ER model from an OLTP system I designed several years ago. This model represents the complexities of the above case in a transaction environment. It has been simplified to only show the more important business entities:

Property Sales Schema - OLTP

Note that in the above OLTP system, table Real represents land and building values from tax billing software. These figures have been aggregated into a single row (i.e. total land value and total building value, number of buildings, total acres, etc.). For the above model, knowing the exact building or land line is not important. For the Data Warehouse, though, we want and need this atomic detail for some detailed analysis. The Deed table represents the actual recording (book, page, date, time, etc.). The bridge table ND_link contains grantor and grantee information. NR_link represents ownership history between the names and entities in Namefile and the aggregate property information in table Real.

The business process dimensional model could therefore be modeled as follows:

Business Process Dimensional Model for a Property Sale

As you can see, there are bridge tables between the SaleFact and DimBuilding, SaleFact and DimLand, and SaleFact and DimPerson dimensions. I typically name these tables by including the word “Group” in its name (LandGroup, BuildingGroup, etc.) but you could use the word “Bridge” or any other term that is consistent throughout your models. These bridges define the many-to-many relationships that are normal for a Real Estate transaction. Because a property sale is a natural business event, and cannot be logically defined to be any more atomic, multi-value bridge tables are a must.

In the above schema, a particularly astute and observant modeler will notice the many-to-many relationship between the keys in the fact and bridge. FoxPro, which is where I put together these examples, will not allow a many-to-many join between two keys. So, I Photoshopped the graphic in order to show you the logical representation! In FoxPro, and in many other database systems, you will need another table in between the Group and the Fact with a single attribute primary key. This would represent the physical implementation of the above design. You would therefore need two physical tables to represent the many-to-many relationship. I try my best to avoid this. In FoxPro, for example, I would simply not create the relation at all! Or in some cases, the bridge group would be unique for each event, making the foreign key in the fact table always unique. Being unique, it can be defined as a candidate key and then happily joined to the bridge table (more on this in the section below on “special considerations”).

Bridge Table Characteristics

Bridge tables are not just janitors or pianos (well, they do have all they keys!). Most bridge tables also contain attributes. For the above case, I include two attributes in each bridge: weight and primary indicator.

Weights are used to allocate metrics in the fact. Weights always add up to 1 across the group. If you need to allocate the sale price across all buildings involved in the sale, you would use the weight. Weights are calculated as part of ETL process and are an important part of the mechanisms of Subsystem 15. Sometimes, it is perfectly OK to divide 1 by the number of group records to get the weight (applying any fraction to the first row perhaps). But more often than not, there is a more complex algorithm to determining the weight. For land, you might base the weight on lot size or price code. For buildings, you may base the weight on the replacement cost of the building, or you might use square footage. All of these are defined by the business and applied during ETL.

I use primary indicators because some towns only care about the primary grantor, grantee, building, or parcel for analysis. One row in the bridge table will be identified as ‘primary’. During ETL, the primary row is identified by using a sequence number (number 1 = primary) or by determining the highest valued property. Sometimes, its arbitrary, in that you just mark the first one encountered as primary. These decisions are part of your business requirements and must be carefully implemented during data transformation.

Other Considerations

Guess what? Your bridge table is really a type of fact table. In the above schema, LandGroup is a sort of fact with some semi-additive metrics that represent the relationship between a parcel of land and a sale.

In my example, a sale is a static event. The buildings, land, and people involved will change over time, but the groupings used at the time of the sale will remain the same. For this reason, there is no need to make the group tables Type2 dimensions. If we ever need to change a value in one of these bridge tables, then it is likely to correct a data problem and not to represent a changing dimension. Additionally, in this scenario, it would be a safe assumption — especially because there are weighting factors and primary indicators involved in each bridge — that each grouping is unique to the fact. For modeling purposes, you could avoid the many-to-many relationship issue discussed in the previous paragraphs by making the bridge keys in the fact table candidate keys. There would be a unique set of group keys for these bridge tables. But be cautioned: The volume of property sales is small in any given year for any given municipality. You’ll never see a million transactions in a year. This makes the candidate key approach attractive. If you are dealing with patient diagnosis across multiple hospitals then your bridge tables will become too large. In these cases, you’ll need to consider re-using groups. The consequence of reusing groups is that the key in the fact table can no longer be a candidate.

To be clear: Bridge tables can also be a Type 2 SCD. If this is needed, and tracking bridge table changes over time is important, then you will need to combine bridge building with your SCD logic. You will need activity dates (from and to), and a current row indicator.

As I said at the top: If you can thwart the need for bridge tables by redefining the grain of your fact table to be more atomic, then you should do so. Bridge tables are not for the faint of heart!

SQL Server 2005 Integration Services (SSIS)

Bridge table building is largely a modeling and business rules problem. There is nothing in SSIS that will prohibit you from implementing even the most complex bridge table design. Most difficulties will come when managing the keys and implementing weighting factors. For the keys, you can use the same techniques described in my posts about Subsystem 10 and Subsystem 14. For weightings, I’ve implemented weighting algorithms using both stored procedures (Execute SQL Task) and VB (Script Component) when appropriate.

Hand Coding with Visual FoxPro (VFP9)

The above case is taken from an actual implementation I created in Visual FoxPro. I must say, it was rather simple once I ironed out the nuances and had my weighting algorithms in place. Developing the bridge table builder subsystem was a simple matter of inserting and linking the fact with the multi-valued entity and applying the weight and primary indicator logic. The resulting analytics that can be derived now far outshines that which could be done with the original OLTP model. Not to mention, the dimensional model has atomic detail (not like the aggregate Real table shown in the first schema) which allows for more slicing and dicing. But I digress!

From Here

Next post, I’ll discuss Late Arriving Data (Subsystem 16). That’s a big topic, so I’ll distill it as much as possible to keep it brief!

Published June 24th, 2008

ETL Subsystem 14: Surrogate Key Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Surrogate key management is an important part of designing an ETL system using the Kimball Methodology. Slowly Changing Dimensions require the use of surrogates because natural keys, like a Customer ID, Product ID, and so on, will be repeated in dimensions whenever Type 2 changes are tracked. Surrogates are also desirable because they are single-part and fast for joins. I’ve already discussed the need for surrogates in several posts. Here, I’ll talk about some techniques for replacing the natural keys that come in from the source systems with the new surrogate keys that exist in the data warehouse.

The idea is simple in theory: For each dimension row, you generate a surrogate key. This surrogate key replaces the row’s candidate key and is used to link to one or more Facts.

For example, the candidate for a Customer dimension may be CustomerID + ActiveFromDate. The ActiveFromDate is used to sequence the row if Type 2 change tracking is used. It is entirely possible to have the same CustomerID listed dozens of times in the dimension. The differentiator is the ActiveFromDate. This combination is guaranteed to be unique in the Customer dimension (although you would not normally enforce this using an index, you would certainly enforce this through ETL). This is a simple example. When modeling Parts or Real Estate, you may have several attributes combining to form a candidate key. This also becomes a bit more complex when Customer data is refreshed throughout the day — meaning you cannot rely on the date alone, you also need to consider time.

In dimensional modeling, you will normally process dimensions before facts. Because of this, you will have all surrogates defined for each row you need. In a subsequent post, I’ll talk more about situations where you load a fact before a dimension, or situations when late arriving data must be handled. For now, assume that all dimensions have rows that correspond to a fact. When you process the fact, you simply need to look up the surrogate key assigned in the dimension and use that key as the foreign key to that dimension in the fact.

Dimensional modeling essentially requires surrogate keys, and does so by taking a performance hit during ETL. It takes time to look up surrogate keys. You will spend a lot of time yourself making this process as fast and efficient as possible. I do have a few suggestions, though:

Load by the Day

Process facts one day at a time (all of yesterday’s data plus any changed data or late arriving data from previous days). Doing so allows you to join your staging fact row by the natural key in the dimension for that day. For example, in SQL, you could acquire all dimensions that are valid for a particular day by doing the following:

SELECT DISTINCT NaturalKey 
    FROM Dimension 
    WHERE DataDate BETWEEN ActiveFromDate AND ActiveToDate

This returns a subset of the dimension for the day in which the fact occurred (DataDate). If the candidate key involves more than the natural key, you will need to adjust the above SQL to be more robust. For example, you may have the name and birthdate of a patient at a hospital with an insurance number and insurance company code that uniquely identifies the row! That’s six attributes (if you include the ActiveFromDate) that uniquely identifies the row when combined.

With the above, you can incorporate the query into a JOIN on the staging fact data (by selecting all facts that occurred on DataDate), or you could use it as the source for a lookup operation. By narrowing the focus to be a single day (and you load by the day), you eliminate the need to worry about SCDs.

You can further enhance the above by maintaining special staging tables for each of your dimensions. These staging tables contain all candidate attributes and the surrogate key for each dimension. These special staging tables are maintained while you process your dimensions. Instead of selecting against the actual, production dimension, you can use the staging table. Although there is a little extra overhead in maintaining the staging tables, you get it all back — and then some — when you do your key lookups.

If you need to load by time of day (essential for real-time or intra-day ETL), you can still use the above method, except that you will need to join your fact data on time of day as well.

You can get around using BETWEEN if you are processing the most recent data. Instead, you could filter results by the dimension’s “current” flag. The current flag is set while processing the dimension. Only one row, for particular natural key, can be marked as current. A predicate on a current flag would be faster than checking between dates. A common trick is to split the rows into two streams: One with current data (anything that happened yesterday), and one with historical data (anything that happened the day before yesterday and back). Use the current flag for the current data, and use the BETWEEN for historical data.

I find that this technique works well for small dimensions with a low number of facts (a few thousand transactions a day). With a good index plan, this might be all you need to do.

Store the Surrogate Right Away

Another technique worth mentioning is particularly useful when dealing with very large dimensions and lots of facts. You should also consider this technique if you are manually generating your surrogates (see my post “ETL Subsystem 10: Surrogate Key Generator” for specifics). Basically, when you generate your surrogate key for a dimension row, store it with the fact data at the same time.

This technique requires writing the surrogate to the dimension and the fact staging table. Using an UPDATE statement is a simple approach, but a more exotic and better performing method would be to INSERT the new keys into another staging table, and joining to the staging fact table later on.

Either way, there is a performance hit with storing they key right away, so be sure to thoroughly review your environment and needs. I have used this technique before with great success, but it took me a long time to determine it was the best approach. In fact, my original design worked great until the dimension grew to a few million rows and the transaction load doubled. Doing a lookup using BETWEEN proved to be too slow compared with saving the key immediately.

SQL Server 2005 Integration Services (SSIS)

Jamie Thomson wrote an article on how to use SSIS to update fact tables (actually, most of it is a case study written by “Mag”). I won’t repeat what was in that post. Have a look. Jamie’s technique is basically what I have come to adopt myself. You’ll notice the liberal usage of Lookups instead of Joins (joins, at least for me seemed like a more logical starting point). Lookups perform better for a few reasons. The Merge Join operation requires a Sort — either by using the Sort Component or by ORDERing the data through the Source Component. This sorting can be expensive. Lookups, in addition to not requiring that the input be sorted, can use caching and limit access to the database. For these reasons, I use Lookups now almost exclusively.

When using the Lookup approach you must be aware of two possible conditions: The Lookup could return more than one match if you don’t properly define your candidate key. Also, if no match is found, common for late arriving data, you need to either divert the errors and handle them separately, or convert the NULLs to your default “unknown” value for each dimension. For me, I typically use negative numbers to define various types of null (-1 for “not known”, -2 for “not applicable”, etc…).

If you process by the day, use staging tables as dimension lookups, and rely on Lookups instead of Sorts and Merge Joins, then fact processing should go quite well.

Hand Coding with Visual FoxPro (VFP9)

As I discussed in my post “ETL Subsystem 10: Surrogate Key Generator “, VFP’s SEEKing capability far outshines set-based approaches to this problem. Check out my comments in that posting for some ideas on how to use VFP to get the surrogate keys for your fact data.

Not to get lost though in the wonders of SEEK, VFP is perfectly suited for set-based approaches as well. So all of the methods discussed in the article (Joins and lookups) can be implemented easily in FoxPro. As I wrote in that post, VFP really shines for this type of task.

From Here

Next post, hopefully later this week, I’ll discuss bridge tables and how to best build them. I have an interesting case study to walk through. I have refrained from using cases in my ETL subsystem posts, but for bridge tables, I am making an exception!

Published June 16th, 2008

ETL Subsystem 13: Fact Table Builders

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Fact tables are the heart and soul of the business process dimensional model. I discussed fact tables already in a previous post (”Fact Tables“), so I won’t repeat myself here. Recently, I also provided a more formal definition of a fact table as the: “central table of a Star Schema with numeric performance measurements, identified by a composite key, each of whose elements is a foreign key drawn from a dimension table”.

Three grains define fact tables: Transaction-level, Accumulating Snapshot, and the Periodic Snapshot grain. Refer to my post referenced above (”Fact Tables”) for more information on granularity.

Subsystem 13 is responsible for the construction of all types of fact grains.

Transaction grain fact tables are always the largest and should contain data at the finest grain possible. This would include a line on an invoice, a patient diagnosis for a series of emergency room events, or even a Josh Beckett fastball. Data is almost always inserted (via bulk load) into transaction-grain facts. Deleting rows is dangerous (more on this in a second), and updates should only occur in order to correct an error or update some late-arriving key. In some cases, I have even gone back into a transaction grain fact table in order to update a calculated field.

Instead of deleting data in a transaction-grain fact table, “negate” the row instead. Otherwise, you run the risk that some of your historical reports and analysis will be inconsistent. To negate a row, simply add a new row to the fact, with identical foreign keys to the dimensions as the row you are negating, but update the metric so that when both rows are summed (or counted, etc…) they result in zero activity. For some metrics, this is tricky, but if you are using mostly additive and semi-additive metrics in your facts, you should be OK with this approach.

For the periodic snapshot grain, there are some differences in the loading process to consider. First, as these facts deal with aggregate data (based on daily, weekly, monthly, quarterly, etc…) time spans, it is appropriate to only load them when a period is complete. Secondly, extremely careful consideration for the grain of the periodic snapshot must be honored because it is too easy to apply an aggregate calculation inappropriately against the stated grain. Third, and as a compliment to my first point, it may be necessary to aggregate some running totals. If for example you are building a monthly periodic snapshot for a checking account at the bank, you would normally also provide data from the end of the last period to date.

Accumulating snapshots represent some process as it evolves over time. Therefore, rows in the fact are constantly updated to represent this evolution. For example, a patient may enter the ER, check in at triage, get assigned a room, see the nurse, see the doctor, etc. A building inspector may visit a new construction before work is done, when the foundation is set, when the framing is up, when the electrician as finished, etc. Loading this type of snapshot is much different from the other two, in that it tracks some defined process over time in a single row. There are many updates, and additional logic is needed to handle the loads.

SQL Server 2005 Integration Services (SSIS)

I find working with the various types of fact tables in SSIS to be rather simple. The key for success is usually in how the Control Flow is sequenced (facts generally come last), when dimensions are loaded, and where you do your dimension key lookups. In essence, all you need is a source containing your metrics (line items on an invoice pulled during the extract phase), a series of lookups (or joins, whichever provides the best performance in your situation), and some way to load the fact (bulk insert if possible).

Some advice: Do your UPDATEs in an Execute SQL Task. Do not try to use the OLD DB Command. Especially — and I mean especially - if you are doing a large amount of updates and your fact table is also large. To get this to work, offload the data you need for your updates into a separate staging table. Use the Execute SQL Task and write an appropriate UPDATE statement which joins your fact to your new staging table. The same approach holds true for any rows you might be deleting.

Hand Coding with Visual FoxPro (VFP9)

I find working with fact tables in FoxPro to be just as easy as in SSIS. There is no clear advantage of one over the other. One major benefit with VFP is that it is an excellent OOP language with very fast cursors (a VFP Cursor is much easier to work with than an SSIS Raw file). This gives you many more options on how you look up and populate foreign keys for the fact at load time.

If you are using VFP tables for your RDBMS, you will need to be careful for the 2 gig size limitation for your fact tables. Your Fact Table Builder may have to handle partitions manually. This is a major disadvantage with using the VFP database. Especially because 2 gigs is not a lot of space for daily transactions!

From here

In the next article, I’ll discuss ETL Subsystem 14: Surrogate Key Manager. Subsystem 14 helps to maintain the integrity among facts and dimensions for each business process, and is a complement to your Fact Table Builders.

Published June 6th, 2008

Fun with Serial Dates

I ran into a bit of an oddity today while working with serial dates in SQL Server 2005 and MS Excel 2003. I’ve dealt with this before with Excel using FoxPro, but don’t recall the specifics. Anyway…

A serial date is a sequential number, starting on January 1st, 1900 (or if you want compatibility with a Mac serial date, 1904) that represents the number of days since any date to that point in time (you can also represent time by using a decimal, but I won’t get into that here). Excel has a bug (or a “Lotus 123 compatibility feature” as some like to call it) in it that counts Feb 29, 1900 as an actual date, when in fact 1900 was not a leap year.

Hence the first bit of fun. Run this TSQL code:

SELECT DATEDIFF(dd,'19000101','20080606')

The above query returns ‘39603′. So far so good. In SQL Server, you can now derive the date from the serial value by using convert:

SELECT CONVERT(DATETIME,DATEDIFF(dd,'19000101','20080606'))

…which gives me ‘2008-06-06 00:00:00.000′. So far so good.

The reason we’re using serial dates in the first place is to accommodate some models developed in Excel. The data is fed in directly from the database and calculations are performed.

When I take value 39603, put it into an Excel cell and re-format the column from General to Date, I get ‘June 4, 2008′! Two days off from what I get from SQL Server.

I already mentioned that Excel counts Feb 29, 1900. That’s one day. So why am I off by two? Here’s some more fun. In SQL Server, try the following:

SELECT CONVERT(DATETIME,0)

You get ‘1900-01-01 00:00:00.000′. But January 1st is supposed to be day one. Not day zero (must be a zero-based array sort of thing). When you try the same in Excel, you get a more appropriate (but equally odd) value of ‘January 0, 1900′. You’d think I’m dealing with scientific ephemeris or something.

SELECT CONVERT(DATETIME,1)

So, in SQL Server, we’re off by a day. The first day shouldn’t result in the second day (’1900-01-02 00:00:00.000′) of the year as the above TSQL shows.

This helps to explain the 2 day offset between Excel and SQL Server.

Come on Redmond! Work with me here!

To solve (read: work around) the problem, we’ve decided to stick with keeping Excel happy (against my recommendation, but at the end of the day, this is a business decision). Our data integration packages add two days to the serial date we calculate from a YYYYMMDD value. When data is moved to Excel from the data warehouse, no transformation is done on the serial value and Excel calculates as expected. When we use the serial date in a SQL Server context, we need to subtract 2 before the convert.

This is not a very good design and will undoubtedly cause someone problems some day.

I recommend storing the ISO YYYYMMDD format instead. But Excel won’t format a value such as 20080606 to a date. What a shame! If anyone knows of a way, or can enlighten me more about this subject — do not hesitate!

Published May 26th, 2008

ETL Subsystem 12: Special Dimensions Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Each dimensional model that you create will have special design characteristics and requirements. This typically means that you may need to rely on specialized dimension tables. Not all dimensions will be “Product” or “Customer”. In fact, you will likely have more specialized dimensions in your models than actual normal, business dimensions.

There are several categories of special dimensions. They are:

  • Date and Time Dimensions
  • Bridges to support many-to-many relationships
  • Special Indicators and Flags Dimensions
  • Study or Research Group Dimensions
  • Mini-Dimensions
  • “Current” Dimensions
  • Lookup and other Static Dimensions
  • Administration Dimensions (such as special logs, monitors, and audits)

Much of the above has already been discussed in my posts to date (date dimensions, bridges, and lookup dimensions for example). So not to beat a dead horse, I’ll spend the next few paragraphs explaining those I haven’t really talked much about yet.

Special Indicators and Flags Dimensions
These dimensions, also referred to under some circumstances as “junk” dimensions, exist to store extra information about the fact when it occurred. I typically use these special dimensions to hold flags (for example, if a tax payer has a history of delinquency, if a certain diagnosis and age creates greater risk for death, etc..). I usually store these flags as a tiny int with three settings: -1 null, 0 off, and 1 on. For researchers, this table acts as a way to screen large amounts of data and focus efforts on a subset of events with certain characteristics.
Study or Research Group Dimensions
Whether I’m working with Qualitative Analysts or Quantitative Analysts, they always seem to need specialized research databases (and they all seem to like using SAS too!). In the past, I got in the habit of creating copies of the required business processes dimensional models for their use, filtered and sampled appropriately. These databases would be installed in an isolated environment. This isolation allowed them to develop models, conduct what-if analysis, and to manipulate the data to test theories. This gave the researches their own data to play with and everyone was happy. Another solution — a better solution — which I have only recently begun to implement with more regularity is creating a special study group and/or research dimension which defines the research being conducted (”Infection Research”, “buying patterns of customers in Alaska in January”, etc..). A bridge table between the subject (Customer? Product? Stock? Web Page?) links the study with the sample used. All queries for that study simply use the new special dimension as an additional filter. And everyone’s happy!
Mini-Dimensions
Mini-dimensions are necessary when a subset of attributes (typically related) need Type 2 change behavior tracking, the changes happen frequently, and the dimension is rather large (such as customer, webpage hits, or stock index characteristics). A specialized mini dimension is created with the required attributes, stored at a grain that represents a combination of attributes. This mini dimension does not contain a link back to the parent dimension, nor does it contain the natural or primary key of the dimension. Each row is instead given a unique key that is referenced alongside the parent dimension in the fact table. For a large Customer dimension, you may create a mini-dimension with income bands, year-to-date behaviors, job status, and other related financial and purchase information. This “profile” might be shared by hundreds or thousands of customers and change frequently. Therefore, the data you store would not be identifiable on its own to any single customer, but to all customers who share the characteristics.
“Current” Dimensions
There are times when large Type 2 SCD dimensions become too unwieldy for querying and other analysis; or, what you really want is a Type 1 dimension to do most of your work, but need to maintain Type 2 changes over time to accommodate some special need (like compliance). The solution is to copy out the “current” rows of a dimension (using a current flag or date range) into a second, special Type 1 dimension. I’ve had success using this technique using three methods: (a) create a role-playing view on the dimension using only the current rows (one challenge with this method is that you need to store a special key in the parent dimension so that you can use the view to relate to a fact table row); (b) by creating a new Type 1 dimension through ETL and placing a new foreign key to my table in the fact; and (c) by using the Type 2 dimension as a bridge table, linking to the special dimension through its natural key (which can be a view or a physical table created through ETL).

SQL Server 2005 Integration Services (SSIS)

Other than building the dimensions, there is nothing special from an SSIS point of view. You will generally handle the creation of the special dimensions in separate packages (dates and lookups for example) or as part of normal dimensional processing (which would be the case for mini-dimensions and current dimensions for example).

Hand Coding with Visual FoxPro (VFP9)

Same as with SSIS. Nothing special. If you can process a dimension, you can process a special dimension. Most of the onus falls on business requirements, data modeling, and overall data warehouse design. Non of which are inhibited by Visual FoxPro.

From Here

This subsystem is deigned to be a place-holder for handling special dimensions. I have found that a great deal of value found in a DW/BI solution actually stem from these. All those special flags, scoring results, quality assessments, and additional features that we’ve painstakingly added through ETL are now available.

In my next post in this series, I’ll discuss lucky subsystem 13: Fact Table Builders.