This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Fact tables are the heart and soul of the business process dimensional model. I discussed fact tables already in a previous post (”Fact Tables“), so I won’t repeat myself here. Recently, I also provided a more formal definition of a fact table as the: “central table of a Star Schema with numeric performance measurements, identified by a composite key, each of whose elements is a foreign key drawn from a dimension table”.
Three grains define fact tables: Transaction-level, Accumulating Snapshot, and the Periodic Snapshot grain. Refer to my post referenced above (”Fact Tables”) for more information on granularity.
Subsystem 13 is responsible for the construction of all types of fact grains.
Transaction grain fact tables are always the largest and should contain data at the finest grain possible. This would include a line on an invoice, a patient diagnosis for a series of emergency room events, or even a Josh Beckett fastball. Data is almost always inserted (via bulk load) into transaction-grain facts. Deleting rows is dangerous (more on this in a second), and updates should only occur in order to correct an error or update some late-arriving key. In some cases, I have even gone back into a transaction grain fact table in order to update a calculated field.
Instead of deleting data in a transaction-grain fact table, “negate” the row instead. Otherwise, you run the risk that some of your historical reports and analysis will be inconsistent. To negate a row, simply add a new row to the fact, with identical foreign keys to the dimensions as the row you are negating, but update the metric so that when both rows are summed (or counted, etc…) they result in zero activity. For some metrics, this is tricky, but if you are using mostly additive and semi-additive metrics in your facts, you should be OK with this approach.
For the periodic snapshot grain, there are some differences in the loading process to consider. First, as these facts deal with aggregate data (based on daily, weekly, monthly, quarterly, etc…) time spans, it is appropriate to only load them when a period is complete. Secondly, extremely careful consideration for the grain of the periodic snapshot must be honored because it is too easy to apply an aggregate calculation inappropriately against the stated grain. Third, and as a compliment to my first point, it may be necessary to aggregate some running totals. If for example you are building a monthly periodic snapshot for a checking account at the bank, you would normally also provide data from the end of the last period to date.
Accumulating snapshots represent some process as it evolves over time. Therefore, rows in the fact are constantly updated to represent this evolution. For example, a patient may enter the ER, check in at triage, get assigned a room, see the nurse, see the doctor, etc. A building inspector may visit a new construction before work is done, when the foundation is set, when the framing is up, when the electrician as finished, etc. Loading this type of snapshot is much different from the other two, in that it tracks some defined process over time in a single row. There are many updates, and additional logic is needed to handle the loads.
SQL Server 2005 Integration Services (SSIS)
I find working with the various types of fact tables in SSIS to be rather simple. The key for success is usually in how the Control Flow is sequenced (facts generally come last), when dimensions are loaded, and where you do your dimension key lookups. In essence, all you need is a source containing your metrics (line items on an invoice pulled during the extract phase), a series of lookups (or joins, whichever provides the best performance in your situation), and some way to load the fact (bulk insert if possible).
Some advice: Do your UPDATEs in an Execute SQL Task. Do not try to use the OLD DB Command. Especially — and I mean especially - if you are doing a large amount of updates and your fact table is also large. To get this to work, offload the data you need for your updates into a separate staging table. Use the Execute SQL Task and write an appropriate UPDATE statement which joins your fact to your new staging table. The same approach holds true for any rows you might be deleting.
Hand Coding with Visual FoxPro (VFP9)
I find working with fact tables in FoxPro to be just as easy as in SSIS. There is no clear advantage of one over the other. One major benefit with VFP is that it is an excellent OOP language with very fast cursors (a VFP Cursor is much easier to work with than an SSIS Raw file). This gives you many more options on how you look up and populate foreign keys for the fact at load time.
If you are using VFP tables for your RDBMS, you will need to be careful for the 2 gig size limitation for your fact tables. Your Fact Table Builder may have to handle partitions manually. This is a major disadvantage with using the VFP database. Especially because 2 gigs is not a lot of space for daily transactions!
From here
In the next article, I’ll discuss ETL Subsystem 14: Surrogate Key Manager. Subsystem 14 helps to maintain the integrity among facts and dimensions for each business process, and is a complement to your Fact Table Builders.