Posts Tagged Star Schema

ETL Subsystem 13: Fact Table Builders

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Fact tables are the heart and soul of the business process dimensional model. I discussed fact tables already in a previous post (“Fact Tables“), so I won’t repeat myself here. Recently, I also provided a more formal definition of a fact table as the: “central table of a Star Schema with numeric performance measurements, identified by a composite key, each of whose elements is a foreign key drawn from a dimension table”.

Three grains define fact tables: Transaction-level, Accumulating Snapshot, and the Periodic Snapshot grain. Refer to my post referenced above (“Fact Tables”) for more information on granularity.

Subsystem 13 is responsible for the construction of all types of fact grains.

Transaction grain fact tables are always the largest and should contain data at the finest grain possible. This would include a line on an invoice, a patient diagnosis for a series of emergency room events, or even a Josh Beckett fastball. Data is almost always inserted (via bulk load) into transaction-grain facts. Deleting rows is dangerous (more on this in a second), and updates should only occur in order to correct an error or update some late-arriving key. In some cases, I have even gone back into a transaction grain fact table in order to update a calculated field.

Instead of deleting data in a transaction-grain fact table, “negate” the row instead. Otherwise, you run the risk that some of your historical reports and analysis will be inconsistent. To negate a row, simply add a new row to the fact, with identical foreign keys to the dimensions as the row you are negating, but update the metric so that when both rows are summed (or counted, etc…) they result in zero activity. For some metrics, this is tricky, but if you are using mostly additive and semi-additive metrics in your facts, you should be OK with this approach.

For the periodic snapshot grain, there are some differences in the loading process to consider. First, as these facts deal with aggregate data (based on daily, weekly, monthly, quarterly, etc…) time spans, it is appropriate to only load them when a period is complete. Secondly, extremely careful consideration for the grain of the periodic snapshot must be honored because it is too easy to apply an aggregate calculation inappropriately against the stated grain. Third, and as a compliment to my first point, it may be necessary to aggregate some running totals. If for example you are building a monthly periodic snapshot for a checking account at the bank, you would normally also provide data from the end of the last period to date.

Accumulating snapshots represent some process as it evolves over time. Therefore, rows in the fact are constantly updated to represent this evolution. For example, a patient may enter the ER, check in at triage, get assigned a room, see the nurse, see the doctor, etc. A building inspector may visit a new construction before work is done, when the foundation is set, when the framing is up, when the electrician as finished, etc. Loading this type of snapshot is much different from the other two, in that it tracks some defined process over time in a single row. There are many updates, and additional logic is needed to handle the loads.

SQL Server 2005 Integration Services (SSIS)

I find working with the various types of fact tables in SSIS to be rather simple. The key for success is usually in how the Control Flow is sequenced (facts generally come last), when dimensions are loaded, and where you do your dimension key lookups. In essence, all you need is a source containing your metrics (line items on an invoice pulled during the extract phase), a series of lookups (or joins, whichever provides the best performance in your situation), and some way to load the fact (bulk insert if possible).

Some advice: Do your UPDATEs in an Execute SQL Task. Do not try to use the OLD DB Command. Especially — and I mean especially – if you are doing a large amount of updates and your fact table is also large. To get this to work, offload the data you need for your updates into a separate staging table. Use the Execute SQL Task and write an appropriate UPDATE statement which joins your fact to your new staging table. The same approach holds true for any rows you might be deleting.

Hand Coding with Visual FoxPro (VFP9)

I find working with fact tables in FoxPro to be just as easy as in SSIS. There is no clear advantage of one over the other. One major benefit with VFP is that it is an excellent OOP language with very fast cursors (a VFP Cursor is much easier to work with than an SSIS Raw file). This gives you many more options on how you look up and populate foreign keys for the fact at load time.

If you are using VFP tables for your RDBMS, you will need to be careful for the 2 gig size limitation for your fact tables. Your Fact Table Builder may have to handle partitions manually. This is a major disadvantage with using the VFP database. Especially because 2 gigs is not a lot of space for daily transactions!

From here

In the next article, I’ll discuss ETL Subsystem 14: Surrogate Key Manager. Subsystem 14 helps to maintain the integrity among facts and dimensions for each business process, and is a complement to your Fact Table Builders.

Tags: , , , , , ,

1 Comment

Dimension Tables

In my post, Fact Tables, I discussed three types of fact tables as defined by their grain: transactional, accumulating, and periodic. In this post, I’ll go over the basic types of dimension tables typically found in a dimensional model.

What are Dimensions?

A Dimension table is a denormalized entity in a dimensional model that contains detailed information about a fact stored in a fact table. Dimensions are independent of each other, joined through the fact table or in the case of an outrigger, to other specialized dimensions that are not connected to facts. This type of design facilitates simple many-to-many analysis where an operator can pick attributes from a broad set of dimensions and “slice-and-dice” through them via the fact table.

Every dimension contains a single-part surrogate primary key field. This primary key is used to relate to one or more fact tables in the schema. Natural keys (like an employee id) are kept in the dimension, but are not used in joins. This technique ensures a unique value for each record, simplifies joins, and keeps indexes small.

A zero-key row represents the “null” or “unknown” condition in a dimension. The fact table, if the data dimension is unknown, will use a foreign key of zero, and still link to the dimension. This facilitates “did not happen” or “does not exist” queries and related analysis on missing or incomplete data.

Dimensions can take on different forms in different contexts. The grain of the fact table dictates if a particular dimension can be a part of the star schema involving the fact. A Star schema represents a single business process and its facts are stored at a particular grain. Multiple Star schemas can exist in a single database, where dimensions are shared among business processes. A Product dimension might be used in both the “procurement” and “order” Stars for example — provided the facts are at compatible grain. A time dimension, which I discuss below, will most likely be shared by every Star.

Types of Dimensions

The following list is not exhaustive, but should give you an idea of the types dimensions that can be found in a typical dimensional model:

Primary Dimension: Primary dimensions represent the major elements of the business process. For example, in an order system, you will find Order, Salesperson, Product, and Shipper dimensions (among others). Primary dimensions are familiar to business users because they represent some tangible or recognizable aspect of the business (typically the “nouns”). Most ad-hoc queries and reports will use attributes from Primary Dimensions as predicates and in Select and Order By clauses. It follows that significant time will be devoted to end-user training on the structure and content of these Primary Dimensions.

Time Dimension: Date and time dimension are crucial for the data warehouse and to derive Business Intelligence. Most business processes are tracked and analyzed over time; these points in time are stored in the fact table along with the appropriate metrics. Instead of embedding a date in the fact table, dimensional modelers create a date dimension that stores information about each day of the year, for how ever many years is relevant to the business.

Time can be stored in a separate table as well, that is, if it makes business sense. Typically, though I see time stored as a degenerate dimension in the fact table as in integer representing seconds since midnight.

I’ll spend more time on time dimensions in a future post. The uses of this type of table extend beyond the Dimensional Model!

Junk, or “Mystery” Dimension: A Junk Dimension is a dimension that stores extra “junk” data about the fact. A junk dimension can take the place of degenerate dimensions, as well as store other semi-useful information about a particular fact. Transaction codes, operation stamps, yes/no flags, and free text attributes are good candidates for junk dimensions.

Causal Dimensions: These dimensions allow you to store, and therefore analyze external events which can measure causality of a business process. Causal data gives an extra dimension (no pun intended) to your existing data. These dimensions can help explain why a fact exits in the first place! For example, storing weather conditions might be beneficial to a retailer so that last month’s increase in uumbrella sales can be explained (it rained more!). Other causal dimensions would include things like rate specials, store promotions, a newspaper ad, or some major world event.

Outriggers: A type of dimension, outriggers are not directly related to a single fact but rather to a dimension. An example of an outrigger might be postal code, where postal and delivery point information is retrieved from the USPS in order to link and verify address data stored in an Employee, Customer, or perhaps Shipper dimension.

For more information about Dimensions, feel free to download my code examples and presentation slides from FoxForward. Also, bookmark this page as I plan to expand upon these ideas in future posts. You can also read my articles on the subject at Advisor.com.

Tags: , , , ,

5 Comments

Fact Tables

Fact tables represent business measurements in a dimensional model (more info: 1, 2, 3). They form the center of the star schema and act as a hub among dimensions for “slice-and-dice” operations to create interesting many-to-many relationships. Measurement data in a fact table is most useful if it is additive (prices and number of units for example); however, non-additive data (such as a percentage) as well as partly-additive data (a metric that is only additive in certain contexts) can also be useful under the right circumstances.

Granularity is the defined level of detail of a measurement. The more granular, the greater the detail. The granularity of the fact table will help determine its type and application. More detail allows you to aggregate data in different ways. If there is not enough detail, then drilling down into the data won’t be possible. Here are the three basic types of fact tables as defined by their grain:

Transaction-level grain

These fact tables are common and represent a business function at the transaction level, often representing the lowest-possible level of data available. An example would be a line item on an invoice or sales receipt; a visit, or “hit”, to a page on a website; or perhaps a single pitch in a baseball game.

Accumulating Snapshot grain

This type of fact is not as common as the transaction-level fact, but very important, especially if you are interested in tracking the lifetime of a process. For example, you could use an accumulating snapshot to store a product’s movement through the supply chain. Each row in the fact table would contain date foreign keys that represent milestones in the process (product ordered, removed from inventory, shipped, received, etc.). This fact would compliment your transaction-level fact table, giving you the ability to view a process’s performance.

Periodic Snapshot grain

The periodic snapshot, like the type implies, is a view of a business process at the end of a specific interval. A summary of measurement data is taken at consistent intervals, such as by hour, day, week, or month. These measurements can then be used for trend analysis and to test business performance over time, without the overhead of aggregating transaction-level details.

Each of these fact table types have different applications in the warehouse. Transaction grain is most intuitive to business users and queries against these tables tend to be simple. If the level of detail in the transaction fact is fine enough, then a periodic snapshot will simply be an aggregate of this activity. If the grain is not fine enough, then the periodic snapshot will give additional insight into the operations of the business by providing a better and complimentary view of business data. Accumulating snapshots work exceedingly well when tracking a process over time. Product orders are a great example, as they move from point-to-point in the supply chain. Any business process that expands beyond a single moment in time is a candidate for an accumulating snapshot.

Next, I will discuss different types of dimensions, followed by examples using facts and dimensions in a data warehouse.

Tags: , , , ,

8 Comments