This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
I spent the last two posts describing aggregate facts and dimensions and explaining the differences between aggregates and summaries. I wanted to do that in a separate thread so that I could focus on Kimball’s “Aggregate Builder” without spending too much time introducing aggregates. So, if you haven’t already, please have a look at those two previous posts.
The key to subsystem 19 is that you budget for and build a system to maintain your aggregates. This can be easily overlooked. I repeat: This can be easily overlooked. Much like what happens with the very important and easily cast aside data profiling subsystem. Don’t let it happen to you!
Building Aggregates
Aggregates can be a challenge to build — especially if your relational or OLAP engines don’t do it for you. Actually, if your OLAP tool doesn’t handle building aggregates, and you are relying on it for your aggregates, then you should think about getting another OLAP tool! I’ll talk more about OLAP in my next Subsystem post.
When designing aggregates to build, consider two things: user access patterns and distribution of data along the hierarchies. You cannot hope to construct aggregates for every possible level of every hierarchy across all metrics in a fact. Instead, you need to assess how your users query the data and then, taking it a bit further, decide if creating an aggregate will make any real difference for performance based on attribute distribution. The main goal of creating aggregates in a data warehouse is to speed query performance mainly by limiting the amount of data required to obtain results. But building them can be expensive.
Generally speaking, every time you update, insert, or delete data into any dimension or fact, you will need to rebuild all aggregates that are tied to those attributes and metrics. If you have two or three dozen aggregates to build every time you load a few business processes dimensional models, then you can imagine the time and processor loads required. In many instances, there is a balancing act between getting the data into the warehouse and getting the data out.
Technically speaking, building aggregates is a simple matter in SQL. Often it consists merely of a GROUPed query, some aggregate or custom functions, and surrogate key management.
Here’s a tip: Embed surrogate keys in your dimensions for each hierarchy used in aggregates. This practice will allow you to avoid the costly process of generating new keys for your hierarchy at the expense of a few administrative attributes stored in each dimension. For example, if you are building a set of region aggregates, you might embed region_key, state_key, city_key along with region, state, and city in your geography dimension. The keys will be repeated in the dimension until they are rolled up, at which time they become unique. The new aggregate fact table will point to these keys, eliminating the need for a specialized surrogate key manager.
Navigating Aggregates
Depending on your technology (here, I talk about SQL Server 2005 via SSIS and Visual FoxPro only — see below), navigating aggregates may or may not be a simple matter. If you put all your aggregates into OLAP engines, then chances are good that your OLAP technology will handle aggregates for you. To do it yourself, say, if you were hand-coding this part of the data warehouse using Visual FoxPro, you have a few design questions to answer.
In an old, but still relevant article published by DBMS, Ralph Kimball defines four key design requirements when dealing with aggregates. They are:
- Aggregates must be stored in their own fact tables, separate from the base-level data. In addition, each distinct aggregation level must occupy its own unique fact table.
- The dimension tables attached to the aggregate fact tables must, wherever possible, be shrunken versions of the dimension tables associated with the base fact table.
- The base Fact table and all of its related aggregate Fact tables must be associated together as a “family of schemas” so that the aggregate navigator knows which tables are related to one another.
- Force all SQL created by any end user or application to refer exclusively to the base fact table and its associated full-size dimension tables.
Which leads to the simple process of navigating the schemas when a user queries the dimensional data. The process can be distilled into the following three processing steps proposed by Ralph Kimball in his article:
- Find the smallest fact table in the “family of schemas” referenced by a user’s query.
- test the query against the schema to be sure that all attributes and metrics are present. If so, proceed. If not, go back to step 1
- Rewrite the user’s query to use the smaller aggregates where possible.
If you’re faced with the challenge of developing your own navigator, this article is a must read!
SQL Server 2005 Integration Services (SSIS)
Build aggregates after loading the atomic fact data for each business process. This is usually done in a separate package that can be called from the master package. It may be appropriate to schedule these aggregate-building packages periodically throughout the week as opposed to after each load, but business requirements will dictate this approach.
The aggregate package will contain three important types of data flows.
- Dimension shrinking
- Surrogate key management
- Aggregate fact loading
Within the dimension shrinking data flows, start collapsing the relevant dimensions to the grain of the aggregate fact table. At this time, you would also assign surrogate keys (or, if you use my tip described above, just include the appropriate key and designate it as the primary key).
If you’re particularly crafty, you could design this in a way that would allow you to collapse dimensions from the previously shrunken dimension — saving significant time and resources. For example, collapse the geography dimension first into geography_region, then from geography_region create geography_state, and then from geography_state create geography_city.
Surrogate key management (where you look up the keys in the dimensions to use as foreign keys in the fact) and fact loading (actually inserting into the fact table) are very similar to the normal ETL processes. The only difference is that in order to look up your keys, you need to generate a special dataset from the base atomic data at the proper grain using a GROUP BY which contains all natural keys of the newly shrunken dimensions.
Hand Coding with Visual FoxPro (VFP9)
I find that hand-coding works well with aggregates. You can hand-code dimension shrinking, surrogate key management, and aggregate fact loading quite easily. All that we’re dealing with is some basic workflow and simple SQL. The challenge comes in scheduling and paralleling. Certainly, we want aggregates to run after the atomic data is ready. How you do that in FoxPro is entirely up to you. I prefer to chain events, so that when one finishes, the other kicks off. Scheduling requires additional reliance on the OS or the use of a timer (a bad idea).
From Here
There is so much more to say about this topic. I would like to follow this up with some examples in both SSIS and FoxPro. Stay tuned… In the meantime, I’ll move on to talk more about OLAP.
Aggregate fact tables are special fact tables in a data warehouse that contain new metrics derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other specialized functions that output totals derived from a grouping of the base data. These new metrics, called “aggregate facts” or “summary statistics” are stored and maintained in the data warehouse database in special fact tables at the grain of the aggregation. Likewise, the corresponding dimensions are rolled up and condensed to match the new grain of the fact.