Posts Tagged Aggregates

ETL Subsystem 19: Aggregate Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

I spent the last two posts describing aggregate facts and dimensions and explaining the differences between aggregates and summaries. I wanted to do that in a separate thread so that I could focus on Kimball’s “Aggregate Builder” without spending too much time introducing aggregates. So, if you haven’t already, please have a look at those two previous posts.

The key to subsystem 19 is that you budget for and build a system to maintain your aggregates. This can be easily overlooked. I repeat: This can be easily overlooked. Much like what happens with the very important and easily cast aside data profiling subsystem. Don’t let it happen to you!

Building Aggregates

Aggregates can be a challenge to build — especially if your relational or OLAP engines don’t do it for you. Actually, if your OLAP tool doesn’t handle building aggregates, and you are relying on it for your aggregates, then you should think about getting another OLAP tool! I’ll talk more about OLAP in my next Subsystem post.

When designing aggregates to build, consider two things: user access patterns and distribution of data along the hierarchies. You cannot hope to construct aggregates for every possible level of every hierarchy across all metrics in a fact. Instead, you need to assess how your users query the data and then, taking it a bit further, decide if creating an aggregate will make any real difference for performance based on attribute distribution. The main goal of creating aggregates in a data warehouse is to speed query performance mainly by limiting the amount of data required to obtain results. But building them can be expensive.

Generally speaking, every time you update, insert, or delete data into any dimension or fact, you will need to rebuild all aggregates that are tied to those attributes and metrics. If you have two or three dozen aggregates to build every time you load a few business processes dimensional models, then you can imagine the time and processor loads required. In many instances, there is a balancing act between getting the data into the warehouse and getting the data out.

Technically speaking, building aggregates is a simple matter in SQL. Often it consists merely of a GROUPed query, some aggregate or custom functions, and surrogate key management.

Here’s a tip: Embed surrogate keys in your dimensions for each hierarchy used in aggregates. This practice will allow you to avoid the costly process of generating new keys for your hierarchy at the expense of a few administrative attributes stored in each dimension. For example, if you are building a set of region aggregates, you might embed region_key, state_key, city_key along with region, state, and city in your geography dimension. The keys will be repeated in the dimension until they are rolled up, at which time they become unique. The new aggregate fact table will point to these keys, eliminating the need for a specialized surrogate key manager.

Navigating Aggregates

Depending on your technology (here, I talk about SQL Server 2005 via SSIS and Visual FoxPro only — see below), navigating aggregates may or may not be a simple matter. If you put all your aggregates into OLAP engines, then chances are good that your OLAP technology will handle aggregates for you. To do it yourself, say, if you were hand-coding this part of the data warehouse using Visual FoxPro, you have a few design questions to answer.

In an old, but still relevant article published by DBMS, Ralph Kimball defines four key design requirements when dealing with aggregates. They are:

  1. Aggregates must be stored in their own fact tables, separate from the base-level data. In addition, each distinct aggregation level must occupy its own unique fact table.
  2. The dimension tables attached to the aggregate fact tables must, wherever possible, be shrunken versions of the dimension tables associated with the base fact table.
  3. The base Fact table and all of its related aggregate Fact tables must be associated together as a “family of schemas” so that the aggregate navigator knows which tables are related to one another.
  4. Force all SQL created by any end user or application to refer exclusively to the base fact table and its associated full-size dimension tables.

Which leads to the simple process of navigating the schemas when a user queries the dimensional data. The process can be distilled into the following three processing steps proposed by Ralph Kimball in his article:

  1. Find the smallest fact table in the “family of schemas” referenced by a user’s query.
  2. test the query against the schema to be sure that all attributes and metrics are present. If so, proceed. If not, go back to step 1
  3. Rewrite the user’s query to use the smaller aggregates where possible.

If you’re faced with the challenge of developing your own navigator, this article is a must read!

SQL Server 2005 Integration Services (SSIS)

Build aggregates after loading the atomic fact data for each business process. This is usually done in a separate package that can be called from the master package. It may be appropriate to schedule these aggregate-building packages periodically throughout the week as opposed to after each load, but business requirements will dictate this approach.

The aggregate package will contain three important types of data flows.

  • Dimension shrinking
  • Surrogate key management
  • Aggregate fact loading

Within the dimension shrinking data flows, start collapsing the relevant dimensions to the grain of the aggregate fact table. At this time, you would also assign surrogate keys (or, if you use my tip described above, just include the appropriate key and designate it as the primary key).

If you’re particularly crafty, you could design this in a way that would allow you to collapse dimensions from the previously shrunken dimension — saving significant time and resources. For example, collapse the geography dimension first into geography_region, then from geography_region create geography_state, and then from geography_state create geography_city.

Surrogate key management (where you look up the keys in the dimensions to use as foreign keys in the fact) and fact loading (actually inserting into the fact table) are very similar to the normal ETL processes. The only difference is that in order to look up your keys, you need to generate a special dataset from the base atomic data at the proper grain using a GROUP BY which contains all natural keys of the newly shrunken dimensions.

Hand Coding with Visual FoxPro (VFP9)

I find that hand-coding works well with aggregates. You can hand-code dimension shrinking, surrogate key management, and aggregate fact loading quite easily. All that we’re dealing with is some basic workflow and simple SQL. The challenge comes in scheduling and paralleling. Certainly, we want aggregates to run after the atomic data is ready. How you do that in FoxPro is entirely up to you. I prefer to chain events, so that when one finishes, the other kicks off. Scheduling requires additional reliance on the OS or the use of a timer (a bad idea).

From Here

There is so much more to say about this topic. I would like to follow this up with some examples in both SSIS and FoxPro. Stay tuned… In the meantime, I’ll move on to talk more about OLAP.

Tags: , , , , , ,

No Comments

Differences Between Aggregates and Summaries

I find the need to put down some of my thoughts regarding aggregates and summaries, especially on how they apply to SQL and DW/BI. To me, aggregates and summaries are not the same, but I have come to terms with the fact that most users, developers, modelers, and architects use the terms interchangeably. There are two pairs of terms this applies to (one pair is the noun form, while the other is the verb form):

  • n. Aggregate / Summary
  • v. Aggregating / Summarizing

Aggregate (plural aggregates) and summary (plural summaries) datum are created from functions placed upon underlying atomic data. The primary differences between the two lies in two areas: (a) what type of function is applied and (b) how the atomic data is organized as inputs to the function.

General rule: Any function that requires a grouping (typically along a hierarchy) is an aggregate, while any function applied to the entire population of data is a summary. Summaries can be derived from aggregates, but I am hard pressed to come up with an example where the reverse is true.

Functions such as SUM, AVG, MIN, and MAX can be used to derive both aggregates and summaries (by aggregating or summarizing), as you can apply them to an entire dataset or to groups within the dataset. In SQL however, you are forced to define a grouping when you include one or more attributes with the summary statistic. In this case, you have created an aggregate:

SELECT region, SUM(sale_amount) 
    FROM customer c 
    JOIN sales s ON c.cid = s.cid 
    GROUP BY region

A summary:

SELECT SUM(sale_amount) FROM sales

Note that you can create summaries from aggregates. Obviously, SELECT sum(sale_amount) FROM sales doesn’t really tell you much about your data. To aggregate by time or product line is much more useful. Consider the following, very useful aggregate and summary table:

-- get sales by region and month for 2007
SELECT region, sale_month, SUM(sale_amount) AS amount 
    INTO #tmpAggregate
    FROM customer c 
    JOIN sales s ON c.cid = s.cid
    WHERE YEAR =  '2007' and region = 'NE'
    GROUP BY region, sale_month
 
-- summarize the above by giving you total sales
-- in the North East for 2007. Note that this type of
-- summary is usually done in a reporting tool
SELECT SUM(amount) 
    INTO #tmpSumary
    FROM #tmpAggregate

Frequencies and other categorical functions are only valuable when applied against groups. Imagine a histogram without categories! Likewise, nearly all of the aggregate functions available in the majority of programming languages are most useful when applied against groups. This includes functions such as standard deviations, variances, means, and so on.

When an elementary arithmetic function (such as division) is performed on groups of data, the resulting calculation can be called a summary. If the function is applied within the row, it is called a calculation. To restate: arithmetic functions that span groups are a special type of summary. An example of this might be to find the ratio of one group as compared to another (a GROUP BY clause applied to two different study groups in the data).

While it might be a good idea to put together a table to illustrate my point about how functions and inputs affect the terminology, I’ll leave that exercise for another day. That type of project will take a little more thought!

Another difference between an aggregate and a summary is how the two are presented. Aggregates are rarely presented with the atomic data — not to suggest that the underlying data should not be available. Summaries most always come before or after the underlying data is presented to the user.

Am I off base? Do I have it all wrong?

Tags: ,

3 Comments

Aggregate Facts and Dimensions

Aggregate — picture from http://www.stonetohome.comAggregate fact tables are special fact tables in a data warehouse that contain new metrics derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other specialized functions that output totals derived from a grouping of the base data. These new metrics, called “aggregate facts” or “summary statistics” are stored and maintained in the data warehouse database in special fact tables at the grain of the aggregation. Likewise, the corresponding dimensions are rolled up and condensed to match the new grain of the fact.

These specialized tables are used as substitutes whenever possible for returning user queries. The reason? Speed. Querying a tidy aggregate table is much faster and uses much less disk I/O than the base, atomic fact table, especially if the dimensions are large as well. If you want to wow your users, start adding aggregates. You can even use this “trick” in your operational systems to serve as a foundation for operational reports. I’ve always done this for any report referred to by my users as a “Summary”. (As an aside, there is a difference between an “aggregate” and a “summary”. I’ll explore these differences in my next post.)

For example, take the “Orders” business process from an online catalog company where you might have customer orders in a fact table called FactOrders with dimensions Customer, Product, and OrderDate. With possibly millions of orders in the transaction fact, it makes sense to start thinking about aggregates.

To further the above example, assume that the business is interested in a report: “Monthly orders by state and product type”. While you could generate this easily enough using the FactOrders fact table, you could likely speed up the data retrieval for the report by at least half (but likely much, much more) using an aggregate.

Here, using the atomic transaction FactOrders table:

SELECT c.STATE, p.product_type, t.YEAR, t.MONTH, SUM(f.order_amount)
    FROM FactOrders f
    JOIN DimCustomer c ON c.CustomerID = f.CustomerID
    JOIN DimProduct p ON p.ProductID = f.ProductID
    JOIN DimTime t ON 
        t.YEAR = DATEPART(yy, f.YearMonthID) AND 
        t.MONTH = DATEPART(mm, f.DateID)
    GROUP BY c.STATE, p.product_type, t.YEAR, t.MONTH

The aggregate is querying much less data and queries against time are now much simpler. In my non-scientific tests, the following query ran many times faster (a few seconds compared to about 30 seconds!).

SELECT c.STATE, p.product_type, t.YEAR, t.MONTH, SUM(f.order_amount)
    FROM FactOrders_Agg1 f
    JOIN DimCustomerState c ON c.CustomerStateID = f.CustomerStateID
    JOIN DimProductType p ON p.ProductTypeID = f.ProductTypeID
    JOIN DimMonth t ON t.YearMonthID = f.YearMonthID
    GROUP BY c.STATE, p.product_type, t.YEAR, t.MONTH

Creating the Fact and Dimensions

To implement, you will need to roll up your fact table by the hierarchies found in your dimensions. The result will be a new fact table, a set of new accompanying dimensions at the grain of the fact, and all new foreign keys for mapping. I usually name the fact table the same as the base fact with some meaningful suffix appended to the end. In SSMS, this keeps the aggregates with the fact in my object explorer. Dimensions usually get new names (like CustomerState and ProductType) and should be conformed so that they can be reused across business processes. You could even create views instead of new dimensions, but this does not eliminate the need to regenerate new surrogate keys.

When rolling up dimensions, you are provided with an excellent opportunity to perform aggregate functions on the dimension itself and store the results as new attributes. For example, you may want to know how many customers are living in each state. This could be used as the denominator in some population calculation you plan to use against the aggregate fact. Your new dimension might therefore look like the following:

SELECT Cust_Country, Cust_Region, Cust_State, COUNT(Cust_ID) 
FROM DimCustomer 
GROUP BY Cust_Country, Cust_Region, Cust_State

The most obvious aggregate function to use is COUNT, but depending on the type of data you have in your dimensions, other functions may prove useful. Just be warned: If you find that aggregate functions are being used a lot on your dimensions, you may need to revisit your design. There may be opportunities to pull out those metrics into existing or new fact tables!

Generating aggregates is largely an incremental process, where you examine query and reporting usage looking for places to improve performance. Aggregates stored in the RDBMS are maintained through ETL and/or your OLAP engine.

A Note About OLAP Aggregates

Theoretically, storing aggregates in a fact table in a RDBMS is the same as storing them in an OLAP cube. In OLAP storage, aggregates are precalculated summaries of data from the different dimensions of the cube, such that a query that seeks to know the aggregate (sum) of some metric (order amount) for X (customer state) and Y (product type) over T (monthly orders) would only need to look inside the cube at those exact coordinates to get the answer. I won’t pretend to know how this stuff is physically stored, but OLAP engines across the board offer better performance, management, and navigation mechanisms for the aggregations than is available through the RDBMS (even when using Indexed or Materialized Views).

Next post, I’ll write some thoughts on the differences between “Summaries” and “Aggregates”!

Tags: , , , , , , ,

3 Comments