I find the need to put down some of my thoughts regarding aggregates and summaries, especially on how they apply to SQL and DW/BI. To me, aggregates and summaries are not the same, but I have come to terms with the fact that most users, developers, modelers, and architects use the terms interchangeably. There are two pairs of terms this applies to (one pair is the noun form, while the other is the verb form):
- n. Aggregate / Summary
- v. Aggregating / Summarizing
Aggregate (plural aggregates) and summary (plural summaries) datum are created from functions placed upon underlying atomic data. The primary differences between the two lies in two areas: (a) what type of function is applied and (b) how the atomic data is organized as inputs to the function.
General rule: Any function that requires a grouping (typically along a hierarchy) is an aggregate, while any function applied to the entire population of data is a summary. Summaries can be derived from aggregates, but I am hard pressed to come up with an example where the reverse is true.
Functions such as SUM, AVG, MIN, and MAX can be used to derive both aggregates and summaries (by aggregating or summarizing), as you can apply them to an entire dataset or to groups within the dataset. In SQL however, you are forced to define a grouping when you include one or more attributes with the summary statistic. In this case, you have created an aggregate:
SELECT region, SUM(sale_amount) FROM customer c JOIN sales s ON c.cid = s.cid GROUP BY region
A summary:
SELECT SUM(sale_amount) FROM sales
Note that you can create summaries from aggregates. Obviously, SELECT sum(sale_amount) FROM sales doesn’t really tell you much about your data. To aggregate by time or product line is much more useful. Consider the following, very useful aggregate and summary table:
-- get sales by region and month for 2007 SELECT region, sale_month, SUM(sale_amount) AS amount INTO #tmpAggregate FROM customer c JOIN sales s ON c.cid = s.cid WHERE YEAR = '2007' and region = 'NE' GROUP BY region, sale_month -- summarize the above by giving you total sales -- in the North East for 2007. Note that this type of -- summary is usually done in a reporting tool SELECT SUM(amount) INTO #tmpSumary FROM #tmpAggregate
Frequencies and other categorical functions are only valuable when applied against groups. Imagine a histogram without categories! Likewise, nearly all of the aggregate functions available in the majority of programming languages are most useful when applied against groups. This includes functions such as standard deviations, variances, means, and so on.
When an elementary arithmetic function (such as division) is performed on groups of data, the resulting calculation can be called a summary. If the function is applied within the row, it is called a calculation. To restate: arithmetic functions that span groups are a special type of summary. An example of this might be to find the ratio of one group as compared to another (a GROUP BY clause applied to two different study groups in the data).
While it might be a good idea to put together a table to illustrate my point about how functions and inputs affect the terminology, I’ll leave that exercise for another day. That type of project will take a little more thought!
Another difference between an aggregate and a summary is how the two are presented. Aggregates are rarely presented with the atomic data — not to suggest that the underlying data should not be available. Summaries most always come before or after the underlying data is presented to the user.
Am I off base? Do I have it all wrong?
#1 by Arielsigma at August 14th, 2008
| Quote
great, usefull 0_0