Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Business Intelligence’


Published August 26th, 2008

Analytical Databases

ying yang of relational verses analytical databasesMost of what I discuss on Tod means Fox (and especially in my 34 Subsystem series) has to do with relational database engines (SQL Server 2005 and Visual FoxPro in particular). For most data integration projects, desktop and web solutions, and data warehouses, the relational database is all you’ll need. It can be used to create and manage both relational and dimensional models. But in Business Intelligence applications, chances are that you’ll need an additional, more specialized form of multidimensional data storage and retrieval. Analytical databases cover this need. In fact, Dr. Codd, the inventor of the relational database, often explained how analytical databases are a “necessary companion” to relational databases.

The OLAP Story

OLAP stands for Online Analytical Processing. It is a terrible name to describe multidimensional analytical databases. Instead of a more meaningful name, like FASMI (Fast Analysis of Shared Multidimensional Information) proposed by Nigel Pendse of “OLAP Report” we seem to be stuck with OLAP. With great pain, I will continue to discuss analytical databases using the OLAP terminology.

Basically, an OLAP (*sigh*) database has certain characteristics that set it apart from relational databases. And, coincidentally, Pendse included all characteristics in the FASMI acronym. Check out the OLAP Report for more details. If you’re just getting into business analytics, or are in the process of evaluating analytical database software – you must visit this site!

The Smartness Factor

I like to think of the relational database as “dumb” and the analytic database as “smart”. Let me explain:

The primary focus of Analytic servers is to get the (often pre-calculated) data out of the database as quickly as possible, allowing the user to zoom in and out along different hierarchies. Contrast this with the Codd’s relational model which seeks to eliminate data anomalies at transaction time through normalization: data retrieval is slow and often complex. With this in mind, you can get a feel for how a database engine can store, catalog, and retrieve data differently. (In fact, this is what makes dimensional modeling in general so favorable for querying and analytics – it is not bound by the restrictiveness imposed by normalization. I’d also like to note that as with dimensional models, OLAP databases are multidimensional.)

While the relational database plays a very important role, so does the multidimensional (OLAP) database. It’s simply built differently to service a different need.

Primarily, analytic servers can manage aggregates, aggregate navigation, and complex calculations and summaries across hierarchies at blazing speeds. These skills were borrowed from both navigational database designs (think of the DOM or a Wiki) and hierarchal designs (trees with leaves and branches). A lot of this has to do with how the data is stored.

Storage

Analytics servers offer a different type of storage. MOLAP, or Multidimensional OLAP, can be much more efficient than relational engines. Some tools (like SQL Server Analysis Services) allow you to store your analytical databases in a relational way (ROLAP) or using a hybrid approach (HOLAP). Personally, I see no benefit at all with ROLAP, aside from real-time systems where using the relational database is a must because of the latency involved with updating the OLAP database. (Not to digress too much here, but even real-time environments can use MOLAP storage by using partitions and caches correctly.)

MOLAP is more natural and faster. This may differ depending on your tool choice, but I beg someone to tell me differently. For a detailed discussion on how MOLAP and ROLAP engines store data, you can try this resource “Analysis Services 2005 Performance Guide” (which defines the way Analysis Services does it).

HOLAP storage is an interesting option and could actually perform well –even better than MOLAP in some instances — with less disk usage. But a lot of this depends on how many aggregates are defined and how often the system needs to query the relational database. It might be a good consideration, but if you’re not sure and you have plenty of disk space (disk space is cheap), then go with MOLAP.

The smartness factor2 (the MDX language)

I like to think of SQL as “dumb” and MDX as “smart”. Let me explain:

The MultiDimensional eXpression (MDX) language was created about 10 years ago by Microsoft to work specifically with multidimensional data stored in analytical servers. This OLAP-specific language improves upon the SQL syntax, removing much of the bulkiness associated with the language. MDX is an elegant and highly relevant partner to the analytical database. While you could get out what you need using SQL, most every SQL statement will be a challenging one. It’s like cutting down a tree with a hand saw as opposed to using a chain saw. MDX is painless and intuitive. Concepts such as dimensions, hierarchies, and sets are built into the syntax. For more on MDX and the mdXML standard, please visit XML For Analysis.

Central to the MDX language is the cube concept, which deserves a proper introduction:

Cubes

OLAP data is stored in structures called cubes. As you know, a cube is a 3-dimensional solid that, given a point on three of its faces which form a vertex, can take you to a precise point somewhere within the cube itself. A cubeThis point represents the aggregate metric you want to view. The faces represent the different dimensions (like Product, Customer, and Time) that are used to find the point. The dimensions are further broken down by its hierarchies on each face.

Cubes are sometimes — and more aptly — called hypercubes (or a tesseract or “n-cube”) because analysis often makes it necessary to examine more than three points. As soon as you look at more than three faces, you need higher dimensionality.

I admit, when I first heard about cubes I was a bit intimidated. I felt that they were too complex and advanced. Now, working with cubes is natural. Notice though that the concept of a cube is almost the same thing as a star schema. The only difference (at least that I can really think of) is that a star schema generally stores atomic data, and barring any usable aggregate fact table, calculations need to be done on the fly. The cube theoretically stores the results of these calculations inside the cube.

Dimensional Modeling

If you read my blog, then you know I advocate dimensional models for proper data warehousing. If you also advocate and use dimensional models, then the concepts discussed so far will fit you well.

What’s great about using Dimensional Models in your relational database is that your OLAP database will almost mirror the design, making the development and deployment of your cubes so much easier. Not only that, but the cubes will likely load faster and be easier to maintain when built off of the dimensional model. Note you can still achieve most of what you can do without an analytics server, just by using a properly constructed set of business process dimensional models. The OLAP database is just smarter, making several tasks (navigating hierarchies, security, and calculations to name a few) easier.

Published August 11th, 2008

ETL Subsystem 19: Aggregate Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

I spent the last two posts describing aggregate facts and dimensions and explaining the differences between aggregates and summaries. I wanted to do that in a separate thread so that I could focus on Kimball’s “Aggregate Builder” without spending too much time introducing aggregates. So, if you haven’t already, please have a look at those two previous posts.

The key to subsystem 19 is that you budget for and build a system to maintain your aggregates. This can be easily overlooked. I repeat: This can be easily overlooked. Much like what happens with the very important and easily cast aside data profiling subsystem. Don’t let it happen to you!

Building Aggregates

Aggregates can be a challenge to build — especially if your relational or OLAP engines don’t do it for you. Actually, if your OLAP tool doesn’t handle building aggregates, and you are relying on it for your aggregates, then you should think about getting another OLAP tool! I’ll talk more about OLAP in my next Subsystem post.

When designing aggregates to build, consider two things: user access patterns and distribution of data along the hierarchies. You cannot hope to construct aggregates for every possible level of every hierarchy across all metrics in a fact. Instead, you need to assess how your users query the data and then, taking it a bit further, decide if creating an aggregate will make any real difference for performance based on attribute distribution. The main goal of creating aggregates in a data warehouse is to speed query performance mainly by limiting the amount of data required to obtain results. But building them can be expensive.

Generally speaking, every time you update, insert, or delete data into any dimension or fact, you will need to rebuild all aggregates that are tied to those attributes and metrics. If you have two or three dozen aggregates to build every time you load a few business processes dimensional models, then you can imagine the time and processor loads required. In many instances, there is a balancing act between getting the data into the warehouse and getting the data out.

Technically speaking, building aggregates is a simple matter in SQL. Often it consists merely of a GROUPed query, some aggregate or custom functions, and surrogate key management.

Here’s a tip: Embed surrogate keys in your dimensions for each hierarchy used in aggregates. This practice will allow you to avoid the costly process of generating new keys for your hierarchy at the expense of a few administrative attributes stored in each dimension. For example, if you are building a set of region aggregates, you might embed region_key, state_key, city_key along with region, state, and city in your geography dimension. The keys will be repeated in the dimension until they are rolled up, at which time they become unique. The new aggregate fact table will point to these keys, eliminating the need for a specialized surrogate key manager.

Navigating Aggregates

Depending on your technology (here, I talk about SQL Server 2005 via SSIS and Visual FoxPro only — see below), navigating aggregates may or may not be a simple matter. If you put all your aggregates into OLAP engines, then chances are good that your OLAP technology will handle aggregates for you. To do it yourself, say, if you were hand-coding this part of the data warehouse using Visual FoxPro, you have a few design questions to answer.

In an old, but still relevant article published by DBMS, Ralph Kimball defines four key design requirements when dealing with aggregates. They are:

  1. Aggregates must be stored in their own fact tables, separate from the base-level data. In addition, each distinct aggregation level must occupy its own unique fact table.
  2. The dimension tables attached to the aggregate fact tables must, wherever possible, be shrunken versions of the dimension tables associated with the base fact table.
  3. The base Fact table and all of its related aggregate Fact tables must be associated together as a “family of schemas” so that the aggregate navigator knows which tables are related to one another.
  4. Force all SQL created by any end user or application to refer exclusively to the base fact table and its associated full-size dimension tables.

Which leads to the simple process of navigating the schemas when a user queries the dimensional data. The process can be distilled into the following three processing steps proposed by Ralph Kimball in his article:

  1. Find the smallest fact table in the “family of schemas” referenced by a user’s query.
  2. test the query against the schema to be sure that all attributes and metrics are present. If so, proceed. If not, go back to step 1
  3. Rewrite the user’s query to use the smaller aggregates where possible.

If you’re faced with the challenge of developing your own navigator, this article is a must read!

SQL Server 2005 Integration Services (SSIS)

Build aggregates after loading the atomic fact data for each business process. This is usually done in a separate package that can be called from the master package. It may be appropriate to schedule these aggregate-building packages periodically throughout the week as opposed to after each load, but business requirements will dictate this approach.

The aggregate package will contain three important types of data flows.

  • Dimension shrinking
  • Surrogate key management
  • Aggregate fact loading

Within the dimension shrinking data flows, start collapsing the relevant dimensions to the grain of the aggregate fact table. At this time, you would also assign surrogate keys (or, if you use my tip described above, just include the appropriate key and designate it as the primary key).

If you’re particularly crafty, you could design this in a way that would allow you to collapse dimensions from the previously shrunken dimension — saving significant time and resources. For example, collapse the geography dimension first into geography_region, then from geography_region create geography_state, and then from geography_state create geography_city.

Surrogate key management (where you look up the keys in the dimensions to use as foreign keys in the fact) and fact loading (actually inserting into the fact table) are very similar to the normal ETL processes. The only difference is that in order to look up your keys, you need to generate a special dataset from the base atomic data at the proper grain using a GROUP BY which contains all natural keys of the newly shrunken dimensions.

Hand Coding with Visual FoxPro (VFP9)

I find that hand-coding works well with aggregates. You can hand-code dimension shrinking, surrogate key management, and aggregate fact loading quite easily. All that we’re dealing with is some basic workflow and simple SQL. The challenge comes in scheduling and paralleling. Certainly, we want aggregates to run after the atomic data is ready. How you do that in FoxPro is entirely up to you. I prefer to chain events, so that when one finishes, the other kicks off. Scheduling requires additional reliance on the OS or the use of a timer (a bad idea).

From Here

There is so much more to say about this topic. I would like to follow this up with some examples in both SSIS and FoxPro. Stay tuned… In the meantime, I’ll move on to talk more about OLAP.

Published August 4th, 2008

Differences Between Aggregates and Summaries

I find the need to put down some of my thoughts regarding aggregates and summaries, especially on how they apply to SQL and DW/BI. To me, aggregates and summaries are not the same, but I have come to terms with the fact that most users, developers, modelers, and architects use the terms interchangeably. There are two pairs of terms this applies to (one pair is the noun form, while the other is the verb form):

  • n. Aggregate / Summary
  • v. Aggregating / Summarizing

Aggregate (plural aggregates) and summary (plural summaries) datum are created from functions placed upon underlying atomic data. The primary differences between the two lies in two areas: (a) what type of function is applied and (b) how the atomic data is organized as inputs to the function.

General rule: Any function that requires a grouping (typically along a hierarchy) is an aggregate, while any function applied to the entire population of data is a summary. Summaries can be derived from aggregates, but I am hard pressed to come up with an example where the reverse is true.

Functions such as SUM, AVG, MIN, and MAX can be used to derive both aggregates and summaries (by aggregating or summarizing), as you can apply them to an entire dataset or to groups within the dataset. In SQL however, you are forced to define a grouping when you include one or more attributes with the summary statistic. In this case, you have created an aggregate:

SELECT region, SUM(sale_amount) 
    FROM customer c 
    JOIN sales s ON c.cid = s.cid 
    GROUP BY region

A summary:

SELECT SUM(sale_amount) FROM sales

Note that you can create summaries from aggregates. Obviously, SELECT sum(sale_amount) FROM sales doesn’t really tell you much about your data. To aggregate by time or product line is much more useful. Consider the following, very useful aggregate and summary table:

-- get sales by region and month for 2007
SELECT region, sale_month, SUM(sale_amount) AS amount 
    INTO #tmpAggregate
    FROM customer c 
    JOIN sales s ON c.cid = s.cid
    WHERE YEAR =  '2007' and region = 'NE'
    GROUP BY region, sale_month
 
-- summarize the above by giving you total sales
-- in the North East for 2007. Note that this type of
-- summary is usually done in a reporting tool
SELECT SUM(amount) 
    INTO #tmpSumary
    FROM #tmpAggregate

Frequencies and other categorical functions are only valuable when applied against groups. Imagine a histogram without categories! Likewise, nearly all of the aggregate functions available in the majority of programming languages are most useful when applied against groups. This includes functions such as standard deviations, variances, means, and so on.

When an elementary arithmetic function (such as division) is performed on groups of data, the resulting calculation can be called a summary. If the function is applied within the row, it is called a calculation. To restate: arithmetic functions that span groups are a special type of summary. An example of this might be to find the ratio of one group as compared to another (a GROUP BY clause applied to two different study groups in the data).

While it might be a good idea to put together a table to illustrate my point about how functions and inputs affect the terminology, I’ll leave that exercise for another day. That type of project will take a little more thought!

Another difference between an aggregate and a summary is how the two are presented. Aggregates are rarely presented with the atomic data — not to suggest that the underlying data should not be available. Summaries most always come before or after the underlying data is presented to the user.

Am I off base? Do I have it all wrong?

Published July 30th, 2008

Aggregate Facts and Dimensions

Aggregate — picture from http://www.stonetohome.comAggregate fact tables are special fact tables in a data warehouse that contain new metrics derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other specialized functions that output totals derived from a grouping of the base data. These new metrics, called “aggregate facts” or “summary statistics” are stored and maintained in the data warehouse database in special fact tables at the grain of the aggregation. Likewise, the corresponding dimensions are rolled up and condensed to match the new grain of the fact.

These specialized tables are used as substitutes whenever possible for returning user queries. The reason? Speed. Querying a tidy aggregate table is much faster and uses much less disk I/O than the base, atomic fact table, especially if the dimensions are large as well. If you want to wow your users, start adding aggregates. You can even use this “trick” in your operational systems to serve as a foundation for operational reports. I’ve always done this for any report referred to by my users as a “Summary”. (As an aside, there is a difference between an “aggregate” and a “summary”. I’ll explore these differences in my next post.)

For example, take the “Orders” business process from an online catalog company where you might have customer orders in a fact table called FactOrders with dimensions Customer, Product, and OrderDate. With possibly millions of orders in the transaction fact, it makes sense to start thinking about aggregates.

To further the above example, assume that the business is interested in a report: “Monthly orders by state and product type”. While you could generate this easily enough using the FactOrders fact table, you could likely speed up the data retrieval for the report by at least half (but likely much, much more) using an aggregate.

Here, using the atomic transaction FactOrders table:

SELECT c.STATE, p.product_type, t.YEAR, t.MONTH, SUM(f.order_amount)
    FROM FactOrders f
    JOIN DimCustomer c ON c.CustomerID = f.CustomerID
    JOIN DimProduct p ON p.ProductID = f.ProductID
    JOIN DimTime t ON 
        t.YEAR = DATEPART(yy, f.YearMonthID) AND 
        t.MONTH = DATEPART(mm, f.DateID)
    GROUP BY c.STATE, p.product_type, t.YEAR, t.MONTH

The aggregate is querying much less data and queries against time are now much simpler. In my non-scientific tests, the following query ran many times faster (a few seconds compared to about 30 seconds!).

SELECT c.STATE, p.product_type, t.YEAR, t.MONTH, SUM(f.order_amount)
    FROM FactOrders_Agg1 f
    JOIN DimCustomerState c ON c.CustomerStateID = f.CustomerStateID
    JOIN DimProductType p ON p.ProductTypeID = f.ProductTypeID
    JOIN DimMonth t ON t.YearMonthID = f.YearMonthID
    GROUP BY c.STATE, p.product_type, t.YEAR, t.MONTH

Creating the Fact and Dimensions

To implement, you will need to roll up your fact table by the hierarchies found in your dimensions. The result will be a new fact table, a set of new accompanying dimensions at the grain of the fact, and all new foreign keys for mapping. I usually name the fact table the same as the base fact with some meaningful suffix appended to the end. In SSMS, this keeps the aggregates with the fact in my object explorer. Dimensions usually get new names (like CustomerState and ProductType) and should be conformed so that they can be reused across business processes. You could even create views instead of new dimensions, but this does not eliminate the need to regenerate new surrogate keys.

When rolling up dimensions, you are provided with an excellent opportunity to perform aggregate functions on the dimension itself and store the results as new attributes. For example, you may want to know how many customers are living in each state. This could be used as the denominator in some population calculation you plan to use against the aggregate fact. Your new dimension might therefore look like the following:

SELECT Cust_Country, Cust_Region, Cust_State, COUNT(Cust_ID) 
FROM DimCustomer 
GROUP BY Cust_Country, Cust_Region, Cust_State

The most obvious aggregate function to use is COUNT, but depending on the type of data you have in your dimensions, other functions may prove useful. Just be warned: If you find that aggregate functions are being used a lot on your dimensions, you may need to revisit your design. There may be opportunities to pull out those metrics into existing or new fact tables!

Generating aggregates is largely an incremental process, where you examine query and reporting usage looking for places to improve performance. Aggregates stored in the RDBMS are maintained through ETL and/or your OLAP engine.

A Note About OLAP Aggregates

Theoretically, storing aggregates in a fact table in a RDBMS is the same as storing them in an OLAP cube. In OLAP storage, aggregates are precalculated summaries of data from the different dimensions of the cube, such that a query that seeks to know the aggregate (sum) of some metric (order amount) for X (customer state) and Y (product type) over T (monthly orders) would only need to look inside the cube at those exact coordinates to get the answer. I won’t pretend to know how this stuff is physically stored, but OLAP engines across the board offer better performance, management, and navigation mechanisms for the aggregations than is available through the RDBMS (even when using Indexed or Materialized Views).

Next post, I’ll write some thoughts on the differences between “Summaries” and “Aggregates”!

Published July 18th, 2008

ETL Subsystem 18: Fact Table Provider

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

This subsystem mirrors the previous one I discussed: ETL Subsystem 17: Dimension Manager. It exists to ensure that fact tables are properly maintained and delivered to the organization. The functionality to accomplish this is done through ETL (with some help by other technologies, such as replication). The functions include (but are not limited to) the following significant tasks:

Note: I’ll discuss Aggregates more in my next post.

SQL Server 2005 Integration Services (SSIS)

Fact distribution can be done using ETL or through replication. While other methods exist (RPCs, Web Services, and other SOA-type solutions for example), ETL and replication are by far the most widespread and documented. SSIS works in both conditions just fine. For more thoughts on this, refer back to ETL Subsytem 17.

Hand Coding with Visual FoxPro (VFP9)

Nothing new to report! VFP is limited, without a native replication feature, on its own. It can move data in distributed environments quite well though. So using ETL as a solution is certainly high on the list. If replication is a must, then the VFP database backend will need to be replaced with SQL Server or similar RDBMS product.

From Here

As stated, in the next subsystem, I’ll talk about aggregates. I love aggregates (my wife wants a divorce), so my next post looks to be quite interesting!

Published July 14th, 2008

ETL Subsystem 17: Dimension Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

This subsystem is largely a management task. It involves ensuring that conformed dimensions (see my post “ETL Subsystem 8: Data Conformance“) are properly published to the organization. Conformed Dimensions are one of the most important and fundamental aspects of data warehousing. In order to deliver these conformed dimensions to business users spread across the organization — including geographically — a system must be in place to deal them out.

In simple environments (a single RDBMS server that services a handful of analysts, for example), dimension management is as easy as inserting new rows through ETL and updating data model metadata by the DBA on a single server. In more complex environments — where servers and applications are scaled up and out, across large geographies — dimension management becomes more of a challenge.

It is common in these distributed environments (horizontally and vertically) to have copies of dimensions sitting at various locations in the enterprise. Don’t confuse this with “Mart Madness” or “Silos”. These copies are managed by a single ETL process which not only handles normal ETL, but likely handles the distribution as well (replication, as I’ll talk about in a moment, is another option).

Why not have remote users connect to a single, centralized database? A Dimensional Model exists to facilitate analytics and query performance. To this end, it is necessary at times to move data closer to the analyst. Not only is this necessary, but it is also smart. If you have presentation servers in Brazil, Costa Rica, Singapore, and Sri Lanka, then guess what: You will distribute your data directly to these locations. How you do it is more of a question of technology and policy, rather than a process tied to ETL directly.

SQL Server 2005 Integration Services (SSIS)

If you choose to distribute dimensions using ETL, then SSIS will work just fine. This is nothing more than a set of special packages that you design to select changes from the master dimensions (stored on your central integration server), multicast the data, and load to the destination dimension or dimensions. This incremental approach is desirable when you are processing a small amount of changes to medium and large dimensions.

Alternately, if your dimensions are small enough, it might make sense to drop and recreate them on the destination servers. The overhead associated with collecting the changes from the master dimensions might not be worth the effort. But before you think this method the best, be aware that all constraints to this dimension (including those connected through the fact) will need to be dropped. Also, the database will be unavailable to the users during this operation.

Another good option is to simply use Replication to propagate the changes to the various distributed dimensions. I am currently using replication now as my standard, but only because my situation currently dictates it. I have large data volumes, which are updated frequently (several times a day, with sensitivity on global timezones).

Hand Coding with Visual FoxPro (VFP9)

I am starting to sound like a broken record: VFP is not an ETL tool, but an OOP language. Dimension distribution will need to be done manually by either making copies of dimensions and pushing them out to the client systems, or designing a mechanism for the client systems to pull changed data from the master dimensions. I have found this to be a major challenge using VFP as the sole solution. Obviously, if your backend database is SQL Server or another RDBMS, you likely have a replication option. That being said, it certainly is not an impossible task, but one in which you need to prepare yourself for the various complexities. I would go out on a limb and suggest that this is true for all hand-coded solutions.

From Here

In my next post, I will discuss ETL Subsystem 18: Fact Table Provider. As with this subsystem, the Fact Table Provider is largely a management task designed to bring facts to the user community.