Posts Tagged Analytical Databases

ETL Subsystem 20: OLAP Cube Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

In my previous post, I discussed OLAP, the Analytics Server, and Cubes as a primer to this post. If you’re new to this topic, please have a look there for more details. This post is specific to building cubes from the ETL environment.

Cube building is simply the process of populating the analytical server database with data from the relational database. The relational database can contain relational or dimensional models.

You will want to re-build your analytical (OLAP) database after loading data into your dimensional model. This will keep your analytical data stored in the analytical database in sync with your dimensional models stored in the relational database. From the ETL point-of-view, the basic process works like this:

  1. Extract data from disparate sources for a business process
  2. Cleanse and conform the data
  3. Load Dimensions
  4. Load Facts, rebuilding any aggregate fact tables
  5. Build cubes and other analytical server objects (like aggregates and dimensions)

These steps are repeated for each atomic business process dimensional model in the data warehouse. The first four are familiar, as I have discussed them at length. The last one should seem natural: You need to devise a way to alert the analytical database that the data warehouse has just been updated. The best place to do this is directly in the ETL system. How you handle the rebuild process is dependent upon how the OLAP vendor’s database works (I offer some suggestions below).

One thing to consider before building cube-updating code in your ETL workflow is latency. Latency, in this context, describes the time delay between the moment the atomic business process dimensional model is up-to-date, and the moment the related cube is up-to-date. It might seem natural to keep this in sync as close as possible. Certainly placing the code to update the cubes right after processing the fact tables in the ETL system is the way to go. But check your requirements. You may only need to have data refreshed once a day, in the early mornings for example (whereas your ETL system might run every 4 hours). You may also have a research environment where OLAP data need only be one week fresh. You can save yourself a lot of processing time by getting these requirements right.

To accommodate this and other scenarios, I highly recommend putting any cube-building or OLAP-related update code in a separate ETL package or module. As with fact and dimension processing, a master package or program (depending on your tool) will “call” the child packages or modules. You could embed some logic in the final step to only process cubes after the morning ETL run, for example.

I do not recommend processing OLAP cubes on a fixed schedule unless you can predict with certainty when your dimensional models will be ready. Normally, it makes most sense to kick off this process as soon as the facts have been loaded.

SQL Server 2005 Integration Services (SSIS)

If you’re using SSIS to update your cubes, then you are likely using SQL Server Analysis Services as your analytical database. If not, and you use SSIS, I’d like to know what you’re using for your analytic server and how you’re building your cubes!

SSIS comes with a powerful task that handles cube-building for you. It’s called “Analysis Services Processing Task” and can do the following: (a) connect to a SSAS database through the Analysis Services connection; (b) specify various objects in the database to update (measure groups, partitions, dimensions, cubes, mining models, mining structures, and databases); (c) define batch-mode or sequence mode to process objects together or one after another; (d) perform an impact analysis on objects associated with the cube; and (e) specify whether dependent and affected objects are also processed.

The Analysis Services Execute DDL Task also allows you to work with cubes by giving you the ability to run DDL statements, as defined by the Analysis Services Scripting Language (ASSL). These statements are included in an XMLA command (more on XMLA below). Although this task can’t process a cube, it can manipulate it. For details on the ASSL language, check out the Analysis Services Scripting Language Reference.

Hand Coding with Visual FoxPro (VFP9)

FoxPro, like all programming languages, just isn’t an analytical database engine so there is no native cube or cube builder. So a lot here depends on what type of technology you are using to handle your cubes.

Your first stop should be to check with the vendor to see if there is an api or technique available to update and rebuild cubes. Even if this functionality exists, it might not be the best option.

For your second stop, look into running XML for Analysis (XMLA) scripts against the database. XMLA is a SOAP-based protocol for interacting with OLAP databases. Through web services, you simply issue commands (like “process”) and receive responses. For more information on XMLA, please check out the XML for Analysis website.

From Here

It is normal for an ETL developer to be involved in generating cubes for analysis services, so this is a skill you should try to foster. Understanding dimensional modeling will give you a huge head start.

Next post (I’m getting closer and closer to the magic 34 — I just hope Ralph and the gang don’t decide to renumber them on me or something), I’ll go through ETL Subsystem 21: Data Propagation Manager!

Tags: , , , , , , ,

No Comments

Analytical Databases

ying yang of relational verses analytical databasesMost of what I discuss on Tod means Fox (and especially in my 34 Subsystem series) has to do with relational database engines (SQL Server 2005 and Visual FoxPro in particular). For most data integration projects, desktop and web solutions, and data warehouses, the relational database is all you’ll need. It can be used to create and manage both relational and dimensional models. But in Business Intelligence applications, chances are that you’ll need an additional, more specialized form of multidimensional data storage and retrieval. Analytical databases cover this need. In fact, Dr. Codd, the inventor of the relational database, often explained how analytical databases are a “necessary companion” to relational databases.

The OLAP Story

OLAP stands for Online Analytical Processing. It is a terrible name to describe multidimensional analytical databases. Instead of a more meaningful name, like FASMI (Fast Analysis of Shared Multidimensional Information) proposed by Nigel Pendse of “OLAP Report” we seem to be stuck with OLAP. With great pain, I will continue to discuss analytical databases using the OLAP terminology.

Basically, an OLAP (*sigh*) database has certain characteristics that set it apart from relational databases. And, coincidentally, Pendse included all characteristics in the FASMI acronym. Check out the OLAP Report for more details. If you’re just getting into business analytics, or are in the process of evaluating analytical database software – you must visit this site!

The Smartness Factor

I like to think of the relational database as “dumb” and the analytic database as “smart”. Let me explain:

The primary focus of Analytic servers is to get the (often pre-calculated) data out of the database as quickly as possible, allowing the user to zoom in and out along different hierarchies. Contrast this with the Codd’s relational model which seeks to eliminate data anomalies at transaction time through normalization: data retrieval is slow and often complex. With this in mind, you can get a feel for how a database engine can store, catalog, and retrieve data differently. (In fact, this is what makes dimensional modeling in general so favorable for querying and analytics – it is not bound by the restrictiveness imposed by normalization. I’d also like to note that as with dimensional models, OLAP databases are multidimensional.)

While the relational database plays a very important role, so does the multidimensional (OLAP) database. It’s simply built differently to service a different need.

Primarily, analytic servers can manage aggregates, aggregate navigation, and complex calculations and summaries across hierarchies at blazing speeds. These skills were borrowed from both navigational database designs (think of the DOM or a Wiki) and hierarchal designs (trees with leaves and branches). A lot of this has to do with how the data is stored.

Storage

Analytics servers offer a different type of storage. MOLAP, or Multidimensional OLAP, can be much more efficient than relational engines. Some tools (like SQL Server Analysis Services) allow you to store your analytical databases in a relational way (ROLAP) or using a hybrid approach (HOLAP). Personally, I see no benefit at all with ROLAP, aside from real-time systems where using the relational database is a must because of the latency involved with updating the OLAP database. (Not to digress too much here, but even real-time environments can use MOLAP storage by using partitions and caches correctly.)

MOLAP is more natural and faster. This may differ depending on your tool choice, but I beg someone to tell me differently. For a detailed discussion on how MOLAP and ROLAP engines store data, you can try this resource “Analysis Services 2005 Performance Guide” (which defines the way Analysis Services does it).

HOLAP storage is an interesting option and could actually perform well –even better than MOLAP in some instances — with less disk usage. But a lot of this depends on how many aggregates are defined and how often the system needs to query the relational database. It might be a good consideration, but if you’re not sure and you have plenty of disk space (disk space is cheap), then go with MOLAP.

The smartness factor2 (the MDX language)

I like to think of SQL as “dumb” and MDX as “smart”. Let me explain:

The MultiDimensional eXpression (MDX) language was created about 10 years ago by Microsoft to work specifically with multidimensional data stored in analytical servers. This OLAP-specific language improves upon the SQL syntax, removing much of the bulkiness associated with the language. MDX is an elegant and highly relevant partner to the analytical database. While you could get out what you need using SQL, most every SQL statement will be a challenging one. It’s like cutting down a tree with a hand saw as opposed to using a chain saw. MDX is painless and intuitive. Concepts such as dimensions, hierarchies, and sets are built into the syntax. For more on MDX and the mdXML standard, please visit XML For Analysis.

Central to the MDX language is the cube concept, which deserves a proper introduction:

Cubes

OLAP data is stored in structures called cubes. As you know, a cube is a 3-dimensional solid that, given a point on three of its faces which form a vertex, can take you to a precise point somewhere within the cube itself. A cubeThis point represents the aggregate metric you want to view. The faces represent the different dimensions (like Product, Customer, and Time) that are used to find the point. The dimensions are further broken down by its hierarchies on each face.

Cubes are sometimes — and more aptly — called hypercubes (or a tesseract or “n-cube”) because analysis often makes it necessary to examine more than three points. As soon as you look at more than three faces, you need higher dimensionality.

I admit, when I first heard about cubes I was a bit intimidated. I felt that they were too complex and advanced. Now, working with cubes is natural. Notice though that the concept of a cube is almost the same thing as a star schema. The only difference (at least that I can really think of) is that a star schema generally stores atomic data, and barring any usable aggregate fact table, calculations need to be done on the fly. The cube theoretically stores the results of these calculations inside the cube.

Dimensional Modeling

If you read my blog, then you know I advocate dimensional models for proper data warehousing. If you also advocate and use dimensional models, then the concepts discussed so far will fit you well.

What’s great about using Dimensional Models in your relational database is that your OLAP database will almost mirror the design, making the development and deployment of your cubes so much easier. Not only that, but the cubes will likely load faster and be easier to maintain when built off of the dimensional model. Note you can still achieve most of what you can do without an analytics server, just by using a properly constructed set of business process dimensional models. The OLAP database is just smarter, making several tasks (navigating hierarchies, security, and calculations to name a few) easier.

Tags: , , , ,

2 Comments