Posts Tagged OLAP

ETL Subsystem 20: OLAP Cube Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

In my previous post, I discussed OLAP, the Analytics Server, and Cubes as a primer to this post. If you’re new to this topic, please have a look there for more details. This post is specific to building cubes from the ETL environment.

Cube building is simply the process of populating the analytical server database with data from the relational database. The relational database can contain relational or dimensional models.

You will want to re-build your analytical (OLAP) database after loading data into your dimensional model. This will keep your analytical data stored in the analytical database in sync with your dimensional models stored in the relational database. From the ETL point-of-view, the basic process works like this:

  1. Extract data from disparate sources for a business process
  2. Cleanse and conform the data
  3. Load Dimensions
  4. Load Facts, rebuilding any aggregate fact tables
  5. Build cubes and other analytical server objects (like aggregates and dimensions)

These steps are repeated for each atomic business process dimensional model in the data warehouse. The first four are familiar, as I have discussed them at length. The last one should seem natural: You need to devise a way to alert the analytical database that the data warehouse has just been updated. The best place to do this is directly in the ETL system. How you handle the rebuild process is dependent upon how the OLAP vendor’s database works (I offer some suggestions below).

One thing to consider before building cube-updating code in your ETL workflow is latency. Latency, in this context, describes the time delay between the moment the atomic business process dimensional model is up-to-date, and the moment the related cube is up-to-date. It might seem natural to keep this in sync as close as possible. Certainly placing the code to update the cubes right after processing the fact tables in the ETL system is the way to go. But check your requirements. You may only need to have data refreshed once a day, in the early mornings for example (whereas your ETL system might run every 4 hours). You may also have a research environment where OLAP data need only be one week fresh. You can save yourself a lot of processing time by getting these requirements right.

To accommodate this and other scenarios, I highly recommend putting any cube-building or OLAP-related update code in a separate ETL package or module. As with fact and dimension processing, a master package or program (depending on your tool) will “call” the child packages or modules. You could embed some logic in the final step to only process cubes after the morning ETL run, for example.

I do not recommend processing OLAP cubes on a fixed schedule unless you can predict with certainty when your dimensional models will be ready. Normally, it makes most sense to kick off this process as soon as the facts have been loaded.

SQL Server 2005 Integration Services (SSIS)

If you’re using SSIS to update your cubes, then you are likely using SQL Server Analysis Services as your analytical database. If not, and you use SSIS, I’d like to know what you’re using for your analytic server and how you’re building your cubes!

SSIS comes with a powerful task that handles cube-building for you. It’s called “Analysis Services Processing Task” and can do the following: (a) connect to a SSAS database through the Analysis Services connection; (b) specify various objects in the database to update (measure groups, partitions, dimensions, cubes, mining models, mining structures, and databases); (c) define batch-mode or sequence mode to process objects together or one after another; (d) perform an impact analysis on objects associated with the cube; and (e) specify whether dependent and affected objects are also processed.

The Analysis Services Execute DDL Task also allows you to work with cubes by giving you the ability to run DDL statements, as defined by the Analysis Services Scripting Language (ASSL). These statements are included in an XMLA command (more on XMLA below). Although this task can’t process a cube, it can manipulate it. For details on the ASSL language, check out the Analysis Services Scripting Language Reference.

Hand Coding with Visual FoxPro (VFP9)

FoxPro, like all programming languages, just isn’t an analytical database engine so there is no native cube or cube builder. So a lot here depends on what type of technology you are using to handle your cubes.

Your first stop should be to check with the vendor to see if there is an api or technique available to update and rebuild cubes. Even if this functionality exists, it might not be the best option.

For your second stop, look into running XML for Analysis (XMLA) scripts against the database. XMLA is a SOAP-based protocol for interacting with OLAP databases. Through web services, you simply issue commands (like “process”) and receive responses. For more information on XMLA, please check out the XML for Analysis website.

From Here

It is normal for an ETL developer to be involved in generating cubes for analysis services, so this is a skill you should try to foster. Understanding dimensional modeling will give you a huge head start.

Next post (I’m getting closer and closer to the magic 34 — I just hope Ralph and the gang don’t decide to renumber them on me or something), I’ll go through ETL Subsystem 21: Data Propagation Manager!

Tags: , , , , , , ,

No Comments

Analytical Databases

 Analytical DatabasesMost of what I discuss on Tod means Fox (and especially in my 34 Subsystem series) has to do with relational database engines (SQL Server 2005 and Visual FoxPro in particular). For most data integration projects, desktop and web solutions, and data warehouses, the relational database is all you’ll need. It can be used to create and manage both relational and dimensional models. But in Business Intelligence applications, chances are that you’ll need an additional, more specialized form of multidimensional data storage and retrieval. Analytical databases cover this need. In fact, Dr. Codd, the inventor of the relational database, often explained how analytical databases are a “necessary companion” to relational databases.

The OLAP Story

OLAP stands for Online Analytical Processing. It is a terrible name to describe multidimensional analytical databases. Instead of a more meaningful name, like FASMI (Fast Analysis of Shared Multidimensional Information) proposed by Nigel Pendse of “OLAP Report” we seem to be stuck with OLAP. With great pain, I will continue to discuss analytical databases using the OLAP terminology.

Basically, an OLAP (*sigh*) database has certain characteristics that set it apart from relational databases. And, coincidentally, Pendse included all characteristics in the FASMI acronym. Check out the OLAP Report for more details. If you’re just getting into business analytics, or are in the process of evaluating analytical database software – you must visit this site!

The Smartness Factor

I like to think of the relational database as “dumb” and the analytic database as “smart”. Let me explain:

The primary focus of Analytic servers is to get the (often pre-calculated) data out of the database as quickly as possible, allowing the user to zoom in and out along different hierarchies. Contrast this with the Codd’s relational model which seeks to eliminate data anomalies at transaction time through normalization: data retrieval is slow and often complex. With this in mind, you can get a feel for how a database engine can store, catalog, and retrieve data differently. (In fact, this is what makes dimensional modeling in general so favorable for querying and analytics – it is not bound by the restrictiveness imposed by normalization. I’d also like to note that as with dimensional models, OLAP databases are multidimensional.)

While the relational database plays a very important role, so does the multidimensional (OLAP) database. It’s simply built differently to service a different need.

Primarily, analytic servers can manage aggregates, aggregate navigation, and complex calculations and summaries across hierarchies at blazing speeds. These skills were borrowed from both navigational database designs (think of the DOM or a Wiki) and hierarchal designs (trees with leaves and branches). A lot of this has to do with how the data is stored.

Storage

Analytics servers offer a different type of storage. MOLAP, or Multidimensional OLAP, can be much more efficient than relational engines. Some tools (like SQL Server Analysis Services) allow you to store your analytical databases in a relational way (ROLAP) or using a hybrid approach (HOLAP). Personally, I see no benefit at all with ROLAP, aside from real-time systems where using the relational database is a must because of the latency involved with updating the OLAP database. (Not to digress too much here, but even real-time environments can use MOLAP storage by using partitions and caches correctly.)

MOLAP is more natural and faster. This may differ depending on your tool choice, but I beg someone to tell me differently. For a detailed discussion on how MOLAP and ROLAP engines store data, you can try this resource “Analysis Services 2005 Performance Guide” (which defines the way Analysis Services does it).

HOLAP storage is an interesting option and could actually perform well –even better than MOLAP in some instances — with less disk usage. But a lot of this depends on how many aggregates are defined and how often the system needs to query the relational database. It might be a good consideration, but if you’re not sure and you have plenty of disk space (disk space is cheap), then go with MOLAP.

The smartness factor2 (the MDX language)

I like to think of SQL as “dumb” and MDX as “smart”. Let me explain:

The MultiDimensional eXpression (MDX) language was created about 10 years ago by Microsoft to work specifically with multidimensional data stored in analytical servers. This OLAP-specific language improves upon the SQL syntax, removing much of the bulkiness associated with the language. MDX is an elegant and highly relevant partner to the analytical database. While you could get out what you need using SQL, most every SQL statement will be a challenging one. It’s like cutting down a tree with a hand saw as opposed to using a chain saw. MDX is painless and intuitive. Concepts such as dimensions, hierarchies, and sets are built into the syntax. For more on MDX and the mdXML standard, please visit XML For Analysis.

Central to the MDX language is the cube concept, which deserves a proper introduction:

Cubes

OLAP data is stored in structures called cubes. As you know, a cube is a 3-dimensional solid that, given a point on three of its faces which form a vertex, can take you to a precise point somewhere within the cube itself.  Analytical DatabasesThis point represents the aggregate metric you want to view. The faces represent the different dimensions (like Product, Customer, and Time) that are used to find the point. The dimensions are further broken down by its hierarchies on each face.

Cubes are sometimes — and more aptly — called hypercubes (or a tesseract or “n-cube”) because analysis often makes it necessary to examine more than three points. As soon as you look at more than three faces, you need higher dimensionality.

I admit, when I first heard about cubes I was a bit intimidated. I felt that they were too complex and advanced. Now, working with cubes is natural. Notice though that the concept of a cube is almost the same thing as a star schema. The only difference (at least that I can really think of) is that a star schema generally stores atomic data, and barring any usable aggregate fact table, calculations need to be done on the fly. The cube theoretically stores the results of these calculations inside the cube.

Dimensional Modeling

If you read my blog, then you know I advocate dimensional models for proper data warehousing. If you also advocate and use dimensional models, then the concepts discussed so far will fit you well.

What’s great about using Dimensional Models in your relational database is that your OLAP database will almost mirror the design, making the development and deployment of your cubes so much easier. Not only that, but the cubes will likely load faster and be easier to maintain when built off of the dimensional model. Note you can still achieve most of what you can do without an analytics server, just by using a properly constructed set of business process dimensional models. The OLAP database is just smarter, making several tasks (navigating hierarchies, security, and calculations to name a few) easier.

Tags: , , , ,

2 Comments

ETL Subsystem 19: Aggregate Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

I spent the last two posts describing aggregate facts and dimensions and explaining the differences between aggregates and summaries. I wanted to do that in a separate thread so that I could focus on Kimball’s “Aggregate Builder” without spending too much time introducing aggregates. So, if you haven’t already, please have a look at those two previous posts.

The key to subsystem 19 is that you budget for and build a system to maintain your aggregates. This can be easily overlooked. I repeat: This can be easily overlooked. Much like what happens with the very important and easily cast aside data profiling subsystem. Don’t let it happen to you!

Building Aggregates

Aggregates can be a challenge to build — especially if your relational or OLAP engines don’t do it for you. Actually, if your OLAP tool doesn’t handle building aggregates, and you are relying on it for your aggregates, then you should think about getting another OLAP tool! I’ll talk more about OLAP in my next Subsystem post.

When designing aggregates to build, consider two things: user access patterns and distribution of data along the hierarchies. You cannot hope to construct aggregates for every possible level of every hierarchy across all metrics in a fact. Instead, you need to assess how your users query the data and then, taking it a bit further, decide if creating an aggregate will make any real difference for performance based on attribute distribution. The main goal of creating aggregates in a data warehouse is to speed query performance mainly by limiting the amount of data required to obtain results. But building them can be expensive.

Generally speaking, every time you update, insert, or delete data into any dimension or fact, you will need to rebuild all aggregates that are tied to those attributes and metrics. If you have two or three dozen aggregates to build every time you load a few business processes dimensional models, then you can imagine the time and processor loads required. In many instances, there is a balancing act between getting the data into the warehouse and getting the data out.

Technically speaking, building aggregates is a simple matter in SQL. Often it consists merely of a GROUPed query, some aggregate or custom functions, and surrogate key management.

Here’s a tip: Embed surrogate keys in your dimensions for each hierarchy used in aggregates. This practice will allow you to avoid the costly process of generating new keys for your hierarchy at the expense of a few administrative attributes stored in each dimension. For example, if you are building a set of region aggregates, you might embed region_key, state_key, city_key along with region, state, and city in your geography dimension. The keys will be repeated in the dimension until they are rolled up, at which time they become unique. The new aggregate fact table will point to these keys, eliminating the need for a specialized surrogate key manager.

Navigating Aggregates

Depending on your technology (here, I talk about SQL Server 2005 via SSIS and Visual FoxPro only — see below), navigating aggregates may or may not be a simple matter. If you put all your aggregates into OLAP engines, then chances are good that your OLAP technology will handle aggregates for you. To do it yourself, say, if you were hand-coding this part of the data warehouse using Visual FoxPro, you have a few design questions to answer.

In an old, but still relevant article published by DBMS, Ralph Kimball defines four key design requirements when dealing with aggregates. They are:

  1. Aggregates must be stored in their own fact tables, separate from the base-level data. In addition, each distinct aggregation level must occupy its own unique fact table.
  2. The dimension tables attached to the aggregate fact tables must, wherever possible, be shrunken versions of the dimension tables associated with the base fact table.
  3. The base Fact table and all of its related aggregate Fact tables must be associated together as a “family of schemas” so that the aggregate navigator knows which tables are related to one another.
  4. Force all SQL created by any end user or application to refer exclusively to the base fact table and its associated full-size dimension tables.

Which leads to the simple process of navigating the schemas when a user queries the dimensional data. The process can be distilled into the following three processing steps proposed by Ralph Kimball in his article:

  1. Find the smallest fact table in the “family of schemas” referenced by a user’s query.
  2. test the query against the schema to be sure that all attributes and metrics are present. If so, proceed. If not, go back to step 1
  3. Rewrite the user’s query to use the smaller aggregates where possible.

If you’re faced with the challenge of developing your own navigator, this article is a must read!

SQL Server 2005 Integration Services (SSIS)

Build aggregates after loading the atomic fact data for each business process. This is usually done in a separate package that can be called from the master package. It may be appropriate to schedule these aggregate-building packages periodically throughout the week as opposed to after each load, but business requirements will dictate this approach.

The aggregate package will contain three important types of data flows.

  • Dimension shrinking
  • Surrogate key management
  • Aggregate fact loading

Within the dimension shrinking data flows, start collapsing the relevant dimensions to the grain of the aggregate fact table. At this time, you would also assign surrogate keys (or, if you use my tip described above, just include the appropriate key and designate it as the primary key).

If you’re particularly crafty, you could design this in a way that would allow you to collapse dimensions from the previously shrunken dimension — saving significant time and resources. For example, collapse the geography dimension first into geography_region, then from geography_region create geography_state, and then from geography_state create geography_city.

Surrogate key management (where you look up the keys in the dimensions to use as foreign keys in the fact) and fact loading (actually inserting into the fact table) are very similar to the normal ETL processes. The only difference is that in order to look up your keys, you need to generate a special dataset from the base atomic data at the proper grain using a GROUP BY which contains all natural keys of the newly shrunken dimensions.

Hand Coding with Visual FoxPro (VFP9)

I find that hand-coding works well with aggregates. You can hand-code dimension shrinking, surrogate key management, and aggregate fact loading quite easily. All that we’re dealing with is some basic workflow and simple SQL. The challenge comes in scheduling and paralleling. Certainly, we want aggregates to run after the atomic data is ready. How you do that in FoxPro is entirely up to you. I prefer to chain events, so that when one finishes, the other kicks off. Scheduling requires additional reliance on the OS or the use of a timer (a bad idea).

From Here

There is so much more to say about this topic. I would like to follow this up with some examples in both SSIS and FoxPro. Stay tuned… In the meantime, I’ll move on to talk more about OLAP.

Tags: , , , , , ,

No Comments

Aggregate Facts and Dimensions

aggregate Aggregate Facts and DimensionsAggregate fact tables are special fact tables in a data warehouse that contain new metrics derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other specialized functions that output totals derived from a grouping of the base data. These new metrics, called “aggregate facts” or “summary statistics” are stored and maintained in the data warehouse database in special fact tables at the grain of the aggregation. Likewise, the corresponding dimensions are rolled up and condensed to match the new grain of the fact.

These specialized tables are used as substitutes whenever possible for returning user queries. The reason? Speed. Querying a tidy aggregate table is much faster and uses much less disk I/O than the base, atomic fact table, especially if the dimensions are large as well. If you want to wow your users, start adding aggregates. You can even use this “trick” in your operational systems to serve as a foundation for operational reports. I’ve always done this for any report referred to by my users as a “Summary”. (As an aside, there is a difference between an “aggregate” and a “summary”. I’ll explore these differences in my next post.)

For example, take the “Orders” business process from an online catalog company where you might have customer orders in a fact table called FactOrders with dimensions Customer, Product, and OrderDate. With possibly millions of orders in the transaction fact, it makes sense to start thinking about aggregates.

To further the above example, assume that the business is interested in a report: “Monthly orders by state and product type”. While you could generate this easily enough using the FactOrders fact table, you could likely speed up the data retrieval for the report by at least half (but likely much, much more) using an aggregate.

Here, using the atomic transaction FactOrders table:

SELECT c.state, p.product_type, t.year, t.month, SUM(f.order_amount)
    FROM FactOrders f
    JOIN DimCustomer c ON c.CustomerID = f.CustomerID
    JOIN DimProduct p ON p.ProductID = f.ProductID
    JOIN DimTime t ON 
        t.year = DATEPART(yy, f.YearMonthID) AND 
        t.month = DATEPART(mm, f.DateID)
    GROUP BY c.state, p.product_type, t.year, t.month

The aggregate is querying much less data and queries against time are now much simpler. In my non-scientific tests, the following query ran many times faster (a few seconds compared to about 30 seconds!).

SELECT c.state, p.product_type, t.year, t.month, SUM(f.order_amount)
    FROM FactOrders_Agg1 f
    JOIN DimCustomerState c ON c.CustomerStateID = f.CustomerStateID
    JOIN DimProductType p ON p.ProductTypeID = f.ProductTypeID
    JOIN DimMonth t ON t.YearMonthID = f.YearMonthID
    GROUP BY c.state, p.product_type, t.year, t.month

Creating the Fact and Dimensions

To implement, you will need to roll up your fact table by the hierarchies found in your dimensions. The result will be a new fact table, a set of new accompanying dimensions at the grain of the fact, and all new foreign keys for mapping. I usually name the fact table the same as the base fact with some meaningful suffix appended to the end. In SSMS, this keeps the aggregates with the fact in my object explorer. Dimensions usually get new names (like CustomerState and ProductType) and should be conformed so that they can be reused across business processes. You could even create views instead of new dimensions, but this does not eliminate the need to regenerate new surrogate keys.

When rolling up dimensions, you are provided with an excellent opportunity to perform aggregate functions on the dimension itself and store the results as new attributes. For example, you may want to know how many customers are living in each state. This could be used as the denominator in some population calculation you plan to use against the aggregate fact. Your new dimension might therefore look like the following:

SELECT Cust_Country, Cust_Region, Cust_State, COUNT(Cust_ID) 
FROM DimCustomer 
GROUP BY Cust_Country, Cust_Region, Cust_State

The most obvious aggregate function to use is COUNT, but depending on the type of data you have in your dimensions, other functions may prove useful. Just be warned: If you find that aggregate functions are being used a lot on your dimensions, you may need to revisit your design. There may be opportunities to pull out those metrics into existing or new fact tables!

Generating aggregates is largely an incremental process, where you examine query and reporting usage looking for places to improve performance. Aggregates stored in the RDBMS are maintained through ETL and/or your OLAP engine.

A Note About OLAP Aggregates

Theoretically, storing aggregates in a fact table in a RDBMS is the same as storing them in an OLAP cube. In OLAP storage, aggregates are precalculated summaries of data from the different dimensions of the cube, such that a query that seeks to know the aggregate (sum) of some metric (order amount) for X (customer state) and Y (product type) over T (monthly orders) would only need to look inside the cube at those exact coordinates to get the answer. I won’t pretend to know how this stuff is physically stored, but OLAP engines across the board offer better performance, management, and navigation mechanisms for the aggregations than is available through the RDBMS (even when using Indexed or Materialized Views).

Next post, I’ll write some thoughts on the differences between “Summaries” and “Aggregates”!

Tags: , , , , , , ,

3 Comments

A Data Warehouser’s Vocabulary (Part 2)

This post is part 2 (read part 1) of a series of posts containing a glossary of terms and concepts that I feel has some relevance to the data warehousing and business intelligence world. Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! When finished, these terms will be compiled and made a static page on TmF.

Aggregation
The process of redefining data into a summarization based on some rules or criteria. Aggregation may also encompass de-normalization for data access and retrieval.
Analytical Processing
Producing analysis for management decisions, usually involving trend analysis, drill-down analysis, demographic analysis, profiling, and so on.
Attribute
Any detail that serves to qualify, identify, classify, quantify, or express the state of an entity.
Data Mining
The process of analyzing large amounts of data in search of previously undiscovered business patterns.
Dimension
A denormalized table in a dimensional model with a single part primary key and descriptive attribute columns.
Event
A signal that some activity (usually a business transaction) has occurred.
Fact
Central table of a Star Schema which numeric performance measurements identified by a composite key, each of whose elements is a foreign key drawn from a dimension table.
Heuristic Analysis
Heuristic Analysis is a method to help to solve a problem, commonly informal. It is particularly used for a method that often rapidly leads to a solution that is usually reasonably close to the best possible answer. Heuristics are “rules of thumb”, educated guesses, intuitive judgments or simply common sense.
Online Analytical Processing (OLAP, also MOLAP)
On-line retrieval and analysis of data to reveal business trends and statistics not directly visible in the data directly retrieved from a data warehouse. Also known as multidimensional analysis.
Outrigger
A secondary dimension table attached to a dimension table. An outrigger is not used to normalize a dimension.
Relational OLAP (ROLAP)
“Relational” OLAP, in which the OLAP processes use a relational, normalized model for its source.
Slowly Changing Dimension (SCD)
The tendency for dimension attributes to change gradually or occasionally over time. The techniques for handling these changes include Type 1 (overwrite), Type 2 (keep history), and Type 3 (alternate realities).
Snowflake
A normalized dimension where a flat, single dimension table is deconstructed into a tree structure with potentially many nesting levels. Snowflaking a dimension generally compromises user understandability and browsing performance.
Snowflaking
The (undesirable) act of normalizing a dimensional model.
Star Schema
A generic representation of a dimensional model in a relational database in which a fact table with a composite key is joined to a number of single level dimension tables, each with a single primary key.

Tags: , , ,

1 Comment

Live from Kimball University: Day 4 (ETL, Addresses, Applications, BI Toolkit?)

Ok, so I’m a day late! I had to catch a train back to Brussels, and considering I hadn’t seen my family since Monday morning, I didn’t spend last night blogging! Besides, the four days were fairly taxing. Whether you think you know everything about dimensional modeling or not, the material was fairly dense and Warren and Margy both moved fast.

If you are serious about data warehousing, and call yourself a data warehousing professional (especially if you are a “dimensional modeler”), then attending this and other sessions is a must. I learned more Warren and Margy directly than I could have from their books. And hearing about their various experiences added an extra dimension (no pun) to the education.

Now, if we can just convince them to create a certification exam

ETL System

The entire morning was focused on ETL Design and Development. The first step in designing the ETL processes is to meet a basic set of requirements. These would include having all business requirements fleshed out; a “solid” dimensional model ready to go; data quality, archiving, lineage, and management infrastructures defined; compliance, latency, and security issues addressed; and finally a good understanding of all application requirements. With these in place, it is possible to create the ETL System.

Basically, the ETL system is in charge of converting source system data to the data warehouse database. Don’t get hung up on the acronym though. Depending on the circumstances, the process may feel more like ELT, or ELTL, or E&T, more T, some L, and more T.

Warren recommends using an ETL vendor tool mainly because the tools handle technical metadata requirements and likely have a lot of other very helpful functionality built in (logging, standard tasks, connection managers, dimension management, surrogate key lookup capability, in-memory pipelining, etc..).

As many of you know, I am an advocate of hand-coding ETL, especially using Visual FoxPro. VFP makes it easy to build in all the things that other ETL tools do with little effort. In addition, VFP is data centric and object-oriented, and if the ETL processes are designed correctly can be totally metadata driven. With that said, ETL tools are getting better and better. If you have a couple of developers in house that are experienced with FoxPro or another data-centric language, then by all means go for it. If not, then using an ETL tool is likely your best bet.

Regardless if you hand-code or buy, Warren defined the following steps that are needed to develop the ETL system:

  1. Create a high-level plan for the target model
  2. Install and learn the ETL tool!
  3. Determine your default strategies
  4. Drill down by target table
  5. Develop the historical load for dimensions
  6. Populate historical facts
  7. Dimension table incremental processing
  8. Fact table incremental processing
  9. Aggregate and OLAP loads
  10. ETL operation and automation

Nothing earth-shattering, but I have seen first-hand how ETL systems can stall and flutter along because one or more steps were ignored or done out of order. The Kimball Group has spent a long time together developing their best-practice and practical techniques so that we can all benefit. This list is just another example; it’s a simple pattern to follow that will likely lead to greater successes.

Name and Address Fun

As part of our ETL discussion, we spoke of data cleansing and conforming. One major challenge that data integrators have is dealing with very dirty data from multiple source systems. One good example is the challenges associated with name and address matching.

Warren showed us a series of slides with scanned address labels delivered (miraculously in some cases if you saw the slides) to his house. His name was butchered in every which way. Some had his name as Thorn. Another had him as a woman. It goes to show that data quality and name and address cleansing is very hard to do. When you consider that some companies spend millions on their mailings, this is a big deal: And they can’t even get it right!

BI Applications

After lunch, we dove right into a discussion on Business Intelligence (BI) Applications. One thing Warren said really stuck out to me, and it is something I wrote in big letters on my copy of the presentation slides: BRANDING.

BI Applications are in fact a way for us — as DW/BI professionals — to brand our services. We create our image through the BI Application layer. Despite all the background and expertise we need to do all other phases of the LifeCycle, the executives, managers, and users will see us through our applications.

I admit, I had an “ah-ha” moment when this came up. Everything from Portal design to reporting templates to our logo (yes, we need a logo too!) is equally as important as cleansing and conforming data, designing a perfect data model, and conducting hours upon ours of requirement-gathering interviews.

I only wish that we had more time to spend on this topic. An entire day would have been great.

BI Toolkit

We need a BI Toolkit. A vendor-neutral, general discussion on what it means to deliver BI Applications to our users. I talked with both Margy and Warren about this and both seemed to agree (as does their publisher). It’s been in their sights for some time now, I suppose. While the other toolkit books have been an invaluable reference for the entire community — this book could literally change the way data warehousing teams and businesses consider BI, BI Applications, and the various possibilities that BI offers.

Personally, I would like this book to take more time delving into user requirements. I don’t want it to tell me why I need one p-value over another, nor do I want it to tell me to use Business Objects because it can support feature X or Y. I need a reference that reminds me that developing a scorecard requires multiple business processes, that building a reporting template to pass around to business users can be a great tool to flesh out design decisions, and that perhaps a “BI Matrix” needs to be developed so that each BI Application can be resolved against each underlying business process dimensional model. Ideally, it would be a book that a business sponsor can read and digest, while providing the Data Warehousing team with enough technical knowledge to appropriately evaluate tools, build strategies, and formulate roll out plans that give the user the biggest and quickest return on their investments.

Lastly…

It’s a bummer that Warren had to race through the deployment and maintenance materials. We really did have a good class though, and a lot of great questions were asked. You can tell that most people in the room really “get it”, and their questions all reflect that. Of course, this led to a few diversions and a tangent or two. We paid for that in the last section. Thankfully, I have a brand new, 2cnd Edition Lifecylce Toolkit book to refer to!

This was a great experience for me, and I met a lot of great people. If you haven’t been to one of these courses, I recommend it no matter the cost.

If you have questions or comments for me, don’t be shy. I will be writing more and more about these topics in future postings, and I’ll also be picking up where I left off with my postings on the ETL Subsytems 1 through 34. So if there is something you would like me to address first, let me know!

Tags: , , , , , ,

8 Comments

Business Intelligence Through Web Analysis

I just received a surprising email regarding my previous post about Analyzing Apache’s Raw Access Logs in FoxPro. The commenter wrote “dude, you’re nuts. Why waste your time on this?!? I use Awstats and that works fine”.

I sat back in my chair quite puzzled. After all, I know that Awstats doesn’t even come close to giving me the answers I need to grow my business and website. Then I realized my folly: I jumped right into my example in my last post without fully explaining the goal of the project. I’ll try to redeem myself now.

Awstats and similar tools (in my case, provided with cPanel) are retrospective reporting tools. They give you nice charts and display some very interesting numbers and facts about the types of traffic generated on your site. If you’re good with numbers and can associate events to dates (in your head) then you may be able to notice some interesting patterns (like, “hey – it seems that whenever I post a new blog entry, my hits double!”).

But this isn’t good enough for more serious projects (but isn’t a bad place to start either).

When growing your business (whether you’re a blogger or selling widgets), this type of information is invaluable. You need good, consistent, scientific analysis to pull it all together (hunches and gut feelings don’t count). Pool raw data (from a variety of sources), integrate it, clean and add value to it, and compile it to create some incredibly useful and valuable information (Read: Business Intelligence). This information can help you make decisions like (a) how much to spend on advertising, (b) whether or not to sponsor an event (such as FoxForward), (c) how many blog posts to make per week to keep the interest of readers, (d) should I sign up with Google AdSense, or (e) what design elements and layout plans are making the greatest impact.

Data smog is a real issue, however. Too much data can not only waste your valuable analysis time, but the integration of this meaningless data will do nothing but eat away at your resources (and give you a headache). The key then is to do a little preparation before you begin a project like this (duh!). I think there are two important steps (1) monetize all elements of your business, and (2) identify all key performance indicators (KPIs). Armed with this information, you will be able to build a dimensional model (in VFP of course!) with an incredibly rich fact table. Monetizing helps you assign value to all your tasks, and KPIs allow you to measure the benefits of these endeavors. Throughout the course of this project, I’ll be itemizing these two items in greater detail. As an example, for my blog todmeansfox, I’ve monetized the following items:

  • Posting a new blog entry: $90.00 / post (my time and effort to post based on my current rates, discounted)
  • Responding to posts: $10.00 / comment
  • Social networking: $90.00 / contact (includes setting up accounts on del.ico.us for example)
  • Advertising: $40.00 / free advertising (example, updating my fox.wiki profile)
  • Advertising: case by case (I have not done any advertising yet)
  • Sponsoring: case by case (I have not sponsored an event yet)
  • Updating CSS files: $60.00 / hour (how much does it ‘cost’ me to update my site’s layout and colors?)

Next, I tried to identify all important KPIs:

  • referrals
  • length of stay
  • out clicks
  • total number of hits and page views
  • total number of comments made
  • total number of emails received
  • Blackstone Providence leads
  • consulting job hours from website leads

Of course, as I do more data profiling, I may uncover some other very useful pieces of data that I can later integrate.

The goal of this project, therefore, is rather simple: make better business decisions. Using the Internet, I will gather the right data, integrate it in a meaningful way, and use OLAP to analyze and report on the findings. I’ll use data mining, profiling, and trend analysis to identify abuse and spam, as well as identify areas were improvements in layout, methodology, and content can make a greater impact. My hope is to generate more business by leveraging my current assets. On top of that, I want to do it in the open so others can benefit.

The first step in my process is data profiling, where I’ll gather the data I think I can use, analyze it, test it for quality, and prepare it for my dimensional model. My last blog post attempted to start that process by simply downloading and parsing the daily Apache raw access log file (which I should mention is in the NCSA combined/XLF/ELF log format, in case you were wondering).

As you can see, Awstats can only get me so far (actually, not that far at all).

Tags: , , , ,

2 Comments