Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for August, 2008


Published August 29th, 2008

Top 7 Reasons I Wear a Suit

I dislike wearing suits.

It used to be that I could code in my favorite Phish t-shirt wearing sandals. I had a key instead of a badge, and lunch usually meant a few greasy pizzas or clam cakes. In those days, my attire only meant something if there was an off-site or if clients were coming to visit “the shop” (which was a tiny building several miles from the heart of the big city). I could easily bounce back and forth between long and short hair and between full beard and cleaned-shaved. Ahh… those were the days.

Now I work in a major international city for a rather large bank. I code in a suit when I’m not in meetings, wear nice shoes, carry a badge, and eat salads and yogurt for lunch. *sigh*

To be fair, I enjoy the new challenges and the big city. And if wearing a suit on occasion is a consequence, I can live with it. So while a suit is not fully mandatory, I still wear one at times. Here’s why:

  1. It easily puts me in line with the dress code
  2. Dressing is simpler in the morning (although sometimes it takes a couple of tries to get the perfect knot in my tie)
  3. My wife tells me I look great
  4. Dressing down on Friday never felt so good
  5. I look more important than I am
  6. I feel more important than I am
  7. My jacket flaps behind me in the wind when I ride my bike to the train station, which makes me feel like a super hero with a cape

Other than those fantastic reasons, wearing a suit is a real drag.

Some man on a bike wearing a suit in Europe (I do admit, there is something rather Monty Pythonish about wearing a suit on a bike. I bet I look pretty silly to the folks driving past me. But riding my bike gives me more than 30 minutes a day of much-needed exercise, and on top of that, the price of gas here in Europe would blow your mind!)

Published August 26th, 2008

Analytical Databases

ying yang of relational verses analytical databasesMost of what I discuss on Tod means Fox (and especially in my 34 Subsystem series) has to do with relational database engines (SQL Server 2005 and Visual FoxPro in particular). For most data integration projects, desktop and web solutions, and data warehouses, the relational database is all you’ll need. It can be used to create and manage both relational and dimensional models. But in Business Intelligence applications, chances are that you’ll need an additional, more specialized form of multidimensional data storage and retrieval. Analytical databases cover this need. In fact, Dr. Codd, the inventor of the relational database, often explained how analytical databases are a “necessary companion” to relational databases.

The OLAP Story

OLAP stands for Online Analytical Processing. It is a terrible name to describe multidimensional analytical databases. Instead of a more meaningful name, like FASMI (Fast Analysis of Shared Multidimensional Information) proposed by Nigel Pendse of “OLAP Report” we seem to be stuck with OLAP. With great pain, I will continue to discuss analytical databases using the OLAP terminology.

Basically, an OLAP (*sigh*) database has certain characteristics that set it apart from relational databases. And, coincidentally, Pendse included all characteristics in the FASMI acronym. Check out the OLAP Report for more details. If you’re just getting into business analytics, or are in the process of evaluating analytical database software – you must visit this site!

The Smartness Factor

I like to think of the relational database as “dumb” and the analytic database as “smart”. Let me explain:

The primary focus of Analytic servers is to get the (often pre-calculated) data out of the database as quickly as possible, allowing the user to zoom in and out along different hierarchies. Contrast this with the Codd’s relational model which seeks to eliminate data anomalies at transaction time through normalization: data retrieval is slow and often complex. With this in mind, you can get a feel for how a database engine can store, catalog, and retrieve data differently. (In fact, this is what makes dimensional modeling in general so favorable for querying and analytics – it is not bound by the restrictiveness imposed by normalization. I’d also like to note that as with dimensional models, OLAP databases are multidimensional.)

While the relational database plays a very important role, so does the multidimensional (OLAP) database. It’s simply built differently to service a different need.

Primarily, analytic servers can manage aggregates, aggregate navigation, and complex calculations and summaries across hierarchies at blazing speeds. These skills were borrowed from both navigational database designs (think of the DOM or a Wiki) and hierarchal designs (trees with leaves and branches). A lot of this has to do with how the data is stored.

Storage

Analytics servers offer a different type of storage. MOLAP, or Multidimensional OLAP, can be much more efficient than relational engines. Some tools (like SQL Server Analysis Services) allow you to store your analytical databases in a relational way (ROLAP) or using a hybrid approach (HOLAP). Personally, I see no benefit at all with ROLAP, aside from real-time systems where using the relational database is a must because of the latency involved with updating the OLAP database. (Not to digress too much here, but even real-time environments can use MOLAP storage by using partitions and caches correctly.)

MOLAP is more natural and faster. This may differ depending on your tool choice, but I beg someone to tell me differently. For a detailed discussion on how MOLAP and ROLAP engines store data, you can try this resource “Analysis Services 2005 Performance Guide” (which defines the way Analysis Services does it).

HOLAP storage is an interesting option and could actually perform well –even better than MOLAP in some instances — with less disk usage. But a lot of this depends on how many aggregates are defined and how often the system needs to query the relational database. It might be a good consideration, but if you’re not sure and you have plenty of disk space (disk space is cheap), then go with MOLAP.

The smartness factor2 (the MDX language)

I like to think of SQL as “dumb” and MDX as “smart”. Let me explain:

The MultiDimensional eXpression (MDX) language was created about 10 years ago by Microsoft to work specifically with multidimensional data stored in analytical servers. This OLAP-specific language improves upon the SQL syntax, removing much of the bulkiness associated with the language. MDX is an elegant and highly relevant partner to the analytical database. While you could get out what you need using SQL, most every SQL statement will be a challenging one. It’s like cutting down a tree with a hand saw as opposed to using a chain saw. MDX is painless and intuitive. Concepts such as dimensions, hierarchies, and sets are built into the syntax. For more on MDX and the mdXML standard, please visit XML For Analysis.

Central to the MDX language is the cube concept, which deserves a proper introduction:

Cubes

OLAP data is stored in structures called cubes. As you know, a cube is a 3-dimensional solid that, given a point on three of its faces which form a vertex, can take you to a precise point somewhere within the cube itself. A cubeThis point represents the aggregate metric you want to view. The faces represent the different dimensions (like Product, Customer, and Time) that are used to find the point. The dimensions are further broken down by its hierarchies on each face.

Cubes are sometimes — and more aptly — called hypercubes (or a tesseract or “n-cube”) because analysis often makes it necessary to examine more than three points. As soon as you look at more than three faces, you need higher dimensionality.

I admit, when I first heard about cubes I was a bit intimidated. I felt that they were too complex and advanced. Now, working with cubes is natural. Notice though that the concept of a cube is almost the same thing as a star schema. The only difference (at least that I can really think of) is that a star schema generally stores atomic data, and barring any usable aggregate fact table, calculations need to be done on the fly. The cube theoretically stores the results of these calculations inside the cube.

Dimensional Modeling

If you read my blog, then you know I advocate dimensional models for proper data warehousing. If you also advocate and use dimensional models, then the concepts discussed so far will fit you well.

What’s great about using Dimensional Models in your relational database is that your OLAP database will almost mirror the design, making the development and deployment of your cubes so much easier. Not only that, but the cubes will likely load faster and be easier to maintain when built off of the dimensional model. Note you can still achieve most of what you can do without an analytics server, just by using a properly constructed set of business process dimensional models. The OLAP database is just smarter, making several tasks (navigating hierarchies, security, and calculations to name a few) easier.

Published August 18th, 2008

FoxPro on the Mac via Parallels

As you may know, last weekend I bought my first Mac. In a week, I’ve made a ton of progress and am starting to feel rather comfortable with the new OS. My first impression remains strong: The Mac is awesome. It takes fewer clicks to get things done, it is faster than PCs of equivalent power, and it is much more intuitive. There are still a few things to do, like check with Adobe about cross-platform upgrades of Dreamweaver, Photoshop, and Acrobat.

I have, however, managed to place one very important piece of the migration puzzle: Getting FoxPro up and running.

Virtualization

To get Windows running on the Mac, you basically have two options: Dual boot via Boot Camp or use Virtualization. Boot Camp was not a great option for me because I want to work in the Mac and share items between the Windows install and OSX. For virtualization, I seemed to have three choices: VirtualBox by Sun, VMware Fusion, or Parallels Desktop for Mac.

Reviews for Parallels were slightly better overall, and I got a recommendation from a colleague, so I went with it. For less than 100 Euro, this software is simply incredible. Installation was as simple as downloading and running the installer. These were the steps I took:

  1. Downloaded the latest version from parallels.com
  2. Double-clicked “Install Parallels Desktop”
  3. Went through the Wizard, clicking Continue through each step
  4. Accepted the terms of the license
  5. Selected my Mac’s HD for the install destination
  6. After the install finished, I moved the icon to the Dock
  7. Started the software and from the Help menu clicked “Activate Product”
  8. Entered the activation code that Parallels sent me via email
  9. Clicked ‘New’ from the File menu to create a new VM via the OS Installation Assistant
  10. Selected “Windows Express Installation”, Next
  11. Selected “Windows XP”, Next
  12. Entered the XP product key, set some basic settings, and gave the VM a name
  13. Chose to allocate more resourced to the VM than to Mac OSX
  14. Clicked Finish

Next, I installed Windows XP Professional using the new VM window created by Parallels. This was the fastest XP install I’ve ever done. Not only that, but the entire process was the same as if I had been on a PC (just in a window — which is a bit ironic). When Windows booted for the first time, and I heard the familiar Windows theme, I felt a sense of cool calm overcome me. This was going to work.

Ejecting the CD

During install, I had to insert my Windows 2000 disk (I have an Upgrade copy of XP Pro). I think it took me about 30 minutes to figure out how to eject the XP Pro CD from the Mac! Here’s how I did it:

  1. Right-click on the CD icon in the footer of the Parallels Desktop window
  2. Choose ‘Disconnect’
  3. The CD drive will re-appear in OSX’s Disk Utility application, where it can be ejected.
  4. Right-click on the CD icon in the footer of the Parallels Desktop window to reconnect when ready

When the VM is running, it ‘owns’ the CD drive. Because I was installing XP Pro, I didn’t have any obvious way to eject the disk. There’s no button on the Mac and I wasn’t yet ready to straighten out a paper clip! The above steps are a bit of a hassle, but certainly doable.

Two-times the Charm

Of course, I had to do it all twice (my fault). I have an XP Pro version 2002 CD without any Service Packs. After Windows installed, I went straight for Windows update and let it install SP2. It failed with some read errors and then Windows would no longer boot. So, I deleted the VM and started over. No biggie. This was easy. This time, I installed SP1 and then went straight for SP3. That worked and now I’m golden. After getting all the other updates, I was ready for the Fox.

Installing Visual FoxPro v9

Ok, so there was nothing to installing FoxPro. I literally put the CD in, installed the prerequisites, and then VFP. No surprises, no issues, no problems. I ran some basic checks and so far everything seems to run fine. I’ve yet to do any development but if I have any issues, I’ll be sure to report them.

I recall at FoxForward 2007 seeing a few Macs. If you’re using VFP on your Mac — let me know about it. I’d like to know how you’ve done it (Boot Camp? VM Ware?) and if you’ve encountered any issues.

Lord Voldemort

Lastly, I’ve named my XP Pro Virtual Machine “Lord Voldemort” of Harry Potter fame. My entire home network is named after various Harry Potter characters. I named my iMac Dumbledore.

This seems totally appropriate.

Published August 11th, 2008

ETL Subsystem 19: Aggregate Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

I spent the last two posts describing aggregate facts and dimensions and explaining the differences between aggregates and summaries. I wanted to do that in a separate thread so that I could focus on Kimball’s “Aggregate Builder” without spending too much time introducing aggregates. So, if you haven’t already, please have a look at those two previous posts.

The key to subsystem 19 is that you budget for and build a system to maintain your aggregates. This can be easily overlooked. I repeat: This can be easily overlooked. Much like what happens with the very important and easily cast aside data profiling subsystem. Don’t let it happen to you!

Building Aggregates

Aggregates can be a challenge to build — especially if your relational or OLAP engines don’t do it for you. Actually, if your OLAP tool doesn’t handle building aggregates, and you are relying on it for your aggregates, then you should think about getting another OLAP tool! I’ll talk more about OLAP in my next Subsystem post.

When designing aggregates to build, consider two things: user access patterns and distribution of data along the hierarchies. You cannot hope to construct aggregates for every possible level of every hierarchy across all metrics in a fact. Instead, you need to assess how your users query the data and then, taking it a bit further, decide if creating an aggregate will make any real difference for performance based on attribute distribution. The main goal of creating aggregates in a data warehouse is to speed query performance mainly by limiting the amount of data required to obtain results. But building them can be expensive.

Generally speaking, every time you update, insert, or delete data into any dimension or fact, you will need to rebuild all aggregates that are tied to those attributes and metrics. If you have two or three dozen aggregates to build every time you load a few business processes dimensional models, then you can imagine the time and processor loads required. In many instances, there is a balancing act between getting the data into the warehouse and getting the data out.

Technically speaking, building aggregates is a simple matter in SQL. Often it consists merely of a GROUPed query, some aggregate or custom functions, and surrogate key management.

Here’s a tip: Embed surrogate keys in your dimensions for each hierarchy used in aggregates. This practice will allow you to avoid the costly process of generating new keys for your hierarchy at the expense of a few administrative attributes stored in each dimension. For example, if you are building a set of region aggregates, you might embed region_key, state_key, city_key along with region, state, and city in your geography dimension. The keys will be repeated in the dimension until they are rolled up, at which time they become unique. The new aggregate fact table will point to these keys, eliminating the need for a specialized surrogate key manager.

Navigating Aggregates

Depending on your technology (here, I talk about SQL Server 2005 via SSIS and Visual FoxPro only — see below), navigating aggregates may or may not be a simple matter. If you put all your aggregates into OLAP engines, then chances are good that your OLAP technology will handle aggregates for you. To do it yourself, say, if you were hand-coding this part of the data warehouse using Visual FoxPro, you have a few design questions to answer.

In an old, but still relevant article published by DBMS, Ralph Kimball defines four key design requirements when dealing with aggregates. They are:

  1. Aggregates must be stored in their own fact tables, separate from the base-level data. In addition, each distinct aggregation level must occupy its own unique fact table.
  2. The dimension tables attached to the aggregate fact tables must, wherever possible, be shrunken versions of the dimension tables associated with the base fact table.
  3. The base Fact table and all of its related aggregate Fact tables must be associated together as a “family of schemas” so that the aggregate navigator knows which tables are related to one another.
  4. Force all SQL created by any end user or application to refer exclusively to the base fact table and its associated full-size dimension tables.

Which leads to the simple process of navigating the schemas when a user queries the dimensional data. The process can be distilled into the following three processing steps proposed by Ralph Kimball in his article:

  1. Find the smallest fact table in the “family of schemas” referenced by a user’s query.
  2. test the query against the schema to be sure that all attributes and metrics are present. If so, proceed. If not, go back to step 1
  3. Rewrite the user’s query to use the smaller aggregates where possible.

If you’re faced with the challenge of developing your own navigator, this article is a must read!

SQL Server 2005 Integration Services (SSIS)

Build aggregates after loading the atomic fact data for each business process. This is usually done in a separate package that can be called from the master package. It may be appropriate to schedule these aggregate-building packages periodically throughout the week as opposed to after each load, but business requirements will dictate this approach.

The aggregate package will contain three important types of data flows.

  • Dimension shrinking
  • Surrogate key management
  • Aggregate fact loading

Within the dimension shrinking data flows, start collapsing the relevant dimensions to the grain of the aggregate fact table. At this time, you would also assign surrogate keys (or, if you use my tip described above, just include the appropriate key and designate it as the primary key).

If you’re particularly crafty, you could design this in a way that would allow you to collapse dimensions from the previously shrunken dimension — saving significant time and resources. For example, collapse the geography dimension first into geography_region, then from geography_region create geography_state, and then from geography_state create geography_city.

Surrogate key management (where you look up the keys in the dimensions to use as foreign keys in the fact) and fact loading (actually inserting into the fact table) are very similar to the normal ETL processes. The only difference is that in order to look up your keys, you need to generate a special dataset from the base atomic data at the proper grain using a GROUP BY which contains all natural keys of the newly shrunken dimensions.

Hand Coding with Visual FoxPro (VFP9)

I find that hand-coding works well with aggregates. You can hand-code dimension shrinking, surrogate key management, and aggregate fact loading quite easily. All that we’re dealing with is some basic workflow and simple SQL. The challenge comes in scheduling and paralleling. Certainly, we want aggregates to run after the atomic data is ready. How you do that in FoxPro is entirely up to you. I prefer to chain events, so that when one finishes, the other kicks off. Scheduling requires additional reliance on the OS or the use of a timer (a bad idea).

From Here

There is so much more to say about this topic. I would like to follow this up with some examples in both SSIS and FoxPro. Stay tuned… In the meantime, I’ll move on to talk more about OLAP.

Published August 10th, 2008

iTod

I have officially crawled out from under the rock and joined the land of the living. As a proud owner of a new iMac I feel both unshackled and invigorated.

iMacFor the past 10 years I have been convincing myself that I needed a PC and Windows (before that, I didn’t really take the Mac seriously). For one, all of my desktop development was in Visual FoxPro and .NET. Secondly, since my desktop applications were designed for PC users, I felt that it was important for me to have and use PCs (and Windows) like my clients. I would be developing in and using the environment that my users did, giving me more comfort and confidence when helping them use the software. But now that I’m involved strictly in Internet development and data warehousing, I no longer can justify the need for a PC; I simply do not have as much contact with my end-users as I used to.

My first order of business: I’ve got to set up a virtual machine on my iMac so I can install VFP9, .NET, and SQL Server. I may explore some other options, including Remote Desktop (or maybe VNC) into my PC, but I believe VM will be the best route. I’m dead set against having two computers on my desk at this point. The iMac just looks cool and I don’t want to cramp its style with my ugly PC box.

It is likely going to take me a while to get fully set up. I need to repurchase some of my key software (I need upgrades in all of them anyway): Dreamweaver & Photoshop CS3, Office 2008, and something to replace my old version of Cakewalk Pro Audio. And now that SQL Server 2008 is released, I’ll likely get that installed as well.

Published August 4th, 2008

Differences Between Aggregates and Summaries

I find the need to put down some of my thoughts regarding aggregates and summaries, especially on how they apply to SQL and DW/BI. To me, aggregates and summaries are not the same, but I have come to terms with the fact that most users, developers, modelers, and architects use the terms interchangeably. There are two pairs of terms this applies to (one pair is the noun form, while the other is the verb form):

  • n. Aggregate / Summary
  • v. Aggregating / Summarizing

Aggregate (plural aggregates) and summary (plural summaries) datum are created from functions placed upon underlying atomic data. The primary differences between the two lies in two areas: (a) what type of function is applied and (b) how the atomic data is organized as inputs to the function.

General rule: Any function that requires a grouping (typically along a hierarchy) is an aggregate, while any function applied to the entire population of data is a summary. Summaries can be derived from aggregates, but I am hard pressed to come up with an example where the reverse is true.

Functions such as SUM, AVG, MIN, and MAX can be used to derive both aggregates and summaries (by aggregating or summarizing), as you can apply them to an entire dataset or to groups within the dataset. In SQL however, you are forced to define a grouping when you include one or more attributes with the summary statistic. In this case, you have created an aggregate:

SELECT region, SUM(sale_amount) 
    FROM customer c 
    JOIN sales s ON c.cid = s.cid 
    GROUP BY region

A summary:

SELECT SUM(sale_amount) FROM sales

Note that you can create summaries from aggregates. Obviously, SELECT sum(sale_amount) FROM sales doesn’t really tell you much about your data. To aggregate by time or product line is much more useful. Consider the following, very useful aggregate and summary table:

-- get sales by region and month for 2007
SELECT region, sale_month, SUM(sale_amount) AS amount 
    INTO #tmpAggregate
    FROM customer c 
    JOIN sales s ON c.cid = s.cid
    WHERE YEAR =  '2007' and region = 'NE'
    GROUP BY region, sale_month
 
-- summarize the above by giving you total sales
-- in the North East for 2007. Note that this type of
-- summary is usually done in a reporting tool
SELECT SUM(amount) 
    INTO #tmpSumary
    FROM #tmpAggregate

Frequencies and other categorical functions are only valuable when applied against groups. Imagine a histogram without categories! Likewise, nearly all of the aggregate functions available in the majority of programming languages are most useful when applied against groups. This includes functions such as standard deviations, variances, means, and so on.

When an elementary arithmetic function (such as division) is performed on groups of data, the resulting calculation can be called a summary. If the function is applied within the row, it is called a calculation. To restate: arithmetic functions that span groups are a special type of summary. An example of this might be to find the ratio of one group as compared to another (a GROUP BY clause applied to two different study groups in the data).

While it might be a good idea to put together a table to illustrate my point about how functions and inputs affect the terminology, I’ll leave that exercise for another day. That type of project will take a little more thought!

Another difference between an aggregate and a summary is how the two are presented. Aggregates are rarely presented with the atomic data — not to suggest that the underlying data should not be available. Summaries most always come before or after the underlying data is presented to the user.

Am I off base? Do I have it all wrong?