Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for March, 2008


Published March 29th, 2008

FoxPro a Big Hit in China

Did you know that “a survey showed that Microsoft’s FoxPro database application accounted for 65% of database installations in
China, yet Microsoft had not sold a single legal copy in the country.” (source)

This blows me a way. I know pirating is bad in China but this is just unbelievable. Imagine if 10% of those installs were legal. Microsoft would have additional pressure to carry on the language (and would have plenty of additional revenue as well).

Anyway, that article is a great read for anyone interested. Now I can blame Bill AND China for VFP’s fate.

Published March 27th, 2008

What is MDM Anyway?

What is master data and what is the master data domain? What does it cover? Where is the business value in MDM? Is MDM a data warehousing function? How can business users be sold on the MDM investment?

While the industry dukes it out over answering these questions (you can’t get two experts or vendors to give you the same answer to any of those questions), I thought I’d share my thoughts. I admit, that my MDM experience to this point is minimal, having only been involved in one official MDM project (and only at the planning and data modeling level). But nevertheless, I do have some of my own ideas.

The bottom line, though, is that MDM is important and should not be overlooked when planning for new business intelligence or enterprise integration initiatives.

What MDM is

Master Data Management, an information activity, is the process of ensuring that an organization’s data (including its metadata) is consistent, reliable, accessible, distinctive and well defined. This master data, once it meets these requirements, can then be used by the enterprise as a system of record for key business entities and attributes.

There are different types of MDM: Analytical (A-MDM), Operational (O-MDM), and Enterprise (E-MDM). I see it like this: A-MDM is related to data warehousing and would therefore be implemented solely as a compliment to (or a reaction from) a data warehouse/business intelligence project (I don’t agree that simply having conformed dimensions constitutes “having” A-MDM, though). O-MDM is about collecting, managing, and redistributing master data to be used by operational systems (which is largely an effort in synchronization). E-MDM, is the Holy Grail, and would be a combination of both O-MDM and A-MDM. E-MDM reminds me of Enterprise Data Modeling (EDM), in that you truly need a 360-degree view of the business to make it work (READ: get business and IT together for tea).

Data governance and stewardship, as well as data quality management, data integration, and service-oriented architecture (SOA) are all functions and processes related in some way to Master Data.

What MDM is Not

MDM is not a technology. It is a business function. You are either managing master data, or you are not. It is unlikely that any single approach or software technology will solve the Master Data problem. It drives me nuts to hear about how vendors sell MDM (and BI, for that matter), but I digress…

The work of developing and maintaining master data is not a data warehousing function. To restate: Data warehouses are not developed to create master data (even A-MDM). Some will disagree with me on this. In my opinion/observation, MDM is a separate, highly important activity that works in conjunction with the data warehouse and all other applications and processes that exist in the enterprise.

Other Thoughts

I think that a common mistake made with regard to planning for, and implementing MDM, is that there is not enough emphasis on the data quality gains that will be realized. In addition, the data integration process for data warehousing projects will be infinitely easier, as the complex work of conforming dimensions and attributes will already be complete (at least in definition, with plenty of good metadata to use as a reference).

Master data, in many ways, provides a new way of looking at data as an asset with tangible, strategic value.

I’ll post more on this topic as time rolls on. As always, thoughts, questions, and criticisms welcome!

Published March 25th, 2008

ETL Subsystem 9: Slowly Changing Dimensions (SCD)

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

To this point, I’ve discussed “Data Extraction” and “Cleansing and Conforming” ETL Subsystems. Now, I’ll shift attention to data delivery (the “L” in ETL). As I have already discussed Slowly Changing Dimensions in a previous post, I’ll only highlight some of the points I think are most important here. So please, if you haven’t already, read my previous post “Dimensional Modeling: Loading the Slowly Changing Dimension” before you continue.

Some important takeaways:

  • Dimensions change over time. We can identify how to process these changes by using SCD Types 1, 2, and 3 techniques.
  • Dimensions (especially large ones) sometimes need a hybrid approach. This approach is sometimes referred to as a Type 1.5 SCD, which allow a dimension to be added with incomplete data. Updates are processed later as if the dimension was a Type 1 (empty and incomplete attributes are overwritten instead of retaining history) until the entire row can be labelled as “complete”. To indicate this, I have found including a “complete” flag in each dimension is a clear way to mark Type 1.5 dimension rows as complete, or in need of additional data.
  • Like the Type 1.5, sometimes a dimension row must be added with no details. These are called Inferred Members. In this scenario, just a natural key and the system-generated surrogate key are added. The dimension is then updated much like the Type 1.5 SCD until its data is complete.
  • For Type 1.5 and Type 2 SCDs to work, the row needs some sort of status indicator, the best being active “to” and “from” dates.

 

SQL Server 2005 Integration Services (SSIS)

I already mentioned in my previous post how SSIS’s Slowly Changing Dimension Task can be used to meet the requirements of this subsystem. Again, my only word of advice is to be careful about the amount of data you plan on passing through the generated SCD Tasks. I have seen performance degrade substantially on large datasets. It is best, in these scenarios, to introduce additional logic before the SCD logic to filter out inserts, updates, and deletes into different trees.

Also, it is a good idea to fully understand the different SCD types (inferred dimensions as well) before running through the wizard. Not only will the entire process be clearer, but you will better understand how to optimize the generated data flow when all is created.

Hand Coding with Visual FoxPro (VFP9)

I’ve had great success using FoxPro for hand-coded dimensional modeling projects. My first stab at loading Slowly Changing Dimensions was in 1998, using VFP6. Using a database trigger, I wrote code to update a denormalized data model that was used for reporting and ad-hoc queries. At the time, I had no idea what a SCD was, but somehow, instinctively I implemented it pretty close to the tried-and-true Kimball methods. That system is still in use today (the app is in VFP9 now) and I have redesigned it to match the Kimbal SCD methodology much closer than I originally conceived it.

If you are involved in a similar VFP project, or are at all interested in learning more about how VFP can be used in this way, please do contact me. I’m pretty passionate about VFP and Dimensional Modeling, so I’d love to hear from you!

Published March 22nd, 2008

Geek Love

Sorry I missed this the other week, but this graphic is absolutely priceless.

It is by Sam Potts and was published by The New York Times Company on March 9th in response to the death of D&D creator Gary Gygax. I realize that exposing myself as a longtime D&D fanatic might get me the “did you play with real swords”-type-questions, but hey, it was great fun.

Geek Love

Click the image to enlarge.

At the risk of fully exposing myself as a geek (no–I am not into nerdcore, though), I have to confess: I worked at the Georgia Renaissance Festival (and have been a fully costumed patron on many occasions), own pewter figurines, am a web designer, taught myself FoxPro (not Pascal), I do wonder about the technical accuracy of the diagram, I am blogging about it, and the analogy between D&D and Google is pretty darn good!

By the way, the secret to mixing girls and D&D is to find the girls who play too. If you can’t find them… then convert them (sorry Katie, Kathy, and all the other girls exposed to Orcs, Green Dragons, and a group of geeky guys eating Taco Bell and drinking Mountain Dew till 4 in the morning)!

Ok, I promise: no more D&D postings on the blog…

Published March 20th, 2008

More Reviews on the Way

Anyone have some good reading suggestions?

I read a lot, and now that I have a long train commute each day into and out of Brussels, I’ll be reading much more. Some books on my radar include “Blink: The Power of Thinking Without Thinking” by Malcolm Gladwell, “SOA Approach to Integration” by Ramesh Loganathan, et. al., and a re-read of the “Data Modeler’s Workbench: Tools and Techniques for Analysis and Design” by Steve Hoberman.

I really enjoyed Gladwell’s “Tipping Point”, and have had Blink on my shelf for several months. As soon as my US taxes are finished (*sigh*), I’ll start on it.

On the tech front, I’m pretty excited about reading “SOA Approach to Integration”. The book focuses on WS-BPEL (see WS-BPEL 2.0) as well as Enterprise Service Bus.

I read the Data Modeler’s Workbench a few years ago. Since I’ve matured as a data modeler and have entered a few new industries (clinical, financial), I think it is time for a re-read. But I’ve got to buy it (again) first! Steve gets my money twice :-s

Published March 17th, 2008

ETL Subsystem 8: Data Conformance

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data conformity is one of the most important and fundamental aspects of data warehousing. Data conformity means that a data element’s label (column header) and content (which includes data types, lengths, and nullability) are consistent across multiple tables and across multiple business processes.

In dimensional modeling, a single dimension shared across business processes is a conformed dimension. Shared dimensions allow for drill-across queries where a user could conduct enterprise-level analysis across several business processes (inventory, sales, procurement, etc.). Conformed dimensions, as a data quality technique, also cut down the likelihood of data inconsistencies in an organization.

Margy Ross of the Kimball Group states that “Using conformed dimensions ensures that the data warehouse is delivering consistently defined attributes for labeling, grouping, filtering and integrating data from multiple business processes.”

Dimensional conformity also ensures reusability. Reusing dimensions across multiple business processes is what makes dimensional modeling so simple and scalable. The Product dimension, for example, would be the exact same physical table used in the Orders, Inventory, and Procurement dimensional models.

Obtaining Conformity

There are some challenges associated with obtaining conformity. Some of these challenges might be political, where business users refuse to agree on common business definitions. Most of them, however, are in the planning, data modeling, and system architecture, where individuals fail to identify or recognize where conformity exists. This is typically a direct result of failing to do adequate data profiling! Not having data stewards available to consult with on specific data issues could also lead to some slow times.

The first step to conforming dimensions is to define the business processes that will be used in the dimensional model. The next step is to conduct a data profile to identify and document the entities and relationships in the source system. Finally, identify the common dimensions across business processes.

The question then becomes: How do we determine which dimensions are conformed across business processes?

The answer is the Bus Matrix. The Kimball Group does an excellent job making the case for the Bus Matrix in their books and articles. In short, it acts as a roadmap for a dimensional modeling project. Check out “The Matrix” and “The Matrix: Revisited” on the Intelligent Enterprise Website. I won’t attempt to describe the Bus Matrix here for the sake of brevity, but may share some thoughts in a future posting (which I will likely title “The Matrix: Reloaded” just to be cheeky!).

The Bus Matrix will tell you exactly what your conformed dimensions are.

Conforming Attributes

By using conformed dimensions, it is possible to conduct drill-across queries. You might, for example, want to do some analysis on retail sales and inventory. Retail Sales and Inventory business processes exist in the data warehouse as two separate dimensional models. The cool thing is that they share dimensions (ex. Product). It is rather simple then to relate the fact tables together using the Product dimension based on some product characteristics such as “size”, “weight”, and “cost”. These drill-across queries can lend a great deal of insight into an organization’s various business entities.

Filtering and searching for data is also simplified if, for example, all names are in UPPER case and stored in atomic parts (last, first, middle, title).

You should strive to get as many attributes in various dimensions to conform. Attributes such as Name, Length, Weight, Size, and Color should mean exactly the same thing across all dimensions. If not, then they must be named differently. Conformity on this level lends to the simplicity of the overall design and makes operator queries simpler to write and understand.

Take for example Customer and Shipper dimensions, both with an attribute called “state”. In the Customer dimension, state is abbreviated to 2 characters, but in the Shipper dimension it is spelled out. You would not be able to conduct a query on all shippers and customers in the same state without the need for some lookup table. This could confuse and likely frustrate the operator. If you must spell out the state’s name, re-name the column to “state_name” (and add a column called ’state’ with the abbreviated value)!

Allowing business users to relate multiple business processes together using commonly defined metrics (”facts”) also adds tremendous value to the data warehouse database. For example, a metric such as “discount” would not be a percentage in one fact table and an actual dollar amount in another. Instead, convert one to the other during data integration, or create two metrics: “discount_percent” and “discount_amount”. To take this a step further, in order for “discount_amount” to be conformed, it must be stored in some common currency (such as EUR or USD).

Not all attributes and metrics can be conformed, but the more you can, the more flexibility you will have with your drill-across queries. If you cannot conform a particular attribute or fact, then be sure to change its label so that end-users won’t make an assumption that it means something that it’s not!

I find it intuitive to identify attributes and metrics that are conformed across multiple source systems once the conformed dimensions have been identified and a data profile conducted. Where you need to be careful is when two column labels are named the same but mean (sometimes slightly) different things. Product color, for example, might be entered as a color’s name (red, black, blue) in one system, and a hex or RGB value in another. Multiple systems might have different product identifiers, all labeled “Product ID”. “Distance” could be in kilometers in once system, and in miles in another. Fahrenheit to Celsius conversions may be necessary. All of these issues need to be worked out during the data integration process so that “distance”, “color”, and “Product ID” mean exactly the same thing across Dimensions and Facts.

A 3lb red widget from NY is a 3lb red widget from NY is a 3lb red widget from NY!

SQL Server 2005 Integration Services (SSIS)

SSIS has many transformation tasks that you can use to enforce conformity in the dimensional model. In fact, there is not a single data conformity task that I have run across that SSIS does not provide a solution for out of the box (well, with some help from the Script Task, that is). Because many transformation tasks can be used to enforce conformity, I won’t enumerate them here. The challenges of conforming data truly lie in the planning and design process.

Hand Coding with Visual FoxPro (VFP9)

As with SSIS, there is nothing specific about hand-coding regarding conformed dimensions. So much rests on planning, mapping, data profiling, de-duping, and cleansing, that the process of conforming is just a matter of implementing the conversions and mappings necessary to build the dimensions so that they adhere to conformity. Check out my post on Loading Dimensions for some further details.

For some examples on some things you might do in FoxPro code to achieve conformity, take a look at my previous posts “Compare a String to a List of Formats” and “Conforming Temperature in FoxPro (some conversions)“. These are two examples of things you may need to do to enforce conformity.

From Here

In my next ETL Subsystem post, I will discuss Subsystem 9: Slowly Changing Dimensions (SCD). If you want to know more about Conformed Dimensions and the theory behind them, I recommend reading any of Kimball’s data warehousing books or articles that discuss the subject. He lays out the case quite nicely for conformed dimensions. I would like to stress that the success of the dimensional model rests in the ability of the data modelers and integration architects to design the data warehouse using as many conformed dimensions, attributes, and facts as possible.

There are not too many other blog postings (ex: 1 2 3) on the topic of data conformity that truly dive into the topic. I would love to see more, as this is such an important subsystem and best practice.