Posts Tagged Quality

Fun with Serial Dates

I ran into a bit of an oddity today while working with serial dates in SQL Server 2005 and MS Excel 2003. I’ve dealt with this before with Excel using FoxPro, but don’t recall the specifics. Anyway…

A serial date is a sequential number, starting on January 1st, 1900 (or if you want compatibility with a Mac serial date, 1904) that represents the number of days since any date to that point in time (you can also represent time by using a decimal, but I won’t get into that here). Excel has a bug (or a “Lotus 123 compatibility feature” as some like to call it) in it that counts Feb 29, 1900 as an actual date, when in fact 1900 was not a leap year.

Hence the first bit of fun. Run this TSQL code:

SELECT DATEDIFF(dd,'19000101','20080606')

The above query returns ’39603′. So far so good. In SQL Server, you can now derive the date from the serial value by using convert:

SELECT convert(datetime,DATEDIFF(dd,'19000101','20080606'))

…which gives me ’2008-06-06 00:00:00.000′. So far so good.

The reason we’re using serial dates in the first place is to accommodate some models developed in Excel. The data is fed in directly from the database and calculations are performed.

When I take value 39603, put it into an Excel cell and re-format the column from General to Date, I get ‘June 4, 2008′! Two days off from what I get from SQL Server.

I already mentioned that Excel counts Feb 29, 1900. That’s one day. So why am I off by two? Here’s some more fun. In SQL Server, try the following:

SELECT convert(datetime,0)

You get ’1900-01-01 00:00:00.000′. But January 1st is supposed to be day one. Not day zero (must be a zero-based array sort of thing). When you try the same in Excel, you get a more appropriate (but equally odd) value of ‘January 0, 1900′. You’d think I’m dealing with scientific ephemeris or something.

SELECT convert(datetime,1)

So, in SQL Server, we’re off by a day. The first day shouldn’t result in the second day (’1900-01-02 00:00:00.000′) of the year as the above TSQL shows.

This helps to explain the 2 day offset between Excel and SQL Server.

Come on Redmond! Work with me here!

To solve (read: work around) the problem, we’ve decided to stick with keeping Excel happy (against my recommendation, but at the end of the day, this is a business decision). Our data integration packages add two days to the serial date we calculate from a YYYYMMDD value. When data is moved to Excel from the data warehouse, no transformation is done on the serial value and Excel calculates as expected. When we use the serial date in a SQL Server context, we need to subtract 2 before the convert.

This is not a very good design and will undoubtedly cause someone problems some day.

I recommend storing the ISO YYYYMMDD format instead. But Excel won’t format a value such as 20080606 to a date. What a shame! If anyone knows of a way, or can enlighten me more about this subject — do not hesitate!

Tags: , ,

4 Comments

A Data Warehouser’s Vocabulary (Part 1)

Partly inspired by a post entitled “The most important thing I know about Analytics is that no-one agrees what it means” by James Taylor and partly inspired by the section “Slowly Changing Vocabulary” in the book “Data Warehouse Lifecycle Toolkit 2nd Edition“, I have decided to compile a glossary of terms and concepts that I feel have some relevance to the data warehousing and business intelligence world. I’ll break this list into several postings, and I reserve the right to refine, enhance, clarify, and augment a definition at any time! When finished, I’ll make them a permanent feature of TmF.

With this list, I am not attempting to resolve any debates, nor am I attempting to invalidate or discredit a definition you may be using. These are the definitions I use. Also be aware that certain terms might hold different meanings under different contexts. If I need to use one of those ambiguous terms, I try my best to put a good context around it. For example, when I refer to “Data Mart”, I specifically mean “Atomic Business Process Dimensional Model”. However, there are times when what I mean is to describe a separate (perhaps normalized) database for a specific user or department (i.e. a throw-away sandbox for the big kids).

Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! Also, when I finish this list, and put these all together on a single page, I’ll be sure to include a reference link section as well.

So, without further ado, I give you the first group of many to come (A-Z):

Business Intelligence (BI)
A generic term to describe leveraging the organizations’ internal and external information assets for making better business decisions.
Business Process
The complete response that a business makes to an event. A business process entails the execution of a sequence of one or more process steps. It has a clearly defined deliverable or outcome. A Business Process is defined by the business event that triggers the process, the inputs and outputs, all the operational steps required to produce the output, the sequential relationship between the process steps, the business decisions that are part of the event response, and the flow of material and/or information between process steps.
Changed Data Capture (CDC)
Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.
Data Mart
A business process dimensional model.
Data Profiling
Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more than reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.
Data Quality
Assurances that the integrated data is consistent, complete, and fit to publish to the business community.
Data Warehouse Database
The largest possible union of queryable presentation data in a DW/BI System.
ETL
A set of processes that prepare source data for a Data Warehouse, adding value and confidence along the way. These processes include extraction, transformations (cleans & conform), and load operations. Note that the order in which ETL processes occur can be varied based on the situation. Some sources refer to the ET or just the E broadly as “Data Acquisition”.
Master Data Management (MDM)
Centralized facilities designed to hold master copies of shared entities, such as Customer and Product.
Metadata
All the information that defines and describes the structures, operations, and contents of a BI/DW system.
Operational Data Store (ODS)
A physical set of tables sitting between the operational systems and the data warehouse, or a specially administered hot partition of the data warehouse itself. The main purpose of an ODS is to provide immediate reporting of operational results if neither the operational system or the data warehouse can provide satisfactory access.
Staging
Physical workspace for data during the ETL process. Some data is temporarily staged, while other data may persist.

Tags: , , , ,

1 Comment

Live from Kimball University: Day 4 (ETL, Addresses, Applications, BI Toolkit?)

Ok, so I’m a day late! I had to catch a train back to Brussels, and considering I hadn’t seen my family since Monday morning, I didn’t spend last night blogging! Besides, the four days were fairly taxing. Whether you think you know everything about dimensional modeling or not, the material was fairly dense and Warren and Margy both moved fast.

If you are serious about data warehousing, and call yourself a data warehousing professional (especially if you are a “dimensional modeler”), then attending this and other sessions is a must. I learned more Warren and Margy directly than I could have from their books. And hearing about their various experiences added an extra dimension (no pun) to the education.

Now, if we can just convince them to create a certification exam

ETL System

The entire morning was focused on ETL Design and Development. The first step in designing the ETL processes is to meet a basic set of requirements. These would include having all business requirements fleshed out; a “solid” dimensional model ready to go; data quality, archiving, lineage, and management infrastructures defined; compliance, latency, and security issues addressed; and finally a good understanding of all application requirements. With these in place, it is possible to create the ETL System.

Basically, the ETL system is in charge of converting source system data to the data warehouse database. Don’t get hung up on the acronym though. Depending on the circumstances, the process may feel more like ELT, or ELTL, or E&T, more T, some L, and more T.

Warren recommends using an ETL vendor tool mainly because the tools handle technical metadata requirements and likely have a lot of other very helpful functionality built in (logging, standard tasks, connection managers, dimension management, surrogate key lookup capability, in-memory pipelining, etc..).

As many of you know, I am an advocate of hand-coding ETL, especially using Visual FoxPro. VFP makes it easy to build in all the things that other ETL tools do with little effort. In addition, VFP is data centric and object-oriented, and if the ETL processes are designed correctly can be totally metadata driven. With that said, ETL tools are getting better and better. If you have a couple of developers in house that are experienced with FoxPro or another data-centric language, then by all means go for it. If not, then using an ETL tool is likely your best bet.

Regardless if you hand-code or buy, Warren defined the following steps that are needed to develop the ETL system:

  1. Create a high-level plan for the target model
  2. Install and learn the ETL tool!
  3. Determine your default strategies
  4. Drill down by target table
  5. Develop the historical load for dimensions
  6. Populate historical facts
  7. Dimension table incremental processing
  8. Fact table incremental processing
  9. Aggregate and OLAP loads
  10. ETL operation and automation

Nothing earth-shattering, but I have seen first-hand how ETL systems can stall and flutter along because one or more steps were ignored or done out of order. The Kimball Group has spent a long time together developing their best-practice and practical techniques so that we can all benefit. This list is just another example; it’s a simple pattern to follow that will likely lead to greater successes.

Name and Address Fun

As part of our ETL discussion, we spoke of data cleansing and conforming. One major challenge that data integrators have is dealing with very dirty data from multiple source systems. One good example is the challenges associated with name and address matching.

Warren showed us a series of slides with scanned address labels delivered (miraculously in some cases if you saw the slides) to his house. His name was butchered in every which way. Some had his name as Thorn. Another had him as a woman. It goes to show that data quality and name and address cleansing is very hard to do. When you consider that some companies spend millions on their mailings, this is a big deal: And they can’t even get it right!

BI Applications

After lunch, we dove right into a discussion on Business Intelligence (BI) Applications. One thing Warren said really stuck out to me, and it is something I wrote in big letters on my copy of the presentation slides: BRANDING.

BI Applications are in fact a way for us — as DW/BI professionals — to brand our services. We create our image through the BI Application layer. Despite all the background and expertise we need to do all other phases of the LifeCycle, the executives, managers, and users will see us through our applications.

I admit, I had an “ah-ha” moment when this came up. Everything from Portal design to reporting templates to our logo (yes, we need a logo too!) is equally as important as cleansing and conforming data, designing a perfect data model, and conducting hours upon ours of requirement-gathering interviews.

I only wish that we had more time to spend on this topic. An entire day would have been great.

BI Toolkit

We need a BI Toolkit. A vendor-neutral, general discussion on what it means to deliver BI Applications to our users. I talked with both Margy and Warren about this and both seemed to agree (as does their publisher). It’s been in their sights for some time now, I suppose. While the other toolkit books have been an invaluable reference for the entire community — this book could literally change the way data warehousing teams and businesses consider BI, BI Applications, and the various possibilities that BI offers.

Personally, I would like this book to take more time delving into user requirements. I don’t want it to tell me why I need one p-value over another, nor do I want it to tell me to use Business Objects because it can support feature X or Y. I need a reference that reminds me that developing a scorecard requires multiple business processes, that building a reporting template to pass around to business users can be a great tool to flesh out design decisions, and that perhaps a “BI Matrix” needs to be developed so that each BI Application can be resolved against each underlying business process dimensional model. Ideally, it would be a book that a business sponsor can read and digest, while providing the Data Warehousing team with enough technical knowledge to appropriately evaluate tools, build strategies, and formulate roll out plans that give the user the biggest and quickest return on their investments.

Lastly…

It’s a bummer that Warren had to race through the deployment and maintenance materials. We really did have a good class though, and a lot of great questions were asked. You can tell that most people in the room really “get it”, and their questions all reflect that. Of course, this led to a few diversions and a tangent or two. We paid for that in the last section. Thankfully, I have a brand new, 2cnd Edition Lifecylce Toolkit book to refer to!

This was a great experience for me, and I met a lot of great people. If you haven’t been to one of these courses, I recommend it no matter the cost.

If you have questions or comments for me, don’t be shy. I will be writing more and more about these topics in future postings, and I’ll also be picking up where I left off with my postings on the ETL Subsytems 1 through 34. So if there is something you would like me to address first, let me know!

Tags: , , , , , ,

8 Comments

Live from Kimball University: Day 1 (planning, requirements, dimensional models)

As you may know, I am currently in Amsterdam attending the Kimball University (organized by Quest for Knowledge / Q4K) training on the Data Warehouse Lifecycle. Margy Ross and Warren Thornthwaite are here and so far so good. Margy is teaching for the first two days, while Warren will take the final two days. I am not disappointed thus far.

Admittedly, much of the material in this course is not new to me. I’ve been exposed to dimensional modeling for some time now, and with a few projects under my belt, I’m starting to feel like I know what I’m talking about (which could be dangerous, I know). I’ve read the books, the articles, the tips, and have even written my own articles and blogged on many subjects. Attending these four day sessions will be primarily to validate and shore-up what I already know and hopefully correct some of my misconceptions and mistakes. With that said, I am getting a lot out of each session so far.

Project and Program Planning

Margy started out with a brief introduction, detailing the roots of the Kimball Group. Its pretty interesting to hear how they started out as 3 2-person companies and essentially merged into one: The Kimball Group. These six people have fundamentally changed the way data modelers and architects think about and approach data warehousing. What sets them apart, I believe, is that their ideas and theories stem from needs of the user.

Tremendous emphasis is placed on getting the requirements right. Margy pointed out that involving the users early and often is a key to success. Oftentimes, she said, users are involved early on and then forgotten once development begins. Three or Four months later, what is delivered might not be exactly what the end user had in mind. As the development team works, they uncover more and more hidden issues that must be resolved. The best person to help resolve many of these issues is the business user — not the data modeler or ETL architect.

In short, planning for a data warehouse must involve the business community. You need a strong and charismatic sponsor who has vision and influence, a data steward who cares about convention and quality, and a business analyst who can straddle the fence between the green grassy fields of IT and the barren wasteland of the business (kidding…).

Margy also developed a list of warning signs that ranged from failing to fill the various shoes I mentioned in the previous paragraph to acknowledging “BI/DW success is tied to user acceptance”. One of her points, not to underestimate the data cleansing workload, is finally starting to become less common. But it still remains one of the main reasons why data warehousing projects fail.

The Bus Architecture

Here’s something you don’t get by reading a book: Margy and the group didn’t quite like Ralph’s name for what we know as the Bus Architecture. I don’t have an electrical engineering background, but Ralph does. aw bus.thumbnail Live from Kimball University: Day 1 (planning, requirements, dimensional models)I suppose he felt that the way business processes and conformed dimensions related together reminded him of how a computer’s bus architecture worked. Essentially, being able to swap in and out components (If you have an electrical engineering background, or are just plain smarter than me on this, then please correct me if I’m off-base!).

I asked Margy what she would have called it, and she didn’t really know. I guess that’s why Ralph won-out. He was the only one with an idea!

For our purposes, the Bus Architecture “provides a standardized master set of conformed dimensions and conformed facts used throughout the data warehouse.” Click the image for a larger picture. This image comes from a SQL Server Mag article “Dimensional Modeling Basics” by Joy Mundy and Warren Thornthwaite.

Dimensional Modeling and Conformed Dimensions

I’ve talked at length in my articles and posts about dimensional modeling. Margy went over the basics today, starting out with a discussion on the differences between the Kimball approach and the Corporate Information Factory (CIF) approach pushed by Inmon. Although the dust between the two camps has settled in recent years, Margy was quite adamant about the benefits of dimensional modeling over the enterprise data warehouse concepts. Margy detailed these, but until I get my notes together, I’ll leave it at that.

snowflake1.thumbnail Live from Kimball University: Day 1 (planning, requirements, dimensional models)She also stressed the reliance on the star schema, and warned of snowflaking designs. Snowflaking occurs mainly when you attempt to normalize a dimension, usually by breaking the dimension up by its hierarchal data. For example, you might be tempted to break the “Store” dimension, which contains city, district, region, and state into separate City, District, Region, and State dimensions. Of course, doing so creates a snowflake and makes the model much more complex than it needs to be.

Conformed dimensions and facts are really the key to a successful dimensional modeling — and by extension data warehousing — project or program. They allow you to conduct drill-across queries, where you query data across business processes, and ultimately conformed dimensions and facts give your end users a consistent version of the data no matter what process they are looking at. Using the Bus Architecture, dimensions that are used across business processes are easy to spot.

Comments and Questions

I am also paying close attention to many of the questions and comments being made during the Q&A moments. The room is roughly composed of about 75% of people who are somewhat new to data warehousing and business intelligence. So it is a great opportunity to hear some of the questions the group has had for Margy.

Some questions: a.) How do you link a promotion to a product to show all the products that did not sell during a promotion? b.) Can you add the time to the date dimension? c.) Why should I conduct interviews during requirement gathering when I know the business well enough myself? d.) When determining the feasibility of implementing a business process, are there other important factors outside of data availability to consider?

I’ll elaborate more on all of these answers (and more) in future postings. In short, the answer key is simply: a.) Use a factless fact between product and promotion; b.) No, use either a Time dimension or store a timestamp in the Fact table; c.) You don’t know as much as you think you know and interviews will form the basis for future documentation; d.) available resources and technology issues can be overcome, perhaps the next biggest factor is dependencies between business processes.

And Lastly…

Almost everyone in attendance is a consultant. I am still getting used to this. I’ve seen and met more consultants in my short time here in Europe (about 5 months so far) than I have my entire life. Margy was also surprised. She did make a very interesting observation though. Consultants, who usually pay their own way through courses and seminars, typically stay away when times are tough. When the economy is good and/or they have plenty of work, consultants attend more conferences. Interesting. I suppose this could mean that the consultant business is doing quite well in this part of Europe.

I couldn’t get everything into this single post. I’ll be sure to post more notes as time goes on. For now, it’s on to tomorrow!

Tags: , , , , ,

1 Comment

An SSIS Custom Transformation Success Story

I recently inherited eight SSIS packages that all do essentially the same thing: They extract data from disparate sources, cleanse and conform them, and finally load them into an analytical data warehouse (dimensional model). Welcome to Data Integration 101!

The project calls for another two dozen of these extracts. Unfortunately, the team before me operated with little functional documentation, a limited knowledge of data integration, and did not focus on using tried and true best practices. There was no data profiling, no logical data mapping, no quality controls, no auditing, and very little in the way of logging. Essentially, things are a bit of a mess.

My primary focus over the next 6 months is to clean things up. I’ve already introduced better documentation practices, naming conventions, and this crazy concept of the “package template”. There is no way on earth I am developing 20 similar packages without using a template!

I’ve also started work on a Data Quality Manager Application (for now, it will consist of a table-driven set of SQL statements that can be run via stored procedure on a scheduler) and a Logical Data Map (LDM). The LDM will provide much needed documentation and give the other SSIS developers a little more direction.

I plan on posting much more about the LDM, Data Quality Manager, and package templates in a later post. For now, I want to gloat a little about a custom transformation task component I wrote.

Update: I almost never re-publish a post, but felt I needed to in this case. Darren Green of http://www.sqlis.com, which is an excellent SSIS resource by the way, commented about my use of the term “Task” to refer to Data Flow “components”. He’s right, so I made the change! Sorry for any confusion. In fact, while reading his comment, I had an “Ah-Ha” moment: The way to tell the difference between the two Scripts in SSIS is to call them by their correct names: Script Task (of the Control Flow) and Script Component (of the Data Flow). Duh!

The Problem

When the data comes into SSIS, work needs to be done to relate the rows from the source to those of the target. The mapping is a little different depending on the source file, but in the end, we need to match any number of significant characteristics (aka the file’s candidate key) to a single unique ID Number in our data warehouse database.

For example, one file might contain Social Security number, another might have last name and date of birth, a third might have the last name and the last 4 digits of a credit card number, yet another would have an account number from a billing system. Attached to all these identifiers are some important metrics that need to be added to a fact table in the data warehouse.

To do this, the developers before me used three Data Flow tasks, and over 100 Transformations (no joke). The second (and meatiest) Data Flow task took the extracted data (now in a staging table) and then embarked on a long drive through the various possible match combinations. First, there would be a Conditional Split that checked to see if the row had a valid account number, if not, then the output would hit another Conditional Split that would check on last name and so on. If there was a valid account number, a Lookup component would check the account dimension in the data warehouse for a match. If exactly one match was found (it is possible to have repeating account numbers), it would add the ID to the pipeline and drop the results into another staging table as a “match”. If not found — or there was a duplicate — another Lookup component would be used on some other identifiable column. Repeat. Repeat. Repeat!

The other two data flows did various cleanup work.

To add complexity to this mess, the attributes in each source are different. Sometimes account number has an extra check digit at the end. Other times, dashes are inserted in different positions.

 An SSIS Custom Transformation Success Story

A picture is worth a thousand words. Here is one of the data flows. The quality is bad on purpose: This is business logic, and as messy as it is, I don’t want to give away any company secrets! But, here is a high view so you can taste the pasta.

The developers did an admirable job of wading through all this and managed to implement a pure SSIS solution that works. They did this all without a data profile, LDM, or any real functional documentation. However, maintaining this, or repeating this logic in 20 more packages, makes me want to extract my eyes from my head, transform them into soup, and load them into my mouth.

A Better Approach

So I spent an hour looking at all this and thought to myself how easy and simple this would be using Visual FoxPro. A few SELECT statements, perhaps a CASE here, an IF there, and a REPLACE command for good measure.

Unfortunately, I can’t use VFP for this project. We’re locked into SSIS and there isn’t a single VFP license in my reach. The next best option, of course, is to design a custom component using C#.

In my next post, I’ll explain the nuts and bolts in detail. But to summarize, my Custom Task component (called “ID Lookup”) uses about 100 lines of code in the ProcessInput method, and another 100 lines in PreExecute (give or take). It essentially acts as a Lookup and Conditional Split with some extra logic built in that is specific to our situation. There are properties on the component that allow the developer to specify what input columns should be used in the matching process.

 An SSIS Custom Transformation Success Story

The new Data Flow looks a little more manageable.

ID Lookup achieves the exact same results as the three data flows and 100 transformations did. As an added bonus, it performs over 600% better! And I’m sure with a bit more tweaking, I can squeeze out additional performance.

Aesop Time

I suppose that being a senior-level software developer in my previous life helped out a bit. I’m not certain that the team before me even thought about writing some code to solve the problem. I also have the added benefit of seeing eight different packages in a completed state. It is much easier to identify patterns when you have something tangible to look at.

So the moral is, don’t be afraid of writing custom tasks or Data Flow components. This was not my first, and it won’t be my last. Now, when we start to write the new packages, we’ll drop this transformation into the Data Flow, right smack between the appropriate staging tables.

As I said, in my next post, I’ll take a look at some of the code I used to create the component. Keep in mind that I cannot show you our business-logic, but I will show the framework.

Tags: , , , , ,

6 Comments

ETL Subsystem 8: Data Conformance

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data conformity is one of the most important and fundamental aspects of data warehousing. Data conformity means that a data element’s label (column header) and content (which includes data types, lengths, and nullability) are consistent across multiple tables and across multiple business processes.

In dimensional modeling, a single dimension shared across business processes is a conformed dimension. Shared dimensions allow for drill-across queries where a user could conduct enterprise-level analysis across several business processes (inventory, sales, procurement, etc.). Conformed dimensions, as a data quality technique, also cut down the likelihood of data inconsistencies in an organization.

Margy Ross of the Kimball Group states that “Using conformed dimensions ensures that the data warehouse is delivering consistently defined attributes for labeling, grouping, filtering and integrating data from multiple business processes.”

Dimensional conformity also ensures reusability. Reusing dimensions across multiple business processes is what makes dimensional modeling so simple and scalable. The Product dimension, for example, would be the exact same physical table used in the Orders, Inventory, and Procurement dimensional models.

Obtaining Conformity

There are some challenges associated with obtaining conformity. Some of these challenges might be political, where business users refuse to agree on common business definitions. Most of them, however, are in the planning, data modeling, and system architecture, where individuals fail to identify or recognize where conformity exists. This is typically a direct result of failing to do adequate data profiling! Not having data stewards available to consult with on specific data issues could also lead to some slow times.

The first step to conforming dimensions is to define the business processes that will be used in the dimensional model. The next step is to conduct a data profile to identify and document the entities and relationships in the source system. Finally, identify the common dimensions across business processes.

The question then becomes: How do we determine which dimensions are conformed across business processes?

The answer is the Bus Matrix. The Kimball Group does an excellent job making the case for the Bus Matrix in their books and articles. In short, it acts as a roadmap for a dimensional modeling project. Check out “The Matrix” and “The Matrix: Revisited” on the Intelligent Enterprise Website. I won’t attempt to describe the Bus Matrix here for the sake of brevity, but may share some thoughts in a future posting (which I will likely title “The Matrix: Reloaded” just to be cheeky!).

The Bus Matrix will tell you exactly what your conformed dimensions are.

Conforming Attributes

By using conformed dimensions, it is possible to conduct drill-across queries. You might, for example, want to do some analysis on retail sales and inventory. Retail Sales and Inventory business processes exist in the data warehouse as two separate dimensional models. The cool thing is that they share dimensions (ex. Product). It is rather simple then to relate the fact tables together using the Product dimension based on some product characteristics such as “size”, “weight”, and “cost”. These drill-across queries can lend a great deal of insight into an organization’s various business entities.

Filtering and searching for data is also simplified if, for example, all names are in UPPER case and stored in atomic parts (last, first, middle, title).

You should strive to get as many attributes in various dimensions to conform. Attributes such as Name, Length, Weight, Size, and Color should mean exactly the same thing across all dimensions. If not, then they must be named differently. Conformity on this level lends to the simplicity of the overall design and makes operator queries simpler to write and understand.

Take for example Customer and Shipper dimensions, both with an attribute called “state”. In the Customer dimension, state is abbreviated to 2 characters, but in the Shipper dimension it is spelled out. You would not be able to conduct a query on all shippers and customers in the same state without the need for some lookup table. This could confuse and likely frustrate the operator. If you must spell out the state’s name, re-name the column to “state_name” (and add a column called ‘state’ with the abbreviated value)!

Allowing business users to relate multiple business processes together using commonly defined metrics (“facts”) also adds tremendous value to the data warehouse database. For example, a metric such as “discount” would not be a percentage in one fact table and an actual dollar amount in another. Instead, convert one to the other during data integration, or create two metrics: “discount_percent” and “discount_amount”. To take this a step further, in order for “discount_amount” to be conformed, it must be stored in some common currency (such as EUR or USD).

Not all attributes and metrics can be conformed, but the more you can, the more flexibility you will have with your drill-across queries. If you cannot conform a particular attribute or fact, then be sure to change its label so that end-users won’t make an assumption that it means something that it’s not!

I find it intuitive to identify attributes and metrics that are conformed across multiple source systems once the conformed dimensions have been identified and a data profile conducted. Where you need to be careful is when two column labels are named the same but mean (sometimes slightly) different things. Product color, for example, might be entered as a color’s name (red, black, blue) in one system, and a hex or RGB value in another. Multiple systems might have different product identifiers, all labeled “Product ID”. “Distance” could be in kilometers in once system, and in miles in another. Fahrenheit to Celsius conversions may be necessary. All of these issues need to be worked out during the data integration process so that “distance”, “color”, and “Product ID” mean exactly the same thing across Dimensions and Facts.

A 3lb red widget from NY is a 3lb red widget from NY is a 3lb red widget from NY!

SQL Server 2005 Integration Services (SSIS)

SSIS has many transformation tasks that you can use to enforce conformity in the dimensional model. In fact, there is not a single data conformity task that I have run across that SSIS does not provide a solution for out of the box (well, with some help from the Script Task, that is). Because many transformation tasks can be used to enforce conformity, I won’t enumerate them here. The challenges of conforming data truly lie in the planning and design process.

Hand Coding with Visual FoxPro (VFP9)

As with SSIS, there is nothing specific about hand-coding regarding conformed dimensions. So much rests on planning, mapping, data profiling, de-duping, and cleansing, that the process of conforming is just a matter of implementing the conversions and mappings necessary to build the dimensions so that they adhere to conformity. Check out my post on Loading Dimensions for some further details.

For some examples on some things you might do in FoxPro code to achieve conformity, take a look at my previous posts “Compare a String to a List of Formats” and “Conforming Temperature in FoxPro (some conversions)“. These are two examples of things you may need to do to enforce conformity.

From Here

In my next ETL Subsystem post, I will discuss Subsystem 9: Slowly Changing Dimensions (SCD). If you want to know more about Conformed Dimensions and the theory behind them, I recommend reading any of Kimball’s data warehousing books or articles that discuss the subject. He lays out the case quite nicely for conformed dimensions. I would like to stress that the success of the dimensional model rests in the ability of the data modelers and integration architects to design the data warehouse using as many conformed dimensions, attributes, and facts as possible.

There are not too many other blog postings (ex: 1 2 3) on the topic of data conformity that truly dive into the topic. I would love to see more, as this is such an important subsystem and best practice.

Tags: , , , , , , , , ,

5 Comments