Posts Tagged Conformity

10 Commandments of Data Integration

  1. You shall compile and document all requirements and mappings; segregate the work by business process. You may have more than one of these business processes, some of which may come before others.
  2. Do not begin without first conducting a thorough data profile; otherwise, you will be punished for your inequities, as will the generations that come after you.
  3. Do not think commandments one or two are in vain, lest you will become overrun by the dead line, scope creepers, and a great exodus of people from your tribe; if this happens to you, do not swear or curse, for you have been warned.
  4. Remember that latency and timeliness are equal in importance to non-volatility and having a traceable lineage; a staging area may lead you to this promised land.
  5. Honor the rules of data conformance.
  6. Do not kill dirty data: you shall clean them, or take them back to their sources for retribution.
  7. Do not commit the worst data integration transgression of all and ignore data quality, your ignorance will not be forgiven.
  8. Do not be shy about stealing your neighbor’s work, for his trials have led to best practices that you can make equally good use of.
  9. Do not rely solely on business keys; surrogates are your friend and will permit you to engage in slowly changing your dimensions.
  10. You shall covet a proper audit and log system; for on the day of judgment, you will need proof of your compliance.

Tags: , , , , ,

No Comments

Naming Conventions and the Underscore

I’ve seen and worked with a lot of naming conventions. When I start a new development project, I always — without exception — document how I intend to name the new items I create, whether they are physical objects such as tables and fields in a database, or names used in code for objects, variables, and the like. This document is shared with the team, adjustments are made where necessary, and adopted as standard for the project or group of projects.

The important thing in this effort is consistency, and not the technique we adopt. If all single-part surrogate key fields are to end in “Key”, and all single-part varchar business key fields are to end in “ID”, then the rule must be obeyed by all developers and DBAs.

A major point of contention that seems to crop up every single time naming conventions are discussed is how and when to use an underscore. Should the field be “CustomerID” or “Customer_ID”? “ProductKey” or “Product_Key”? “SSNumber” or “SS_Number”?

CamelCase

boy and camel 150x150 Naming Conventions and the UnderscoreI have a general rule: If the container (database, property window, etc) can remember case, then only use an underscore when combining an acronym with additional information, when that information comes after the acronym. This is because you can use “CamelCase“, which I find more readable.

Here are a few examples, showing how I would nornally handle underscores when case is remembered:

  • CustomerID and not Customer_ID
  • ProductKey and not Product_Key
  • PhoneNumber and not Phone_Number
  • SS_Number and not SSNumber
  • ISO_Date and not ISODate
  • WP_Theme and not WPTheme
  • ValueEPS and not Value_EPS

And while I’m at it, avoid redundancy: VIN (Vehicle Identification Number) is not VIN_Number; GICS (Global Industry Classification Standard) is not GICS_Classification; and DG (Disease Group) is not DG_Group!

No Case Support?

If the container does not remember case, then always use an underscore to separate the parts of a name. Examples:

  • customer_id and not customerid
  • product_key and not productKey
  • phone_number and not phonenumber
  • ss_number and not ssnumber
  • iso_date and not isodate
  • wp_theme and not wptheme
  • value_eps and not valueeps

If you can be strict about naming, project members and future generations will have an easier time understanding your code, objects, and documentation. I find that the underscore — even within a good naming system — is often used inconsistently. I also find that the underscore can make a major visual impact if used correctly. So it pays to pay special attention to ASCII character 95!

As an aside, naming conventions also play an important role in data warehouse conformity. So while it is important to have good standards, you will also need good governance to be sure that your names are meaningful and consistent.

Tags: , , , ,

6 Comments

Aggregate Facts and Dimensions

Aggregate — picture from http://www.stonetohome.comAggregate fact tables are special fact tables in a data warehouse that contain new metrics derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other specialized functions that output totals derived from a grouping of the base data. These new metrics, called “aggregate facts” or “summary statistics” are stored and maintained in the data warehouse database in special fact tables at the grain of the aggregation. Likewise, the corresponding dimensions are rolled up and condensed to match the new grain of the fact.

These specialized tables are used as substitutes whenever possible for returning user queries. The reason? Speed. Querying a tidy aggregate table is much faster and uses much less disk I/O than the base, atomic fact table, especially if the dimensions are large as well. If you want to wow your users, start adding aggregates. You can even use this “trick” in your operational systems to serve as a foundation for operational reports. I’ve always done this for any report referred to by my users as a “Summary”. (As an aside, there is a difference between an “aggregate” and a “summary”. I’ll explore these differences in my next post.)

For example, take the “Orders” business process from an online catalog company where you might have customer orders in a fact table called FactOrders with dimensions Customer, Product, and OrderDate. With possibly millions of orders in the transaction fact, it makes sense to start thinking about aggregates.

To further the above example, assume that the business is interested in a report: “Monthly orders by state and product type”. While you could generate this easily enough using the FactOrders fact table, you could likely speed up the data retrieval for the report by at least half (but likely much, much more) using an aggregate.

Here, using the atomic transaction FactOrders table:

SELECT c.STATE, p.product_type, t.YEAR, t.MONTH, SUM(f.order_amount)
    FROM FactOrders f
    JOIN DimCustomer c ON c.CustomerID = f.CustomerID
    JOIN DimProduct p ON p.ProductID = f.ProductID
    JOIN DimTime t ON 
        t.YEAR = DATEPART(yy, f.YearMonthID) AND 
        t.MONTH = DATEPART(mm, f.DateID)
    GROUP BY c.STATE, p.product_type, t.YEAR, t.MONTH

The aggregate is querying much less data and queries against time are now much simpler. In my non-scientific tests, the following query ran many times faster (a few seconds compared to about 30 seconds!).

SELECT c.STATE, p.product_type, t.YEAR, t.MONTH, SUM(f.order_amount)
    FROM FactOrders_Agg1 f
    JOIN DimCustomerState c ON c.CustomerStateID = f.CustomerStateID
    JOIN DimProductType p ON p.ProductTypeID = f.ProductTypeID
    JOIN DimMonth t ON t.YearMonthID = f.YearMonthID
    GROUP BY c.STATE, p.product_type, t.YEAR, t.MONTH

Creating the Fact and Dimensions

To implement, you will need to roll up your fact table by the hierarchies found in your dimensions. The result will be a new fact table, a set of new accompanying dimensions at the grain of the fact, and all new foreign keys for mapping. I usually name the fact table the same as the base fact with some meaningful suffix appended to the end. In SSMS, this keeps the aggregates with the fact in my object explorer. Dimensions usually get new names (like CustomerState and ProductType) and should be conformed so that they can be reused across business processes. You could even create views instead of new dimensions, but this does not eliminate the need to regenerate new surrogate keys.

When rolling up dimensions, you are provided with an excellent opportunity to perform aggregate functions on the dimension itself and store the results as new attributes. For example, you may want to know how many customers are living in each state. This could be used as the denominator in some population calculation you plan to use against the aggregate fact. Your new dimension might therefore look like the following:

SELECT Cust_Country, Cust_Region, Cust_State, COUNT(Cust_ID) 
FROM DimCustomer 
GROUP BY Cust_Country, Cust_Region, Cust_State

The most obvious aggregate function to use is COUNT, but depending on the type of data you have in your dimensions, other functions may prove useful. Just be warned: If you find that aggregate functions are being used a lot on your dimensions, you may need to revisit your design. There may be opportunities to pull out those metrics into existing or new fact tables!

Generating aggregates is largely an incremental process, where you examine query and reporting usage looking for places to improve performance. Aggregates stored in the RDBMS are maintained through ETL and/or your OLAP engine.

A Note About OLAP Aggregates

Theoretically, storing aggregates in a fact table in a RDBMS is the same as storing them in an OLAP cube. In OLAP storage, aggregates are precalculated summaries of data from the different dimensions of the cube, such that a query that seeks to know the aggregate (sum) of some metric (order amount) for X (customer state) and Y (product type) over T (monthly orders) would only need to look inside the cube at those exact coordinates to get the answer. I won’t pretend to know how this stuff is physically stored, but OLAP engines across the board offer better performance, management, and navigation mechanisms for the aggregations than is available through the RDBMS (even when using Indexed or Materialized Views).

Next post, I’ll write some thoughts on the differences between “Summaries” and “Aggregates”!

Tags: , , , , , , ,

3 Comments

ETL Subsystem 17: Dimension Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

This subsystem is largely a management task. It involves ensuring that conformed dimensions (see my post “ETL Subsystem 8: Data Conformance“) are properly published to the organization. Conformed Dimensions are one of the most important and fundamental aspects of data warehousing. In order to deliver these conformed dimensions to business users spread across the organization — including geographically — a system must be in place to deal them out.

In simple environments (a single RDBMS server that services a handful of analysts, for example), dimension management is as easy as inserting new rows through ETL and updating data model metadata by the DBA on a single server. In more complex environments — where servers and applications are scaled up and out, across large geographies — dimension management becomes more of a challenge.

It is common in these distributed environments (horizontally and vertically) to have copies of dimensions sitting at various locations in the enterprise. Don’t confuse this with “Mart Madness” or “Silos”. These copies are managed by a single ETL process which not only handles normal ETL, but likely handles the distribution as well (replication, as I’ll talk about in a moment, is another option).

Why not have remote users connect to a single, centralized database? A Dimensional Model exists to facilitate analytics and query performance. To this end, it is necessary at times to move data closer to the analyst. Not only is this necessary, but it is also smart. If you have presentation servers in Brazil, Costa Rica, Singapore, and Sri Lanka, then guess what: You will distribute your data directly to these locations. How you do it is more of a question of technology and policy, rather than a process tied to ETL directly.

SQL Server 2005 Integration Services (SSIS)

If you choose to distribute dimensions using ETL, then SSIS will work just fine. This is nothing more than a set of special packages that you design to select changes from the master dimensions (stored on your central integration server), multicast the data, and load to the destination dimension or dimensions. This incremental approach is desirable when you are processing a small amount of changes to medium and large dimensions.

Alternately, if your dimensions are small enough, it might make sense to drop and recreate them on the destination servers. The overhead associated with collecting the changes from the master dimensions might not be worth the effort. But before you think this method the best, be aware that all constraints to this dimension (including those connected through the fact) will need to be dropped. Also, the database will be unavailable to the users during this operation.

Another good option is to simply use Replication to propagate the changes to the various distributed dimensions. I am currently using replication now as my standard, but only because my situation currently dictates it. I have large data volumes, which are updated frequently (several times a day, with sensitivity on global timezones).

Hand Coding with Visual FoxPro (VFP9)

I am starting to sound like a broken record: VFP is not an ETL tool, but an OOP language. Dimension distribution will need to be done manually by either making copies of dimensions and pushing them out to the client systems, or designing a mechanism for the client systems to pull changed data from the master dimensions. I have found this to be a major challenge using VFP as the sole solution. Obviously, if your backend database is SQL Server or another RDBMS, you likely have a replication option. That being said, it certainly is not an impossible task, but one in which you need to prepare yourself for the various complexities. I would go out on a limb and suggest that this is true for all hand-coded solutions.

From Here

In my next post, I will discuss ETL Subsystem 18: Fact Table Provider. As with this subsystem, the Fact Table Provider is largely a management task designed to bring facts to the user community.

Tags: , , , , , , ,

2 Comments

What is MDM Anyway?

What is master data and what is the master data domain? What does it cover? Where is the business value in MDM? Is MDM a data warehousing function? How can business users be sold on the MDM investment?

While the industry dukes it out over answering these questions (you can’t get two experts or vendors to give you the same answer to any of those questions), I thought I’d share my thoughts. I admit, that my MDM experience to this point is minimal, having only been involved in one official MDM project (and only at the planning and data modeling level). But nevertheless, I do have some of my own ideas.

The bottom line, though, is that MDM is important and should not be overlooked when planning for new business intelligence or enterprise integration initiatives.

What MDM is

Master Data Management, an information activity, is the process of ensuring that an organization’s data (including its metadata) is consistent, reliable, accessible, distinctive and well defined. This master data, once it meets these requirements, can then be used by the enterprise as a system of record for key business entities and attributes.

There are different types of MDM: Analytical (A-MDM), Operational (O-MDM), and Enterprise (E-MDM). I see it like this: A-MDM is related to data warehousing and would therefore be implemented solely as a compliment to (or a reaction from) a data warehouse/business intelligence project (I don’t agree that simply having conformed dimensions constitutes “having” A-MDM, though). O-MDM is about collecting, managing, and redistributing master data to be used by operational systems (which is largely an effort in synchronization). E-MDM, is the Holy Grail, and would be a combination of both O-MDM and A-MDM. E-MDM reminds me of Enterprise Data Modeling (EDM), in that you truly need a 360-degree view of the business to make it work (READ: get business and IT together for tea).

Data governance and stewardship, as well as data quality management, data integration, and service-oriented architecture (SOA) are all functions and processes related in some way to Master Data.

What MDM is Not

MDM is not a technology. It is a business function. You are either managing master data, or you are not. It is unlikely that any single approach or software technology will solve the Master Data problem. It drives me nuts to hear about how vendors sell MDM (and BI, for that matter), but I digress…

The work of developing and maintaining master data is not a data warehousing function. To restate: Data warehouses are not developed to create master data (even A-MDM). Some will disagree with me on this. In my opinion/observation, MDM is a separate, highly important activity that works in conjunction with the data warehouse and all other applications and processes that exist in the enterprise.

Other Thoughts

I think that a common mistake made with regard to planning for, and implementing MDM, is that there is not enough emphasis on the data quality gains that will be realized. In addition, the data integration process for data warehousing projects will be infinitely easier, as the complex work of conforming dimensions and attributes will already be complete (at least in definition, with plenty of good metadata to use as a reference).

Master data, in many ways, provides a new way of looking at data as an asset with tangible, strategic value.

I’ll post more on this topic as time rolls on. As always, thoughts, questions, and criticisms welcome!

Tags: , , ,

2 Comments

ETL Subsystem 9: Slowly Changing Dimensions (SCD)

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

To this point, I’ve discussed “Data Extraction” and “Cleansing and Conforming” ETL Subsystems. Now, I’ll shift attention to data delivery (the “L” in ETL). As I have already discussed Slowly Changing Dimensions in a previous post, I’ll only highlight some of the points I think are most important here. So please, if you haven’t already, read my previous post “Dimensional Modeling: Loading the Slowly Changing Dimension” before you continue.

Some important takeaways:

  • Dimensions change over time. We can identify how to process these changes by using SCD Types 1, 2, and 3 techniques.
  • Dimensions (especially large ones) sometimes need a hybrid approach. This approach is sometimes referred to as a Type 1.5 SCD, which allow a dimension to be added with incomplete data. Updates are processed later as if the dimension was a Type 1 (empty and incomplete attributes are overwritten instead of retaining history) until the entire row can be labelled as “complete”. To indicate this, I have found including a “complete” flag in each dimension is a clear way to mark Type 1.5 dimension rows as complete, or in need of additional data.
  • Like the Type 1.5, sometimes a dimension row must be added with no details. These are called Inferred Members. In this scenario, just a natural key and the system-generated surrogate key are added. The dimension is then updated much like the Type 1.5 SCD until its data is complete.
  • For Type 1.5 and Type 2 SCDs to work, the row needs some sort of status indicator, the best being active “to” and “from” dates.

 

SQL Server 2005 Integration Services (SSIS)

I already mentioned in my previous post how SSIS’s Slowly Changing Dimension Task can be used to meet the requirements of this subsystem. Again, my only word of advice is to be careful about the amount of data you plan on passing through the generated SCD Tasks. I have seen performance degrade substantially on large datasets. It is best, in these scenarios, to introduce additional logic before the SCD logic to filter out inserts, updates, and deletes into different trees.

Also, it is a good idea to fully understand the different SCD types (inferred dimensions as well) before running through the wizard. Not only will the entire process be clearer, but you will better understand how to optimize the generated data flow when all is created.

Hand Coding with Visual FoxPro (VFP9)

I’ve had great success using FoxPro for hand-coded dimensional modeling projects. My first stab at loading Slowly Changing Dimensions was in 1998, using VFP6. Using a database trigger, I wrote code to update a denormalized data model that was used for reporting and ad-hoc queries. At the time, I had no idea what a SCD was, but somehow, instinctively I implemented it pretty close to the tried-and-true Kimball methods. That system is still in use today (the app is in VFP9 now) and I have redesigned it to match the Kimbal SCD methodology much closer than I originally conceived it.

If you are involved in a similar VFP project, or are at all interested in learning more about how VFP can be used in this way, please do contact me. I’m pretty passionate about VFP and Dimensional Modeling, so I’d love to hear from you!

Tags: , , , , , , ,

4 Comments

ETL Subsystem 8: Data Conformance

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data conformity is one of the most important and fundamental aspects of data warehousing. Data conformity means that a data element’s label (column header) and content (which includes data types, lengths, and nullability) are consistent across multiple tables and across multiple business processes.

In dimensional modeling, a single dimension shared across business processes is a conformed dimension. Shared dimensions allow for drill-across queries where a user could conduct enterprise-level analysis across several business processes (inventory, sales, procurement, etc.). Conformed dimensions, as a data quality technique, also cut down the likelihood of data inconsistencies in an organization.

Margy Ross of the Kimball Group states that “Using conformed dimensions ensures that the data warehouse is delivering consistently defined attributes for labeling, grouping, filtering and integrating data from multiple business processes.”

Dimensional conformity also ensures reusability. Reusing dimensions across multiple business processes is what makes dimensional modeling so simple and scalable. The Product dimension, for example, would be the exact same physical table used in the Orders, Inventory, and Procurement dimensional models.

Obtaining Conformity

There are some challenges associated with obtaining conformity. Some of these challenges might be political, where business users refuse to agree on common business definitions. Most of them, however, are in the planning, data modeling, and system architecture, where individuals fail to identify or recognize where conformity exists. This is typically a direct result of failing to do adequate data profiling! Not having data stewards available to consult with on specific data issues could also lead to some slow times.

The first step to conforming dimensions is to define the business processes that will be used in the dimensional model. The next step is to conduct a data profile to identify and document the entities and relationships in the source system. Finally, identify the common dimensions across business processes.

The question then becomes: How do we determine which dimensions are conformed across business processes?

The answer is the Bus Matrix. The Kimball Group does an excellent job making the case for the Bus Matrix in their books and articles. In short, it acts as a roadmap for a dimensional modeling project. Check out “The Matrix” and “The Matrix: Revisited” on the Intelligent Enterprise Website. I won’t attempt to describe the Bus Matrix here for the sake of brevity, but may share some thoughts in a future posting (which I will likely title “The Matrix: Reloaded” just to be cheeky!).

The Bus Matrix will tell you exactly what your conformed dimensions are.

Conforming Attributes

By using conformed dimensions, it is possible to conduct drill-across queries. You might, for example, want to do some analysis on retail sales and inventory. Retail Sales and Inventory business processes exist in the data warehouse as two separate dimensional models. The cool thing is that they share dimensions (ex. Product). It is rather simple then to relate the fact tables together using the Product dimension based on some product characteristics such as “size”, “weight”, and “cost”. These drill-across queries can lend a great deal of insight into an organization’s various business entities.

Filtering and searching for data is also simplified if, for example, all names are in UPPER case and stored in atomic parts (last, first, middle, title).

You should strive to get as many attributes in various dimensions to conform. Attributes such as Name, Length, Weight, Size, and Color should mean exactly the same thing across all dimensions. If not, then they must be named differently. Conformity on this level lends to the simplicity of the overall design and makes operator queries simpler to write and understand.

Take for example Customer and Shipper dimensions, both with an attribute called “state”. In the Customer dimension, state is abbreviated to 2 characters, but in the Shipper dimension it is spelled out. You would not be able to conduct a query on all shippers and customers in the same state without the need for some lookup table. This could confuse and likely frustrate the operator. If you must spell out the state’s name, re-name the column to “state_name” (and add a column called ’state’ with the abbreviated value)!

Allowing business users to relate multiple business processes together using commonly defined metrics (”facts”) also adds tremendous value to the data warehouse database. For example, a metric such as “discount” would not be a percentage in one fact table and an actual dollar amount in another. Instead, convert one to the other during data integration, or create two metrics: “discount_percent” and “discount_amount”. To take this a step further, in order for “discount_amount” to be conformed, it must be stored in some common currency (such as EUR or USD).

Not all attributes and metrics can be conformed, but the more you can, the more flexibility you will have with your drill-across queries. If you cannot conform a particular attribute or fact, then be sure to change its label so that end-users won’t make an assumption that it means something that it’s not!

I find it intuitive to identify attributes and metrics that are conformed across multiple source systems once the conformed dimensions have been identified and a data profile conducted. Where you need to be careful is when two column labels are named the same but mean (sometimes slightly) different things. Product color, for example, might be entered as a color’s name (red, black, blue) in one system, and a hex or RGB value in another. Multiple systems might have different product identifiers, all labeled “Product ID”. “Distance” could be in kilometers in once system, and in miles in another. Fahrenheit to Celsius conversions may be necessary. All of these issues need to be worked out during the data integration process so that “distance”, “color”, and “Product ID” mean exactly the same thing across Dimensions and Facts.

A 3lb red widget from NY is a 3lb red widget from NY is a 3lb red widget from NY!

SQL Server 2005 Integration Services (SSIS)

SSIS has many transformation tasks that you can use to enforce conformity in the dimensional model. In fact, there is not a single data conformity task that I have run across that SSIS does not provide a solution for out of the box (well, with some help from the Script Task, that is). Because many transformation tasks can be used to enforce conformity, I won’t enumerate them here. The challenges of conforming data truly lie in the planning and design process.

Hand Coding with Visual FoxPro (VFP9)

As with SSIS, there is nothing specific about hand-coding regarding conformed dimensions. So much rests on planning, mapping, data profiling, de-duping, and cleansing, that the process of conforming is just a matter of implementing the conversions and mappings necessary to build the dimensions so that they adhere to conformity. Check out my post on Loading Dimensions for some further details.

For some examples on some things you might do in FoxPro code to achieve conformity, take a look at my previous posts “Compare a String to a List of Formats” and “Conforming Temperature in FoxPro (some conversions)“. These are two examples of things you may need to do to enforce conformity.

From Here

In my next ETL Subsystem post, I will discuss Subsystem 9: Slowly Changing Dimensions (SCD). If you want to know more about Conformed Dimensions and the theory behind them, I recommend reading any of Kimball’s data warehousing books or articles that discuss the subject. He lays out the case quite nicely for conformed dimensions. I would like to stress that the success of the dimensional model rests in the ability of the data modelers and integration architects to design the data warehouse using as many conformed dimensions, attributes, and facts as possible.

There are not too many other blog postings (ex: 1 2 3) on the topic of data conformity that truly dive into the topic. I would love to see more, as this is such an important subsystem and best practice.

Tags: , , , , , , , , ,

5 Comments