Posts Tagged Normalization

ETL Subsystem 11: Hierarchy Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

This subsystem exists to ensure that hierarchies are appropriately translated and represented in the dimensional model. There are two types of hierarchies that you’ll need to contend with: fixed and ragged.

Managing hierarchies can be a complex process especially when you have hierarchies that are extremely ragged (for example, manufacturers’ parts or an organizational chart). You’ll also run into complexities when a dimension entity (like a single customer or a single product) exists in simultaneous hierarchies. I’ll talk more about these later in this post. In contrast, the fixed variety is easier to work with.

Hierarchies as Attributes of a Dimension

Generally speaking, you can think of hierarchies as many-to-one relationships. In the dimensional modeling world, these relationships are represented in a single table. This would include stores to regions to state, children to parents to grandparents, and greens to vegetables to produce. ETL Subsystem 11 seeks to maintain the integrity of these relationships.

To simplify even further, instead of thinking in “tree” structures, think in “lines” from the child up the hierarchy to the parent. This will help you build the dimensions to accommodate the relationships. In the simplest form: A single store is in a single region in a single state. This is actually a very interesting topic from a modeling perspective — one in which I’ll get to in more detail in a future posting.

state_region_storeDimensions are denormalized structures, which means that you will have many repeated elements. This is normal and desired. For example, a store region will be repeated for each store and state will be repeated for each region all in the same dimension (look left). This is normal and desirable in a dimension. The trick to getting this to work correctly is that the hierarchy must be represented as a single value with the dimension row’s primary (surrogate) key.

Data modelers who are used to 3rd Normal Form might look at the above and cringe. But remember: normalized models are for preventing data anomalies in a transaction environment. In a data warehousing environment, the rules are different. First, there are no opportunities for data anomalies due to data integration controls. Secondly, normalizing the data warehouse makes absolutely no sense from a usability perspective: it only complicates and slows down reports and queries.

The Ragged Kind

So fixed hierarchies are easiest to work with: got it. It is not so easy to work with variable and ragged hierarchies because of their variable depth. The classic example is an organization chart, where any employee can be at the top or at the bottom of the hierarchy. Knowing how deep the organization runs from any point is a challenge that usually requires self joins and bridge tables to represent the relationships.

I have always solved these types of hierarchy “problems” using “helper tables” in a dimensional model. Ralph Kimball wrote a great article a decade ago on this subject. Check it out for more details.

Helper tables look like bridge tables that sit between the fact and the dimension. They facilitate the representation complex hierarchal information. This design complicates user queries though, so be sure that helper tables are absolutely needed. It might be, for example, you only really need the manager’s name and not the entire chain of command with each employee. You don’t need a helper table for that (see the following code)!

SELECT
   c.FirstName AS empFirstName,
   c.LastName AS empLastName,
   e.title AS empTitle,
   COALESCE(m.mgrFirstName,'N/A'),
   COALESCE(m.mgrLastName,'N/A'),
   COALESCE(m.mgrTitle,'I am the boss')
FROM
    HumanResources.Employee e 
JOIN Person.contact c ON c.ContactId = e.EmployeeID 
LEFT JOIN (
   SELECT
      e2.EmployeeID,
      c2.FirstName AS mgrFirstName,
      c2.LastName AS mgrLastName,
      e2.title AS mgrTitle 
   FROM
      HumanResources.Employee e2 
   JOIN Person.contact c2 ON c2.ContactId = e2.EmployeeID  ) 
  AS m ON m.EmployeeID = e.ManagerID

I wrote a while ago on ragged hierarchies from a programming perspective. Take a look at that post for more details.

Date Hierarchies

For another example, let’s look at a common hierarchy we can all relate to: year, quarter, month, and week. You can see how this hierarchy is modeled by looking at any date dimension. As you’ll see, each day contains information about how it is grouped on the calendar. The information is repeated for each day until one of the groups change. When designing reports that allow your users to drill down into the data, it is a common approach to start at the highest group (sales by year) and then look a bit deeper as necessary (sales by quarter or sales by month, for example).

Also, in the date dimension, weeks don’t line up too well within a year, quarter, or month. This is a classic example of how one group can fit entirely or partly into another. You see this often when a sales region crosses multiple states, or when an employee serves multiple roles within the company. These are all situations that must be accounted for by your hierarchy manager. For the date dimension, one technique I’ve adopted is to include the week of the year number, week of the quarter, and week of the month in the date dimension to give users the ability to drill into weekly data much easier (for example, you might want to measure holiday sales in the US from the 3rd week of November to the 4th week of December).

Snowflakes and Hierarchies

SnowflakeSnowflakes are usually a sign that a hierarchy has been normalized. This is bad. Don’t fall into the trap! In order to keep the dimensional model as simple as possible, you should avoid snowflaking; however, snowflake designs are perfectly legal under certain circumstances. Carefully examine the reasons though. If you are snowflaking to accommodate a hierarchy, hold your horses. Hierarchies are a natural part of a dimension. In fact, most things in life are categorized and need to be grouped in some way. Remember that one of the primary purposes of delivering a denormalized dimension is to remove almost all complexity from the user’s perspective. This usually means hierarchies.

SQL Server 2005 Integration Services (SSIS)

Denormalizing hierarchies using SSIS is not difficult. The hardest part is usually in writing the SQL that correctly fetches the right data. In a Data Flow, you can use a series of Lookup and Merge components. In some cases, especially for the more complex ragged hierarchies, your best bet is to use SQL statement in an Execute SQL Task.

Although I won’t get into details in this post, CTEs (Common Table Expressions) are excellent for working with recursion and hierarchies.

The real joy of hierarchy management using the SQL Server Business Intelligence suite comes at the end when you want to start using your hierarchies to allow users to drill down into the data. Even if you have complex ragged hierarchies structured with bridge tables in your model, Analysis (SSAS) and Reporting (SSRS) Services can really make sense of it all. But I’m digressing; this series of posts is about ETL and not presentation tools!

Hand Coding with Visual FoxPro (VFP9)

It’s easy to write recursive functions in Visual FoxPro. Recursion is one of the “secretes” to flattening out hierarchal structures. For more details on writing recursive functions in FoxPro, check out my post “Ragged Hierarchy Alert” or even better, visit the FoxWiki page on the subject.

From Here

In the end, as I continue to compare SSIS with VFP on data integration, I find that hierarchy management is about equal between the two. FoxPro seems to perform much better but I have no benchmarking (yet) to prove it. One advantage with SSIS is being able to utilize CTEs.

This was a long one, but a heavy subject. In my next post, I’ll discuss the Special Dimensions Manager, ETL Subsystem 12. Not as heavy and hopefully not as long!

Tags: , , , , , ,

1 Comment

Database Design: 4th and 5th Normal Forms

I had the pleasure of speaking with Andrew MacNeill of The FoxShow last week for an interview. We discussed database normalization and denormalization in relational databases. I’d like to thank Andrew for the opportunity to discuss this very important database design topic. He contacted me after a listener asked for a discussion on normalization, and having seen my recent posting “The Role of Good Database Design and Normalization for Data Integration Projects“, Andrew felt that I would be up to the challenge.

You can listen to the Interview here: FoxShow #47: Data Normalization with FoxPro and Tod McKenna.

I promised during the interview to present the 4th and 5th normal forms here on Tod means Fox because trying to explain these forms without pictures and examples is very difficult. I would also like to confess (shame on me!) that I didn’t take the time ahead of time to prepare solid examples for these forms, so my explanation of these forms in the interview was a bit lacking. I hope to clarify what I said here.

4th Normal Form

In my previous post, I defined the 4th Normal Form as a form that seeks to “remove all multi-valued dependencies”. So what exactly does this mean? A multi-valued attribute is an attribute that helps to describe the entity. In order to represent this in a table, the attribute could be repeated several times. Consider this real-world scenario:

Tod Means Fox - Boyce-Codd Normal Form Example

The town, Region, and (Congressional) District together form the candidate key. In this example, Providence has a split congressional district and exists in multiple regions. Lincoln also exists in multiple regions but only has a single district. The above table meets Boyce-Codd Normal Form because all three attributes are part of the candidate key (read more about BCNF here).

This type of table is called a ‘cross-product’, where data is repeated. It is not in 4th normal form because of this repeating data scenario. To fix, you would create two separate tables:

Tod Means Fox - 4th Normal Form Example

In 4th normal form, you reduce the need for inserts, updates and deletes and therefore you eliminate the possibility for data anomalies. It is reasonable to strive for 4th normal form as a baseline in your designs. In some cases, your database will take up less space. The cost for normalizing this far is only in query performance, as you’ll need additional joins to report on the data. But in my opinion, data integrity (especially if you plan on re-purposing your data for data warehousing and data integration projects) should always take precedence in OLTP systems. With that said, the 5th Normal Form goes even further:

5th Normal Form

The 5th Normal Form is a “projection-based normalization approach where every nontrivial join dependency is implied by a candidate key”. In the previous example, I showed how one table became two, reducing the risk of data anomalies. 5th normal form could split a single table into three or more.

Let’s say you have a table with store, supplier, and products. The table contains a row for each store, which is repeated for each supplier the store uses, and again for each product that the supplier provides to the particular store (A store might not sell all the products that a supplier provides). All fields in this table are valuable and necessary to represent the row and is in 4th Normal Form. In fact, this is a common type of table that could represent a many-to-many-to-many relationship among stores, products, and suppliers. But, notice that there is repeating data:

Tod Means Fox - Almost 5th Normal Form Example

In 5NF, this table would become 3: a store_to_prod table, a supp_to_prod table, and finally a store_to_supp.

Tod Means Fox - 5th Normal Form Example

When a store picks up a new supplier, a new row is added to the store_to_supp table, eliminating the redundancy of extra rows for product. When a store sells a new product, the supp_to_prod table will let you know where to get it. A simple example, yes — but it demonstrates 5th normal form.

As you can see, you increase the amount of relations and tables as you normalize deeper and deeper.

Denormalizing

As discussed in the interview, at a certain point it is logical to push back and denormalize. You would do this for performance and manageability purposes, or because the design of the database dictates it. In the previous example, it might not be practical from a development standpoint to bring all tables to 5th form. However, you should weigh the decision to denormalize carefully. Consider the purpose of the database. If it is a transaction-based database with many data access points, you will want to normalize as much as possible to ensure the integrity of the data. If the primary purpose of the database is for ad-hoc queries, then backing off might be well received.

As I said in my previous post, normalization is an art form. I find it difficult to explain at times, but relatively easy to implement. I hope that this post is helpful in some way, and if I have made any errors, or if you don’t agree with any part of this discussion please feel free to contact me!

Tags: , ,

5 Comments

The Role of Good Database Design and Normalization for Data Integration Projects

Not only is a database designer charged with the task of identifying entities, attributes, and relationships (the fun stuff) but also to ensure data integrity and data quality (often a challenge). This challenge is not limited to building indexes, rules, triggers, and database procedures. Database normalization is crucial in the design of relational databases. A good design is a normalized one, which seeks to minimize duplicate data and isolates entity rows so that they contain the exact data needed to satisfy the entity definition. The entity definition is based on business definitions: “customer demographics file”, “daily orders”, “receipt line items”, “store location”, etc. The database architect takes these business definitions and translates them to entities using ER or similar diagramming. And thus begins the normalization process.

I don’t want to get too academic here, or dive into the theory behind database normalization. That discussion can be saved for another day (but you can read more here or ). But basically, normalization is based on several cumulative rules or ‘Forms’. In the list below, Forms (1NF, 2NF, and 3NF) are most widely implemented. BCNF and 4NF are used but generally are the first to be relaxed when purposefully denormalizing the database (more on this below). I don’t recall ever seeing 5NF, DKNF, or 6NF in any production environment — instead these Forms seem to be relegated to specialized object-based systems, academic, and theoretic projects.

First Normal Form (1NF)
Entity does not have duplicate columns and each row can be identified by a primary key
Second Normal Form (2NF)
The usage of foreign keys to build relationships to remove repeating subsets of data
Third Normal Form (3NF)
Removes all attributes from a table not dependent upon the primary key
Boyce-Codd Normal Form (BCNF)
The only determinants are superkeys, or a superset of candidate key
Fourth Normal Form (4NF)
The removal of all multi-valued dependencies
Fifth Normal Form (5NF)
Projection-based normalization approach where every nontrivial join dependency is implied by a candidate key
Domain/Key Normal Form (DKNF)
Domain constraints and key constraints only
Sixth Normal Form (6NF)
If interested, read about the temporal/time dimension and 6NF in the book Temporal Data & the Relational Model.

Database normalization is an art form and takes practice, good sense, and a thorough understanding of the operational process to be modeled. The reasons why we walk through this exercise before each design is simple: Denormalized data can lead to update, insert, and delete anomalies, which must be avoided for data integrity and quality. Each of the above Normal Forms address one or more type or degree of anomaly:

Update Anomaly
When data to update is stored in many places, and not all updates take place
Insert Anomaly
When data to insert cannot be inserted because additional information (which should be stored as its own entity) is unavailable
Delete Anomaly
When deleting a row causes other information to be deleted

Many of these problems caused by denormalization are caused by bad planning, faulty design, laziness, or database scope overloading.

  • The Planning Factor
  • Normalization was not a consideration when designing the database. There are many reasons for this, but from my experience it is usually one of two things: The designer is a novice and is building his or her first database, or the design is based faithfully off of a spreadsheet (affectionately referred to as Spreadsheet Syndrome). This type of problem I have seen countless times in Access-based database systems, but other user-friendly RAD RDMS systems (like Visual FoxPro) suffer from this as well.

  • Faulty Design
  • Sometimes good intentions just aren’t enough! Poor or uninformed choices were made early on in the design process either because normalization concepts were misunderstood or business rules were not clearly defined. Database refactoring is a big deal and once you get started it is very difficult to backstep. I should note that faulty design is not limited to inadequate normalization but also to entity definitions, metadata considerations (is there any metadata at all?), and naming conventions (account_no, acct_nbr, account, acctnumb, etc…).

  • Laziness
  • Laziness and selective ignorance go hand-in-hand. After all, why create 10 fully normalized entities when you can have 1! The challenge of maintaining a database with dozens upon dozens of tables can be daunting. Especially when considering the need for careful index planning and the challenge of translating many entities into a good GUI design. So the urge is to combine order information with line items, or billing information with the customer demographic data to ease the workload.

  • Overloading the Design
  • I have seen database designs where the architect was clearly trying to satisfy multiple purposes; namely, daily transactions and ad-hoc reports. When building ER diagrams, a designer (who is often the developer as well) may take a step back and wonder what it would take to produce certain reports across multiple entities in their fully normalized model. Not only is writing a report of that nature tough, but performance will certainly be an issue. These are valid concerns. But the solution is not selective denormalization. The transaction database should be specific to transactions. If the same database is needed for reporting and ad-hoc queries, consider creating views or build a separate dimensional model with real-time/right-time updates using triggers on insert, update, and delete (my preferred method and a breeze in VFP).

Denormalization

There are times when denormilization is in order for a relational database. But the rule is this: Normalize the entire design first (usually up to 4NF where possible), approve the model with business users, and build a working prototype. Then, and only then, look for ways to improve system optimization and performance through denormilization. back off of any change that requires any sort of debate. Remember that data integrity and quality are far more important in an operational system than a tiny performance improvement or a couple of saved minutes in programming time.

Of course, in a dimensional model, denormalization is encouraged!

Data Integration

The reason for this post has more to do with data integration than you might have gathered. When conducting a data profile to learn about the source data, it helps tremendously if this data has been normalized properly. Certain assumptions can then be made and the integration process can steam along. A normalized database helps the integration engineers determine which data is reliable and represents the single version of the truth for the data warehouse. If data anomalies are possible because of a bad or lazy design, it could cost your data warehousing project days, weeks, or months to account for.

Therefore, normalized operational systems are not only essential for the data integrity and quality of those systems, but also for Business Intelligence initiatives that rely on them. This is true even if no anomalies exist in the source database (because of heavy application rules and code to ensure integrity on a denormalized structure); the integration team will not be able to make the assumption that anomalies will never exist. Data value is so important for the dimensional model and Business Intelligence that precautions like these must be taken.

Tags: , ,

4 Comments