The Role of Good Database Design and Normalization for Data Integration Projects
Not only is a database designer charged with the task of identifying entities, attributes, and relationships (the fun stuff) but also to ensure data integrity and data quality (often a challenge). This challenge is not limited to building indexes, rules, triggers, and database procedures. Database normalization is crucial in the design of relational databases. A good design is a normalized one, which seeks to minimize duplicate data and isolates entity rows so that they contain the exact data needed to satisfy the entity definition. The entity definition is based on business definitions: “customer demographics file”, “daily orders”, “receipt line items”, “store location”, etc. The database architect takes these business definitions and translates them to entities using ER or similar diagramming. And thus begins the normalization process.
I don’t want to get too academic here, or dive into the theory behind database normalization. That discussion can be saved for another day (but you can read more here or ). But basically, normalization is based on several cumulative rules or ‘Forms’. In the list below, Forms (1NF, 2NF, and 3NF) are most widely implemented. BCNF and 4NF are used but generally are the first to be relaxed when purposefully denormalizing the database (more on this below). I don’t recall ever seeing 5NF, DKNF, or 6NF in any production environment — instead these Forms seem to be relegated to specialized object-based systems, academic, and theoretic projects.
- First Normal Form (1NF)
- Entity does not have duplicate columns and each row can be identified by a primary key
- Second Normal Form (2NF)
- The usage of foreign keys to build relationships to remove repeating subsets of data
- Third Normal Form (3NF)
- Removes all attributes from a table not dependent upon the primary key
- Boyce-Codd Normal Form (BCNF)
- The only determinants are superkeys, or a superset of candidate key
- Fourth Normal Form (4NF)
- The removal of all multi-valued dependencies
- Fifth Normal Form (5NF)
- Projection-based normalization approach where every nontrivial join dependency is implied by a candidate key
- Domain/Key Normal Form (DKNF)
- Domain constraints and key constraints only
- Sixth Normal Form (6NF)
- If interested, read about the temporal/time dimension and 6NF in the book Temporal Data & the Relational Model.
Database normalization is an art form and takes practice, good sense, and a thorough understanding of the operational process to be modeled. The reasons why we walk through this exercise before each design is simple: Denormalized data can lead to update, insert, and delete anomalies, which must be avoided for data integrity and quality. Each of the above Normal Forms address one or more type or degree of anomaly:
- Update Anomaly
- When data to update is stored in many places, and not all updates take place
- Insert Anomaly
- When data to insert cannot be inserted because additional information (which should be stored as its own entity) is unavailable
- Delete Anomaly
- When deleting a row causes other information to be deleted
Many of these problems caused by denormalization are caused by bad planning, faulty design, laziness, or database scope overloading.
- The Planning Factor
- Faulty Design
- Laziness
- Overloading the Design
Normalization was not a consideration when designing the database. There are many reasons for this, but from my experience it is usually one of two things: The designer is a novice and is building his or her first database, or the design is based faithfully off of a spreadsheet (affectionately referred to as Spreadsheet Syndrome). This type of problem I have seen countless times in Access-based database systems, but other user-friendly RAD RDMS systems (like Visual FoxPro) suffer from this as well.
Sometimes good intentions just aren’t enough! Poor or uninformed choices were made early on in the design process either because normalization concepts were misunderstood or business rules were not clearly defined. Database refactoring is a big deal and once you get started it is very difficult to backstep. I should note that faulty design is not limited to inadequate normalization but also to entity definitions, metadata considerations (is there any metadata at all?), and naming conventions (account_no, acct_nbr, account, acctnumb, etc…).
Laziness and selective ignorance go hand-in-hand. After all, why create 10 fully normalized entities when you can have 1! The challenge of maintaining a database with dozens upon dozens of tables can be daunting. Especially when considering the need for careful index planning and the challenge of translating many entities into a good GUI design. So the urge is to combine order information with line items, or billing information with the customer demographic data to ease the workload.
I have seen database designs where the architect was clearly trying to satisfy multiple purposes; namely, daily transactions and ad-hoc reports. When building ER diagrams, a designer (who is often the developer as well) may take a step back and wonder what it would take to produce certain reports across multiple entities in their fully normalized model. Not only is writing a report of that nature tough, but performance will certainly be an issue. These are valid concerns. But the solution is not selective denormalization. The transaction database should be specific to transactions. If the same database is needed for reporting and ad-hoc queries, consider creating views or build a separate dimensional model with real-time/right-time updates using triggers on insert, update, and delete (my preferred method and a breeze in VFP).
Denormalization
There are times when denormilization is in order for a relational database. But the rule is this: Normalize the entire design first (usually up to 4NF where possible), approve the model with business users, and build a working prototype. Then, and only then, look for ways to improve system optimization and performance through denormilization. back off of any change that requires any sort of debate. Remember that data integrity and quality are far more important in an operational system than a tiny performance improvement or a couple of saved minutes in programming time.
Of course, in a dimensional model, denormalization is encouraged!
Data Integration
The reason for this post has more to do with data integration than you might have gathered. When conducting a data profile to learn about the source data, it helps tremendously if this data has been normalized properly. Certain assumptions can then be made and the integration process can steam along. A normalized database helps the integration engineers determine which data is reliable and represents the single version of the truth for the data warehouse. If data anomalies are possible because of a bad or lazy design, it could cost your data warehousing project days, weeks, or months to account for.
Therefore, normalized operational systems are not only essential for the data integrity and quality of those systems, but also for Business Intelligence initiatives that rely on them. This is true even if no anomalies exist in the source database (because of heavy application rules and code to ensure integrity on a denormalized structure); the integration team will not be able to make the assumption that anomalies will never exist. Data value is so important for the dimensional model and Business Intelligence that precautions like these must be taken.
Data Warehousing and Business Intelligence consultant, with expertise in business analysis, data modeling, and data integration. Extensive experience developing vertical and integrated desktop and Internet applications spanning municipal, clinical, and financial industries.
November 29th, 2007 at 11:22 am
[…] The Role of Good Database Design and Normalization for Data … […]
December 4th, 2007 at 3:31 pm
[…] after a listener asked for a discussion on normalization, and having seen my recent posting “The Role of Good Database Design and Normalization for Data Integration Projects“, Andrew felt that I would be up to the […]