Database Design: 4th and 5th Normal Forms
I had the pleasure of speaking with Andrew MacNeill of The FoxShow last week for an interview. We discussed database normalization and denormalization in relational databases. I’d like to thank Andrew for the opportunity to discuss this very important database design topic. He contacted me after a listener asked for a discussion on normalization, and having seen my recent posting “The Role of Good Database Design and Normalization for Data Integration Projects“, Andrew felt that I would be up to the challenge.
You can listen to the Interview here: FoxShow #47: Data Normalization with FoxPro and Tod McKenna.
I promised during the interview to present the 4th and 5th normal forms here on Tod means Fox because trying to explain these forms without pictures and examples is very difficult. I would also like to confess (shame on me!) that I didn’t take the time ahead of time to prepare solid examples for these forms, so my explanation of these forms in the interview was a bit lacking. I hope to clarify what I said here.
4th Normal Form
In my previous post, I defined the 4th Normal Form as a form that seeks to “remove all multi-valued dependencies”. So what exactly does this mean? A multi-valued attribute is an attribute that helps to describe the entity. In order to represent this in a table, the attribute could be repeated several times. Consider this real-world scenario:
The town, Region, and (Congressional) District together form the candidate key. In this example, Providence has a split congressional district and exists in multiple regions. Lincoln also exists in multiple regions but only has a single district. The above table meets Boyce-Codd Normal Form because all three attributes are part of the candidate key (read more about BCNF here).
This type of table is called a ‘cross-product’, where data is repeated. It is not in 4th normal form because of this repeating data scenario. To fix, you would create two separate tables:
In 4th normal form, you reduce the need for inserts, updates and deletes and therefore you eliminate the possibility for data anomalies. It is reasonable to strive for 4th normal form as a baseline in your designs. In some cases, your database will take up less space. The cost for normalizing this far is only in query performance, as you’ll need additional joins to report on the data. But in my opinion, data integrity (especially if you plan on re-purposing your data for data warehousing and data integration projects) should always take precedence in OLTP systems. With that said, the 5th Normal Form goes even further:
5th Normal Form
The 5th Normal Form is a “projection-based normalization approach where every nontrivial join dependency is implied by a candidate key”. In the previous example, I showed how one table became two, reducing the risk of data anomalies. 5th normal form could split a single table into three or more.
Let’s say you have a table with store, supplier, and products. The table contains a row for each store, which is repeated for each supplier the store uses, and again for each product that the supplier provides to the particular store (A store might not sell all the products that a supplier provides). All fields in this table are valuable and necessary to represent the row and is in 4th Normal Form. In fact, this is a common type of table that could represent a many-to-many-to-many relationship among stores, products, and suppliers. But, notice that there is repeating data:
In 5NF, this table would become 3: a store_to_prod table, a supp_to_prod table, and finally a store_to_supp.
When a store picks up a new supplier, a new row is added to the store_to_supp table, eliminating the redundancy of extra rows for product. When a store sells a new product, the supp_to_prod table will let you know where to get it. A simple example, yes — but it demonstrates 5th normal form.
As you can see, you increase the amount of relations and tables as you normalize deeper and deeper.
Denormalizing
As discussed in the interview, at a certain point it is logical to push back and denormalize. You would do this for performance and manageability purposes, or because the design of the database dictates it. In the previous example, it might not be practical from a development standpoint to bring all tables to 5th form. However, you should weigh the decision to denormalize carefully. Consider the purpose of the database. If it is a transaction-based database with many data access points, you will want to normalize as much as possible to ensure the integrity of the data. If the primary purpose of the database is for ad-hoc queries, then backing off might be well received.
As I said in my previous post, normalization is an art form. I find it difficult to explain at times, but relatively easy to implement. I hope that this post is helpful in some way, and if I have made any errors, or if you don’t agree with any part of this discussion please feel free to contact me!
Data Warehousing and Business Intelligence consultant, with expertise in business analysis, data modeling, and data integration. Extensive experience developing vertical and integrated desktop and Internet applications spanning municipal, clinical, and financial industries.
December 11th, 2007 at 2:02 am
hi
i m amitabh mishra,my question is that in what situation we denormalize the table. like if we want speed for searchig or for other purpose.
and second question is
>do foreign ket have some space or slow the speed of the table?
March 12th, 2008 at 7:30 am
Hi amitabh, I apologize for being very late here on my reply.
Query performance and data model simplicity are two often cited reasons to denormalize. The former is a much more reasonable reason than the latter. The latter, in fact, sounds like laziness!
OLTP systems can be complex and data integrity and quality should be a primary focus. There are solutions to solve the query and reporting performance and join complexity problem, ranging from views to operational data stores built for query speed.
Denormalized tables could potentially expose the database to anomalies especially if multiple systems interact with the database at different levels.
There is a threshold to how much you can normalize, though. You could normalize every entity up to higher and higher forms but at some point — the tipping point, if you will — maintainability and understandability will erode, rendering the model too difficult to use. You’ll enter the land of academia and theoretical models which might be far from practical.
The balance is really for the modeler to discover (based on many factors such as industry, business needs, application access, compliance concerns, data quality policies, etc.). I don’t think that there is a magic formula. Some entities in the model may be normalized to 5th, while others to 3rd.
Some scenarios where denormalization might be the answer for performance purposes: (a) to repeat attributes in a table row so that calculations can be performed without the need to join to other tables, (b) when repeating groups of rows exist, but must be processed as a group, rather than by the row, or (c) when certain attributes in a table are queries so often that it makes sense to include them in the table.
When denormalizing, consider these factors: (a) the possibility that data integrity will be jeopardized, (b) that performance gains will still not be significant enough to make the risk worthwhile, and (c) that other methods may be available to handle the performance issues associated with highly normalized data.
Personally, I think denormalization makes the most sense on tables that are read-only with a very controlled process for updates, deletes, and inserts. For example, I modeled a database 10 years ago that rests mostly in 4th normal form, with some entities in 5th, and others in 3rd, where necessary. I have a separate database in mostly 2nd form for reporting (I designed this data model before I was aware of dimensional modeling; although my design is similar, it is NOT a true dimensional model). This database is updated using triggers from the primary OLTP database. Not only does this work, but it works very well. Those denormalized tables are “read only” to everyone but the triggers. Yet, several applications use them for reporting and query purposes. It is very fast, while my OLTP database is very secure.
April 16th, 2008 at 3:16 pm
I would say a case for 3NF would be data representing a ternary relationship, i.e. where 4th & 5th become relevant, but where the entities are written like transactions, i.e. INSERTed but never UPDATEd or DELETEd. The danger of not splitting out entities where 4th or 5th is appropriate is not in creating the data, it’s in updating it (where you can easily introduce update anomalies because you are carrying redundant data) or deleting it (where you can lose all the information held about a particular attribute). And on DB2 on the mainframe we used to talk about 2-and-a-half normal form, which was a denormalisation tactic to reduce frequent joins on very read-intensive tables. CRUD analysis will sometimes highlight this sort of situation up front but more often than not it is done post hoc facto to improve performance.
April 17th, 2008 at 1:41 am
Hi Caroline, I like it. You’ve presented a very good example where 3NF is preferable over higher forms.
One thing I always come back to is the original intent of normalization: To reduce the number of data anomalies that can occur during INSERT, UPDATE, and DELETE operations. Normally, if tables are up to 3NF, then INSERT would never be a problem due to the constraints set up by primary and foreign key relationships. But UPDATEs and DELETEs do create potential issues. So, if you are not going to update or delete a row from a table (or there is little likelihood), then it is safe to denormalize an otherwise 4th or 5th model “down” to 3rd.
I’d love to hear more about 2-and-a-half normal form too. And I admit, I have never used CRUD analysis for the purposes of spotting opportunities to denormalize. That’s a great tip!!