Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Education’


Published February 15th, 2008

Cooling Down

It is (or should be) common knowledge that you should never send an email, write a blog or forum post, or make a phone call when you’re totally ticked off about something! You are likely to say something you don’t mean or perhaps you’ll be a little too honest.

First cool down, and then respond. Easy enough, but what if you can’t wait to cool down using traditional methods (you know like, take a long hot bath)?

The solution: Simply write your name a few times on a piece of paper using your non-dominant hand. Apparently, it will force the logical side of your brain to start working, giving your emotional side a few seconds to forget why it is so upset (or sad, or excited, etc.). For all the neurosurgeons out there who might want to debate brain lateralization, I’m not the guy for you! But this technique has worked many times for me (and it recently got my sister-in-law out of a funk).

Over the past several days, I’ve also been looking into other ways to train my brain to either help in logical tasks, management tasks, programming, motivation, etc. I stumbled upon a blog entry (from Gary’s Historical Art) that spoke of the book “Drawing on the right side of the brain“. I remember this book from my childhood and was thrilled to see it has a new addition. It contains some additional information on (a) the latest developments in brain research, and (b) information on using drawing skills for problem solving. I plan to get a copy soon.

Published January 17th, 2008

SSIS Expressions (by Scott Whigham)

In yesterday’s post, I made a comment that SSIS expressions were difficult to get used to. Earlier today, I ran across this video on YouTube. If you’re new to SSIS, or you simply have a problem with expressions, check it out. Here is a blurb:

Expressions are new in SSIS (meaning that they were not in DTS) but they allow you to create extremely flexible packages. Expressions can be used to assign values to variables, help you determine whether to execute a task, and to assign properties. If you ever found yourself writing ActiveX code (VBScript) in DTS to determine which task to execute based on a variable’s value, then you probably can forego the scripting all together in favor of expressions! They are very powerful in SSIS and you’ll want to have a solid understanding of how to write them and how to use them.

I love the comedy around online help. I totally feel for him…

Hope you found this useful!

Published December 4th, 2007

Database Design: 4th and 5th Normal Forms

I had the pleasure of speaking with Andrew MacNeill of The FoxShow last week for an interview. We discussed database normalization and denormalization in relational databases. I’d like to thank Andrew for the opportunity to discuss this very important database design topic. He contacted me after a listener asked for a discussion on normalization, and having seen my recent posting “The Role of Good Database Design and Normalization for Data Integration Projects“, Andrew felt that I would be up to the challenge.

You can listen to the Interview here: FoxShow #47: Data Normalization with FoxPro and Tod McKenna.

I promised during the interview to present the 4th and 5th normal forms here on Tod means Fox because trying to explain these forms without pictures and examples is very difficult. I would also like to confess (shame on me!) that I didn’t take the time ahead of time to prepare solid examples for these forms, so my explanation of these forms in the interview was a bit lacking. I hope to clarify what I said here.

4th Normal Form

In my previous post, I defined the 4th Normal Form as a form that seeks to “remove all multi-valued dependencies”. So what exactly does this mean? A multi-valued attribute is an attribute that helps to describe the entity. In order to represent this in a table, the attribute could be repeated several times. Consider this real-world scenario:

Tod Means Fox - Boyce-Codd Normal Form Example

The town, Region, and (Congressional) District together form the candidate key. In this example, Providence has a split congressional district and exists in multiple regions. Lincoln also exists in multiple regions but only has a single district. The above table meets Boyce-Codd Normal Form because all three attributes are part of the candidate key (read more about BCNF here).

This type of table is called a ‘cross-product’, where data is repeated. It is not in 4th normal form because of this repeating data scenario. To fix, you would create two separate tables:

Tod Means Fox - 4th Normal Form Example

In 4th normal form, you reduce the need for inserts, updates and deletes and therefore you eliminate the possibility for data anomalies. It is reasonable to strive for 4th normal form as a baseline in your designs. In some cases, your database will take up less space. The cost for normalizing this far is only in query performance, as you’ll need additional joins to report on the data. But in my opinion, data integrity (especially if you plan on re-purposing your data for data warehousing and data integration projects) should always take precedence in OLTP systems. With that said, the 5th Normal Form goes even further:

5th Normal Form

The 5th Normal Form is a “projection-based normalization approach where every nontrivial join dependency is implied by a candidate key”. In the previous example, I showed how one table became two, reducing the risk of data anomalies. 5th normal form could split a single table into three or more.

Let’s say you have a table with store, supplier, and products. The table contains a row for each store, which is repeated for each supplier the store uses, and again for each product that the supplier provides to the particular store (A store might not sell all the products that a supplier provides). All fields in this table are valuable and necessary to represent the row and is in 4th Normal Form. In fact, this is a common type of table that could represent a many-to-many-to-many relationship among stores, products, and suppliers. But, notice that there is repeating data:

Tod Means Fox - Almost 5th Normal Form Example

In 5NF, this table would become 3: a store_to_prod table, a supp_to_prod table, and finally a store_to_supp.

Tod Means Fox - 5th Normal Form Example

When a store picks up a new supplier, a new row is added to the store_to_supp table, eliminating the redundancy of extra rows for product. When a store sells a new product, the supp_to_prod table will let you know where to get it. A simple example, yes — but it demonstrates 5th normal form.

As you can see, you increase the amount of relations and tables as you normalize deeper and deeper.

Denormalizing

As discussed in the interview, at a certain point it is logical to push back and denormalize. You would do this for performance and manageability purposes, or because the design of the database dictates it. In the previous example, it might not be practical from a development standpoint to bring all tables to 5th form. However, you should weigh the decision to denormalize carefully. Consider the purpose of the database. If it is a transaction-based database with many data access points, you will want to normalize as much as possible to ensure the integrity of the data. If the primary purpose of the database is for ad-hoc queries, then backing off might be well received.

As I said in my previous post, normalization is an art form. I find it difficult to explain at times, but relatively easy to implement. I hope that this post is helpful in some way, and if I have made any errors, or if you don’t agree with any part of this discussion please feel free to contact me!

Published November 8th, 2007

The Role of Good Database Design and Normalization for Data Integration Projects

Not only is a database designer charged with the task of identifying entities, attributes, and relationships (the fun stuff) but also to ensure data integrity and data quality (often a challenge). This challenge is not limited to building indexes, rules, triggers, and database procedures. Database normalization is crucial in the design of relational databases. A good design is a normalized one, which seeks to minimize duplicate data and isolates entity rows so that they contain the exact data needed to satisfy the entity definition. The entity definition is based on business definitions: “customer demographics file”, “daily orders”, “receipt line items”, “store location”, etc. The database architect takes these business definitions and translates them to entities using ER or similar diagramming. And thus begins the normalization process.

I don’t want to get too academic here, or dive into the theory behind database normalization. That discussion can be saved for another day (but you can read more here or ). But basically, normalization is based on several cumulative rules or ‘Forms’. In the list below, Forms (1NF, 2NF, and 3NF) are most widely implemented. BCNF and 4NF are used but generally are the first to be relaxed when purposefully denormalizing the database (more on this below). I don’t recall ever seeing 5NF, DKNF, or 6NF in any production environment — instead these Forms seem to be relegated to specialized object-based systems, academic, and theoretic projects.

First Normal Form (1NF)
Entity does not have duplicate columns and each row can be identified by a primary key
Second Normal Form (2NF)
The usage of foreign keys to build relationships to remove repeating subsets of data
Third Normal Form (3NF)
Removes all attributes from a table not dependent upon the primary key
Boyce-Codd Normal Form (BCNF)
The only determinants are superkeys, or a superset of candidate key
Fourth Normal Form (4NF)
The removal of all multi-valued dependencies
Fifth Normal Form (5NF)
Projection-based normalization approach where every nontrivial join dependency is implied by a candidate key
Domain/Key Normal Form (DKNF)
Domain constraints and key constraints only
Sixth Normal Form (6NF)
If interested, read about the temporal/time dimension and 6NF in the book Temporal Data & the Relational Model.

Database normalization is an art form and takes practice, good sense, and a thorough understanding of the operational process to be modeled. The reasons why we walk through this exercise before each design is simple: Denormalized data can lead to update, insert, and delete anomalies, which must be avoided for data integrity and quality. Each of the above Normal Forms address one or more type or degree of anomaly:

Update Anomaly
When data to update is stored in many places, and not all updates take place
Insert Anomaly
When data to insert cannot be inserted because additional information (which should be stored as its own entity) is unavailable
Delete Anomaly
When deleting a row causes other information to be deleted

Many of these problems caused by denormalization are caused by bad planning, faulty design, laziness, or database scope overloading.

  • The Planning Factor
  • Normalization was not a consideration when designing the database. There are many reasons for this, but from my experience it is usually one of two things: The designer is a novice and is building his or her first database, or the design is based faithfully off of a spreadsheet (affectionately referred to as Spreadsheet Syndrome). This type of problem I have seen countless times in Access-based database systems, but other user-friendly RAD RDMS systems (like Visual FoxPro) suffer from this as well.

  • Faulty Design
  • Sometimes good intentions just aren’t enough! Poor or uninformed choices were made early on in the design process either because normalization concepts were misunderstood or business rules were not clearly defined. Database refactoring is a big deal and once you get started it is very difficult to backstep. I should note that faulty design is not limited to inadequate normalization but also to entity definitions, metadata considerations (is there any metadata at all?), and naming conventions (account_no, acct_nbr, account, acctnumb, etc…).

  • Laziness
  • Laziness and selective ignorance go hand-in-hand. After all, why create 10 fully normalized entities when you can have 1! The challenge of maintaining a database with dozens upon dozens of tables can be daunting. Especially when considering the need for careful index planning and the challenge of translating many entities into a good GUI design. So the urge is to combine order information with line items, or billing information with the customer demographic data to ease the workload.

  • Overloading the Design
  • I have seen database designs where the architect was clearly trying to satisfy multiple purposes; namely, daily transactions and ad-hoc reports. When building ER diagrams, a designer (who is often the developer as well) may take a step back and wonder what it would take to produce certain reports across multiple entities in their fully normalized model. Not only is writing a report of that nature tough, but performance will certainly be an issue. These are valid concerns. But the solution is not selective denormalization. The transaction database should be specific to transactions. If the same database is needed for reporting and ad-hoc queries, consider creating views or build a separate dimensional model with real-time/right-time updates using triggers on insert, update, and delete (my preferred method and a breeze in VFP).

Denormalization

There are times when denormilization is in order for a relational database. But the rule is this: Normalize the entire design first (usually up to 4NF where possible), approve the model with business users, and build a working prototype. Then, and only then, look for ways to improve system optimization and performance through denormilization. back off of any change that requires any sort of debate. Remember that data integrity and quality are far more important in an operational system than a tiny performance improvement or a couple of saved minutes in programming time.

Of course, in a dimensional model, denormalization is encouraged!

Data Integration

The reason for this post has more to do with data integration than you might have gathered. When conducting a data profile to learn about the source data, it helps tremendously if this data has been normalized properly. Certain assumptions can then be made and the integration process can steam along. A normalized database helps the integration engineers determine which data is reliable and represents the single version of the truth for the data warehouse. If data anomalies are possible because of a bad or lazy design, it could cost your data warehousing project days, weeks, or months to account for.

Therefore, normalized operational systems are not only essential for the data integrity and quality of those systems, but also for Business Intelligence initiatives that rely on them. This is true even if no anomalies exist in the source database (because of heavy application rules and code to ensure integrity on a denormalized structure); the integration team will not be able to make the assumption that anomalies will never exist. Data value is so important for the dimensional model and Business Intelligence that precautions like these must be taken.

Published October 17th, 2007

Who’s Blogging From SWFox 2007?

Just curious, but does anyone know who will be blogging from Southwest Fox 2007?

If it might be YOU (or someone near and dear to you), then please let me know! This looks to be a good Fox-filled weekend. I’m envious and wish I could have been there.

Published October 1st, 2007

The “Not Intersect” or “Symmetric Difference” Between Two Tables

A few months back I wrote an entry on how to achieve an INTERSECT using VFP. That post was inspired by the INTERSECT command that exists in SQL Server 2005. In set theory, the intersect can be defined as follows: “x is an element of sets A and B if and only if x is an element of A and x is an element of B.”

But what about returning non-intersecting data from two sets? This would be like a union, but that we only want a list of unique rows within each table, that is, the difference. The difference between two sets could be defined as: “x is an element of set A or B when x is not an element of A and B.” In set theory, this is called a “symmetric difference” (like XOR for x-basers!).

The SQL to accomplish this is almost identical to the Intersect query. Here is the general form in VFP:

SELECT [key field], [attributes..] ;
  FROM (;
    SELECT [key field], [attributes..] FROM A ;
    UNION ALL ;
    SELECT [key field], [attributes..] FROM B ;
  ) AS tmp ;
  GROUP BY [key field], [attributes..] ;
  HAVING COUNT(*) = 1

The change is in the HAVING clause, where instead of checking to see if COUNT(*) = 2, we check to see if COUNT(*) = 1. This gives us the “Not Intersect” of two sets: perfect for determining how two tables are different! The result is a complete list of non-intersecting data.

Here’s an actual example in VFP (followed by an example using T-SQL):

*-- two sets
CREATE CURSOR crTempL (keyfld i)
CREATE CURSOR crTempR (keyfld i)
 
*-- test data
INSERT INTO crTempL VALUES(1)
INSERT INTO crTempL VALUES(3)
INSERT INTO crTempL VALUES(5)
INSERT INTO crTempL VALUES(7)
INSERT INTO crTempL VALUES(9)
 
INSERT INTO crTempR VALUES(3)
INSERT INTO crTempR VALUES(6)
INSERT INTO crTempR VALUES(9)
INSERT INTO crTempR VALUES(12)
INSERT INTO crTempR VALUES(15)
 
*-- the difference
SELECT keyfld ;
  FROM (;
    SELECT keyfld FROM crTempL;
    UNION ALL ;
    SELECT keyfld FROM crTempR;
  ) AS tmp1 ;
  GROUP BY keyfld  ;
  HAVING COUNT(*) = 1

Now, here is the equivalent in T-SQL.

-- two sets
CREATE TABLE #crTempL (keyfld INTEGER)
CREATE TABLE #crTempR (keyfld INTEGER)
 
-- test data
INSERT INTO #crTempL VALUES(1)
INSERT INTO #crTempL VALUES(3)
INSERT INTO #crTempL VALUES(5)
INSERT INTO #crTempL VALUES(7)
INSERT INTO #crTempL VALUES(9)
 
INSERT INTO #crTempR VALUES(3)
INSERT INTO #crTempR VALUES(6)
INSERT INTO #crTempR VALUES(9)
INSERT INTO #crTempR VALUES(12)
INSERT INTO #crTempR VALUES(15)
 
-- the query
SELECT keyfld 
  FROM (
    SELECT keyfld FROM #crTempL
    UNION ALL 
    SELECT keyfld FROM #crTempR
  ) AS tmp1 
  GROUP BY keyfld  
  HAVING COUNT(*) = 1

As you can see, the syntax is almost identical. Thank you ANSI SQL standards!

Can anyone come up with a better or more efficient way to find the Symmetric Difference between two tables in TSQL, VFP, or any other DB?