Archive for November, 2007

Graphical Execution Plan Notes for SQL Server 2005

I often need a quick execution plan reference for query optimization in SQL Server 2005 that gives me a definition of the icon and a link to more details. The resource on page ms175913.aspx from the msdn website is very helpful but a bit incomplete.

For my personal use, I compiled the following document. I include the description of each plan icon and removed all items that do not exist in SQL Server 2005. The letters under the item name represent its type (N/A=none; P = Physical, L = Logical). If the item has any important optional Arguments, they are listed under the type. To keep the document short, details on optional arguments are not included.

You can download the document: ‘Graphical Execution Plan Notes‘. It is a zip file that contains a single Word document. If you have any suggestions on how to improve it, let me know!

Here’s a preview:

 

Graphical Execution Plan Notes

 

And here are some links to other blogs that discuss query optimization in SQL Server:

And a great book:
Inside Microsoft® SQL Server(TM) 2005: Query Tuning and Optimization

The document can be downloaded here, or on my download page. I hope you find this useful!

 

Tags: ,

2 Comments

34 Subsystems of ETL Data Integration

In this, and in the next series of posts, I will be exploring the 34 subsystems of ETL Data Integration (note: there used to be 38) as defined by the Kimball Group. I introduce the subsystems in this post, and then I will discuss how each fits (or does not fit) into both a SQL Server environment and in a hand-coded environment using VFP9.

The subsystem concept is a best-practice initiative formulated by The Kimball Group to help organizations design effective and efficient Data Integration environments for Data Warehousing using the Dimensional Model. As you’ll see, the subsystems are well-defined, battle-tested, and represent a relatively complete picture of what is needed for ETL Integration.

The Kimball Group categorizes the subsystems into 4 distinct groups: Data Extraction, Cleansing and Conforming Tasks, Data Delivery, and Management. Management is an often forgotten aspect of ETL (If the acronym was ETLM, we’d be OK!). You’ll notice that there are quite a few management-specific subsystems.

    Data Extraction

  1. Data Profiling
  2. Change Data Capture (CDC)
  3. Extraction
  4. Cleansing and Conforming Tasks

  5. Data Cleansing Subsystem
  6. Error Event Management
  7. Auditing
  8. Removing Duplicates
  9. Data Conformance
  10. Data Delivery

  11. Slowly Changing Dimensions (SCD)
  12. Surrogate Key Generator
  13. Hierarchy Manager
  14. Special Dimensions Manager
  15. Fact Table Builders
  16. Surrogate Key Management
  17. Bridge Table Builder
  18. Late Arriving Data Handler
  19. Dimension Manager
  20. Fact Table Provider
  21. Aggregate Generation
  22. OLAP Cube Builder
  23. Data Propagation Manager
  24. Management

  25. Scheduler
  26. Backup System
  27. Recovery and Restart
  28. Version Control
  29. Version Migration
  30. Work flow Monitor
  31. Sorting
  32. Data Lineage and Dependency
  33. Problem Escalation
  34. Paralleling and Pipelining
  35. Security
  36. Compliance Manager
  37. Metadata Repository

Data Integration architects have a lot to be thankful for the work and effort put forth by the Kimball Group. As you can probably guess, to do ETL right, you’ll need adequate time and resources. It is no wonder that data integration is by far the most time-consuming aspect of data warehousing.

In subsequent articles I will define each of these systems, provide examples, and discuss how they can be accomplished in SQL Server as well as VFP9. I have a lot of work to do, so stay tuned!

Tags: , , ,

43 Comments

Dimensional Modeling: Loading The Slowly Changing Dimension

In my entry “Dimension Tables”, I discussed some of the different types of dimension tables that could exist in a dimensional model. In this entry, I will discuss how they are loaded.

Dimensions change over time: Data entry errors are fixed, new customers arrive, employees get married and change their last names, products are re-categorized, and new store promotions are initiated. There are several challenges associated with these updates — challenges that do not exist in operational database applications. To face these challenges, the concept of the Slowly Changing Dimension has emerged.

There are 3 types of Slowly Changing Dimensions:

Type 1 SCD
When a type 1 attribute changes, it is Ok to overwrite the attribute on the current or all matching rows
Type 2 SCD
When a Type 2 attribute changes, make a copy of the row, mark the copy as ‘expired’ or ‘old’, and place the new Type 2 attribute change into the new row
Type 3 SCD
These represent alternate realities and keep the previous and current value of an attribute in a row

There is also a Type 1.5 SCD which is used to keep very large dimensions from expanding with too much unnecessary history. Consider a case where a Customer dimension is loaded from several sources. Several Type 2 attributes may not be available at the time the row is first loaded. When this data becomes available, it is updated as if it were a Type 1 change, rather than Type 2. If the customer then changes the attribute at a future date, the attribute would be treated as a Type 2. In this scenario, additional information must be made available to the ETL program so that it knows how to treat the attribute(s).

Inferred Members (which are treated somewhat like a Type 1.5 change) exists when a fact table references a dimension row that is not yet loaded. A new row is added to the dimension with a new surrogate key along with whatever additional data is available (at a minimum, the natural key). This almost-empty dimension row would be updated appropriately when the data becomes available. Just like the Type 1.5 change, additional information must be made available to the ETL program so that it knows how to treat the attribute(s) and not create unnecessary history.

To make Type 2 Slowly Changing Dimensions work, the dimension must have some attribute that distinguishes the active or current row from historical or expired rows. This can be done in one of three ways (my preferred method is to use dates):

  1. Dates: Create a from_date and a to_date in the dimension. The active row will have a null ‘to_date’.
  2. Flag: Create a status flag to distinguish active rows from inactive rows. Only one row should be active at a time for a give natural key.
  3. Version: Either a sequence number or some other number that will indicate which row is most recent in the dimension.

To simplify the discussion, when deciding on what SCD Type to tag an attribute, you need to ask yourself (well, ask your business users!): Do I need to keep a history for this attribute? Depending on your answer, consider the following:

Type 1

When an attribute marked as a “Type 1″ attribute is changed, the new value overwrites the existing value in the dimension without keeping a history. There are usually two flavors of this Type: Overwrite all that match the key (SCD subtype 1) or only overwrite the current row (SCD subtype 2). The “current” row is determined by the date, status, or version of the row.

Type 2

Type 2 changes always keep a history by marking the current or active row (determined by the date, status, or version attribute) as inactive and inserting a brand new row into the dimension with matching values from the previous row, plus any Type 2 changes.

For Hand-coders

If you know me, than you know I like to store this sort of information as metadata. For example, in my integration metadata (my Logical Data Map), I have a table that stores all the attributes for each dimension. I have a column called SCD_Type and SCD_SubType. Here is a view of this metadata taken from one of my hand-coded projects:

Slowly Changing Dimension metadata view

In pseudo-code, the processing would look like the following:

For each row
  If any attribute marked as 'Type 2' changes Then
    update status of active/current row
    insert into dimension a new row
  End If
  For each attributes marked as a 'Type 1' change Then
    If attribute is subtype 1
        update all rows that match the key
    Else
        update only the current/active row that matches the key
    End if
  End For
End For

As you can see, I handle Type 2 changes first and then process Type 1 changes. This ensures that historical data is kept in tact and that the correct active row is updated if the subtype is marked as 2. Additional logic is needed to process inferred members and Type 1.5 as mentioned earlier.

When no SCD Types Are Assigned

Not all dimensions contain SCD attributes. A Temporal/Time dimension, for example, does not change (January 15th, 2008 is always January 15th, 2008). These and other dimensions are simply appended to, such as a weather or promotion dimension (i.e. Causal Dimensions). I have found that dimensions that operate over time work this way, although this is a generalization and not a rule.

In my metadata, I usually tag these types of non-SCD dimensions as type 0. My ETL code then treats these as a pure insert (or Upserts depending on the nature of the dimension) if all attributes for the dimension are marked as Type 0.

SSIS

The Slowly Changing Dimension Wizard in SSIS makes a lot of this work easy for you. In my opinion, it is one of the better transformations available through SQL Server. For very large dimensions though, I have found that the SCD transformation does not perform as well as using conditional transforms and lookups on filtered data. I will be sure to talk more about this in a future post. I should also mention that hand-coding SCD logic is tricky; my pseudo-code posted above is only a bird’s eye view of the logic involved.

Tags: , , , , , ,

8 Comments

The Role of Good Database Design and Normalization for Data Integration Projects

Not only is a database designer charged with the task of identifying entities, attributes, and relationships (the fun stuff) but also to ensure data integrity and data quality (often a challenge). This challenge is not limited to building indexes, rules, triggers, and database procedures. Database normalization is crucial in the design of relational databases. A good design is a normalized one, which seeks to minimize duplicate data and isolates entity rows so that they contain the exact data needed to satisfy the entity definition. The entity definition is based on business definitions: “customer demographics file”, “daily orders”, “receipt line items”, “store location”, etc. The database architect takes these business definitions and translates them to entities using ER or similar diagramming. And thus begins the normalization process.

I don’t want to get too academic here, or dive into the theory behind database normalization. That discussion can be saved for another day (but you can read more here or ). But basically, normalization is based on several cumulative rules or ‘Forms’. In the list below, Forms (1NF, 2NF, and 3NF) are most widely implemented. BCNF and 4NF are used but generally are the first to be relaxed when purposefully denormalizing the database (more on this below). I don’t recall ever seeing 5NF, DKNF, or 6NF in any production environment — instead these Forms seem to be relegated to specialized object-based systems, academic, and theoretic projects.

First Normal Form (1NF)
Entity does not have duplicate columns and each row can be identified by a primary key
Second Normal Form (2NF)
The usage of foreign keys to build relationships to remove repeating subsets of data
Third Normal Form (3NF)
Removes all attributes from a table not dependent upon the primary key
Boyce-Codd Normal Form (BCNF)
The only determinants are superkeys, or a superset of candidate key
Fourth Normal Form (4NF)
The removal of all multi-valued dependencies
Fifth Normal Form (5NF)
Projection-based normalization approach where every nontrivial join dependency is implied by a candidate key
Domain/Key Normal Form (DKNF)
Domain constraints and key constraints only
Sixth Normal Form (6NF)
If interested, read about the temporal/time dimension and 6NF in the book Temporal Data & the Relational Model.

Database normalization is an art form and takes practice, good sense, and a thorough understanding of the operational process to be modeled. The reasons why we walk through this exercise before each design is simple: Denormalized data can lead to update, insert, and delete anomalies, which must be avoided for data integrity and quality. Each of the above Normal Forms address one or more type or degree of anomaly:

Update Anomaly
When data to update is stored in many places, and not all updates take place
Insert Anomaly
When data to insert cannot be inserted because additional information (which should be stored as its own entity) is unavailable
Delete Anomaly
When deleting a row causes other information to be deleted

Many of these problems caused by denormalization are caused by bad planning, faulty design, laziness, or database scope overloading.

  • The Planning Factor
  • Normalization was not a consideration when designing the database. There are many reasons for this, but from my experience it is usually one of two things: The designer is a novice and is building his or her first database, or the design is based faithfully off of a spreadsheet (affectionately referred to as Spreadsheet Syndrome). This type of problem I have seen countless times in Access-based database systems, but other user-friendly RAD RDMS systems (like Visual FoxPro) suffer from this as well.

  • Faulty Design
  • Sometimes good intentions just aren’t enough! Poor or uninformed choices were made early on in the design process either because normalization concepts were misunderstood or business rules were not clearly defined. Database refactoring is a big deal and once you get started it is very difficult to backstep. I should note that faulty design is not limited to inadequate normalization but also to entity definitions, metadata considerations (is there any metadata at all?), and naming conventions (account_no, acct_nbr, account, acctnumb, etc…).

  • Laziness
  • Laziness and selective ignorance go hand-in-hand. After all, why create 10 fully normalized entities when you can have 1! The challenge of maintaining a database with dozens upon dozens of tables can be daunting. Especially when considering the need for careful index planning and the challenge of translating many entities into a good GUI design. So the urge is to combine order information with line items, or billing information with the customer demographic data to ease the workload.

  • Overloading the Design
  • I have seen database designs where the architect was clearly trying to satisfy multiple purposes; namely, daily transactions and ad-hoc reports. When building ER diagrams, a designer (who is often the developer as well) may take a step back and wonder what it would take to produce certain reports across multiple entities in their fully normalized model. Not only is writing a report of that nature tough, but performance will certainly be an issue. These are valid concerns. But the solution is not selective denormalization. The transaction database should be specific to transactions. If the same database is needed for reporting and ad-hoc queries, consider creating views or build a separate dimensional model with real-time/right-time updates using triggers on insert, update, and delete (my preferred method and a breeze in VFP).

Denormalization

There are times when denormilization is in order for a relational database. But the rule is this: Normalize the entire design first (usually up to 4NF where possible), approve the model with business users, and build a working prototype. Then, and only then, look for ways to improve system optimization and performance through denormilization. back off of any change that requires any sort of debate. Remember that data integrity and quality are far more important in an operational system than a tiny performance improvement or a couple of saved minutes in programming time.

Of course, in a dimensional model, denormalization is encouraged!

Data Integration

The reason for this post has more to do with data integration than you might have gathered. When conducting a data profile to learn about the source data, it helps tremendously if this data has been normalized properly. Certain assumptions can then be made and the integration process can steam along. A normalized database helps the integration engineers determine which data is reliable and represents the single version of the truth for the data warehouse. If data anomalies are possible because of a bad or lazy design, it could cost your data warehousing project days, weeks, or months to account for.

Therefore, normalized operational systems are not only essential for the data integrity and quality of those systems, but also for Business Intelligence initiatives that rely on them. This is true even if no anomalies exist in the source database (because of heavy application rules and code to ensure integrity on a denormalized structure); the integration team will not be able to make the assumption that anomalies will never exist. Data value is so important for the dimensional model and Business Intelligence that precautions like these must be taken.

Tags: , ,

4 Comments

VFPBuild to build Visual FoxPro Applications Using MSBuild

As I mentioned last week, I have been using MSBuild for some time now to build and deploy complex data warehousing applications. Many parts of this project involve Visual FoxPro. I wrote that I had not yet experimented doing builds with VFP using MSBuild. Alan Stevens quickly pointed me towards his CodePlex project “VfpMsBuildTarget” (aka:VFPBuild).

The goals of the project are simple:

  1. Provide a GUI for including projects/files in a scheduled build, and set their appropriate properties
  2. Create a VFP build target to be used by the MSBuild engine, the build system found in Visual Studio.

While the first goal might be a ways off, the second is close to being complete.

I had a chance to look at this project, test it, debug it a little, and compile a short list of improvements I would like to see (or add myself through CodePlex). The bugs were minor and easy to fix, and my list of improvements is pretty short. It should be noted that the body of work gone into this Task already is excellent. With little effort, I was able to get it going. And with a little more effort, it can be used in production. Note: Before you can have a go at it, you will need to compile vfpprojectbuilder.dll in VFP and VfpBuild.dll in C#.

Documentation

We’ll need some documentation. Alan knows this (and has promised some soon). Even without though, I made it through with little problem. I think I am at an advantage because I (a) know VFP well, (b) know C# well, and (c) know MSBuild well. We would need instructions on how to compile the dlls as well as how to construct the MSBuild XML. If you’re new to MSBuild, you might struggle with setting up the xml. For a simple test, here is a sample screenshot. I would have provided the code but the XML would have been eaten alive by the browser:

Visual FoxPro MSBuild VFPBuild sample


To execute the above, your cmd would look similar to the following:

C:>msbuild projectx.msbuild

Functionality

It doesn’t seem to be too much effort to add support for lRebuildAll, lShowErrors, and lBuildNewGUIDs of the project.build() method. Other than that, it gives you the ability to set build options and to specify a project to load.

From Here

I would like to see goal #2 of this project buttoned up soon. I have a production environment (actually, a few) that could really utilize a stable assembly like this. The build I downloaded was a bit buggy (there were references to class properties that did not exist, for example this.nLevel and this.nBuildAction). Fixing these, plus some enhanced bullet-proofing (defensive coding) and documentation would get us there I believe.

I’ll post more as I go. For now, you can get the active VFPMSBuildTarget project files here.

Tags: , , , ,

2 Comments

Dave Aring’s Lazarus

Great post alert.

Lazarus (Heb. “God hath helped”)

Dave Aring posts about Southwest Fox 2007. Check it out, and keep your eye on his blog, he promises more!

Tags: ,

1 Comment