Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for September, 2007


Published September 28th, 2007

Log4Fox Revival: Instrumenting VFP Applications

Lisa Slater Nicholls has re-published her 2006 FoxTalk article “Log4Fox: A Logging API for Visual FoxPro” over at Spacefold. At FoxForward, some of us were introduced to the concept of instrumenting applications, and Dave Bernard gave us some great ideas to get started. Log4Fox — an article I missed in January 2006 — is a VFP instrumenting approach:

Although Visual FoxPro’s SET COVERAGE, SET DEBUGOUT, and SET EVENTTRACKING features have been steadily enhanced in each version, VFP doesn’t offer native support for fine-tuning the levels and types of trace data as completely as many other environments do. No problem; VFP’s OOP and output-handling resources make it easy to provide class libraries to handle all the required tasks.

If you’re interested in instrumentation (I certainly am), then please take a moment to read (or re-read) this article. Lisa does a marvelous job explaining how the class works and why you might want it.

Published September 24th, 2007

Delete Duplicates from a Table

The need to delete duplicate records from a table rises up all the time, whether you are importing data from a third party, integrating systems, or cleaning dirty data. Duplicate data is a reoccurring quality issue.

A duplicate records does not necessarily mean an exact copy of another record (as if you pressed CTRL+Y with SET CARRY ON in VFP). It could mean that all of the important identifying attributes are duplicated but other data, such as a surrogate key or a last updated timestamp are not. So here is a technique that I’ve used to delete a row if it is in fact a duplicate based on attributes in the table:

DELETE FROM [tablename] ;
  WHERE [primary key] NOT IN ;
    (SELECT MAX([primary key]) FROM [tablename] ;
      GROUP BY [attribute1], [attribute2], [attribute3])

To apply this to a Customer entity, for example, you can try the following (the first half of this code is setup, scroll down to see the delete):

*-- example table
CREATE TABLE Customer FREE   (;
    cust_key integer         ,;  && a surrogate key
    cust_id character(3)     ,;  && natural key
    cust_fname varchar(20)   ,;  && first name
    cust_lname varchar(20)   ,;  && last name
    cust_phone character(14) ,;  && phone number
    cust_lupd datetime       ,;  && last updated
    cust_user character(15)   )   && user last updated
 
*-- need data 
INSERT INTO Customer VALUES (1,"AM1","Andrew","MacNeill","555-5551",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (2,"AK1","Andy","Kramek","555-5552",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (3,"CC1","Cesar","Chalom","555-5553",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (4,"CB1","Craig","Baily","555-5554",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (5,"DC1","Dave","Crozier","555-5555",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (6,"DS1","David","Stevenson","555-5556",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (7,"GF1","Garrett","Fitzgerald","555-5557",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (8,"KR1","Kevin","Ragsdale","555-5558",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (9,"KR1","Kevin","Ragsdale","555-5558",DATETIME(),"ZorroMeansFox")
INSERT INTO Customer VALUES (10,"RS1","Rick","Schummer","555-5559",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (11,"RS2","Rick","Strahl","555-5510",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (12,"RS2","Rick","Strahl","555-5510",DATETIME(),"ZorroMeansFox")
INSERT INTO Customer VALUES (13,"RS2","Rick","Strahl","555-5510",DATETIME(),"FanakMeansFox")
INSERT INTO Customer VALUES (14,"SB1","Stephen","Bodnar","555-5511",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (15,"SD1","Stuart","Dunkeld","555-5512",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (16,"TR1","Ted","Roche","555-5513",DATETIME(),"TodMeansFox")
 
*-- delete duplicates (we don't care about the surrogate 
*-- key or the audit information).
DELETE FROM Customer ;
  WHERE cust_key NOT IN ;
    (SELECT MAX(cust_key) FROM Customer ;
      GROUP BY cust_id, cust_fname, cust_lname, cust_phone)
 
*-- results
SET DELETED OFF
BROWSE NORMAL

This works by selecting the most recent issuance of the primary key (using MAX) and discarding the others using the subquery. If you run the subquery on its own, you can see how SELECT MAX plus the GROUP BY clause gives you the most recent version of the row that is distinct from the others (’Rick’ and ‘Kevin’ only show once) by leaving out fields that don’t identify the row (like auditing information or perhaps de-identified attributes). Next, we delete everything that is not in this list.

This works in all ANSI SQL environments were subqueries are supported in the DELETE statement.

Published September 20th, 2007

Defining Business Intelligence and Data Warehousing

I’ve been asked on occasion to define Business Intelligence and Data Warehousing. Typically, I am not sure how to answer. I’m terrible at inventing definitions, but here is what I usually respond with: For Business Intelligence, it is “a broad term describing the acquired knowledge obtained through reporting and analysis of organizational information” and for Data Warehousing, “A subject-oriented, integrated, time-variant, non-volatile process of collecting data for supporting decisions” (close to Inmon’s definition). But those answers don’t really satisfy me or those who have asked the question.

There are dozens of ‘accepted’ definitions for Business Intelligence and Data Warehousing. To make matters more confusing, the community itself seems torn. For example, the wikipedia article on the subject is woeful at best. I’ve seen others define BI as a ‘front-end’ to Data Warehousing, which I wholeheartedly disagree with. There is also a strong misconception that Business Intelligence and Data Warehousing are one-in-the same or connected at the hip. They are not.

I believe some of the onus can be placed on vendors who sell “BI” or “BI\DW” solutions.

First, Business Intelligence

Business Intelligence is not an activity or process. It’s a result. You either have it or you don’t. This is a very important distinction, considering the level of misconception surrounding it.

How can you get it? Well, lots of ways.

Having intelligence about the business allows you to make informed decisions. The intelligence is derived from many sources. One of which is through Data Warehousing. Enterprise Resource Planning (ERP), knowledge management, and technical reporting are others, for example.

Second, Data Warehousing

Data Warehousing is a process. It is not just a database, nor is its purpose in life to just hold historical data (a common misconception). The process involves Data Profiling, Metadata Management, Dimensional Modeling, Data Integration (ex: ETL, EAI), Data Quality, Reporting, and Governance. There are many variations to this process, all of which are driven by business needs.

The Data Warehouse facilitates Business Intelligence by providing the necessary data and processes needed to integrate disparate transactional data into valuable information presented through various Applications such as executive dashboards, ad-hoc reporting tools, analytics, etc. These Applications are used by decisions-makers to make informed decisions.

It follows that Business Intelligence can be an outcome of Data Warehousing, provided that the data in the warehouse is promoted to information and then applied to decision-making.

Purpose and Justification

Strategic initiatives, goals, and competition will drive how organizations approach obtaining Business Intelligence. One organization might look at it for tactical reasons (i.e. decision support). Another might see it as a mission-critical part of running their day-to-day operations. Yet another might see it as a competitive differentiator. These drivers define the tools, techniques, and processes that might be used. Data Warehousing may solve one or more of these requirements.

Additionally, the drivers dictate the type of Applications that will be created on top of the Data Warehouse Database. Some departments may insist on printable reports or Excel sheets. Some Applications might be designed to analyze trends, issues, and events to gain insight into a process. It is very common to see outcome analysis, scoring, data exports, and some automated processes (such as bulk email operations).

The point of this entry is not to “set the record straight”, but rather to help draw the line between Business Intelligence and Data Warehousing. It is an expression of my opinion on the subject. Please understand that this is an important discussion to be having if you are involved or are planning to be involved in a Data Warehousing project.

Published September 17th, 2007

Dimension Tables

In my post, Fact Tables, I discussed three types of fact tables as defined by their grain: transactional, accumulating, and periodic. In this post, I’ll go over the basic types of dimension tables typically found in a dimensional model.

What are Dimensions?

A Dimension table is a denormalized entity in a dimensional model that contains detailed information about a fact stored in a fact table. Dimensions are independent of each other, joined through the fact table or in the case of an outrigger, to other specialized dimensions that are not connected to facts. This type of design facilitates simple many-to-many analysis where an operator can pick attributes from a broad set of dimensions and “slice-and-dice” through them via the fact table.

Every dimension contains a single-part surrogate primary key field. This primary key is used to relate to one or more fact tables in the schema. Natural keys (like an employee id) are kept in the dimension, but are not used in joins. This technique ensures a unique value for each record, simplifies joins, and keeps indexes small.

A zero-key row represents the “null” or “unknown” condition in a dimension. The fact table, if the data dimension is unknown, will use a foreign key of zero, and still link to the dimension. This facilitates “did not happen” or “does not exist” queries and related analysis on missing or incomplete data.

Dimensions can take on different forms in different contexts. The grain of the fact table dictates if a particular dimension can be a part of the star schema involving the fact. A Star schema represents a single business process and its facts are stored at a particular grain. Multiple Star schemas can exist in a single database, where dimensions are shared among business processes. A Product dimension might be used in both the “procurement” and “order” Stars for example — provided the facts are at compatible grain. A time dimension, which I discuss below, will most likely be shared by every Star.

Types of Dimensions

The following list is not exhaustive, but should give you an idea of the types dimensions that can be found in a typical dimensional model:

Primary Dimension: Primary dimensions represent the major elements of the business process. For example, in an order system, you will find Order, Salesperson, Product, and Shipper dimensions (among others). Primary dimensions are familiar to business users because they represent some tangible or recognizable aspect of the business (typically the “nouns”). Most ad-hoc queries and reports will use attributes from Primary Dimensions as predicates and in Select and Order By clauses. It follows that significant time will be devoted to end-user training on the structure and content of these Primary Dimensions.

Time Dimension: Date and time dimension are crucial for the data warehouse and to derive Business Intelligence. Most business processes are tracked and analyzed over time; these points in time are stored in the fact table along with the appropriate metrics. Instead of embedding a date in the fact table, dimensional modelers create a date dimension that stores information about each day of the year, for how ever many years is relevant to the business.

Time can be stored in a separate table as well, that is, if it makes business sense. Typically, though I see time stored as a degenerate dimension in the fact table as in integer representing seconds since midnight.

I’ll spend more time on time dimensions in a future post. The uses of this type of table extend beyond the Dimensional Model!

Junk, or “Mystery” Dimension: A Junk Dimension is a dimension that stores extra “junk” data about the fact. A junk dimension can take the place of degenerate dimensions, as well as store other semi-useful information about a particular fact. Transaction codes, operation stamps, yes/no flags, and free text attributes are good candidates for junk dimensions.

Causal Dimensions: These dimensions allow you to store, and therefore analyze external events which can measure causality of a business process. Causal data gives an extra dimension (no pun intended) to your existing data. These dimensions can help explain why a fact exits in the first place! For example, storing weather conditions might be beneficial to a retailer so that last month’s increase in uumbrella sales can be explained (it rained more!). Other causal dimensions would include things like rate specials, store promotions, a newspaper ad, or some major world event.

Outriggers: A type of dimension, outriggers are not directly related to a single fact but rather to a dimension. An example of an outrigger might be postal code, where postal and delivery point information is retrieved from the USPS in order to link and verify address data stored in an Employee, Customer, or perhaps Shipper dimension.

For more information about Dimensions, feel free to download my code examples and presentation slides from FoxForward. Also, bookmark this page as I plan to expand upon these ideas in future posts. You can also read my articles on the subject at Advisor.com.

Published September 13th, 2007

VFP9 at Amazon.com for $99?

Anyone looking for a copy of VFP9 for a very good price?

I am seeing that there is at least one used copy selling for $99. So, here’s your chance!

(I’m not sure how long this will last, usually prices hover around $450 and $500 for used copies)

Published September 13th, 2007

Hand Coding the Data Warehouse with VFP

I have posted a new whitepaper that supplements the material I presented at FoxForward and at AFUG in Atlanta. This whitepaper was published in the FoxForward 2007 Book that Kevin provided to conference attendees, which contains lots of other great session material.

This whitepaper makes the case for using Visual FoxPro as a hand-coding tool for developing data profiling, metadata, ETL, and other data warehousing applications.

Download it here.