Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for June, 2008


Published June 24th, 2008

ETL Subsystem 14: Surrogate Key Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Surrogate key management is an important part of designing an ETL system using the Kimball Methodology. Slowly Changing Dimensions require the use of surrogates because natural keys, like a Customer ID, Product ID, and so on, will be repeated in dimensions whenever Type 2 changes are tracked. Surrogates are also desirable because they are single-part and fast for joins. I’ve already discussed the need for surrogates in several posts. Here, I’ll talk about some techniques for replacing the natural keys that come in from the source systems with the new surrogate keys that exist in the data warehouse.

The idea is simple in theory: For each dimension row, you generate a surrogate key. This surrogate key replaces the row’s candidate key and is used to link to one or more Facts.

For example, the candidate for a Customer dimension may be CustomerID + ActiveFromDate. The ActiveFromDate is used to sequence the row if Type 2 change tracking is used. It is entirely possible to have the same CustomerID listed dozens of times in the dimension. The differentiator is the ActiveFromDate. This combination is guaranteed to be unique in the Customer dimension (although you would not normally enforce this using an index, you would certainly enforce this through ETL). This is a simple example. When modeling Parts or Real Estate, you may have several attributes combining to form a candidate key. This also becomes a bit more complex when Customer data is refreshed throughout the day — meaning you cannot rely on the date alone, you also need to consider time.

In dimensional modeling, you will normally process dimensions before facts. Because of this, you will have all surrogates defined for each row you need. In a subsequent post, I’ll talk more about situations where you load a fact before a dimension, or situations when late arriving data must be handled. For now, assume that all dimensions have rows that correspond to a fact. When you process the fact, you simply need to look up the surrogate key assigned in the dimension and use that key as the foreign key to that dimension in the fact.

Dimensional modeling essentially requires surrogate keys, and does so by taking a performance hit during ETL. It takes time to look up surrogate keys. You will spend a lot of time yourself making this process as fast and efficient as possible. I do have a few suggestions, though:

Load by the Day

Process facts one day at a time (all of yesterday’s data plus any changed data or late arriving data from previous days). Doing so allows you to join your staging fact row by the natural key in the dimension for that day. For example, in SQL, you could acquire all dimensions that are valid for a particular day by doing the following:

SELECT DISTINCT NaturalKey 
    FROM Dimension 
    WHERE DataDate BETWEEN ActiveFromDate AND ActiveToDate

This returns a subset of the dimension for the day in which the fact occurred (DataDate). If the candidate key involves more than the natural key, you will need to adjust the above SQL to be more robust. For example, you may have the name and birthdate of a patient at a hospital with an insurance number and insurance company code that uniquely identifies the row! That’s six attributes (if you include the ActiveFromDate) that uniquely identifies the row when combined.

With the above, you can incorporate the query into a JOIN on the staging fact data (by selecting all facts that occurred on DataDate), or you could use it as the source for a lookup operation. By narrowing the focus to be a single day (and you load by the day), you eliminate the need to worry about SCDs.

You can further enhance the above by maintaining special staging tables for each of your dimensions. These staging tables contain all candidate attributes and the surrogate key for each dimension. These special staging tables are maintained while you process your dimensions. Instead of selecting against the actual, production dimension, you can use the staging table. Although there is a little extra overhead in maintaining the staging tables, you get it all back — and then some — when you do your key lookups.

If you need to load by time of day (essential for real-time or intra-day ETL), you can still use the above method, except that you will need to join your fact data on time of day as well.

You can get around using BETWEEN if you are processing the most recent data. Instead, you could filter results by the dimension’s “current” flag. The current flag is set while processing the dimension. Only one row, for particular natural key, can be marked as current. A predicate on a current flag would be faster than checking between dates. A common trick is to split the rows into two streams: One with current data (anything that happened yesterday), and one with historical data (anything that happened the day before yesterday and back). Use the current flag for the current data, and use the BETWEEN for historical data.

I find that this technique works well for small dimensions with a low number of facts (a few thousand transactions a day). With a good index plan, this might be all you need to do.

Store the Surrogate Right Away

Another technique worth mentioning is particularly useful when dealing with very large dimensions and lots of facts. You should also consider this technique if you are manually generating your surrogates (see my post “ETL Subsystem 10: Surrogate Key Generator” for specifics). Basically, when you generate your surrogate key for a dimension row, store it with the fact data at the same time.

This technique requires writing the surrogate to the dimension and the fact staging table. Using an UPDATE statement is a simple approach, but a more exotic and better performing method would be to INSERT the new keys into another staging table, and joining to the staging fact table later on.

Either way, there is a performance hit with storing they key right away, so be sure to thoroughly review your environment and needs. I have used this technique before with great success, but it took me a long time to determine it was the best approach. In fact, my original design worked great until the dimension grew to a few million rows and the transaction load doubled. Doing a lookup using BETWEEN proved to be too slow compared with saving the key immediately.

SQL Server 2005 Integration Services (SSIS)

Jamie Thomson wrote an article on how to use SSIS to update fact tables (actually, most of it is a case study written by “Mag”). I won’t repeat what was in that post. Have a look. Jamie’s technique is basically what I have come to adopt myself. You’ll notice the liberal usage of Lookups instead of Joins (joins, at least for me seemed like a more logical starting point). Lookups perform better for a few reasons. The Merge Join operation requires a Sort — either by using the Sort Component or by ORDERing the data through the Source Component. This sorting can be expensive. Lookups, in addition to not requiring that the input be sorted, can use caching and limit access to the database. For these reasons, I use Lookups now almost exclusively.

When using the Lookup approach you must be aware of two possible conditions: The Lookup could return more than one match if you don’t properly define your candidate key. Also, if no match is found, common for late arriving data, you need to either divert the errors and handle them separately, or convert the NULLs to your default “unknown” value for each dimension. For me, I typically use negative numbers to define various types of null (-1 for “not known”, -2 for “not applicable”, etc…).

If you process by the day, use staging tables as dimension lookups, and rely on Lookups instead of Sorts and Merge Joins, then fact processing should go quite well.

Hand Coding with Visual FoxPro (VFP9)

As I discussed in my post “ETL Subsystem 10: Surrogate Key Generator “, VFP’s SEEKing capability far outshines set-based approaches to this problem. Check out my comments in that posting for some ideas on how to use VFP to get the surrogate keys for your fact data.

Not to get lost though in the wonders of SEEK, VFP is perfectly suited for set-based approaches as well. So all of the methods discussed in the article (Joins and lookups) can be implemented easily in FoxPro. As I wrote in that post, VFP really shines for this type of task.

From Here

Next post, hopefully later this week, I’ll discuss bridge tables and how to best build them. I have an interesting case study to walk through. I have refrained from using cases in my ETL subsystem posts, but for bridge tables, I am making an exception!

Published June 16th, 2008

ETL Subsystem 13: Fact Table Builders

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Fact tables are the heart and soul of the business process dimensional model. I discussed fact tables already in a previous post (”Fact Tables“), so I won’t repeat myself here. Recently, I also provided a more formal definition of a fact table as the: “central table of a Star Schema with numeric performance measurements, identified by a composite key, each of whose elements is a foreign key drawn from a dimension table”.

Three grains define fact tables: Transaction-level, Accumulating Snapshot, and the Periodic Snapshot grain. Refer to my post referenced above (”Fact Tables”) for more information on granularity.

Subsystem 13 is responsible for the construction of all types of fact grains.

Transaction grain fact tables are always the largest and should contain data at the finest grain possible. This would include a line on an invoice, a patient diagnosis for a series of emergency room events, or even a Josh Beckett fastball. Data is almost always inserted (via bulk load) into transaction-grain facts. Deleting rows is dangerous (more on this in a second), and updates should only occur in order to correct an error or update some late-arriving key. In some cases, I have even gone back into a transaction grain fact table in order to update a calculated field.

Instead of deleting data in a transaction-grain fact table, “negate” the row instead. Otherwise, you run the risk that some of your historical reports and analysis will be inconsistent. To negate a row, simply add a new row to the fact, with identical foreign keys to the dimensions as the row you are negating, but update the metric so that when both rows are summed (or counted, etc…) they result in zero activity. For some metrics, this is tricky, but if you are using mostly additive and semi-additive metrics in your facts, you should be OK with this approach.

For the periodic snapshot grain, there are some differences in the loading process to consider. First, as these facts deal with aggregate data (based on daily, weekly, monthly, quarterly, etc…) time spans, it is appropriate to only load them when a period is complete. Secondly, extremely careful consideration for the grain of the periodic snapshot must be honored because it is too easy to apply an aggregate calculation inappropriately against the stated grain. Third, and as a compliment to my first point, it may be necessary to aggregate some running totals. If for example you are building a monthly periodic snapshot for a checking account at the bank, you would normally also provide data from the end of the last period to date.

Accumulating snapshots represent some process as it evolves over time. Therefore, rows in the fact are constantly updated to represent this evolution. For example, a patient may enter the ER, check in at triage, get assigned a room, see the nurse, see the doctor, etc. A building inspector may visit a new construction before work is done, when the foundation is set, when the framing is up, when the electrician as finished, etc. Loading this type of snapshot is much different from the other two, in that it tracks some defined process over time in a single row. There are many updates, and additional logic is needed to handle the loads.

SQL Server 2005 Integration Services (SSIS)

I find working with the various types of fact tables in SSIS to be rather simple. The key for success is usually in how the Control Flow is sequenced (facts generally come last), when dimensions are loaded, and where you do your dimension key lookups. In essence, all you need is a source containing your metrics (line items on an invoice pulled during the extract phase), a series of lookups (or joins, whichever provides the best performance in your situation), and some way to load the fact (bulk insert if possible).

Some advice: Do your UPDATEs in an Execute SQL Task. Do not try to use the OLD DB Command. Especially — and I mean especially - if you are doing a large amount of updates and your fact table is also large. To get this to work, offload the data you need for your updates into a separate staging table. Use the Execute SQL Task and write an appropriate UPDATE statement which joins your fact to your new staging table. The same approach holds true for any rows you might be deleting.

Hand Coding with Visual FoxPro (VFP9)

I find working with fact tables in FoxPro to be just as easy as in SSIS. There is no clear advantage of one over the other. One major benefit with VFP is that it is an excellent OOP language with very fast cursors (a VFP Cursor is much easier to work with than an SSIS Raw file). This gives you many more options on how you look up and populate foreign keys for the fact at load time.

If you are using VFP tables for your RDBMS, you will need to be careful for the 2 gig size limitation for your fact tables. Your Fact Table Builder may have to handle partitions manually. This is a major disadvantage with using the VFP database. Especially because 2 gigs is not a lot of space for daily transactions!

From here

In the next article, I’ll discuss ETL Subsystem 14: Surrogate Key Manager. Subsystem 14 helps to maintain the integrity among facts and dimensions for each business process, and is a complement to your Fact Table Builders.

Published June 13th, 2008

A Data Warehouser’s Vocabulary (Part 2)

This post is part 2 (read part 1) of a series of posts containing a glossary of terms and concepts that I feel has some relevance to the data warehousing and business intelligence world. Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! When finished, these terms will be compiled and made a static page on TmF.

Aggregation
The process of redefining data into a summarization based on some rules or criteria. Aggregation may also encompass de-normalization for data access and retrieval.
Analytical Processing
Producing analysis for management decisions, usually involving trend analysis, drill-down analysis, demographic analysis, profiling, and so on.
Attribute
Any detail that serves to qualify, identify, classify, quantify, or express the state of an entity.
Data Mining
The process of analyzing large amounts of data in search of previously undiscovered business patterns.
Dimension
A denormalized table in a dimensional model with a single part primary key and descriptive attribute columns.
Event
A signal that some activity (usually a business transaction) has occurred.
Fact
Central table of a Star Schema which numeric performance measurements identified by a composite key, each of whose elements is a foreign key drawn from a dimension table.
Heuristic Analysis
Heuristic Analysis is a method to help to solve a problem, commonly informal. It is particularly used for a method that often rapidly leads to a solution that is usually reasonably close to the best possible answer. Heuristics are “rules of thumb”, educated guesses, intuitive judgments or simply common sense.
Online Analytical Processing (OLAP, also MOLAP)
On-line retrieval and analysis of data to reveal business trends and statistics not directly visible in the data directly retrieved from a data warehouse. Also known as multidimensional analysis.
Outrigger
A secondary dimension table attached to a dimension table. An outrigger is not used to normalize a dimension.
Relational OLAP (ROLAP)
“Relational” OLAP, in which the OLAP processes use a relational, normalized model for its source.
Slowly Changing Dimension (SCD)
The tendency for dimension attributes to change gradually or occasionally over time. The techniques for handling these changes include Type 1 (overwrite), Type 2 (keep history), and Type 3 (alternate realities).
Snowflake
A normalized dimension where a flat, single dimension table is deconstructed into a tree structure with potentially many nesting levels. Snowflaking a dimension generally compromises user understandability and browsing performance.
Snowflaking
The (undesirable) act of normalizing a dimensional model.
Star Schema
A generic representation of a dimensional model in a relational database in which a fact table with a composite key is joined to a number of single level dimension tables, each with a single primary key.

Published June 6th, 2008

Fun with Serial Dates

I ran into a bit of an oddity today while working with serial dates in SQL Server 2005 and MS Excel 2003. I’ve dealt with this before with Excel using FoxPro, but don’t recall the specifics. Anyway…

A serial date is a sequential number, starting on January 1st, 1900 (or if you want compatibility with a Mac serial date, 1904) that represents the number of days since any date to that point in time (you can also represent time by using a decimal, but I won’t get into that here). Excel has a bug (or a “Lotus 123 compatibility feature” as some like to call it) in it that counts Feb 29, 1900 as an actual date, when in fact 1900 was not a leap year.

Hence the first bit of fun. Run this TSQL code:

SELECT DATEDIFF(dd,'19000101','20080606')

The above query returns ‘39603′. So far so good. In SQL Server, you can now derive the date from the serial value by using convert:

SELECT CONVERT(DATETIME,DATEDIFF(dd,'19000101','20080606'))

…which gives me ‘2008-06-06 00:00:00.000′. So far so good.

The reason we’re using serial dates in the first place is to accommodate some models developed in Excel. The data is fed in directly from the database and calculations are performed.

When I take value 39603, put it into an Excel cell and re-format the column from General to Date, I get ‘June 4, 2008′! Two days off from what I get from SQL Server.

I already mentioned that Excel counts Feb 29, 1900. That’s one day. So why am I off by two? Here’s some more fun. In SQL Server, try the following:

SELECT CONVERT(DATETIME,0)

You get ‘1900-01-01 00:00:00.000′. But January 1st is supposed to be day one. Not day zero (must be a zero-based array sort of thing). When you try the same in Excel, you get a more appropriate (but equally odd) value of ‘January 0, 1900′. You’d think I’m dealing with scientific ephemeris or something.

SELECT CONVERT(DATETIME,1)

So, in SQL Server, we’re off by a day. The first day shouldn’t result in the second day (’1900-01-02 00:00:00.000′) of the year as the above TSQL shows.

This helps to explain the 2 day offset between Excel and SQL Server.

Come on Redmond! Work with me here!

To solve (read: work around) the problem, we’ve decided to stick with keeping Excel happy (against my recommendation, but at the end of the day, this is a business decision). Our data integration packages add two days to the serial date we calculate from a YYYYMMDD value. When data is moved to Excel from the data warehouse, no transformation is done on the serial value and Excel calculates as expected. When we use the serial date in a SQL Server context, we need to subtract 2 before the convert.

This is not a very good design and will undoubtedly cause someone problems some day.

I recommend storing the ISO YYYYMMDD format instead. But Excel won’t format a value such as 20080606 to a date. What a shame! If anyone knows of a way, or can enlighten me more about this subject — do not hesitate!

Published June 5th, 2008

Business Processes and the Integrated Enterprise

It’s time to think about business processes.

In a recent post, I defined a business process as “the complete response that a business makes to an event”. Because this is such an important topic for data warehousing, I thought I’d share some additional thoughts.

IntegrationBusiness processes include such activities as accounts receivable, orders, sales, and inventory management. Each process has a specific event (or goal) that defines the process and in many cases allows us to gauge the health of that process. For example, an order is an event within the orders business process. Inventory movement is an event within the inventory management process. And so on.

For a few years now, there has been a significant push — mainly by service oriented (SOA) and data warehousing architects — to get businesses to think more about business processes and not about departments, applications, and technologies. Traditionally, most organizations have structured IT around specific software purchases and departmental needs. Integrating these disparate systems later becomes a significant challenge for business intelligence, performance management, and master data initiatives.

James Gibson, in his research piece “A Research Strategy for Investigating Business Process Management Approaches”, wrote that it’s time to start thinking about process and process processing rather than data and data processing (I had to read that more than once too!). The key is that the business process — which is tied to a specific event — is a driver that can lead all other initiatives along. Actionable insights (typically what you hope to derive from your Business Intelligence and Performance Management initiatives) are only useful if they’re tied to a process that can be improved.

Thinking more about business processes, and developing architectures to support them, leads to a more integrated enterprise

Data Warehousing with dimensional modeling is solely focused on the business process. In fact, you cannot develop a true dimensional model without modeling it around some business event. And it should be clear that a single business event can span multiple source systems and departments. The dimensional model pulls all this together.

On the transactional and operational side of the fence, SOA is the right approach to take. Essentially, SOA provides a standard way to access myriad resources across a network through RPC, Web Services, and APIs (among other techniques). One application can communicate with another in real-time.

Developing an SOA and a Data Warehouse one-process-at-a-time is smart. I will talk more about this in a future posting, but the idea is simple: start with a single business process that will make the most impact and is most feasible. Then, in an iterative way, expand into additional processes. This allows development to quickly turn over key functionality while leaving room to resolve business process volatility issues and political ramblings. If you are lucky enough to be starting both data warehousing and SOA programs simultaneously, it makes most sense for the same business process to be the subject for both!

Master Data Management is about data governance and forms a core part of the integrated enterprise. Through SOA, applications can access master copies of shared entities, such as Customer and Product. Master data might be derived partly from a data warehouse using ETL and partly by operational applications in a transactional environment through SOA. When it is time to embark on an MDM initiative, it makes a lot of sense to start thinking about business processes, conformed dimensions, and how to maintain this critical data.

So imagine for a moment an enterprise with dozens of departments all using different tools and software solutions to manage their day-to-day operations. Through SOA, these applications can all talk to each other so that when a customer checks on an order, the clerk can also see who took the order, where the product currently is in transit, the customer’s order history and much more. The data is not integrated, but the processes are. At the end of the day, when the regional salespeople need their numbers, the data warehouse — which has integrated data arranged around various business events — provides the results quickly giving all subscribers a complete and integrated view of all relevant business processes.

Adopt architectures that facilitate business’ natural orientation towards the business process. Business Intelligence, Performance Management, Business Process Reengineering, and Master Data Management initiatives will benefit tremendously. I’ve been saddled by the department-oriented mentality by business for too long. Better IT/Business alignment in this area will create more opportunities for defining clear business processes which in turn will lead to a better integrated organization.

Published June 3rd, 2008

News from the Kimball Group

As you might know, I recently attended the Data Warehouse Lifecycle in Depth course from Kimball University. I blogged about each day of the event and also got an opportunity to speak with Margy and Warren at various times and about various topics. One of which was KU Certification.

“Now, if we can just convince them to create a certification exam…” is what I wrote, somewhat jokingly (because I knew full-well what Warren and Margy thought about a KU certification program).

Well, here is the official response, as written in the most recent design tip newsletter, “Kimball Design Tip #102 Server Configuration Considerations”:

We’ve received several inquiries about Kimball certification. After much consideration, we’ve concluded that it’s not meaningful to bestow certification by charging you to take a multiple choice exam. We do believe quality, in-depth education consistent with our proven methodology is critical to your professional development. Rather than embarking on a certification program, we’re giving Kimball University alumni an alternative opportunity to publicize their completion of our courses.

Which of course leads into my post regarding KU’s new LinkedIN Group!

We’ve launched a Kimball University Alumni group on LinkedIn so you can let everyone in your network and other interested parties know that you’ve attended a KU course. We’re not promoting LinkedIn, but it’s a well-accepted networking tool for industry professionals. Joining our alumni group allows you to promote your alumni status on your profile and connect with other alums (and perhaps potential future employers) in your area.

The Kimball University Alumni group is limited to students who have attended a full length 2- to 4-day Kimball University onsite or public course. Unfortunately, 1-day vendor seminars and other industry events do not qualify. Follow this link to join the Kimball University Alumni group.

So there you have it! If you’ve had the pleasure of attending at least one KU course, then go to LinkedIN and sign up!