Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘SQL Server 2005’


Published August 26th, 2008

Analytical Databases

ying yang of relational verses analytical databasesMost of what I discuss on Tod means Fox (and especially in my 34 Subsystem series) has to do with relational database engines (SQL Server 2005 and Visual FoxPro in particular). For most data integration projects, desktop and web solutions, and data warehouses, the relational database is all you’ll need. It can be used to create and manage both relational and dimensional models. But in Business Intelligence applications, chances are that you’ll need an additional, more specialized form of multidimensional data storage and retrieval. Analytical databases cover this need. In fact, Dr. Codd, the inventor of the relational database, often explained how analytical databases are a “necessary companion” to relational databases.

The OLAP Story

OLAP stands for Online Analytical Processing. It is a terrible name to describe multidimensional analytical databases. Instead of a more meaningful name, like FASMI (Fast Analysis of Shared Multidimensional Information) proposed by Nigel Pendse of “OLAP Report” we seem to be stuck with OLAP. With great pain, I will continue to discuss analytical databases using the OLAP terminology.

Basically, an OLAP (*sigh*) database has certain characteristics that set it apart from relational databases. And, coincidentally, Pendse included all characteristics in the FASMI acronym. Check out the OLAP Report for more details. If you’re just getting into business analytics, or are in the process of evaluating analytical database software – you must visit this site!

The Smartness Factor

I like to think of the relational database as “dumb” and the analytic database as “smart”. Let me explain:

The primary focus of Analytic servers is to get the (often pre-calculated) data out of the database as quickly as possible, allowing the user to zoom in and out along different hierarchies. Contrast this with the Codd’s relational model which seeks to eliminate data anomalies at transaction time through normalization: data retrieval is slow and often complex. With this in mind, you can get a feel for how a database engine can store, catalog, and retrieve data differently. (In fact, this is what makes dimensional modeling in general so favorable for querying and analytics – it is not bound by the restrictiveness imposed by normalization. I’d also like to note that as with dimensional models, OLAP databases are multidimensional.)

While the relational database plays a very important role, so does the multidimensional (OLAP) database. It’s simply built differently to service a different need.

Primarily, analytic servers can manage aggregates, aggregate navigation, and complex calculations and summaries across hierarchies at blazing speeds. These skills were borrowed from both navigational database designs (think of the DOM or a Wiki) and hierarchal designs (trees with leaves and branches). A lot of this has to do with how the data is stored.

Storage

Analytics servers offer a different type of storage. MOLAP, or Multidimensional OLAP, can be much more efficient than relational engines. Some tools (like SQL Server Analysis Services) allow you to store your analytical databases in a relational way (ROLAP) or using a hybrid approach (HOLAP). Personally, I see no benefit at all with ROLAP, aside from real-time systems where using the relational database is a must because of the latency involved with updating the OLAP database. (Not to digress too much here, but even real-time environments can use MOLAP storage by using partitions and caches correctly.)

MOLAP is more natural and faster. This may differ depending on your tool choice, but I beg someone to tell me differently. For a detailed discussion on how MOLAP and ROLAP engines store data, you can try this resource “Analysis Services 2005 Performance Guide” (which defines the way Analysis Services does it).

HOLAP storage is an interesting option and could actually perform well –even better than MOLAP in some instances — with less disk usage. But a lot of this depends on how many aggregates are defined and how often the system needs to query the relational database. It might be a good consideration, but if you’re not sure and you have plenty of disk space (disk space is cheap), then go with MOLAP.

The smartness factor2 (the MDX language)

I like to think of SQL as “dumb” and MDX as “smart”. Let me explain:

The MultiDimensional eXpression (MDX) language was created about 10 years ago by Microsoft to work specifically with multidimensional data stored in analytical servers. This OLAP-specific language improves upon the SQL syntax, removing much of the bulkiness associated with the language. MDX is an elegant and highly relevant partner to the analytical database. While you could get out what you need using SQL, most every SQL statement will be a challenging one. It’s like cutting down a tree with a hand saw as opposed to using a chain saw. MDX is painless and intuitive. Concepts such as dimensions, hierarchies, and sets are built into the syntax. For more on MDX and the mdXML standard, please visit XML For Analysis.

Central to the MDX language is the cube concept, which deserves a proper introduction:

Cubes

OLAP data is stored in structures called cubes. As you know, a cube is a 3-dimensional solid that, given a point on three of its faces which form a vertex, can take you to a precise point somewhere within the cube itself. A cubeThis point represents the aggregate metric you want to view. The faces represent the different dimensions (like Product, Customer, and Time) that are used to find the point. The dimensions are further broken down by its hierarchies on each face.

Cubes are sometimes — and more aptly — called hypercubes (or a tesseract or “n-cube”) because analysis often makes it necessary to examine more than three points. As soon as you look at more than three faces, you need higher dimensionality.

I admit, when I first heard about cubes I was a bit intimidated. I felt that they were too complex and advanced. Now, working with cubes is natural. Notice though that the concept of a cube is almost the same thing as a star schema. The only difference (at least that I can really think of) is that a star schema generally stores atomic data, and barring any usable aggregate fact table, calculations need to be done on the fly. The cube theoretically stores the results of these calculations inside the cube.

Dimensional Modeling

If you read my blog, then you know I advocate dimensional models for proper data warehousing. If you also advocate and use dimensional models, then the concepts discussed so far will fit you well.

What’s great about using Dimensional Models in your relational database is that your OLAP database will almost mirror the design, making the development and deployment of your cubes so much easier. Not only that, but the cubes will likely load faster and be easier to maintain when built off of the dimensional model. Note you can still achieve most of what you can do without an analytics server, just by using a properly constructed set of business process dimensional models. The OLAP database is just smarter, making several tasks (navigating hierarchies, security, and calculations to name a few) easier.

Published August 11th, 2008

ETL Subsystem 19: Aggregate Builder

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

I spent the last two posts describing aggregate facts and dimensions and explaining the differences between aggregates and summaries. I wanted to do that in a separate thread so that I could focus on Kimball’s “Aggregate Builder” without spending too much time introducing aggregates. So, if you haven’t already, please have a look at those two previous posts.

The key to subsystem 19 is that you budget for and build a system to maintain your aggregates. This can be easily overlooked. I repeat: This can be easily overlooked. Much like what happens with the very important and easily cast aside data profiling subsystem. Don’t let it happen to you!

Building Aggregates

Aggregates can be a challenge to build — especially if your relational or OLAP engines don’t do it for you. Actually, if your OLAP tool doesn’t handle building aggregates, and you are relying on it for your aggregates, then you should think about getting another OLAP tool! I’ll talk more about OLAP in my next Subsystem post.

When designing aggregates to build, consider two things: user access patterns and distribution of data along the hierarchies. You cannot hope to construct aggregates for every possible level of every hierarchy across all metrics in a fact. Instead, you need to assess how your users query the data and then, taking it a bit further, decide if creating an aggregate will make any real difference for performance based on attribute distribution. The main goal of creating aggregates in a data warehouse is to speed query performance mainly by limiting the amount of data required to obtain results. But building them can be expensive.

Generally speaking, every time you update, insert, or delete data into any dimension or fact, you will need to rebuild all aggregates that are tied to those attributes and metrics. If you have two or three dozen aggregates to build every time you load a few business processes dimensional models, then you can imagine the time and processor loads required. In many instances, there is a balancing act between getting the data into the warehouse and getting the data out.

Technically speaking, building aggregates is a simple matter in SQL. Often it consists merely of a GROUPed query, some aggregate or custom functions, and surrogate key management.

Here’s a tip: Embed surrogate keys in your dimensions for each hierarchy used in aggregates. This practice will allow you to avoid the costly process of generating new keys for your hierarchy at the expense of a few administrative attributes stored in each dimension. For example, if you are building a set of region aggregates, you might embed region_key, state_key, city_key along with region, state, and city in your geography dimension. The keys will be repeated in the dimension until they are rolled up, at which time they become unique. The new aggregate fact table will point to these keys, eliminating the need for a specialized surrogate key manager.

Navigating Aggregates

Depending on your technology (here, I talk about SQL Server 2005 via SSIS and Visual FoxPro only — see below), navigating aggregates may or may not be a simple matter. If you put all your aggregates into OLAP engines, then chances are good that your OLAP technology will handle aggregates for you. To do it yourself, say, if you were hand-coding this part of the data warehouse using Visual FoxPro, you have a few design questions to answer.

In an old, but still relevant article published by DBMS, Ralph Kimball defines four key design requirements when dealing with aggregates. They are:

  1. Aggregates must be stored in their own fact tables, separate from the base-level data. In addition, each distinct aggregation level must occupy its own unique fact table.
  2. The dimension tables attached to the aggregate fact tables must, wherever possible, be shrunken versions of the dimension tables associated with the base fact table.
  3. The base Fact table and all of its related aggregate Fact tables must be associated together as a “family of schemas” so that the aggregate navigator knows which tables are related to one another.
  4. Force all SQL created by any end user or application to refer exclusively to the base fact table and its associated full-size dimension tables.

Which leads to the simple process of navigating the schemas when a user queries the dimensional data. The process can be distilled into the following three processing steps proposed by Ralph Kimball in his article:

  1. Find the smallest fact table in the “family of schemas” referenced by a user’s query.
  2. test the query against the schema to be sure that all attributes and metrics are present. If so, proceed. If not, go back to step 1
  3. Rewrite the user’s query to use the smaller aggregates where possible.

If you’re faced with the challenge of developing your own navigator, this article is a must read!

SQL Server 2005 Integration Services (SSIS)

Build aggregates after loading the atomic fact data for each business process. This is usually done in a separate package that can be called from the master package. It may be appropriate to schedule these aggregate-building packages periodically throughout the week as opposed to after each load, but business requirements will dictate this approach.

The aggregate package will contain three important types of data flows.

  • Dimension shrinking
  • Surrogate key management
  • Aggregate fact loading

Within the dimension shrinking data flows, start collapsing the relevant dimensions to the grain of the aggregate fact table. At this time, you would also assign surrogate keys (or, if you use my tip described above, just include the appropriate key and designate it as the primary key).

If you’re particularly crafty, you could design this in a way that would allow you to collapse dimensions from the previously shrunken dimension — saving significant time and resources. For example, collapse the geography dimension first into geography_region, then from geography_region create geography_state, and then from geography_state create geography_city.

Surrogate key management (where you look up the keys in the dimensions to use as foreign keys in the fact) and fact loading (actually inserting into the fact table) are very similar to the normal ETL processes. The only difference is that in order to look up your keys, you need to generate a special dataset from the base atomic data at the proper grain using a GROUP BY which contains all natural keys of the newly shrunken dimensions.

Hand Coding with Visual FoxPro (VFP9)

I find that hand-coding works well with aggregates. You can hand-code dimension shrinking, surrogate key management, and aggregate fact loading quite easily. All that we’re dealing with is some basic workflow and simple SQL. The challenge comes in scheduling and paralleling. Certainly, we want aggregates to run after the atomic data is ready. How you do that in FoxPro is entirely up to you. I prefer to chain events, so that when one finishes, the other kicks off. Scheduling requires additional reliance on the OS or the use of a timer (a bad idea).

From Here

There is so much more to say about this topic. I would like to follow this up with some examples in both SSIS and FoxPro. Stay tuned… In the meantime, I’ll move on to talk more about OLAP.

Published July 30th, 2008

Aggregate Facts and Dimensions

Aggregate — picture from http://www.stonetohome.comAggregate fact tables are special fact tables in a data warehouse that contain new metrics derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other specialized functions that output totals derived from a grouping of the base data. These new metrics, called “aggregate facts” or “summary statistics” are stored and maintained in the data warehouse database in special fact tables at the grain of the aggregation. Likewise, the corresponding dimensions are rolled up and condensed to match the new grain of the fact.

These specialized tables are used as substitutes whenever possible for returning user queries. The reason? Speed. Querying a tidy aggregate table is much faster and uses much less disk I/O than the base, atomic fact table, especially if the dimensions are large as well. If you want to wow your users, start adding aggregates. You can even use this “trick” in your operational systems to serve as a foundation for operational reports. I’ve always done this for any report referred to by my users as a “Summary”. (As an aside, there is a difference between an “aggregate” and a “summary”. I’ll explore these differences in my next post.)

For example, take the “Orders” business process from an online catalog company where you might have customer orders in a fact table called FactOrders with dimensions Customer, Product, and OrderDate. With possibly millions of orders in the transaction fact, it makes sense to start thinking about aggregates.

To further the above example, assume that the business is interested in a report: “Monthly orders by state and product type”. While you could generate this easily enough using the FactOrders fact table, you could likely speed up the data retrieval for the report by at least half (but likely much, much more) using an aggregate.

Here, using the atomic transaction FactOrders table:

SELECT c.STATE, p.product_type, t.YEAR, t.MONTH, SUM(f.order_amount)
    FROM FactOrders f
    JOIN DimCustomer c ON c.CustomerID = f.CustomerID
    JOIN DimProduct p ON p.ProductID = f.ProductID
    JOIN DimTime t ON 
        t.YEAR = DATEPART(yy, f.YearMonthID) AND 
        t.MONTH = DATEPART(mm, f.DateID)
    GROUP BY c.STATE, p.product_type, t.YEAR, t.MONTH

The aggregate is querying much less data and queries against time are now much simpler. In my non-scientific tests, the following query ran many times faster (a few seconds compared to about 30 seconds!).

SELECT c.STATE, p.product_type, t.YEAR, t.MONTH, SUM(f.order_amount)
    FROM FactOrders_Agg1 f
    JOIN DimCustomerState c ON c.CustomerStateID = f.CustomerStateID
    JOIN DimProductType p ON p.ProductTypeID = f.ProductTypeID
    JOIN DimMonth t ON t.YearMonthID = f.YearMonthID
    GROUP BY c.STATE, p.product_type, t.YEAR, t.MONTH

Creating the Fact and Dimensions

To implement, you will need to roll up your fact table by the hierarchies found in your dimensions. The result will be a new fact table, a set of new accompanying dimensions at the grain of the fact, and all new foreign keys for mapping. I usually name the fact table the same as the base fact with some meaningful suffix appended to the end. In SSMS, this keeps the aggregates with the fact in my object explorer. Dimensions usually get new names (like CustomerState and ProductType) and should be conformed so that they can be reused across business processes. You could even create views instead of new dimensions, but this does not eliminate the need to regenerate new surrogate keys.

When rolling up dimensions, you are provided with an excellent opportunity to perform aggregate functions on the dimension itself and store the results as new attributes. For example, you may want to know how many customers are living in each state. This could be used as the denominator in some population calculation you plan to use against the aggregate fact. Your new dimension might therefore look like the following:

SELECT Cust_Country, Cust_Region, Cust_State, COUNT(Cust_ID) 
FROM DimCustomer 
GROUP BY Cust_Country, Cust_Region, Cust_State

The most obvious aggregate function to use is COUNT, but depending on the type of data you have in your dimensions, other functions may prove useful. Just be warned: If you find that aggregate functions are being used a lot on your dimensions, you may need to revisit your design. There may be opportunities to pull out those metrics into existing or new fact tables!

Generating aggregates is largely an incremental process, where you examine query and reporting usage looking for places to improve performance. Aggregates stored in the RDBMS are maintained through ETL and/or your OLAP engine.

A Note About OLAP Aggregates

Theoretically, storing aggregates in a fact table in a RDBMS is the same as storing them in an OLAP cube. In OLAP storage, aggregates are precalculated summaries of data from the different dimensions of the cube, such that a query that seeks to know the aggregate (sum) of some metric (order amount) for X (customer state) and Y (product type) over T (monthly orders) would only need to look inside the cube at those exact coordinates to get the answer. I won’t pretend to know how this stuff is physically stored, but OLAP engines across the board offer better performance, management, and navigation mechanisms for the aggregations than is available through the RDBMS (even when using Indexed or Materialized Views).

Next post, I’ll write some thoughts on the differences between “Summaries” and “Aggregates”!

Published July 18th, 2008

ETL Subsystem 18: Fact Table Provider

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

This subsystem mirrors the previous one I discussed: ETL Subsystem 17: Dimension Manager. It exists to ensure that fact tables are properly maintained and delivered to the organization. The functionality to accomplish this is done through ETL (with some help by other technologies, such as replication). The functions include (but are not limited to) the following significant tasks:

Note: I’ll discuss Aggregates more in my next post.

SQL Server 2005 Integration Services (SSIS)

Fact distribution can be done using ETL or through replication. While other methods exist (RPCs, Web Services, and other SOA-type solutions for example), ETL and replication are by far the most widespread and documented. SSIS works in both conditions just fine. For more thoughts on this, refer back to ETL Subsytem 17.

Hand Coding with Visual FoxPro (VFP9)

Nothing new to report! VFP is limited, without a native replication feature, on its own. It can move data in distributed environments quite well though. So using ETL as a solution is certainly high on the list. If replication is a must, then the VFP database backend will need to be replaced with SQL Server or similar RDBMS product.

From Here

As stated, in the next subsystem, I’ll talk about aggregates. I love aggregates (my wife wants a divorce), so my next post looks to be quite interesting!

Published July 14th, 2008

ETL Subsystem 17: Dimension Manager

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

This subsystem is largely a management task. It involves ensuring that conformed dimensions (see my post “ETL Subsystem 8: Data Conformance“) are properly published to the organization. Conformed Dimensions are one of the most important and fundamental aspects of data warehousing. In order to deliver these conformed dimensions to business users spread across the organization — including geographically — a system must be in place to deal them out.

In simple environments (a single RDBMS server that services a handful of analysts, for example), dimension management is as easy as inserting new rows through ETL and updating data model metadata by the DBA on a single server. In more complex environments — where servers and applications are scaled up and out, across large geographies — dimension management becomes more of a challenge.

It is common in these distributed environments (horizontally and vertically) to have copies of dimensions sitting at various locations in the enterprise. Don’t confuse this with “Mart Madness” or “Silos”. These copies are managed by a single ETL process which not only handles normal ETL, but likely handles the distribution as well (replication, as I’ll talk about in a moment, is another option).

Why not have remote users connect to a single, centralized database? A Dimensional Model exists to facilitate analytics and query performance. To this end, it is necessary at times to move data closer to the analyst. Not only is this necessary, but it is also smart. If you have presentation servers in Brazil, Costa Rica, Singapore, and Sri Lanka, then guess what: You will distribute your data directly to these locations. How you do it is more of a question of technology and policy, rather than a process tied to ETL directly.

SQL Server 2005 Integration Services (SSIS)

If you choose to distribute dimensions using ETL, then SSIS will work just fine. This is nothing more than a set of special packages that you design to select changes from the master dimensions (stored on your central integration server), multicast the data, and load to the destination dimension or dimensions. This incremental approach is desirable when you are processing a small amount of changes to medium and large dimensions.

Alternately, if your dimensions are small enough, it might make sense to drop and recreate them on the destination servers. The overhead associated with collecting the changes from the master dimensions might not be worth the effort. But before you think this method the best, be aware that all constraints to this dimension (including those connected through the fact) will need to be dropped. Also, the database will be unavailable to the users during this operation.

Another good option is to simply use Replication to propagate the changes to the various distributed dimensions. I am currently using replication now as my standard, but only because my situation currently dictates it. I have large data volumes, which are updated frequently (several times a day, with sensitivity on global timezones).

Hand Coding with Visual FoxPro (VFP9)

I am starting to sound like a broken record: VFP is not an ETL tool, but an OOP language. Dimension distribution will need to be done manually by either making copies of dimensions and pushing them out to the client systems, or designing a mechanism for the client systems to pull changed data from the master dimensions. I have found this to be a major challenge using VFP as the sole solution. Obviously, if your backend database is SQL Server or another RDBMS, you likely have a replication option. That being said, it certainly is not an impossible task, but one in which you need to prepare yourself for the various complexities. I would go out on a limb and suggest that this is true for all hand-coded solutions.

From Here

In my next post, I will discuss ETL Subsystem 18: Fact Table Provider. As with this subsystem, the Fact Table Provider is largely a management task designed to bring facts to the user community.

Published July 10th, 2008

ETL Subsystem 16: Late Arriving Data Handler

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Sometimes, not all dimension data is available when a transaction (i.e. business event, such as an order) comes into the Warehouse. This is a fact of data warehousing life. Under normal circumstances, you process dimensions first so that you have all your keys ready for inserting into the fact. When dimensional data is unavailable, missing, or incomplete you end up with a bit of a problem. NULL foreign keys in the fact table are not allowed.

Missed the trainOther times, your fact comes in late. In this scenario, you have all your dimensions, but now you need to insert a fact into the fact table and link it to existing dimension rows that were active at the time of the fact. Late arriving facts are easier to work with than late arriving dimensions, in my opinion. I already discussed some techniques for handling this in my post “ETL Subsystem 14: Surrogate Key Manager“.

I find late arriving dimensions (also called “early arriving facts”) to be a more difficult problem. There are a few of good solutions though, all of which are handled during ETL:

  • Hold onto the fact until all dimensions arrive
  • Create a dimension called ‘unknown’ or ‘not available’ with a primary key of -1 and use that
  • Insert a dummy row in the dimension and populate it with whatever you can

None of these options are terribly appealing across all scenarios; however, each has a place under certain circumstances.

Hold Onto the Fact

Holding onto the fact in your staging area may be a great option if, for example, you expect the late arriving data to be coming in soon. Perhaps you process data from multiple source systems. System A contains the transactions, while systems B, C, and D combine to give you your dimensions. Because of your extraction windows, you may not be able to get data from system D until later in the morning, meaning that the business events from system A are missing some context. Holding onto these events and waiting for the data from system D is the appropriate action!

Even if the situation is not predictable, holding onto facts might be the best option. If you are loading daily stocks, for example, and for some reason a particular company is not available (a new company perhaps), but the stock is showing up in your extracts, you should hold onto that stock in your staging area until the company is available. You don’t want that stock in the warehouse until the full details of the company comes in (which could be available in a separate extract, or may need to be manually entered).

As you can see, deciding to work this way has implications on the ETL processes and staging area. Planning for these circumstances up front is a must.

The ‘Unknown’ Dimension

Creating an empty row in the dimension called ‘Unknown’ is also an option that is good under some circumstances. If you are processing orders and no information about a promotion comes in, then it would be safe to link the fact to a special row in the dimension that denotes that no promotional information was available. Depending on your dimension and business requirements, you could have many different levels of ‘unknown’.

Here are a few suggestions:

  • -1, ‘none’
  • -2, ‘unknown’
  • -3, ‘not applicable’

If you use key -1 in your fact row for the foreign key to the promotion table, you allow your users to write queries and reports for all orders that were made when there was no promotion (promotion.name = ‘none’). The same is true for all facts that have an unknown promotion status, meaning there could have been a promotion, but the data was incomplete. Lastly, there are cases — if you do online orders for example — where promotions simply don’t apply.

You can expand these keys heading backwards with lots of detailed descriptions telling you why a dimension might not exist for an event. Don’t go overboard though! Having a few options is a good idea for data quality and analysis purposes. But too many may overload users with too many choices all too similar to be useful.

This approach works well when the likelihood of acquiring the context later is slim-to-none. If you use this method, it gets very complex if you later do in fact find that there was some context (i.e. a promotion). Your only option then is to revisit your facts and look for, and update, the fact rows corresponding to the late arriving data. On the bright side, if you’ve already tagged these rows with a foreign key of -2, for example, you may be able to focus only on those facts — saving you some process time.

Inferring a Dimension

In the company stock example I gave above, you have another option. If the company details of a stock is not known — and all you have is a SEDOL or Ticker — then simply insert a new row into the dimension with all of the information you know to be true about the dimension. This only works if you have enough details about the dimension to construct the natural key. Without this, you would never (although, I’ve learned to never say never) be able to go back and update this dimension row with complete attributes.

To make this work, and if you are using Type 2 SCDs, you will need another attribute in your dimension. Call it “inferred” or similar, make it a bit or char(1), and mark it as true (1, or “Y”).

This has a major implication on your dimensional processing. Now, before you process any Type 2 changes on any dimension that contains an attribute called “inferred”, you need to check if the row you are ready to work on is in fact inferred. If so, treat the entire row as Type 1. Because it is inferred, we don’t have the details anyway. If we went along and processed the row as a Type 2, then your fact will continue to point to an essentially empty row.

SQL Server 2005 Integration Services (SSIS)

The SCD Component handles the inferred dimension approach quite nicely. In fact, “Inferred Dimension” is a term introduced by SSIS (I believe). I talked briefly about the SCD Component and inferred dimensions in my post “ETL Subsystem 9: Slowly Changing Dimensions (SCD)” so I won’t repeat myself here. The bottom line is that SSIS has a very elegant solution to the late arriving dimension problem. If you have them — and almost every data warehouse has late arriving dimensions — then using the SCD Component and the inferred dimension option is must. Also check out this external link for more inspiration: “Handling Late-Arriving Facts“.

Hand Coding with Visual FoxPro (VFP9)

While SSIS has built-in capabilities to handle the inferred rows, VFP does not. This is not surprising of course because VFP is a language and not an ETL tool! With that said, writing SQL to accommodate the “inferred” flag in a dimension row is not a big deal. The processing logic for a SCD Type 2 dimension then becomes (in pseudo-code):

 
For each row
  If any attribute marked as 'Type 2' changes AND inferred = "N" Then
    update status of active/current row
    insert into dimension a new row
  End If
  For each attributes marked as a 'Type 1' change OR inferred = "Y" Then
    If attribute is subtype 1
        update all rows that match the key
    Else
        update only the current/active row that matches the key
    End if
  End For
End For

I discussed the above pseudo-code in more detail in my post “Loading the Slowly Changing Dimension“. Check that out for more insight into some of the logic. I should mention again that this is a very basic structure and not intended to be comprehensive or de jure.

From Here

Enough about late arriving data! Late arriving data, changed data capture, and surrogate key lookups might be the most challenging aspects of ETL — especially when it comes to performance and efficiency. It will take you some time to get these right. You may — like me — be forever tweaking these areas to squeeze every last second out of the process.

Next, I’ll move on to discuss Dimension Management. A much lighter topic!