This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
This subsystem is largely a management task. It involves ensuring that conformed dimensions (see my post “ETL Subsystem 8: Data Conformance“) are properly published to the organization. Conformed Dimensions are one of the most important and fundamental aspects of data warehousing. In order to deliver these conformed dimensions to business users spread across the organization — including geographically — a system must be in place to deal them out.
In simple environments (a single RDBMS server that services a handful of analysts, for example), dimension management is as easy as inserting new rows through ETL and updating data model metadata by the DBA on a single server. In more complex environments — where servers and applications are scaled up and out, across large geographies — dimension management becomes more of a challenge.
It is common in these distributed environments (horizontally and vertically) to have copies of dimensions sitting at various locations in the enterprise. Don’t confuse this with “Mart Madness” or “Silos”. These copies are managed by a single ETL process which not only handles normal ETL, but likely handles the distribution as well (replication, as I’ll talk about in a moment, is another option).
Why not have remote users connect to a single, centralized database? A Dimensional Model exists to facilitate analytics and query performance. To this end, it is necessary at times to move data closer to the analyst. Not only is this necessary, but it is also smart. If you have presentation servers in Brazil, Costa Rica, Singapore, and Sri Lanka, then guess what: You will distribute your data directly to these locations. How you do it is more of a question of technology and policy, rather than a process tied to ETL directly.
SQL Server 2005 Integration Services (SSIS)
If you choose to distribute dimensions using ETL, then SSIS will work just fine. This is nothing more than a set of special packages that you design to select changes from the master dimensions (stored on your central integration server), multicast the data, and load to the destination dimension or dimensions. This incremental approach is desirable when you are processing a small amount of changes to medium and large dimensions.
Alternately, if your dimensions are small enough, it might make sense to drop and recreate them on the destination servers. The overhead associated with collecting the changes from the master dimensions might not be worth the effort. But before you think this method the best, be aware that all constraints to this dimension (including those connected through the fact) will need to be dropped. Also, the database will be unavailable to the users during this operation.
Another good option is to simply use Replication to propagate the changes to the various distributed dimensions. I am currently using replication now as my standard, but only because my situation currently dictates it. I have large data volumes, which are updated frequently (several times a day, with sensitivity on global timezones).
Hand Coding with Visual FoxPro (VFP9)
I am starting to sound like a broken record: VFP is not an ETL tool, but an OOP language. Dimension distribution will need to be done manually by either making copies of dimensions and pushing them out to the client systems, or designing a mechanism for the client systems to pull changed data from the master dimensions. I have found this to be a major challenge using VFP as the sole solution. Obviously, if your backend database is SQL Server or another RDBMS, you likely have a replication option. That being said, it certainly is not an impossible task, but one in which you need to prepare yourself for the various complexities. I would go out on a limb and suggest that this is true for all hand-coded solutions.
From Here
In my next post, I will discuss ETL Subsystem 18: Fact Table Provider. As with this subsystem, the Fact Table Provider is largely a management task designed to bring facts to the user community.