In this, and in the next series of posts, I will be exploring the 34 subsystems of ETL Data Integration (note: there used to be 38) as defined by the Kimball Group. I introduce the subsystems in this post, and then I will discuss how each fits (or does not fit) into both a SQL Server environment and in a hand-coded environment using VFP9.
The subsystem concept is a best-practice initiative formulated by The Kimball Group to help organizations design effective and efficient Data Integration environments for Data Warehousing using the Dimensional Model. As you’ll see, the subsystems are well-defined, battle-tested, and represent a relatively complete picture of what is needed for ETL Integration.
The Kimball Group categorizes the subsystems into 4 distinct groups: Data Extraction, Cleansing and Conforming Tasks, Data Delivery, and Management. Management is an often forgotten aspect of ETL (If the acronym was ETLM, we’d be OK!). You’ll notice that there are quite a few management-specific subsystems.
- Data Profiling
- Change Data Capture (CDC)
- Extraction
- Data Cleansing Subsystem
- Error Event Management
- Auditing
- Removing Duplicates
- Data Conformance
- Slowly Changing Dimensions (SCD)
- Surrogate Key Generator
- Hierarchy Manager
- Special Dimensions Manager
- Fact Table Builders
- Surrogate Key Management
- Bridge Table Builder
- Late Arriving Data Handler
- Dimension Manager
- Fact Table Provider
- Aggregate Generation
- OLAP Cube Builder
- Data Propagation Manager
- Scheduler
- Backup System
- Recovery and Restart
- Version Control
- Version Migration
- Work flow Monitor
- Sorting
- Data Lineage and Dependency
- Problem Escalation
- Paralleling and Pipelining
- Security
- Compliance Manager
- Metadata Repository
Data Extraction
Cleansing and Conforming Tasks
Data Delivery
Management
Data Integration architects have a lot to be thankful for the work and effort put forth by the Kimball Group. As you can probably guess, to do ETL right, you’ll need adequate time and resources. It is no wonder that data integration is by far the most time-consuming aspect of data warehousing.
In subsequent articles I will define each of these systems, provide examples, and discuss how they can be accomplished in SQL Server as well as VFP9. I have a lot of work to do, so stay tuned!
#1 by dan dawdy at February 26th, 2008
| Quote
I am unable to access the 1 Data Profiling link as it points to 3-Data-Extraction
#2 by Tod McKenna at February 26th, 2008
| Quote
Sorry about that Dan. I’ve updated the entry. The link to that post is here:
http://blog.todmeansfox.com/2007/12/07/etl-subsystem-1-data-profiling/
Regards,
Tod
#3 by JanaPoornavel at June 22nd, 2008
| Quote
Todd,
Would you be publishing your views about “Bridge Table Builder” soon ?. The fact that the Bridge table is going to support many-to-many relationships in the form of composite keys is making me fear that it is not possible to prepare the data with the help of database itself. I am sorry If am sounding meaning less. I would be glad to give an example, but i figure that replies are meant to be short.
regards,
jana
#4 by Tod McKenna at June 23rd, 2008
| Quote
Dear Jana,
Yes, I have been preparing posts on the following topics: Surrogate Key Management, Bridge Table Builder, and Late Arriving Data Handler. All should be online by early next week.
Bridge tables are quite interesting. I recently designed a model where bridge tables were essential — and in fact the only way — to make it all work. Consider a sale that involves many sellers, buyers, and items (an item can have multiple parts of different types). In the end, the dimensional model makes the design a breeze and although complex, is still easy for business users to understand and navigate. Performance is excellent as well.
Stay tuned, it will be online soon!
#5 by JanaPoornavel at June 23rd, 2008
| Quote
todd,
That sounds like panacea, you have cracked every major problem a datawareshouse might face. Performance,Navigation,Many to Many Relationships. Will you be posting the Schema or ER diagrams by any chance ?
regards,
jana
#6 by Tod McKenna at June 24th, 2008
| Quote
Great word! “Panacea”
I’ll certainly be posting more about data modeling, enterprise mappings, master data management, governance, and so on. My focus now is on getting through all ETL considerations first. Data Integration consumes most DW/BI resources, so I felt it would be a good place to focus my efforts…