This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Data Lineage and Data Dependency are closely related in that they tell you how data elements were born, how they got to their final resting places (lineage), and where those resting places are (dependency).
For example, you may have a report that shows the total number of sales in a given month for a particular client. The Sales metric, used to generate the total sales for a customer for a given month, also exists in 5 other reports, a few exports, your dashboard and scorecard application, a research data mart, and in 3 fact tables. Many of your data warehousing applications are therefore dependent on the sales metric. You also know that this metric is staged in on your integration server (in a 3NF relational model), and that it is derived from your front-office sales database. Not only that, because you practice good logging, version control and release management, you also happen to know exactly when the front office system gave you the sales data, when it was staged, and what version of what package processed it.
Compliance officer: Take that!
Having complete lineage and dependency trails is not an easy task, and must be planned as part of your metadata strategy from the start. If you, like me, have to operate in a heavily regulated environment than you must pay special attention to lineage and dependency. Even if you’re not bombarded with regulation and auditing, there are great benefits to building lineage and dependency metadata into your data warehouse. Consider:
- Being able to find out quickly where and why any element changed type, value, or size (why are my sales figures rounded to two significant digits?)
- Being able to perform a quick impact analysis if any change in type or size of an attribute or metric in the data warehouse will affect downstream reports and applications
- Being able to focus data quality issues back to the source by showing how and why various data elements got clogged in the integration system (I often build in strict rules when dealing with potential dirty data. In some cases I outright reject the row and report back to the source the reason.)
- Being able to accurately plan maintenance and reporting enhancements
The list can go on. But I think that you get the idea. If you want to be a good data governor, then you’ll need to pay close attention to this subsystem. If you are a good data governor, then you already have!
There and Back Again
So how can you get there? Largely, this is a documentation issue. While databases and tools can provide some metadata to help determine dependencies, transformations, and the like, it is often more useful to pull out these details into some form of central repository that you can query and write reports on. I’ll talk more about metadata in a future posting (subsystem 34, to be specific), and with that, I’ll discuss some great ways to automate this and other tedious documentation tasks. Here are a few other things you can do:
- Keep your Bus Matrix documentation up to date. The Bus Matrix offers a great overview of your data warehouse and can often shed some light on dependency and lineage issues.
- Publish your Logical Data Mappings (after you’ve polished it up) to your data governors, making the LDM a centerpiece for your lineage documentation.
- Get to know your database’s dependency analysis capability, and install a third-party tool if the native capability is insufficient. For SQL Server, you can have a look at a company like ApexSQL for some inspiration.
- Be strict with your versioning. This will allow you to tie data elements to the processes that touched them. In a compliance environment, it can be critical to know that build 345 — and not build 346 — processed the sales figures (in a future post, I’ll explain how I do this in my environment).
From here
While there isn’t anything specific in SSIS or VFP to make lineage and dependency analysis possible, you should focus on logging, process metadata, clear documentation, and good old due diligence.
My next subsystem topic involves problem escalation. This is a touchy subject where I work because it takes so long for the “system” to kick in (SLA times vary and every project has a priority score, that often means critical issues for you are blips on the radar for the support staff). I’ve got some thoughts and ideas on this, so I hope you don’t mind me sharing — and venting!
And lastly, I’ve been unable to commit to blogging over the past 3 months. First the holidays were a distraction, but in the start of January I discovered that my group would be moving from Brussels, Belgium to The Hague, Netherlands. This has been a major distraction and source of stress for my family and I. Do we make the move? Do we stick around in Brussels and look for another opportunity in a down market? I’ll write more about this in my “personal” category as more details come to pass. But that’s my excuse for the slow-going of late!