IT Business Edge writer Michael Stevens recently published a document called the “Data Warehousing ETL Checklist” (link to article with document). To get the file you’ll need to register with ITBusinessEdge (if you don’t already have an account).
The document isn’t a development checklist, but rather “a high-level checklist of the important topics”. Stevens has made a good effort to distill the various data integration topics and issues you’ll need to consider before you start your project.
Highlights and feedback
Mind scope and do not “try to boil the ocean”. Good advice. But I disagree with Stevens on how to approach one aspect of the scoping issue. He suggests that “in some cases, it may be better for the business as a whole to leave certain data sources untouched”. As I wrote here, I don’t think that scoping to the data source is a great approach. But you can decide.
I would also add that instead of looking at “Target System Content”, you should focus specifically on Business Processes — and this is regardless of the type of system your integrating into (MDM, DW, OLAP, etc.).
Data Profiling is a very important topic, and I’m ecstatic to see it on the list. But if it were up to me, I would have made it second or third. Almost every single failed or over budget ETL project can be linked to data source woes (quality, data availability, inadequate documentation, etc.). Most of these woes can be relieved by conducting good data profiles — a process that can begin during the initial requirement gathering phase of the lifecycle.
The data profile will reveal which source systems are useful. There may be redundancy, data quality, and data access issues compelling you to exclude a particular source. This isn’t a scoping issue, though. Go back to the project sponsor and to the board; report your findings. You may have other issues in your organization to resolve first.
In addition to Data profiling, I’m very happy to see that Naming Conventions made the list. Tied in closely with naming is the concept of conformity. I’ve written about this before, so please check here. In general, a label and its data must conform and be consistent throughout the entire system. And as Stevens points out, they must be agreed upon by the users.
For metadata, I also consider a third type: Process Metadata. This is technical data about all the processes (ETL, jobs, automated processes, reports, etc.). This data is then utilized directly by business users and administrators to monitor and audit the data warehouse. Is my report ready? consult the process metadata. Did that job finish? consult the process metadata.
Lastly, don’t forget to think about auditing, compliance, and lineage. If you’re making decisions based on the data retrieved from your data warehouse, you’ll need ways to trace data lineage. You’ll need to know when data was loaded, what version of the ETL system and business logic was used, and even why a particular data item might have been changed in transit.
Conclusion
This is a good high level checklist that you can use before you begin to dig deeper. It gives you a snapshot of some of the high-level decisions you’ll need to make before you get started with your next data warehousing project.
Perhaps Stevens will incorporate some of my feedback into a version 2!