This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Data profiling is the most important step in designing a data warehousing solution. Kimball identifies it as Subsystem 1 for good reason. A Data profile forms the foundation for all phases of data integration. As one author writes, without a data profile you end up in the cycle of code, load, and explode! Trust me (and him, and everyone else) on this: Invest in Data Profiling to ensure integration success.
I have already talked about the data profile in my posts “Data Profiling” and “Getting Started with Data Profiling“. I won’t beat a dead horse here, so please take a moment to review those two short posts for more details. Here is the definition that I have been using:
- Data Profiling
- Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more than reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.
What’s more is that data profiling is dynamic in that it must be repeated often during the data warehousing life cycle. Attribute domains and ranges can change, for example. But also consider a situation where the source system goes through a version upgrade. The developers of the system might have added some new data elements or cleaned up some quality issues. You will need to re-run your data profile and reconcile any differences with your current integration. Without this, you would be left with only the documentation provided which is often not enough.
SQL Server 2005 Integration Services (SSIS)
Informatica, Oracle, DataFlux (SAS), IBM and many others have dedicated profiling tools. Microsoft does not (do a Google search for “Data Profiling Tools” and you won’t see Microsoft anywhere on the map). So if it is so important, why doesn’t Microsoft provide a data profiling tool? Good question.
In SQL Server 2008, there promises to be a new SSIS task called the Data Profiling Task (and accompanying Viewer). This task will certainly help in designing profiling and quality solutions. Although I have not evaluated it yet, I suppose it will take the place of using the hodgepodge of other tasks to accomplish domain and range-related profiling functions. You can read more about it here at Stacia Misner’s Blog. The SQL Server 2008 November CTP, which contains the task is located here.
For the rest of us using SQL Server 2005 in a production environment, you can use SSIS as a data profiling tool with some effort. The process involves utilizing the Script, Row Count, Multicast, Aggregate, and Conditional Split tasks among others in a data flow. The Script Component is very useful in this regard, especially for being able to use regular expressions.
Using SSIS in this way also allows you to discover table structures in the source system if available through the connection type. For relationships, you would need to examine the database schema manually.
Through SSIS, you would run a profiling package on each table or common group of tables and log the results to a specialized database for analysis. Tons of very useful metadata will be produced that can form the basis of your data model and logical data map. None of this comes automatically. You will need to develop this system from the ground-up.
I get the feeling that Data Profiling was an afterthought for the design team at Microsoft. I also get the feeling that they’re finally waking up and will begin to provide important profiling tools in subsequent releases. I don’t think that their offering in 2008 will make much difference though for those serious about data profiling.
Hand Coding with Visual FoxPro (VFP9)
Personally, I feel that your best bet for data profiling is to purchase a data profiling tool from a vendor with a proven track record and with good support. Hand coding ETL is difficult enough as it is, but hand-coding a profiler is quite a challenge in its own right. Data profiling is much more than simply running some SQL queries on a table, and it goes well beyond determining candidate keys and identifying nullable columns. Hand coding this type of application requires research, development time, and plenty of quality controls.
With that said, I have a sample profiling application available in VFP9 that you can download here (click on “VFP Data Profiler”). It is very basic, but should give you a feel for what you will need to do to get started.
If you could choose a language to use, I would select FoxPro over anything else I’ve used (PHP, C++, C#, VB). It can shred XML, process strings blazingly fast, and its database container makes for a perfect metadata and staging repository. You can do more for less in VFP9 than you can do using many other languages or even SSIS.
Data Profiling is an ongoing effort, but when the initial round is complete, the development of the datawarehouse shifts to designing the data model (for analytical data warehousing applications it would be a dimensional model). At the same time, quality issues are being addressed and the ETL team begins to prepare the groundwork for the next subsystem: Changed Data Capture (CDC).