This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Data profiling is the most important step in designing a data warehousing solution. Kimball identifies it as Subsystem 1 for good reason. A Data profile forms the foundation for all phases of data integration. As one author writes, without a data profile you end up in the cycle of code, load, and explode! Trust me (and him, and everyone else) on this: Invest in Data Profiling to ensure integration success.
I have already talked about the data profile in my posts “Data Profiling” and “Getting Started with Data Profiling“. I won’t beat a dead horse here, so please take a moment to review those two short posts for more details. Here is the definition that I have been using:
- Data Profiling
- Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more than reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.
What’s more is that data profiling is dynamic in that it must be repeated often during the data warehousing life cycle. Attribute domains and ranges can change, for example. But also consider a situation where the source system goes through a version upgrade. The developers of the system might have added some new data elements or cleaned up some quality issues. You will need to re-run your data profile and reconcile any differences with your current integration. Without this, you would be left with only the documentation provided which is often not enough.
SQL Server 2005 Integration Services (SSIS)
Informatica, Oracle, DataFlux (SAS), IBM and many others have dedicated profiling tools. Microsoft does not (do a Google search for “Data Profiling Tools” and you won’t see Microsoft anywhere on the map). So if it is so important, why doesn’t Microsoft provide a data profiling tool? Good question.
In SQL Server 2008, there promises to be a new SSIS task called the Data Profiling Task (and accompanying Viewer). This task will certainly help in designing profiling and quality solutions. Although I have not evaluated it yet, I suppose it will take the place of using the hodgepodge of other tasks to accomplish domain and range-related profiling functions. You can read more about it here at Stacia Misner’s Blog. The SQL Server 2008 November CTP, which contains the task is located here.
“SSIS Stuff” by Matt Masson
“SSIS Junkie” by Jamie Thomson
- SSIS: Data Profiling Task: Part 1 - Introduction
- SSIS: Data Profiling Task: Part 2 - Column Length Distribution
- SSIS: Data Profiling Task: Part 3 - Column Null Ratio
- SSIS: Data Profiling Task: Part 4 - Column Pattern
- SSIS: Data Profiling Task: Part 5 - Column Statistics
- SSIS: Data Profiling Task: Part 6 - Column Value Distribution
- SSIS: Data Profiling Task: Part 7 - Functional Dependency
- SSIS: Data Profiling Task: Part 8 - Candidate Key
- SSIS: Data Profiling Task: Part 9 - Value Inclusion
For the rest of us using SQL Server 2005 in a production environment, you can use SSIS as a data profiling tool with some effort. The process involves utilizing the Script, Row Count, Multicast, Aggregate, and Conditional Split tasks among others in a data flow. The Script Component is very useful in this regard, especially for being able to use regular expressions.
Using SSIS in this way also allows you to discover table structures in the source system if available through the connection type. For relationships, you would need to examine the database schema manually.
Through SSIS, you would run a profiling package on each table or common group of tables and log the results to a specialized database for analysis. Tons of very useful metadata will be produced that can form the basis of your data model and logical data map. None of this comes automatically. You will need to develop this system from the ground-up.
I get the feeling that Data Profiling was an afterthought for the design team at Microsoft. I also get the feeling that they’re finally waking up and will begin to provide important profiling tools in subsequent releases. I don’t think that their offering in 2008 will make much difference though for those serious about data profiling.
Hand Coding with Visual FoxPro (VFP9)
Personally, I feel that your best bet for data profiling is to purchase a data profiling tool from a vendor with a proven track record and with good support. Hand coding ETL is difficult enough as it is, but hand-coding a profiler is quite a challenge in its own right. Data profiling is much more than simply running some SQL queries on a table, and it goes well beyond determining candidate keys and identifying nullable columns. Hand coding this type of application requires research, development time, and plenty of quality controls.
With that said, I have a sample profiling application available in VFP9 that you can download here (click on “VFP Data Profiler”). It is very basic, but should give you a feel for what you will need to do to get started.
If you could choose a language to use, I would select FoxPro over anything else I’ve used (PHP, C++, C#, VB). It can shred XML, process strings blazingly fast, and its database container makes for a perfect metadata and staging repository. You can do more for less in VFP9 than you can do using many other languages or even SSIS.
From here
Data Profiling is an ongoing effort, but when the initial round is complete, the development of the datawarehouse shifts to designing the data model (for analytical data warehousing applications it would be a dimensional model). At the same time, quality issues are being addressed and the ETL team begins to prepare the groundwork for the next subsystem: Changed Data Capture (CDC).
#1 by Guru at September 1st, 2008
| Quote
It’s true that you have to “invest in Data Profiling to ensure integration success”, but investing in data profiling does not have to cost a lot. Open Source solutions exist and can be trusted for their robustness.
When you talk about data profiling and open source, one name comes to mind: Talend. Talend’s open source Talend Open Profiler makes it possible to perform data profiling without being an expert.
It is also easy to use, fast to operate and free to download. Talend has an active community able to solve problems and a special debugging team to quickly change features on updates.
You can download Talend Open Profiler from http://www.talend.com/download.php
#2 by Tod McKenna at September 1st, 2008
| Quote
Hi Guru,
If I didn’t agree with you, I would have let my spam filter keep your post in the Ether
But I do agree, and anyone interested in data profiling (ESPECIALLY if you are on Microsoft’s BI platform which has no dedicated data profiling capability (don’t let the 2008 SSIS Data Profiling Task fool you)) should consider Talend’s Open Profiler.
-Tod
#3 by andrea at March 7th, 2009
| Quote
Hi, Tod
When u finished the data profiling works every time, how do u stored and presented the results. excel, word, relational tables?
If they r relationa tables, will u report them via interface to the users?
I have tried to documented my results using MS word, but i found that to much items need recorded, Even some explanation of the bad data that given by BA.
Do u agree that data profiling will sometimes refered to business rules, or we only did it merely a techical process.
eg.
two fields named vhcl_type and vhcl_color,
the BA told that :
when vhcl_type=’bus’, the vhcl_color data quality was quite good.and
when vhcl_type=’truck’, there r too many ‘NULL’ in vhcl_color.
how do u consider this problem?
thanks for ur advice.
#4 by Tod McKenna at March 24th, 2009
| Quote
Hi Andrea,
Data profiling is separate from data quality auditing/management. Profiling is a design and development tool (and process!) that will allow you to examine the content, context, and quality of data coming into your data warehouse. The results of a good data profile will tell you how to build your staging area, your dimensional models, and your data quality procedures.
Currently, I store all data profiling results in relational tables as metadata. You can read more here: http://blog.todmeansfox.com/2007/07/03/data-profiling/ and here: http://blog.todmeansfox.com/2007/07/10/getting-started-with-data-profiling/.
Regarding your business rules question: Data profiling is an exploratory process helping you to explain the data you are receiving. In my experience, there is no room or reason to include business logic in a data profile. You would, however, build these things into a data quality system — preferably one that runs in your staging area or one that executes on data as it is loaded into your data warehouse.
Hope this helps!
#5 by andrea at March 25th, 2009
| Quote
hi,Tod
thanks very much.
your answer is very helpful.
but one thing let me confused that you said data profiling is separate from data quality auditing/management.
do you agree that data profiling is the first and important phase of enterprise data improvement.
what do you think about the relationship between data profiling and data quality.
how do you define the data quality management. in my opinion, data quality management of a DW/BI project includes data profiling, correcting, standadizing, monitoring.
#6 by DavidL at April 15th, 2009
| Quote
Hi Andrea,
When Tod says data profiling is separate from data quality auditing, I think it seems to be the case.
Because data profiling is the process of examining the data available in existing data sources (e.g. databases, files…) and collecting statistics. Data quality entails more than helping companies get correct data into their information systems; it also means getting rid of bad, corrupted, or duplicate data.
Of course they are related, but are not complimentary IMHO.
Hope this helped!
#7 by Tod McKenna at April 19th, 2009
| Quote
Hi Andrea,
Sorry about the confusion. I’ll try to make a better distinction between data profiling and data quality.
Think of data profiling as a precursor to everything else; an introduction. Before you design for or use any data in any of the disparate systems in your enterprise, you need to understand it completely. Data profiling is that process. You will build a ton of useful metadata that you can rely on in later phases of ETL and DW/BI development. Data profiling, consequently, is also a precursor to data quality management.
If you think of data profiling as part of data quality, then you may be inclined to skip a thorough data profile at the start of a new project. The reasoning might be: “I’ll learn the source system while developing the ETL and mappings, and I’ll dig even deeper once the business analysts tell me what to look for regarding potential quality issues. I’ll profile as I go.”
In my experience, this is a death knoll. Profile first and thoroughly. When you move into data quality-related activities (which includes Extract and Transform functions of ETL, as well as complete data quality and monitoring systems you might develop as part of your DW/BI layer), you will have a complete catalog of metadata compiled during your initial profiling to turn to.
Once I’ve been introduced to the data, I am free to test the quality of the relationship. During data quality screenings, I may conduct additional profile-like activities and compile more statistics. I wouldn’t call this a new “data profile”, but rather, simply a part of my data quality research.
Another difference is specificity. Data profiles are broad and should touch upon each and every element you intend to work with. Data quality management tends to be more specific, as you build rules and logic to question the integrity of particular attributes and relationships in the data, oftentimes crossing multiple business processes and systems.
I’d be interested to hear your additional thoughts on this!
-Tod