In my most recent FoxPro Advisor article, “From Source to Warehouse: A VFP Data Integration Strategy“, I discuss an approach to using Visual FoxPro to hand-code an ETL solution. I dedicated three measly paragraphs to Data Profiling. In the first paragraph, I wrote that:

Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more [than] reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.

Whoa. Doesn’t sound as though three paragraphs do this topic justice. In fact, entire books have been written about this very important subject. Because my article was not about data profiling, I had to cut it short and give you — the reader — only a taste of what is involved.

But this should in no way minimize the importance of data profiling.

If you are using VFP to write your ETL solution then chances are you will want to use it to profile your data. I may end up writing a full article on this in the near future, but for now, I want to touch on some of the more important points regarding data profiling. I should also mention that using VFP for data profiling is an excellent decision — especially if one or more of your source systems are written in FoxPro. Heck, if you’re reading from dBase, Excel, or flat text files, no other language out there can compete with the Fox (Ok, I’ll stop cheerleading now…):

  1. First and foremost, data profiling is not the same thing as code profiling, nor does it involve the same mechanisms. Although there are some overlaps in concepts and methods, these are two very separate tasks designed to achieve separate things.
  2. Data Profiling and data quality are closely related, generally falling under a broader ‘Data Analysis’ category of BI and Data Warehousing. Often, data profiling reveals data quality issues as well as many other interesting facts about the data.
  3. Structural analysis, field domains and ranges, and various statistics at the column and row level should be generated during a profile. It is a mistake to assume that running a single SQL statement on a table will suffice.
  4. A data profile should produce valuable metadata that can be used when writing transformations and conforming data. Because source metadata is often out-of-date or inadequate, this part of the profile can prove to be invaluable.
  5. Data Profiling, like designing a data warehouse in general, is an iterative process. A Profiling script may work well on some source data, but be utterly useless or pointless on other source data. This is where hand-coding really benefits you. You can tailor your profiling to meet the requirements of both your BI strategy and the structure and state of the source data.
  6. Data profiling allows you to build a list of assumptions (field A is never NULL, though it accepts them; field B is always a positive integer; table C always has exactly one record; etc.) and action items (We need to atomize the fullname field into first, middle, and last; Why are columns X, Y, and Z empty when the documentation clearly states that they are necessary for process M?; etc.)

As you can probably tell, a proper data profile can take time and effort to get right. But the benefit? A complete and documented understanding of each source data system that will allow you to build your ETL solution with little hassle and fewer surprises. On top of that, it will pinpoint data problems and inefficiencies, oddities, and low-hanging fruits.

What’s wonderful in FoxPro is that there are a whole slew of commands (STREXTRACT, SUBSTRING, CALCULATE, FSIZE, etc) and classes (Cursor Adapters for example) available for you to help out. You need only to rely on your ingenuity and skill as a FoxPro programmer.

Next entry, I’ll post some code to get you started.