Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Data Profiling’


Published December 7th, 2007

ETL Subsystem 1: Data Profiling

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Data profiling is the most important step in designing a data warehousing solution. Kimball identifies it as Subsystem 1 for good reason. A Data profile forms the foundation for all phases of data integration. As one author writes, without a data profile you end up in the cycle of code, load, and explode! Trust me (and him, and everyone else) on this: Invest in Data Profiling to ensure integration success.

I have already talked about the data profile in my posts “Data Profiling” and “Getting Started with Data Profiling“. I won’t beat a dead horse here, so please take a moment to review those two short posts for more details. Here is the definition that I have been using:

Data Profiling
Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more than reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.

What’s more is that data profiling is dynamic in that it must be repeated often during the data warehousing life cycle. Attribute domains and ranges can change, for example. But also consider a situation where the source system goes through a version upgrade. The developers of the system might have added some new data elements or cleaned up some quality issues. You will need to re-run your data profile and reconcile any differences with your current integration. Without this, you would be left with only the documentation provided which is often not enough.

SQL Server 2005 Integration Services (SSIS)

Informatica, Oracle, DataFlux (SAS), IBM and many others have dedicated profiling tools. Microsoft does not (do a Google search for “Data Profiling Tools” and you won’t see Microsoft anywhere on the map). So if it is so important, why doesn’t Microsoft provide a data profiling tool? Good question.

In SQL Server 2008, there promises to be a new SSIS task called the Data Profiling Task (and accompanying Viewer). This task will certainly help in designing profiling and quality solutions. Although I have not evaluated it yet, I suppose it will take the place of using the hodgepodge of other tasks to accomplish domain and range-related profiling functions. You can read more about it here at Stacia Misner’s Blog. The SQL Server 2008 November CTP, which contains the task is located here.

For the rest of us using SQL Server 2005 in a production environment, you can use SSIS as a data profiling tool with some effort. The process involves utilizing the Script, Row Count, Multicast, Aggregate, and Conditional Split tasks among others in a data flow. The Script Component is very useful in this regard, especially for being able to use regular expressions.

Using SSIS in this way also allows you to discover table structures in the source system if available through the connection type. For relationships, you would need to examine the database schema manually.

Through SSIS, you would run a profiling package on each table or common group of tables and log the results to a specialized database for analysis. Tons of very useful metadata will be produced that can form the basis of your data model and logical data map. None of this comes automatically. You will need to develop this system from the ground-up.

I get the feeling that Data Profiling was an afterthought for the design team at Microsoft. I also get the feeling that they’re finally waking up and will begin to provide important profiling tools in subsequent releases. I don’t think that their offering in 2008 will make much difference though for those serious about data profiling.

Hand Coding with Visual FoxPro (VFP9)

Personally, I feel that your best bet for data profiling is to purchase a data profiling tool from a vendor with a proven track record and with good support. Hand coding ETL is difficult enough as it is, but hand-coding a profiler is quite a challenge in its own right. Data profiling is much more than simply running some SQL queries on a table, and it goes well beyond determining candidate keys and identifying nullable columns. Hand coding this type of application requires research, development time, and plenty of quality controls.

With that said, I have a sample profiling application available in VFP9 that you can download here (click on “VFP Data Profiler”). It is very basic, but should give you a feel for what you will need to do to get started.

If you could choose a language to use, I would select FoxPro over anything else I’ve used (PHP, C++, C#, VB). It can shred XML, process strings blazingly fast, and its database container makes for a perfect metadata and staging repository. You can do more for less in VFP9 than you can do using many other languages or even SSIS.

From here

Data Profiling is an ongoing effort, but when the initial round is complete, the development of the datawarehouse shifts to designing the data model (for analytical data warehousing applications it would be a dimensional model). At the same time, quality issues are being addressed and the ETL team begins to prepare the groundwork for the next subsystem: Changed Data Capture (CDC).

Published September 13th, 2007

Hand Coding the Data Warehouse with VFP

I have posted a new whitepaper that supplements the material I presented at FoxForward and at AFUG in Atlanta. This whitepaper was published in the FoxForward 2007 Book that Kevin provided to conference attendees, which contains lots of other great session material.

This whitepaper makes the case for using Visual FoxPro as a hand-coding tool for developing data profiling, metadata, ETL, and other data warehousing applications.

Download it here.

Published July 20th, 2007

Parsing firstname / middlename / lastname

There are many ways to parse or tokenize data in VFP. I like using STREXTRACT and GETWORDNUM for example. But there are cases in which the actual parsing needs to be a bit smarter than what these common methods offer. Take a concatenated name field for example. Common to many legacy (and unfortunately, some modern) systems, is a ‘name’ or ‘address1′ field that stores a person or company’s full name. This method is not looked upon favorably by data integrators or people who like atomic, normalized data (like me!). These types of fields don’t work well with indexes either.

Of course, the design solution is to create 5 or more fields that represent a person’s name: prefix / first / middle / last / suffix. These atomic values can then be arranged as needed (”last, first middle”; “first last”, initials only, etc.) in reports and for display. But oftentimes, making this decision now is too late.

An alternative is to write a smart parser that can place each name element into its appropriate box. In the following code example, I provide a solution that can handle this tasks. My only assumption for this code is that the name is stored in prefix / first / middle / last / suffix order. By data profiling, you can determine how true and consistent this assumption is on your data. You may discover that 95% of the file is stored in lastname / firstname order instead. Whatever the case may be, you can use the method below as a starting point.

The following snippet gets us started, and in the end, loops through the names and prints the parsed results to screen:

#DEFINE PREFIX  1
#DEFINE FIRSTNAME  2
#DEFINE MIDDLENAME  3
#DEFINE SURNAME  4
#DEFINE SUFFIX  5
 
CLEAR
 
*-- some names to demonstrate
DIMENSION aMyNames[13]
aMyNames[1] = "Andrew MacNeill"
aMyNames[2] = "Andy Kramek"
aMyNames[3] = "Bernard Bout"
aMyNames[4] = "Calvin Hsia"
aMyNames[5] = "Cesar Chalom"
aMyNames[6] = "Craig Baily"
aMyNames[7] = "Craig Berntson"
aMyNames[8] = "Emerson Santon Reed"
aMyNames[9] = "Eric den Doop"
aMyNames[10] = "gonzomaximus"
aMyNames[11] = "Tod J. McKenna II"
aMyNames[12] = "Dr. Ralph Kimball"
aMyNames[13] = "Miss Lucy Jones"
 
*-- placeholder for tokenized name, passed by reference into split_name
DIMENSION aParts[1]
 
FOR x = 1 TO 13
  cLastFirst = split_name(aMyNames[x], @aParts)
  ? cLastFirst
  ? "firstname: " + aParts[FIRSTNAME] + CHR(9)
  ?? "lastname: " + aParts[SURNAME]
NEXT

Here’s the code in function “split_name”

FUNCTION split_name
LPARAMETERS tcName , taParts
 
*-- variables and arrays I will use
LOCAL cPart AS Character
LOCAL x , nTotParts AS Integer
LOCAL lIsSuffix , lIsPrefix AS Logical
LOCAL ARRAY aTemp[1]
EXTERNAL ARRAY taParts
 
*-- get the data from tcName into an array for easy looping
tcName = STRTRAN(UPPER(ALLT(tcName))," ",CHR(13))
DIMENSION taParts[5]
STORE "" TO taParts
DIMENSION aTemp[1]
FOR x = 1 TO MEMLINES(tcName)
  cPart = ALLTRIM(MLINE(tcName,x))
  IF !EMPTY(cPart)
    aTemp[ALEN(aTemp,1)] = cPart
    IF x < MEMLINES(tcName)
      DIMENSION aTemp[ALEN(aTemp,1)+1]
    ENDIF
  ENDIF
NEXT
 
*-- now loop through temp array and put each name segment into a smart little box
nTotParts = ALEN(aTemp,1)
FOR x = 1 TO nTotParts
  IF VARTYPE(aTemp[x])!="C"
    LOOP
  ENDIF
  *-- get rid of any periods used in the part, and convert to uppercase
  cPart = STRTRAN(ALLTRIM(UPPER(aTemp[x])),".","")
  IF nTotParts = 1
    *-- Madonna? gonzomaximus? Treat single names as a lastname
    taParts[SURNAME] = cPart
  ELSE
    *-- Note: A data profile on a name file I parsed a few years back revealed
    *-- the following common name prefixes. Beef this list up if you have others!
    lIsPrefix = INLIST(cPart,"MR","MRS","MS","MISS","DR","PROF","SIR",;
                  "MASTER","REV","REVERAND","FATHER","SISTER","BROTHER",;
                  "BR","SIS","ATTORNEY","COL","COLONOL","REP","PRES")
    IF lIsPrefix
      taParts[PREFIX] = cPart
      LOOP
    ENDIF
    *-- Note: The same profile mentioned above revealed the following suffixes
    lIsSuffix = INLIST(cPart,"PHD","II","III","IV","V","JR","SR","ESQ","CPA","CEA","DD","MD","JD")
    IF lIsSuffix
      taParts[SUFFIX] = cPart
      LOOP
    ENDIF
    IF x = nTotParts
      IF EMPTY(taParts[SURNAME])
        taParts[SURNAME] = cPart
      ELSE
        IF EMPTY(taParts[FIRSTNAME])
          taParts[FIRSTNAME] = cPart
        ELSE
          taParts[MIDDLENAME] = cPart
        ENDIF
      ENDIF
    ELSE
      IF EMPTY(taParts[FIRSTNAME])
        taParts[FIRSTNAME] = cPart
      ELSE
        IF EMPTY(aParts[MIDDLENAME])
          taParts[MIDDLENAME] = cPart
        ELSE
          taParts[SURNAME] = cPart
        ENDIF
      ENDIF
    ENDIF
  ENDIF
NEXT
 
RETURN ALLTRIM(ALLTRIM(taParts[SURNAME]) + ", " +;
         ALLTRIM(taParts[FIRSTNAME]) + " " + ;
         ALLTRIM(taParts[MIDDLENAME]) )
ENDFUNC

So there you have it. The code above looks at the tcName variable and attempts to dump its parts into the taParts array. Some defined constants exist to help ease the “where the heck am I in this array” confusions (a practice I use for stuff like this all the time). As an added bonus, the code returns the name in last / first middle order, which is usually a good format for display purposes.

Published July 10th, 2007

Getting Started with Data Profiling

In my last post I said I would publish some code that you could use to get started profiling your FoxPro tables. I’ve included the following sample here, as well as in the Download Section for you to use. This example is written as a procedure, but you should consider converting this to objects. You can create patterns and really cut down on the amount of work you do. Although this code is generic, you may in fact need to write modified versions for certain tables in certain environments. OO techniques lend themselves to this nicely.

Anyway, here is how you would call the function listed below:

LOCAL lcSkipList, lcDirectory, lcTableName,;
     lcResultAlias  AS String 
 
lcDirectory = "C:\\Work\\tables_to_profile\\"
lcTableName = 'some_table'
lcResultAlias = 'some_resultset'
 
*-- enter the columns, separated by comma, to skip
lcSkipList = '' 
 
*-- run the data profile
profile_table(lcTableName, lcDirectory, lcResultAlias, lcSkipList)

The following function, ‘profile_table’, is as follows (please note, you will need my get_median function from here):

FUNCTION profile_table
  LPARAMETERS tcTableName, tcDirectory, tcAltName, tcSkipList
 
  LOCAL nFldCount AS Integer
 
  tcAltName = EVL(tcAltName,'profiled')
  tcSkipList = EVL(tcSkipList,'')
  lcCursorName = 'db_' + tcAltName 
 
  TRY
    IF USED(tcAltName)
      USE IN (tcAltName)
    ENDIF
    USE (ADDBS(tcDirectory)+tcTableName) IN 0 ALIAS (tcAltName )
    nFldCount = AFIELDS(aColumns,tcAltName)
  CATCH
    nFldCount = -1
  ENDTRY
 
  IF nFldCount > 0
    IF USED(lcCursorName )
      USE IN (lcCursorName )
    ENDIF
 
    CREATE CURSOR (lcCursorName)(;
			table_name varchar(45)	,;
			fld_name   varchar(45)	,;
			fld_type   char(1)		,;
			fld_size   integer      ,;
			fld_prec   integer      ,;
			fld_null   logical      ,;
			range_low  varchar(60)	,;
			range_high varchar(60)	,;
			domain     memo         ,;
			uniq_cnt   integer      ,;
			empty_cnt  integer      ,;
			full_cnt   integer      ,;
			dele_cnt   integer      ,;
			null_cnt   integer      ,;
			most_freq  varchar(60)  ,;
			data_mean  numeric(20,4),;
			data_var   numeric(20,4),;
			data_std   numeric(20,4),;
			median     numeric(20,4),;
			location   varchar(200) )
 
    FOR n = 1 TO nFldCount 
 
        *-- afield properties
        m.table_name = tcTableName
        m.fld_name = aColumns[n,1]
        m.fld_type = aColumns[n,2]
        m.fld_size = aColumns[n,3]
        m.fld_prec = aColumns[n,4]
        m.fld_null = aColumns[n,5]
 
        ? tcTableName + ' Field ' + m.fld_name
        ? DATETIME()
 
        IF ALLTRIM(LOWER(m.fld_name)) $ ALLTRIM(LOWER(tcSkipList))
          ?? ' skipped'
          LOOP
        ENDIF
 
        m.location = tcDirectory
 
        lcSQL = "SELECT " + m.fld_name + " AS prof_fld FROM " + ;
                        tcAltName + " INTO CURSOR crTmpProf"
        &lcSQL 
 
        *-- note to self, need to test to see if using SQL-Select is more efficient
        SELECT crTmpProf
        CALCULATE MIN(prof_fld) TO m.range_low
        CALCULATE MAX(prof_fld) TO m.range_high
        CALCULATE CNT() FOR EMPTY(prof_fld) TO m.empty_cnt
        CALCULATE CNT() FOR !EMPTY(prof_fld) TO m.full_cnt
        CALCULATE CNT() FOR DELETED() TO m.dele_cnt
        CALCULATE CNT() FOR ISNULL(prof_fld) TO m.null_cnt 
 
        *-- numbers only!
        IF INLIST(m.fld_type,"N","I","Y","B","F")
          CALCULATE AVG(prof_fld) TO m.data_mean
          CALCULATE STD(prof_fld) TO m.data_std
          CALCULATE VAR(prof_fld) TO m.data_var
          m.median = get_median(2,"crTmpProf","prof_fld")
        ELSE
          STORE -1 TO m.data_mean , m.data_std , m.data_var, m.median
        ENDIF
 
        *-- distinct list and domain
        SELECT prof_fld, COUNT(prof_fld) as freq ;
            FROM crTmpProf GROUP BY prof_fld ;
            ORDER BY freq DESC INTO CURSOR crTmpDistProf
 
        SELECT crTmpDistProf
        LOCATE
        m.uniq_cnt = _TALLY
        m.most_freq = crTmpDistProf.prof_fld
        m.domain = ''
        SCAN
          m.domain = m.domain + TRANSFORM(crTmpDistProf.prof_fld) + CHR(13)
        ENDSCAN
 
        SELECT (lcCursorName)
        APPEND BLANK
        GATHER MEMVAR MEMO
 
        ? REPLICATE('+',12)
 
    NEXT
  ENDIF
ENDFUNC

So there you have it. Now you have a table that lists each attribute, with additional, highly useful information. You can use this metadata to conduct index analysis, check for misspellings, verify that lookup codes exist in external tables, verify that numerical data hasn’t deviated too far, etc. This type of metadata is essential to have before you begin any quality or integration project!

Data profiling is a hefty subject — one in which I promise to address again in the near future.

Published July 3rd, 2007

Data Profiling

In my most recent FoxPro Advisor article, “From Source to Warehouse: A VFP Data Integration Strategy“, I discuss an approach to using Visual FoxPro to hand-code an ETL solution. I dedicated three measly paragraphs to Data Profiling. In the first paragraph, I wrote that:

Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more [than] reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.

Whoa. Doesn’t sound as though three paragraphs do this topic justice. In fact, entire books have been written about this very important subject. Because my article was not about data profiling, I had to cut it short and give you — the reader — only a taste of what is involved.

But this should in no way minimize the importance of data profiling.

If you are using VFP to write your ETL solution then chances are you will want to use it to profile your data. I may end up writing a full article on this in the near future, but for now, I want to touch on some of the more important points regarding data profiling. I should also mention that using VFP for data profiling is an excellent decision — especially if one or more of your source systems are written in FoxPro. Heck, if you’re reading from dBase, Excel, or flat text files, no other language out there can compete with the Fox (Ok, I’ll stop cheerleading now…):

  1. First and foremost, data profiling is not the same thing as code profiling, nor does it involve the same mechanisms. Although there are some overlaps in concepts and methods, these are two very separate tasks designed to achieve separate things.
  2. Data Profiling and data quality are closely related, generally falling under a broader ‘Data Analysis’ category of BI and Data Warehousing. Often, data profiling reveals data quality issues as well as many other interesting facts about the data.
  3. Structural analysis, field domains and ranges, and various statistics at the column and row level should be generated during a profile. It is a mistake to assume that running a single SQL statement on a table will suffice.
  4. A data profile should produce valuable metadata that can be used when writing transformations and conforming data. Because source metadata is often out-of-date or inadequate, this part of the profile can prove to be invaluable.
  5. Data Profiling, like designing a data warehouse in general, is an iterative process. A Profiling script may work well on some source data, but be utterly useless or pointless on other source data. This is where hand-coding really benefits you. You can tailor your profiling to meet the requirements of both your BI strategy and the structure and state of the source data.
  6. Data profiling allows you to build a list of assumptions (field A is never NULL, though it accepts them; field B is always a positive integer; table C always has exactly one record; etc.) and action items (We need to atomize the fullname field into first, middle, and last; Why are columns X, Y, and Z empty when the documentation clearly states that they are necessary for process M?; etc.)

As you can probably tell, a proper data profile can take time and effort to get right. But the benefit? A complete and documented understanding of each source data system that will allow you to build your ETL solution with little hassle and fewer surprises. On top of that, it will pinpoint data problems and inefficiencies, oddities, and low-hanging fruits.

What’s wonderful in FoxPro is that there are a whole slew of commands (STREXTRACT, SUBSTRING, CALCULATE, FSIZE, etc) and classes (Cursor Adapters for example) available for you to help out. You need only to rely on your ingenuity and skill as a FoxPro programmer.

Next entry, I’ll post some code to get you started.