Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for July, 2007


Published July 10th, 2007

Getting Started with Data Profiling

In my last post I said I would publish some code that you could use to get started profiling your FoxPro tables. I’ve included the following sample here, as well as in the Download Section for you to use. This example is written as a procedure, but you should consider converting this to objects. You can create patterns and really cut down on the amount of work you do. Although this code is generic, you may in fact need to write modified versions for certain tables in certain environments. OO techniques lend themselves to this nicely.

Anyway, here is how you would call the function listed below:

LOCAL lcSkipList, lcDirectory, lcTableName,;
     lcResultAlias  AS String 
 
lcDirectory = "C:\\Work\\tables_to_profile\\"
lcTableName = 'some_table'
lcResultAlias = 'some_resultset'
 
*-- enter the columns, separated by comma, to skip
lcSkipList = '' 
 
*-- run the data profile
profile_table(lcTableName, lcDirectory, lcResultAlias, lcSkipList)

The following function, ‘profile_table’, is as follows (please note, you will need my get_median function from here):

FUNCTION profile_table
  LPARAMETERS tcTableName, tcDirectory, tcAltName, tcSkipList
 
  LOCAL nFldCount AS Integer
 
  tcAltName = EVL(tcAltName,'profiled')
  tcSkipList = EVL(tcSkipList,'')
  lcCursorName = 'db_' + tcAltName 
 
  TRY
    IF USED(tcAltName)
      USE IN (tcAltName)
    ENDIF
    USE (ADDBS(tcDirectory)+tcTableName) IN 0 ALIAS (tcAltName )
    nFldCount = AFIELDS(aColumns,tcAltName)
  CATCH
    nFldCount = -1
  ENDTRY
 
  IF nFldCount > 0
    IF USED(lcCursorName )
      USE IN (lcCursorName )
    ENDIF
 
    CREATE CURSOR (lcCursorName)(;
			table_name varchar(45)	,;
			fld_name   varchar(45)	,;
			fld_type   char(1)		,;
			fld_size   integer      ,;
			fld_prec   integer      ,;
			fld_null   logical      ,;
			range_low  varchar(60)	,;
			range_high varchar(60)	,;
			domain     memo         ,;
			uniq_cnt   integer      ,;
			empty_cnt  integer      ,;
			full_cnt   integer      ,;
			dele_cnt   integer      ,;
			null_cnt   integer      ,;
			most_freq  varchar(60)  ,;
			data_mean  numeric(20,4),;
			data_var   numeric(20,4),;
			data_std   numeric(20,4),;
			median     numeric(20,4),;
			location   varchar(200) )
 
    FOR n = 1 TO nFldCount 
 
        *-- afield properties
        m.table_name = tcTableName
        m.fld_name = aColumns[n,1]
        m.fld_type = aColumns[n,2]
        m.fld_size = aColumns[n,3]
        m.fld_prec = aColumns[n,4]
        m.fld_null = aColumns[n,5]
 
        ? tcTableName + ' Field ' + m.fld_name
        ? DATETIME()
 
        IF ALLTRIM(LOWER(m.fld_name)) $ ALLTRIM(LOWER(tcSkipList))
          ?? ' skipped'
          LOOP
        ENDIF
 
        m.location = tcDirectory
 
        lcSQL = "SELECT " + m.fld_name + " AS prof_fld FROM " + ;
                        tcAltName + " INTO CURSOR crTmpProf"
        &lcSQL 
 
        *-- note to self, need to test to see if using SQL-Select is more efficient
        SELECT crTmpProf
        CALCULATE MIN(prof_fld) TO m.range_low
        CALCULATE MAX(prof_fld) TO m.range_high
        CALCULATE CNT() FOR EMPTY(prof_fld) TO m.empty_cnt
        CALCULATE CNT() FOR !EMPTY(prof_fld) TO m.full_cnt
        CALCULATE CNT() FOR DELETED() TO m.dele_cnt
        CALCULATE CNT() FOR ISNULL(prof_fld) TO m.null_cnt 
 
        *-- numbers only!
        IF INLIST(m.fld_type,"N","I","Y","B","F")
          CALCULATE AVG(prof_fld) TO m.data_mean
          CALCULATE STD(prof_fld) TO m.data_std
          CALCULATE VAR(prof_fld) TO m.data_var
          m.median = get_median(2,"crTmpProf","prof_fld")
        ELSE
          STORE -1 TO m.data_mean , m.data_std , m.data_var, m.median
        ENDIF
 
        *-- distinct list and domain
        SELECT prof_fld, COUNT(prof_fld) as freq ;
            FROM crTmpProf GROUP BY prof_fld ;
            ORDER BY freq DESC INTO CURSOR crTmpDistProf
 
        SELECT crTmpDistProf
        LOCATE
        m.uniq_cnt = _TALLY
        m.most_freq = crTmpDistProf.prof_fld
        m.domain = ''
        SCAN
          m.domain = m.domain + TRANSFORM(crTmpDistProf.prof_fld) + CHR(13)
        ENDSCAN
 
        SELECT (lcCursorName)
        APPEND BLANK
        GATHER MEMVAR MEMO
 
        ? REPLICATE('+',12)
 
    NEXT
  ENDIF
ENDFUNC

So there you have it. Now you have a table that lists each attribute, with additional, highly useful information. You can use this metadata to conduct index analysis, check for misspellings, verify that lookup codes exist in external tables, verify that numerical data hasn’t deviated too far, etc. This type of metadata is essential to have before you begin any quality or integration project!

Data profiling is a hefty subject — one in which I promise to address again in the near future.

Published July 3rd, 2007

Data Profiling

In my most recent FoxPro Advisor article, “From Source to Warehouse: A VFP Data Integration Strategy“, I discuss an approach to using Visual FoxPro to hand-code an ETL solution. I dedicated three measly paragraphs to Data Profiling. In the first paragraph, I wrote that:

Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more [than] reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.

Whoa. Doesn’t sound as though three paragraphs do this topic justice. In fact, entire books have been written about this very important subject. Because my article was not about data profiling, I had to cut it short and give you — the reader — only a taste of what is involved.

But this should in no way minimize the importance of data profiling.

If you are using VFP to write your ETL solution then chances are you will want to use it to profile your data. I may end up writing a full article on this in the near future, but for now, I want to touch on some of the more important points regarding data profiling. I should also mention that using VFP for data profiling is an excellent decision — especially if one or more of your source systems are written in FoxPro. Heck, if you’re reading from dBase, Excel, or flat text files, no other language out there can compete with the Fox (Ok, I’ll stop cheerleading now…):

  1. First and foremost, data profiling is not the same thing as code profiling, nor does it involve the same mechanisms. Although there are some overlaps in concepts and methods, these are two very separate tasks designed to achieve separate things.
  2. Data Profiling and data quality are closely related, generally falling under a broader ‘Data Analysis’ category of BI and Data Warehousing. Often, data profiling reveals data quality issues as well as many other interesting facts about the data.
  3. Structural analysis, field domains and ranges, and various statistics at the column and row level should be generated during a profile. It is a mistake to assume that running a single SQL statement on a table will suffice.
  4. A data profile should produce valuable metadata that can be used when writing transformations and conforming data. Because source metadata is often out-of-date or inadequate, this part of the profile can prove to be invaluable.
  5. Data Profiling, like designing a data warehouse in general, is an iterative process. A Profiling script may work well on some source data, but be utterly useless or pointless on other source data. This is where hand-coding really benefits you. You can tailor your profiling to meet the requirements of both your BI strategy and the structure and state of the source data.
  6. Data profiling allows you to build a list of assumptions (field A is never NULL, though it accepts them; field B is always a positive integer; table C always has exactly one record; etc.) and action items (We need to atomize the fullname field into first, middle, and last; Why are columns X, Y, and Z empty when the documentation clearly states that they are necessary for process M?; etc.)

As you can probably tell, a proper data profile can take time and effort to get right. But the benefit? A complete and documented understanding of each source data system that will allow you to build your ETL solution with little hassle and fewer surprises. On top of that, it will pinpoint data problems and inefficiencies, oddities, and low-hanging fruits.

What’s wonderful in FoxPro is that there are a whole slew of commands (STREXTRACT, SUBSTRING, CALCULATE, FSIZE, etc) and classes (Cursor Adapters for example) available for you to help out. You need only to rely on your ingenuity and skill as a FoxPro programmer.

Next entry, I’ll post some code to get you started.

Published July 2nd, 2007

Welcome to the new Tod means Fox

I’ve left Blogger and Windows Live Spaces for my own domain, more control, and greater flexibility. I hope you all like the new digs. My intention now is to grow this site by including some downloads, more resources, and additional FoxPro and Business Intelligence information.

So please, take a moment to update your RSS feeds to my new one: http://blog.todmeansfox.com/feed and your Bookmarks (use one of the Social Bookmarking links on the right-hand-side if you want).

And, feel free to contact me (todatgrengamadotcom) if you have any comments or questions!