Getting Started with Data Profiling
In my last post I said I would publish some code that you could use to get started profiling your FoxPro tables. I’ve included the following sample here, as well as in the Download Section for you to use. This example is written as a procedure, but you should consider converting this to objects. You can create patterns and really cut down on the amount of work you do. Although this code is generic, you may in fact need to write modified versions for certain tables in certain environments. OO techniques lend themselves to this nicely.
Anyway, here is how you would call the function listed below:
LOCAL lcSkipList, lcDirectory, lcTableName,; lcResultAlias AS String lcDirectory = "C:\\Work\\tables_to_profile\\" lcTableName = 'some_table' lcResultAlias = 'some_resultset' *-- enter the columns, separated by comma, to skip lcSkipList = '' *-- run the data profile profile_table(lcTableName, lcDirectory, lcResultAlias, lcSkipList)
The following function, ‘profile_table’, is as follows (please note, you will need my get_median function from here):
FUNCTION profile_table LPARAMETERS tcTableName, tcDirectory, tcAltName, tcSkipList LOCAL nFldCount AS Integer tcAltName = EVL(tcAltName,'profiled') tcSkipList = EVL(tcSkipList,'') lcCursorName = 'db_' + tcAltName TRY IF USED(tcAltName) USE IN (tcAltName) ENDIF USE (ADDBS(tcDirectory)+tcTableName) IN 0 ALIAS (tcAltName ) nFldCount = AFIELDS(aColumns,tcAltName) CATCH nFldCount = -1 ENDTRY IF nFldCount > 0 IF USED(lcCursorName ) USE IN (lcCursorName ) ENDIF CREATE CURSOR (lcCursorName)(; table_name varchar(45) ,; fld_name varchar(45) ,; fld_type char(1) ,; fld_size integer ,; fld_prec integer ,; fld_null logical ,; range_low varchar(60) ,; range_high varchar(60) ,; domain memo ,; uniq_cnt integer ,; empty_cnt integer ,; full_cnt integer ,; dele_cnt integer ,; null_cnt integer ,; most_freq varchar(60) ,; data_mean numeric(20,4),; data_var numeric(20,4),; data_std numeric(20,4),; median numeric(20,4),; location varchar(200) ) FOR n = 1 TO nFldCount *-- afield properties m.table_name = tcTableName m.fld_name = aColumns[n,1] m.fld_type = aColumns[n,2] m.fld_size = aColumns[n,3] m.fld_prec = aColumns[n,4] m.fld_null = aColumns[n,5] ? tcTableName + ' Field ' + m.fld_name ? DATETIME() IF ALLTRIM(LOWER(m.fld_name)) $ ALLTRIM(LOWER(tcSkipList)) ?? ' skipped' LOOP ENDIF m.location = tcDirectory lcSQL = "SELECT " + m.fld_name + " AS prof_fld FROM " + ; tcAltName + " INTO CURSOR crTmpProf" &lcSQL *-- note to self, need to test to see if using SQL-Select is more efficient SELECT crTmpProf CALCULATE MIN(prof_fld) TO m.range_low CALCULATE MAX(prof_fld) TO m.range_high CALCULATE CNT() FOR EMPTY(prof_fld) TO m.empty_cnt CALCULATE CNT() FOR !EMPTY(prof_fld) TO m.full_cnt CALCULATE CNT() FOR DELETED() TO m.dele_cnt CALCULATE CNT() FOR ISNULL(prof_fld) TO m.null_cnt *-- numbers only! IF INLIST(m.fld_type,"N","I","Y","B","F") CALCULATE AVG(prof_fld) TO m.data_mean CALCULATE STD(prof_fld) TO m.data_std CALCULATE VAR(prof_fld) TO m.data_var m.median = get_median(2,"crTmpProf","prof_fld") ELSE STORE -1 TO m.data_mean , m.data_std , m.data_var, m.median ENDIF *-- distinct list and domain SELECT prof_fld, COUNT(prof_fld) as freq ; FROM crTmpProf GROUP BY prof_fld ; ORDER BY freq DESC INTO CURSOR crTmpDistProf SELECT crTmpDistProf LOCATE m.uniq_cnt = _TALLY m.most_freq = crTmpDistProf.prof_fld m.domain = '' SCAN m.domain = m.domain + TRANSFORM(crTmpDistProf.prof_fld) + CHR(13) ENDSCAN SELECT (lcCursorName) APPEND BLANK GATHER MEMVAR MEMO ? REPLICATE('+',12) NEXT ENDIF ENDFUNC
So there you have it. Now you have a table that lists each attribute, with additional, highly useful information. You can use this metadata to conduct index analysis, check for misspellings, verify that lookup codes exist in external tables, verify that numerical data hasn’t deviated too far, etc. This type of metadata is essential to have before you begin any quality or integration project!
Data profiling is a hefty subject — one in which I promise to address again in the near future.
I'm a Quant Technical Specialist (Data Warehousing and Business Intelligence), with expertise in business analysis, data modeling, and data integration. I have extensive experience developing vertical and integrated desktop, Internet, and BI applications spanning municipal, clinical, and financial industries.

August 7th, 2007 at 7:50 pm
[…] - via Roy Osherove DateTime Formatting with single custom format specifiers - via Jean Paul Boodhoo Getting Started with Data Profiling - via Tod McKenna Howto: (Almost) Everything In Active Directory via C# Tracing in ASP.NET […]
November 4th, 2007 at 12:18 am
Jessie…
Man i just love your blog, keep the cool posts comin…..
January 16th, 2008 at 3:26 pm
[…] have already talked about the data profile in my posts “Data Profiling” and “Getting Started with Data Profiling“. I won’t beat a dead horse here, so please take a moment to review those two short […]
February 21st, 2008 at 3:33 pm
I could not agree more with your views that “data profiling is the most important step in designing a data warehousing solution.” No truer words could have been spoken. Profiling is more than the profile_table function, it can benefit from numerous existing models as well. I would appreciate your feedback on our sourceforge project (project jumper) that is proposing a metadata language and will be posting some open-source scripts to build the JUMP models.
February 23rd, 2008 at 11:56 am
Sure Steve,
Thanks for the feedback. I’ll be in touch…