Published July 10th, 2007
Getting Started with Data Profiling
In my last post I said I would publish some code that you could use to get started profiling your FoxPro tables. I’ve included the following sample here, as well as in the Download Section for you to use. This example is written as a procedure, but you should consider converting this to objects. You can create patterns and really cut down on the amount of work you do. Although this code is generic, you may in fact need to write modified versions for certain tables in certain environments. OO techniques lend themselves to this nicely.
Anyway, here is how you would call the function listed below:
LOCAL lcSkipList, lcDirectory, lcTableName,; lcResultAlias AS String lcDirectory = "C:\\Work\\tables_to_profile\\" lcTableName = 'some_table' lcResultAlias = 'some_resultset' *-- enter the columns, separated by comma, to skip lcSkipList = '' *-- run the data profile profile_table(lcTableName, lcDirectory, lcResultAlias, lcSkipList)
The following function, ‘profile_table’, is as follows (please note, you will need my get_median function from here):
FUNCTION profile_table LPARAMETERS tcTableName, tcDirectory, tcAltName, tcSkipList LOCAL nFldCount AS Integer tcAltName = EVL(tcAltName,'profiled') tcSkipList = EVL(tcSkipList,'') lcCursorName = 'db_' + tcAltName TRY IF USED(tcAltName) USE IN (tcAltName) ENDIF USE (ADDBS(tcDirectory)+tcTableName) IN 0 ALIAS (tcAltName ) nFldCount = AFIELDS(aColumns,tcAltName) CATCH nFldCount = -1 ENDTRY IF nFldCount > 0 IF USED(lcCursorName ) USE IN (lcCursorName ) ENDIF CREATE CURSOR (lcCursorName)(; table_name varchar(45) ,; fld_name varchar(45) ,; fld_type char(1) ,; fld_size integer ,; fld_prec integer ,; fld_null logical ,; range_low varchar(60) ,; range_high varchar(60) ,; domain memo ,; uniq_cnt integer ,; empty_cnt integer ,; full_cnt integer ,; dele_cnt integer ,; null_cnt integer ,; most_freq varchar(60) ,; data_mean numeric(20,4),; data_var numeric(20,4),; data_std numeric(20,4),; median numeric(20,4),; location varchar(200) ) FOR n = 1 TO nFldCount *-- afield properties m.table_name = tcTableName m.fld_name = aColumns[n,1] m.fld_type = aColumns[n,2] m.fld_size = aColumns[n,3] m.fld_prec = aColumns[n,4] m.fld_null = aColumns[n,5] ? tcTableName + ' Field ' + m.fld_name ? DATETIME() IF ALLTRIM(LOWER(m.fld_name)) $ ALLTRIM(LOWER(tcSkipList)) ?? ' skipped' LOOP ENDIF m.location = tcDirectory lcSQL = "SELECT " + m.fld_name + " AS prof_fld FROM " + ; tcAltName + " INTO CURSOR crTmpProf" &lcSQL *-- note to self, need to test to see if using SQL-Select is more efficient SELECT crTmpProf CALCULATE MIN(prof_fld) TO m.range_low CALCULATE MAX(prof_fld) TO m.range_high CALCULATE CNT() FOR EMPTY(prof_fld) TO m.empty_cnt CALCULATE CNT() FOR !EMPTY(prof_fld) TO m.full_cnt CALCULATE CNT() FOR DELETED() TO m.dele_cnt CALCULATE CNT() FOR ISNULL(prof_fld) TO m.null_cnt *-- numbers only! IF INLIST(m.fld_type,"N","I","Y","B","F") CALCULATE AVG(prof_fld) TO m.data_mean CALCULATE STD(prof_fld) TO m.data_std CALCULATE VAR(prof_fld) TO m.data_var m.median = get_median(2,"crTmpProf","prof_fld") ELSE STORE -1 TO m.data_mean , m.data_std , m.data_var, m.median ENDIF *-- distinct list and domain SELECT prof_fld, COUNT(prof_fld) as freq ; FROM crTmpProf GROUP BY prof_fld ; ORDER BY freq DESC INTO CURSOR crTmpDistProf SELECT crTmpDistProf LOCATE m.uniq_cnt = _TALLY m.most_freq = crTmpDistProf.prof_fld m.domain = '' SCAN m.domain = m.domain + TRANSFORM(crTmpDistProf.prof_fld) + CHR(13) ENDSCAN SELECT (lcCursorName) APPEND BLANK GATHER MEMVAR MEMO ? REPLICATE('+',12) NEXT ENDIF ENDFUNC
So there you have it. Now you have a table that lists each attribute, with additional, highly useful information. You can use this metadata to conduct index analysis, check for misspellings, verify that lookup codes exist in external tables, verify that numerical data hasn’t deviated too far, etc. This type of metadata is essential to have before you begin any quality or integration project!
Data profiling is a hefty subject — one in which I promise to address again in the near future.
I'm a Quant Technical Specialist (Data Warehousing and Business Intelligence), with expertise in business analysis, data modeling, and data integration. I have extensive experience developing vertical and integrated desktop, Internet, and BI applications spanning municipal, clinical, and financial industries.
