In my last post I said I would publish some code that you could use to get started profiling your FoxPro tables. I’ve included the following sample here, as well as in the Download Section for you to use. This example is written as a procedure, but you should consider converting this to objects. You can create patterns and really cut down on the amount of work you do. Although this code is generic, you may in fact need to write modified versions for certain tables in certain environments. OO techniques lend themselves to this nicely.

Anyway, here is how you would call the function listed below:

LOCAL lcSkipList, lcDirectory, lcTableName,;
     lcResultAlias  AS String 
 
lcDirectory = "C:Worktables_to_profile"
lcTableName = 'some_table'
lcResultAlias = 'some_resultset'
 
*-- enter the columns, separated by comma, to skip
lcSkipList = '' 
 
*-- run the data profile
profile_table(lcTableName, lcDirectory, lcResultAlias, lcSkipList)

The following function, ‘profile_table’, is as follows (please note, you will need my get_median function from here):

FUNCTION profile_table
  LPARAMETERS tcTableName, tcDirectory, tcAltName, tcSkipList
 
  LOCAL nFldCount AS Integer
 
  tcAltName = EVL(tcAltName,'profiled')
  tcSkipList = EVL(tcSkipList,'')
  lcCursorName = 'db_' + tcAltName 
 
  TRY
    IF USED(tcAltName)
      USE IN (tcAltName)
    ENDIF
    USE (ADDBS(tcDirectory)+tcTableName) IN 0 ALIAS (tcAltName )
    nFldCount = AFIELDS(aColumns,tcAltName)
  CATCH
    nFldCount = -1
  ENDTRY
 
  IF nFldCount > 0
    IF USED(lcCursorName )
      USE IN (lcCursorName )
    ENDIF
 
    CREATE CURSOR (lcCursorName)(;
			table_name varchar(45)	,;
			fld_name   varchar(45)	,;
			fld_type   char(1)		,;
			fld_size   integer      ,;
			fld_prec   integer      ,;
			fld_null   logical      ,;
			range_low  varchar(60)	,;
			range_high varchar(60)	,;
			domain     memo         ,;
			uniq_cnt   integer      ,;
			empty_cnt  integer      ,;
			full_cnt   integer      ,;
			dele_cnt   integer      ,;
			null_cnt   integer      ,;
			most_freq  varchar(60)  ,;
			data_mean  numeric(20,4),;
			data_var   numeric(20,4),;
			data_std   numeric(20,4),;
			median     numeric(20,4),;
			location   varchar(200) )
 
    FOR n = 1 TO nFldCount 
 
        *-- afield properties
        m.table_name = tcTableName
        m.fld_name = aColumns[n,1]
        m.fld_type = aColumns[n,2]
        m.fld_size = aColumns[n,3]
        m.fld_prec = aColumns[n,4]
        m.fld_null = aColumns[n,5]
 
        ? tcTableName + ' Field ' + m.fld_name
        ? DATETIME()
 
        IF ALLTRIM(LOWER(m.fld_name)) $ ALLTRIM(LOWER(tcSkipList))
          ?? ' skipped'
          LOOP
        ENDIF
 
        m.location = tcDirectory
 
        lcSQL = "SELECT " + m.fld_name + " AS prof_fld FROM " + ;
                        tcAltName + " INTO CURSOR crTmpProf"
        &lcSQL 
 
        *-- note to self, need to test to see if using SQL-Select is more efficient
        SELECT crTmpProf
        CALCULATE MIN(prof_fld) TO m.range_low
        CALCULATE MAX(prof_fld) TO m.range_high
        CALCULATE CNT() FOR EMPTY(prof_fld) TO m.empty_cnt
        CALCULATE CNT() FOR !EMPTY(prof_fld) TO m.full_cnt
        CALCULATE CNT() FOR DELETED() TO m.dele_cnt
        CALCULATE CNT() FOR ISNULL(prof_fld) TO m.null_cnt 
 
        *-- numbers only!
        IF INLIST(m.fld_type,"N","I","Y","B","F")
          CALCULATE AVG(prof_fld) TO m.data_mean
          CALCULATE STD(prof_fld) TO m.data_std
          CALCULATE VAR(prof_fld) TO m.data_var
          m.median = get_median(2,"crTmpProf","prof_fld")
        ELSE
          STORE -1 TO m.data_mean , m.data_std , m.data_var, m.median
        ENDIF
 
        *-- distinct list and domain
        SELECT prof_fld, COUNT(prof_fld) as freq ;
            FROM crTmpProf GROUP BY prof_fld ;
            ORDER BY freq DESC INTO CURSOR crTmpDistProf
 
        SELECT crTmpDistProf
        LOCATE
        m.uniq_cnt = _TALLY
        m.most_freq = crTmpDistProf.prof_fld
        m.domain = ''
        SCAN
          m.domain = m.domain + TRANSFORM(crTmpDistProf.prof_fld) + CHR(13)
        ENDSCAN
 
        SELECT (lcCursorName)
        APPEND BLANK
        GATHER MEMVAR MEMO
 
        ? REPLICATE('+',12)
 
    NEXT
  ENDIF
ENDFUNC

So there you have it. Now you have a table that lists each attribute, with additional, highly useful information. You can use this metadata to conduct index analysis, check for misspellings, verify that lookup codes exist in external tables, verify that numerical data hasn’t deviated too far, etc. This type of metadata is essential to have before you begin any quality or integration project!

Data profiling is a hefty subject — one in which I promise to address again in the near future.