Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Data Quality’


Published June 6th, 2008

Fun with Serial Dates

I ran into a bit of an oddity today while working with serial dates in SQL Server 2005 and MS Excel 2003. I’ve dealt with this before with Excel using FoxPro, but don’t recall the specifics. Anyway…

A serial date is a sequential number, starting on January 1st, 1900 (or if you want compatibility with a Mac serial date, 1904) that represents the number of days since any date to that point in time (you can also represent time by using a decimal, but I won’t get into that here). Excel has a bug (or a “Lotus 123 compatibility feature” as some like to call it) in it that counts Feb 29, 1900 as an actual date, when in fact 1900 was not a leap year.

Hence the first bit of fun. Run this TSQL code:

SELECT DATEDIFF(dd,'19000101','20080606')

The above query returns ‘39603′. So far so good. In SQL Server, you can now derive the date from the serial value by using convert:

SELECT CONVERT(DATETIME,DATEDIFF(dd,'19000101','20080606'))

…which gives me ‘2008-06-06 00:00:00.000′. So far so good.

The reason we’re using serial dates in the first place is to accommodate some models developed in Excel. The data is fed in directly from the database and calculations are performed.

When I take value 39603, put it into an Excel cell and re-format the column from General to Date, I get ‘June 4, 2008′! Two days off from what I get from SQL Server.

I already mentioned that Excel counts Feb 29, 1900. That’s one day. So why am I off by two? Here’s some more fun. In SQL Server, try the following:

SELECT CONVERT(DATETIME,0)

You get ‘1900-01-01 00:00:00.000′. But January 1st is supposed to be day one. Not day zero (must be a zero-based array sort of thing). When you try the same in Excel, you get a more appropriate (but equally odd) value of ‘January 0, 1900′. You’d think I’m dealing with scientific ephemeris or something.

SELECT CONVERT(DATETIME,1)

So, in SQL Server, we’re off by a day. The first day shouldn’t result in the second day (’1900-01-02 00:00:00.000′) of the year as the above TSQL shows.

This helps to explain the 2 day offset between Excel and SQL Server.

Come on Redmond! Work with me here!

To solve (read: work around) the problem, we’ve decided to stick with keeping Excel happy (against my recommendation, but at the end of the day, this is a business decision). Our data integration packages add two days to the serial date we calculate from a YYYYMMDD value. When data is moved to Excel from the data warehouse, no transformation is done on the serial value and Excel calculates as expected. When we use the serial date in a SQL Server context, we need to subtract 2 before the convert.

This is not a very good design and will undoubtedly cause someone problems some day.

I recommend storing the ISO YYYYMMDD format instead. But Excel won’t format a value such as 20080606 to a date. What a shame! If anyone knows of a way, or can enlighten me more about this subject — do not hesitate!

Published October 8th, 2007

Find Unmatched Keys from Two Data Sets

Here’s a real fast way to find the keys (natural or surrogate) that are missing from Set A or Set B. This query is similar to finding the symmetric difference between two tables, except that it only compares keys:

SELECT DISTINCT EVL(tableA.pkey, tableB.pkey) AS not_in ;
  FROM tableA ;
    FULL OUTER JOIN tableB ON tableA.pkey  = tableB.pkey ;
  WHERE tableA.pkey IS NULL OR tableB.pkey IS NULL

I used EVL in order to show a unique set of values that are missing from either TableA or TableB. To see exactly which keys are missing from each table, simply leave the EVL out and display both columns.

I tested this on two tables (A & B) with more than a million records using an integer surrogate key. The query took 8.6 seconds. A similar approach works in TSQL. Just use COALESCE instead of EVL and you’re off to the races!

Published September 24th, 2007

Delete Duplicates from a Table

The need to delete duplicate records from a table rises up all the time, whether you are importing data from a third party, integrating systems, or cleaning dirty data. Duplicate data is a reoccurring quality issue.

A duplicate records does not necessarily mean an exact copy of another record (as if you pressed CTRL+Y with SET CARRY ON in VFP). It could mean that all of the important identifying attributes are duplicated but other data, such as a surrogate key or a last updated timestamp are not. So here is a technique that I’ve used to delete a row if it is in fact a duplicate based on attributes in the table:

DELETE FROM [tablename] ;
  WHERE [primary key] NOT IN ;
    (SELECT MAX([primary key]) FROM [tablename] ;
      GROUP BY [attribute1], [attribute2], [attribute3])

To apply this to a Customer entity, for example, you can try the following (the first half of this code is setup, scroll down to see the delete):

*-- example table
CREATE TABLE Customer FREE   (;
    cust_key integer         ,;  && a surrogate key
    cust_id character(3)     ,;  && natural key
    cust_fname varchar(20)   ,;  && first name
    cust_lname varchar(20)   ,;  && last name
    cust_phone character(14) ,;  && phone number
    cust_lupd datetime       ,;  && last updated
    cust_user character(15)   )   && user last updated
 
*-- need data 
INSERT INTO Customer VALUES (1,"AM1","Andrew","MacNeill","555-5551",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (2,"AK1","Andy","Kramek","555-5552",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (3,"CC1","Cesar","Chalom","555-5553",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (4,"CB1","Craig","Baily","555-5554",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (5,"DC1","Dave","Crozier","555-5555",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (6,"DS1","David","Stevenson","555-5556",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (7,"GF1","Garrett","Fitzgerald","555-5557",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (8,"KR1","Kevin","Ragsdale","555-5558",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (9,"KR1","Kevin","Ragsdale","555-5558",DATETIME(),"ZorroMeansFox")
INSERT INTO Customer VALUES (10,"RS1","Rick","Schummer","555-5559",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (11,"RS2","Rick","Strahl","555-5510",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (12,"RS2","Rick","Strahl","555-5510",DATETIME(),"ZorroMeansFox")
INSERT INTO Customer VALUES (13,"RS2","Rick","Strahl","555-5510",DATETIME(),"FanakMeansFox")
INSERT INTO Customer VALUES (14,"SB1","Stephen","Bodnar","555-5511",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (15,"SD1","Stuart","Dunkeld","555-5512",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (16,"TR1","Ted","Roche","555-5513",DATETIME(),"TodMeansFox")
 
*-- delete duplicates (we don't care about the surrogate 
*-- key or the audit information).
DELETE FROM Customer ;
  WHERE cust_key NOT IN ;
    (SELECT MAX(cust_key) FROM Customer ;
      GROUP BY cust_id, cust_fname, cust_lname, cust_phone)
 
*-- results
SET DELETED OFF
BROWSE NORMAL

This works by selecting the most recent issuance of the primary key (using MAX) and discarding the others using the subquery. If you run the subquery on its own, you can see how SELECT MAX plus the GROUP BY clause gives you the most recent version of the row that is distinct from the others (’Rick’ and ‘Kevin’ only show once) by leaving out fields that don’t identify the row (like auditing information or perhaps de-identified attributes). Next, we delete everything that is not in this list.

This works in all ANSI SQL environments were subqueries are supported in the DELETE statement.

Published July 10th, 2007

Getting Started with Data Profiling

In my last post I said I would publish some code that you could use to get started profiling your FoxPro tables. I’ve included the following sample here, as well as in the Download Section for you to use. This example is written as a procedure, but you should consider converting this to objects. You can create patterns and really cut down on the amount of work you do. Although this code is generic, you may in fact need to write modified versions for certain tables in certain environments. OO techniques lend themselves to this nicely.

Anyway, here is how you would call the function listed below:

LOCAL lcSkipList, lcDirectory, lcTableName,;
     lcResultAlias  AS String 
 
lcDirectory = "C:\\Work\\tables_to_profile\\"
lcTableName = 'some_table'
lcResultAlias = 'some_resultset'
 
*-- enter the columns, separated by comma, to skip
lcSkipList = '' 
 
*-- run the data profile
profile_table(lcTableName, lcDirectory, lcResultAlias, lcSkipList)

The following function, ‘profile_table’, is as follows (please note, you will need my get_median function from here):

FUNCTION profile_table
  LPARAMETERS tcTableName, tcDirectory, tcAltName, tcSkipList
 
  LOCAL nFldCount AS Integer
 
  tcAltName = EVL(tcAltName,'profiled')
  tcSkipList = EVL(tcSkipList,'')
  lcCursorName = 'db_' + tcAltName 
 
  TRY
    IF USED(tcAltName)
      USE IN (tcAltName)
    ENDIF
    USE (ADDBS(tcDirectory)+tcTableName) IN 0 ALIAS (tcAltName )
    nFldCount = AFIELDS(aColumns,tcAltName)
  CATCH
    nFldCount = -1
  ENDTRY
 
  IF nFldCount > 0
    IF USED(lcCursorName )
      USE IN (lcCursorName )
    ENDIF
 
    CREATE CURSOR (lcCursorName)(;
			table_name varchar(45)	,;
			fld_name   varchar(45)	,;
			fld_type   char(1)		,;
			fld_size   integer      ,;
			fld_prec   integer      ,;
			fld_null   logical      ,;
			range_low  varchar(60)	,;
			range_high varchar(60)	,;
			domain     memo         ,;
			uniq_cnt   integer      ,;
			empty_cnt  integer      ,;
			full_cnt   integer      ,;
			dele_cnt   integer      ,;
			null_cnt   integer      ,;
			most_freq  varchar(60)  ,;
			data_mean  numeric(20,4),;
			data_var   numeric(20,4),;
			data_std   numeric(20,4),;
			median     numeric(20,4),;
			location   varchar(200) )
 
    FOR n = 1 TO nFldCount 
 
        *-- afield properties
        m.table_name = tcTableName
        m.fld_name = aColumns[n,1]
        m.fld_type = aColumns[n,2]
        m.fld_size = aColumns[n,3]
        m.fld_prec = aColumns[n,4]
        m.fld_null = aColumns[n,5]
 
        ? tcTableName + ' Field ' + m.fld_name
        ? DATETIME()
 
        IF ALLTRIM(LOWER(m.fld_name)) $ ALLTRIM(LOWER(tcSkipList))
          ?? ' skipped'
          LOOP
        ENDIF
 
        m.location = tcDirectory
 
        lcSQL = "SELECT " + m.fld_name + " AS prof_fld FROM " + ;
                        tcAltName + " INTO CURSOR crTmpProf"
        &lcSQL 
 
        *-- note to self, need to test to see if using SQL-Select is more efficient
        SELECT crTmpProf
        CALCULATE MIN(prof_fld) TO m.range_low
        CALCULATE MAX(prof_fld) TO m.range_high
        CALCULATE CNT() FOR EMPTY(prof_fld) TO m.empty_cnt
        CALCULATE CNT() FOR !EMPTY(prof_fld) TO m.full_cnt
        CALCULATE CNT() FOR DELETED() TO m.dele_cnt
        CALCULATE CNT() FOR ISNULL(prof_fld) TO m.null_cnt 
 
        *-- numbers only!
        IF INLIST(m.fld_type,"N","I","Y","B","F")
          CALCULATE AVG(prof_fld) TO m.data_mean
          CALCULATE STD(prof_fld) TO m.data_std
          CALCULATE VAR(prof_fld) TO m.data_var
          m.median = get_median(2,"crTmpProf","prof_fld")
        ELSE
          STORE -1 TO m.data_mean , m.data_std , m.data_var, m.median
        ENDIF
 
        *-- distinct list and domain
        SELECT prof_fld, COUNT(prof_fld) as freq ;
            FROM crTmpProf GROUP BY prof_fld ;
            ORDER BY freq DESC INTO CURSOR crTmpDistProf
 
        SELECT crTmpDistProf
        LOCATE
        m.uniq_cnt = _TALLY
        m.most_freq = crTmpDistProf.prof_fld
        m.domain = ''
        SCAN
          m.domain = m.domain + TRANSFORM(crTmpDistProf.prof_fld) + CHR(13)
        ENDSCAN
 
        SELECT (lcCursorName)
        APPEND BLANK
        GATHER MEMVAR MEMO
 
        ? REPLICATE('+',12)
 
    NEXT
  ENDIF
ENDFUNC

So there you have it. Now you have a table that lists each attribute, with additional, highly useful information. You can use this metadata to conduct index analysis, check for misspellings, verify that lookup codes exist in external tables, verify that numerical data hasn’t deviated too far, etc. This type of metadata is essential to have before you begin any quality or integration project!

Data profiling is a hefty subject — one in which I promise to address again in the near future.