Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for October, 2007


Published October 5th, 2007

Sys(2017) and Comparing Data

I’ve been showing how I use intersect and the symmetric difference to compare data in two datasets. What’s nice about these approaches is that they are extremly flexible. You can, for example, pick just a few columns to compare from each table. So if you had two similar tables, you could easily detect which primary key + select attributes are either in both sets or not in both sets, returning the result into a third table or cursor. I’ve used these techniques in the following scenarios:

  1. to build master lists from client tables
  2. test code changes (run a procedure once, save the results A, make a code change, save the results B, compare differences)
  3. for data integration projects, especially for Changed Data Capture (CDC or incremental update) operations
  4. to check the status of metadata
  5. to do some down-and-dirty comparisons between two or more tables

Sys(2017)

One limitation of the SQL approach that I’ve been demonstrating is that you need to use a GROUP BY clause, which insists that you identify each attribute used in the select list (true for VFP, T-SQL, and other ANSI compliant SQL-enabled languages). This can be a lot of work, especially if you have a handful of tables with dozens of attributes.

The solution? Use a CRC (Cyclical Redundancy Check):

SELECT p_key, crc ;
  FROM (;
    SELECT p_key, PADR(Sys(2017, '', 1, 1),10) as crc FROM CursorA;
    UNION ALL ;
    SELECT p_key, PADR(Sys(2017, '', 1, 1),10) as crc  FROM CursorB;
  ) AS tmp1 ;
  GROUP BY p_key, crc  ;
  HAVING COUNT(*) = 1

SYS(2017) will Return a checksum value based on each row in the table. The second parameter is an exclusion list of attributes that you do not want included in the algorithm. The third parameter is the seed value, ignored when using the CRC32 version of function. The last parameter allows you to set flags to include memos or to use the CRC32 algorithm. In my example above, I am using CRC32 without memo support.

The resulting table shows me the primary keys that intersect (when HAVING COUNT(*)=2) or do not intersect (when HAVING COUNT(*)=1).

I hope you find this useful!

PS: It’s worth noting that you should avoid SYS(2017) in FoxPro versions 8 and below. There is a memory leak (fixed in VFP9) that could crash your app!

Published October 1st, 2007

The “Not Intersect” or “Symmetric Difference” Between Two Tables

A few months back I wrote an entry on how to achieve an INTERSECT using VFP. That post was inspired by the INTERSECT command that exists in SQL Server 2005. In set theory, the intersect can be defined as follows: “x is an element of sets A and B if and only if x is an element of A and x is an element of B.”

But what about returning non-intersecting data from two sets? This would be like a union, but that we only want a list of unique rows within each table, that is, the difference. The difference between two sets could be defined as: “x is an element of set A or B when x is not an element of A and B.” In set theory, this is called a “symmetric difference” (like XOR for x-basers!).

The SQL to accomplish this is almost identical to the Intersect query. Here is the general form in VFP:

SELECT [key field], [attributes..] ;
  FROM (;
    SELECT [key field], [attributes..] FROM A ;
    UNION ALL ;
    SELECT [key field], [attributes..] FROM B ;
  ) AS tmp ;
  GROUP BY [key field], [attributes..] ;
  HAVING COUNT(*) = 1

The change is in the HAVING clause, where instead of checking to see if COUNT(*) = 2, we check to see if COUNT(*) = 1. This gives us the “Not Intersect” of two sets: perfect for determining how two tables are different! The result is a complete list of non-intersecting data.

Here’s an actual example in VFP (followed by an example using T-SQL):

*-- two sets
CREATE CURSOR crTempL (keyfld i)
CREATE CURSOR crTempR (keyfld i)
 
*-- test data
INSERT INTO crTempL VALUES(1)
INSERT INTO crTempL VALUES(3)
INSERT INTO crTempL VALUES(5)
INSERT INTO crTempL VALUES(7)
INSERT INTO crTempL VALUES(9)
 
INSERT INTO crTempR VALUES(3)
INSERT INTO crTempR VALUES(6)
INSERT INTO crTempR VALUES(9)
INSERT INTO crTempR VALUES(12)
INSERT INTO crTempR VALUES(15)
 
*-- the difference
SELECT keyfld ;
  FROM (;
    SELECT keyfld FROM crTempL;
    UNION ALL ;
    SELECT keyfld FROM crTempR;
  ) AS tmp1 ;
  GROUP BY keyfld  ;
  HAVING COUNT(*) = 1

Now, here is the equivalent in T-SQL.

-- two sets
CREATE TABLE #crTempL (keyfld INTEGER)
CREATE TABLE #crTempR (keyfld INTEGER)
 
-- test data
INSERT INTO #crTempL VALUES(1)
INSERT INTO #crTempL VALUES(3)
INSERT INTO #crTempL VALUES(5)
INSERT INTO #crTempL VALUES(7)
INSERT INTO #crTempL VALUES(9)
 
INSERT INTO #crTempR VALUES(3)
INSERT INTO #crTempR VALUES(6)
INSERT INTO #crTempR VALUES(9)
INSERT INTO #crTempR VALUES(12)
INSERT INTO #crTempR VALUES(15)
 
-- the query
SELECT keyfld 
  FROM (
    SELECT keyfld FROM #crTempL
    UNION ALL 
    SELECT keyfld FROM #crTempR
  ) AS tmp1 
  GROUP BY keyfld  
  HAVING COUNT(*) = 1

As you can see, the syntax is almost identical. Thank you ANSI SQL standards!

Can anyone come up with a better or more efficient way to find the Symmetric Difference between two tables in TSQL, VFP, or any other DB?