Posts Tagged Set Theory

The “Not Intersect” or “Symmetric Difference” Between Two Tables

A few months back I wrote an entry on how to achieve an INTERSECT using VFP. That post was inspired by the INTERSECT command that exists in SQL Server 2005. In set theory, the intersect can be defined as follows: “x is an element of sets A and B if and only if x is an element of A and x is an element of B.”

But what about returning non-intersecting data from two sets? This would be like a union, but that we only want a list of unique rows within each table, that is, the difference. The difference between two sets could be defined as: “x is an element of set A or B when x is not an element of A and B.” In set theory, this is called a “symmetric difference” (like XOR for x-basers!).

The SQL to accomplish this is almost identical to the Intersect query. Here is the general form in VFP:

SELECT [key field], [attributes..] ;
  FROM (;
    SELECT [key field], [attributes..] FROM A ;
    UNION ALL ;
    SELECT [key field], [attributes..] FROM B ;
  ) AS tmp ;
  GROUP BY [key field], [attributes..] ;
  HAVING COUNT(*) = 1

The change is in the HAVING clause, where instead of checking to see if COUNT(*) = 2, we check to see if COUNT(*) = 1. This gives us the “Not Intersect” of two sets: perfect for determining how two tables are different! The result is a complete list of non-intersecting data.

Here’s an actual example in VFP (followed by an example using T-SQL):

*-- two sets
CREATE CURSOR crTempL (keyfld i)
CREATE CURSOR crTempR (keyfld i)

*-- test data
INSERT INTO crTempL VALUES(1)
INSERT INTO crTempL VALUES(3)
INSERT INTO crTempL VALUES(5)
INSERT INTO crTempL VALUES(7)
INSERT INTO crTempL VALUES(9)
 
INSERT INTO crTempR VALUES(3)
INSERT INTO crTempR VALUES(6)
INSERT INTO crTempR VALUES(9)
INSERT INTO crTempR VALUES(12)
INSERT INTO crTempR VALUES(15)
 
*-- the difference
SELECT keyfld ;
  FROM (;
    SELECT keyfld FROM crTempL;
    UNION ALL ;
    SELECT keyfld FROM crTempR;
  ) AS tmp1 ;
  GROUP BY keyfld  ;
  HAVING COUNT(*) =
*-- test data
INSERT INTO crTempL VALUES(1)
INSERT INTO crTempL VALUES(3)
INSERT INTO crTempL VALUES(5)
INSERT INTO crTempL VALUES(7)
INSERT INTO crTempL VALUES(9)
 
INSERT INTO crTempR VALUES(3)
INSERT INTO crTempR VALUES(6)
INSERT INTO crTempR VALUES(9)
INSERT INTO crTempR VALUES(12)
INSERT INTO crTempR VALUES(15)

*-- the difference
SELECT keyfld ;
  FROM (;
    SELECT keyfld FROM crTempL;
    UNION ALL ;
    SELECT keyfld FROM crTempR;
  ) AS tmp1 ;
  GROUP BY keyfld  ;
  HAVING COUNT(*) =
*-- the difference
SELECT keyfld ;
  FROM (;
    SELECT keyfld FROM crTempL;
    UNION ALL ;
    SELECT keyfld FROM crTempR;
  ) AS tmp1 ;
  GROUP BY keyfld  ;
  HAVING COUNT(*) = 1

Now, here is the equivalent in T-SQL.

-- two sets
CREATE TABLE #crTempL (keyfld integer)
CREATE TABLE #crTempR (keyfld integer)
 
-- test data
INSERT INTO #crTempL VALUES(1)
INSERT INTO #crTempL VALUES(3)
INSERT INTO #crTempL VALUES(5)
INSERT INTO #crTempL VALUES(7)
INSERT INTO #crTempL VALUES(9)
 
INSERT INTO #crTempR VALUES(3)
INSERT INTO #crTempR VALUES(6)
INSERT INTO #crTempR VALUES(9)
INSERT INTO #crTempR VALUES(12)
INSERT INTO #crTempR VALUES(15)
 
-- the query
SELECT keyfld 
  FROM (
    SELECT keyfld FROM #crTempL
    UNION ALL 
    SELECT keyfld FROM #crTempR
  ) AS tmp1 
  GROUP BY keyfld  
  HAVING COUNT(*) = 1

As you can see, the syntax is almost identical. Thank you ANSI SQL standards!

Can anyone come up with a better or more efficient way to find the Symmetric Difference between two tables in TSQL, VFP, or any other DB?

Tags: , , ,

1 Comment

VFP INTERSECTing SQL

For a long time, we have had UNION and UNION ALL to write SQL statements that act as an ‘OR’ operation between two SQL statements. The ORing grabs records from either the first or the second SQL. INTERSECT, though, acts as an AND, meaning that the results must exist in both sets. Unfortunately, there is no INTERSECT keyword in VFP for us to use.

I have found that the need for INTERSECT usually accompanies partitioned files, or when doing comparisons between two versions of files. For example, to find all the customers that have placed an order in 2004 and 2005, using two partitioned customer tables (partitioned by calendar years 2004 and 2005), you can use intersect set logic. In SQL Server 2005, the syntax is quite simple:

*-- In SQL Server 2005, you can:
SELECT cust_id, cust_name, cust_dob FROM sales_2004
INTERSECT
SELECT cust_id, cust_name, cust_dob FROM sales_2005

But in VFP9, and in prior versions of SQL Server, you have to do a little more work:

*-- IN VFP9, the following is equivalent
SELECT cust_id, cust_name, cust_dob ;
  FROM (;
    SELECT cust_id, cust_name, cust_dob FROM sales_2004 ;
    UNION ALL ;
    SELECT cust_id, cust_name, cust_dob FROM sales_2005 ;
  ) AS tmp1 ;
  GROUP BY cust_id, cust_name, cust_dob ;
  HAVING COUNT(*) = 2 ;
  INTO CURSOR crIntersect

Please note these examples both make use of DISTINCT (SQL Server 2005 performs an implicit DISTINCT with INTERSECT). This means that it can only compare distinct records. What if the same customer ordered 12 products in 2004 and 8 in 2005? The above INTERSECT example would bring in the customer just once for each year. Although I have never had a need to write such a query, I have a few ideas on how it could be done. If anyone is bored or needs to burn 15 minutes, give it a try!

UPDATE

I failed to mention how the above code is better or different than simply using an inner join (as a colleague suggested). I had to think about this for a minute but then I recalled why using an inner join for intersecting isn’t perfect: NULLS. Inner joins treat nulls as different, whereas the above syntax does not. To test this out for yourself, run the above SQL on two tables that have a column set to NULL (for example, a NULL DOB) for the same customer in both tables. You’ll notice that the inner join refuses to recognize that NULL = NULL.

Tags: , , ,

2 Comments

UNION oddity (or maybe it’s just me)

I had a situation come up today where I needed to add the contents of one cursor (cursorA) to another (cursorB), without adding any duplicates in the process. Both cursors had the exact same structure. CursorA was going to be closed (a little later), while cursorB would persist (for another few processes). Immediately I turned to the UNION clause. My strategy was to select * from both tables, union them together, and put the results back into cursorB like this:

SELECT * FROM cursorB ;
    UNION ;
SELECT * FROM cursorA ;
INTO CURSOR cursorB READWRITE

I was surprised to see the open dialog box pop up, basically asking me to locate cursorB! What happens here is that the INTO clause seems to strike first, essentially closing cursorB. I suppose I was a little surprised because I do stuff like this all the time:

SELECT * FROM cursorB WHERE .T. INTO CURSOR cursorB

I worked around this by modyfing my SQL to the following:

SELECT * FROM (SELECT * FROM cursorB) as tmp ;
    UNION ;
SELECT * FROM cursorA ;
INTO CURSOR cursorB READWRITE

I should mention that this only works in VFP9. VFP7 will complain that the syntax is wrong (I presume the same in VFP8 too). The only solution that I can think of in VFP7 would be in two-steps:

SELECT * FROM cursorB ;
    UNION ;
SELECT * FROM cursorA ;
INTO CURSOR cursorC
 
SELECT * FROM cursorC INTO cursorB READWRITE

Tags: , ,

2 Comments