Delete Duplicates from a Table
The need to delete duplicate records from a table rises up all the time, whether you are importing data from a third party, integrating systems, or cleaning dirty data. Duplicate data is a reoccurring quality issue.
A duplicate records does not necessarily mean an exact copy of another record (as if you pressed CTRL+Y with SET CARRY ON in VFP). It could mean that all of the important identifying attributes are duplicated but other data, such as a surrogate key or a last updated timestamp are not. So here is a technique that I’ve used to delete a row if it is in fact a duplicate based on attributes in the table:
DELETE FROM [tablename] ; WHERE [primary key] NOT IN ; (SELECT MAX([primary key]) FROM [tablename] ; GROUP BY [attribute1], [attribute2], [attribute3])
To apply this to a Customer entity, for example, you can try the following (the first half of this code is setup, scroll down to see the delete):
*-- example table CREATE TABLE Customer FREE (; cust_key integer ,; && a surrogate key cust_id character(3) ,; && natural key cust_fname varchar(20) ,; && first name cust_lname varchar(20) ,; && last name cust_phone character(14) ,; && phone number cust_lupd datetime ,; && last updated cust_user character(15) ) && user last updated *-- need data INSERT INTO Customer VALUES (1,"AM1","Andrew","MacNeill","555-5551",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (2,"AK1","Andy","Kramek","555-5552",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (3,"CC1","Cesar","Chalom","555-5553",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (4,"CB1","Craig","Baily","555-5554",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (5,"DC1","Dave","Crozier","555-5555",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (6,"DS1","David","Stevenson","555-5556",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (7,"GF1","Garrett","Fitzgerald","555-5557",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (8,"KR1","Kevin","Ragsdale","555-5558",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (9,"KR1","Kevin","Ragsdale","555-5558",DATETIME(),"ZorroMeansFox") INSERT INTO Customer VALUES (10,"RS1","Rick","Schummer","555-5559",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (11,"RS2","Rick","Strahl","555-5510",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (12,"RS2","Rick","Strahl","555-5510",DATETIME(),"ZorroMeansFox") INSERT INTO Customer VALUES (13,"RS2","Rick","Strahl","555-5510",DATETIME(),"FanakMeansFox") INSERT INTO Customer VALUES (14,"SB1","Stephen","Bodnar","555-5511",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (15,"SD1","Stuart","Dunkeld","555-5512",DATETIME(),"TodMeansFox") INSERT INTO Customer VALUES (16,"TR1","Ted","Roche","555-5513",DATETIME(),"TodMeansFox") *-- delete duplicates (we don't care about the surrogate *-- key or the audit information). DELETE FROM Customer ; WHERE cust_key NOT IN ; (SELECT MAX(cust_key) FROM Customer ; GROUP BY cust_id, cust_fname, cust_lname, cust_phone) *-- results SET DELETED OFF BROWSE NORMAL
This works by selecting the most recent issuance of the primary key (using MAX) and discarding the others using the subquery. If you run the subquery on its own, you can see how SELECT MAX plus the GROUP BY clause gives you the most recent version of the row that is distinct from the others (’Rick’ and ‘Kevin’ only show once) by leaving out fields that don’t identify the row (like auditing information or perhaps de-identified attributes). Next, we delete everything that is not in this list.
This works in all ANSI SQL environments were subqueries are supported in the DELETE statement.
I'm a Quant Technical Specialist (Data Warehousing and Business Intelligence), with expertise in business analysis, data modeling, and data integration. I have extensive experience developing vertical and integrated desktop, Internet, and BI applications spanning municipal, clinical, and financial industries.

September 24th, 2007 at 11:15 am
Very nice.
How do you handle child records linked on cust_key, perhaps a similar update query? Untested code follows:
UPDATE Customer ac ;
SET cust_key = ;
(SELECT MAX(cust_key) FROM Customer bc;
WHERE bc.cust_id = ac.cust_id ;
GROUP BY cust_id, cust_fname, cust_lname, cust_phone)
September 24th, 2007 at 12:01 pm
If I don’t have a PK then my subquery should work by having this right?
select distinct([attribute1+attribute2]) as x from [tablename] where ([attribute1+attribute2]) in ( select ([attribute1+attribute2]) as x from [tablename] group by x having count(*) > 1)
September 24th, 2007 at 4:03 pm
Hi apaustria, the above code finds the rows that are duplicate:
For example:
select distinct (cust_fname+cust_lname) as x from Customer where (cust_fname+cust_lname) in ( select (cust_fname+cust_lname) as x from Customer group by x having count(*) > 1)
Returns two records:
KevinRagsdale
RickStrahl
But doesn’t tell me which one I can delete. Unless, of course I can identify an attribute that is consistently different. If I change the above select to a DELETE, every occurrence of KevinRagsdale and RickStrahl will be deleted.
Actually, come to think of it, the method outlined in this article is best suited for cases when there is a small difference in the record. I see this sometimes with a timestamp or surrogate key. If rows are identical in every way, then it doesn’t matter. The trick is to identify which attribute is the “anchor” and use that in the WHERE clause.
Another way around this: you could retrofit a primary key by using RECNO or some other method on the Customer entity:
ALTER TABLE Customer ADD COLUMN retrofit integer
REPLACE ALL retrofit WITH RECNO() IN Customer
And then simply use new attribute “retrofit” in the WHERE and MAX clauses.
September 24th, 2007 at 4:07 pm
Brian, that’s a great question — and at first glance, I think the example is right on. By deleting records from Customer, we’re either creating orphans or we’re (maybe) cascading deletes to all children in the database.
If a table with duplicates has children, Brian’s code should be run first to re-assign all keys to the most current (the MAX) record.
Thanks for the input!
September 27th, 2007 at 8:32 am
A similar technique for the old scholl xBasers:
SET DELETED OFF
USE Customers
DELETE ALL
INDEX ON UNIQUE TO _Temp
RECALL ALL
SET ORDER TO
SET DELETED ON
The trick here is the use of an UNIQUE (xBase style, not SQL) indextype, which always only adds the very first occurence into the index. This way the FIRST entry survives (which is usually the correct one) and subsequent entries are ignored.
Based on that index the RECALL activates only those records listed in the index. After that, this UNIQUE-index can get deleted.
September 27th, 2007 at 8:35 am
Upps, that blog eats code… The INDEX line should read:
INDEX ON [expression for doublicate fields] UNIQUE TO temp.idx
Thus, based on your sample code it would be:
INDEX ON STR(cust_id) + cust_fname+ cust_lname+ cust_phone UNIQUE TO temp
September 28th, 2007 at 9:50 am
I just reread my post, I listed the customer table as the target of the update.
UPDATE Orders ao ;
SET ao.cust_key = ;
(SELECT MAX(ac.cust_key) FROM Customer ac ;
WHERE ac.cust_id = ao.cust_id ;
GROUP BY ac.cust_id, ac.cust_fname, ac.cust_lname, ac.cust_phone)
September 28th, 2007 at 5:28 pm
And wOOdy, that’s a great tip!