<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.2" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Delete Duplicates from a Table</title>
	<link>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/</link>
	<description>Business Intelligence, Data Warehousing, SQL, Visual FoxPro.</description>
	<pubDate>Thu, 20 Nov 2008 23:12:36 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2</generator>

	<item>
		<title>By: Tod McKenna</title>
		<link>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-161</link>
		<author>Tod McKenna</author>
		<pubDate>Fri, 28 Sep 2007 22:28:32 +0000</pubDate>
		<guid>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-161</guid>
		<description>:-) I knew what you meant!


And wOOdy, that's a great tip!</description>
		<content:encoded><![CDATA[<p> <img src='http://blog.todmeansfox.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> I knew what you meant!</p>
<p>And wOOdy, that&#8217;s a great tip!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian Marquis</title>
		<link>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-160</link>
		<author>Brian Marquis</author>
		<pubDate>Fri, 28 Sep 2007 14:50:05 +0000</pubDate>
		<guid>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-160</guid>
		<description>I just reread my post, I listed the customer table as the target of the update. 

UPDATE Orders ao ;
SET ao.cust_key = ;
(SELECT MAX(ac.cust_key) FROM Customer ac ;
WHERE ac.cust_id = ao.cust_id ;
GROUP BY ac.cust_id, ac.cust_fname, ac.cust_lname, ac.cust_phone)</description>
		<content:encoded><![CDATA[<p>I just reread my post, I listed the customer table as the target of the update. </p>
<p>UPDATE Orders ao ;<br />
SET ao.cust_key = ;<br />
(SELECT MAX(ac.cust_key) FROM Customer ac ;<br />
WHERE ac.cust_id = ao.cust_id ;<br />
GROUP BY ac.cust_id, ac.cust_fname, ac.cust_lname, ac.cust_phone)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: wOOdy</title>
		<link>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-159</link>
		<author>wOOdy</author>
		<pubDate>Thu, 27 Sep 2007 13:35:42 +0000</pubDate>
		<guid>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-159</guid>
		<description>Upps, that blog eats code...  The INDEX line should read:

INDEX ON [expression for doublicate fields] UNIQUE TO temp.idx

Thus, based on your sample code it would be:
INDEX ON STR(cust_id) + cust_fname+ cust_lname+ cust_phone UNIQUE TO temp</description>
		<content:encoded><![CDATA[<p>Upps, that blog eats code&#8230;  The INDEX line should read:</p>
<p>INDEX ON [expression for doublicate fields] UNIQUE TO temp.idx</p>
<p>Thus, based on your sample code it would be:<br />
INDEX ON STR(cust_id) + cust_fname+ cust_lname+ cust_phone UNIQUE TO temp</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: wOOdy</title>
		<link>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-158</link>
		<author>wOOdy</author>
		<pubDate>Thu, 27 Sep 2007 13:32:07 +0000</pubDate>
		<guid>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-158</guid>
		<description>A similar technique for the old scholl xBasers:

SET DELETED OFF
USE Customers
DELETE ALL
INDEX ON   UNIQUE TO _Temp 
RECALL ALL
SET ORDER TO
SET DELETED ON

The trick here is the use of an UNIQUE (xBase style, not SQL) indextype, which always only adds the very first occurence into the index. This way the FIRST entry survives (which is usually the correct one) and subsequent entries are ignored.
Based on that index the RECALL activates only those records listed in the index.  After that, this UNIQUE-index can get deleted.</description>
		<content:encoded><![CDATA[<p>A similar technique for the old scholl xBasers:</p>
<p>SET DELETED OFF<br />
USE Customers<br />
DELETE ALL<br />
INDEX ON   UNIQUE TO _Temp<br />
RECALL ALL<br />
SET ORDER TO<br />
SET DELETED ON</p>
<p>The trick here is the use of an UNIQUE (xBase style, not SQL) indextype, which always only adds the very first occurence into the index. This way the FIRST entry survives (which is usually the correct one) and subsequent entries are ignored.<br />
Based on that index the RECALL activates only those records listed in the index.  After that, this UNIQUE-index can get deleted.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tod McKenna</title>
		<link>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-150</link>
		<author>Tod McKenna</author>
		<pubDate>Mon, 24 Sep 2007 21:07:37 +0000</pubDate>
		<guid>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-150</guid>
		<description>Brian, that's a great question -- and at first glance, I think the example is right on. By deleting records from Customer, we're either creating orphans or we're (maybe) cascading deletes to all children in the database.

If a table with duplicates has children, Brian's code should be run first to re-assign all keys to the most current (the MAX) record.

Thanks for the input!</description>
		<content:encoded><![CDATA[<p>Brian, that&#8217;s a great question &#8212; and at first glance, I think the example is right on. By deleting records from Customer, we&#8217;re either creating orphans or we&#8217;re (maybe) cascading deletes to all children in the database.</p>
<p>If a table with duplicates has children, Brian&#8217;s code should be run first to re-assign all keys to the most current (the MAX) record.</p>
<p>Thanks for the input!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tod McKenna</title>
		<link>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-149</link>
		<author>Tod McKenna</author>
		<pubDate>Mon, 24 Sep 2007 21:03:42 +0000</pubDate>
		<guid>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-149</guid>
		<description>Hi apaustria, the above code finds the rows that are duplicate:

For example:

select distinct (cust_fname+cust_lname) as x from Customer where (cust_fname+cust_lname) in ( select (cust_fname+cust_lname) as x from Customer  group by x having count(*) &gt; 1)

Returns two records:
KevinRagsdale
RickStrahl

But doesn't tell me which one I can delete. Unless, of course I can identify an attribute that is consistently different. If I change the above select to a DELETE, every occurrence of KevinRagsdale and RickStrahl will be deleted.

Actually, come to think of it, the method outlined in this article is best suited for cases when there is a small difference in the record. I see this sometimes with a timestamp or surrogate key. If rows are identical in every way, then it doesn't matter. The trick is to identify which attribute is the "anchor" and use that in the WHERE clause.

Another way around this: you could retrofit a primary key by using RECNO or some other method on the Customer entity:

ALTER TABLE Customer ADD COLUMN retrofit integer
REPLACE ALL retrofit WITH RECNO() IN Customer 

And then simply use new attribute "retrofit" in the WHERE and MAX clauses.</description>
		<content:encoded><![CDATA[<p>Hi apaustria, the above code finds the rows that are duplicate:</p>
<p>For example:</p>
<p>select distinct (cust_fname+cust_lname) as x from Customer where (cust_fname+cust_lname) in ( select (cust_fname+cust_lname) as x from Customer  group by x having count(*) > 1)</p>
<p>Returns two records:<br />
KevinRagsdale<br />
RickStrahl</p>
<p>But doesn&#8217;t tell me which one I can delete. Unless, of course I can identify an attribute that is consistently different. If I change the above select to a DELETE, every occurrence of KevinRagsdale and RickStrahl will be deleted.</p>
<p>Actually, come to think of it, the method outlined in this article is best suited for cases when there is a small difference in the record. I see this sometimes with a timestamp or surrogate key. If rows are identical in every way, then it doesn&#8217;t matter. The trick is to identify which attribute is the &#8220;anchor&#8221; and use that in the WHERE clause.</p>
<p>Another way around this: you could retrofit a primary key by using RECNO or some other method on the Customer entity:</p>
<p>ALTER TABLE Customer ADD COLUMN retrofit integer<br />
REPLACE ALL retrofit WITH RECNO() IN Customer </p>
<p>And then simply use new attribute &#8220;retrofit&#8221; in the WHERE and MAX clauses.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: apaustria</title>
		<link>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-148</link>
		<author>apaustria</author>
		<pubDate>Mon, 24 Sep 2007 17:01:47 +0000</pubDate>
		<guid>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-148</guid>
		<description>If I don't have a PK then my subquery should work by having this right?

select distinct([attribute1+attribute2]) as x from [tablename] where ([attribute1+attribute2]) in ( select ([attribute1+attribute2]) as x from [tablename] group by x having count(*) &#62; 1)</description>
		<content:encoded><![CDATA[<p>If I don&#8217;t have a PK then my subquery should work by having this right?</p>
<p>select distinct([attribute1+attribute2]) as x from [tablename] where ([attribute1+attribute2]) in ( select ([attribute1+attribute2]) as x from [tablename] group by x having count(*) &gt; 1)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian Marquis</title>
		<link>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-147</link>
		<author>Brian Marquis</author>
		<pubDate>Mon, 24 Sep 2007 16:15:54 +0000</pubDate>
		<guid>http://blog.todmeansfox.com/2007/09/24/delete-duplicates-from-a-table/#comment-147</guid>
		<description>Very nice.

How do you handle child records linked on cust_key, perhaps a similar update query? Untested code follows:

UPDATE Customer ac ;
  SET cust_key = ;
    (SELECT MAX(cust_key) FROM Customer bc;
      WHERE bc.cust_id = ac.cust_id ; 
      GROUP BY cust_id, cust_fname, cust_lname, cust_phone)</description>
		<content:encoded><![CDATA[<p>Very nice.</p>
<p>How do you handle child records linked on cust_key, perhaps a similar update query? Untested code follows:</p>
<p>UPDATE Customer ac ;<br />
  SET cust_key = ;<br />
    (SELECT MAX(cust_key) FROM Customer bc;<br />
      WHERE bc.cust_id = ac.cust_id ;<br />
      GROUP BY cust_id, cust_fname, cust_lname, cust_phone)</p>
]]></content:encoded>
	</item>
</channel>
</rss>
