<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: ETL Subsystem 7: Removing Duplicates</title>
	<atom:link href="http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/</link>
	<description>Supporting decisions through sound data management</description>
	<lastBuildDate>Tue, 23 Aug 2011 03:10:04 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Tod means Fox &#124; 34 Subsystems of ETL Data Integration</title>
		<link>http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/comment-page-1/#comment-1220</link>
		<dc:creator>Tod means Fox &#124; 34 Subsystems of ETL Data Integration</dc:creator>
		<pubDate>Tue, 18 Mar 2008 14:03:31 +0000</pubDate>
		<guid isPermaLink="false">http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/#comment-1220</guid>
		<description>[...] Removing Duplicates [...]</description>
		<content:encoded><![CDATA[<p>[...] Removing Duplicates [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tod McKenna</title>
		<link>http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/comment-page-1/#comment-1206</link>
		<dc:creator>Tod McKenna</dc:creator>
		<pubDate>Tue, 11 Mar 2008 11:50:46 +0000</pubDate>
		<guid isPermaLink="false">http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/#comment-1206</guid>
		<description>Nice to hear you Garrett. The de-duping process is a rather complex one in general with every solution having some significant downside. Soundex only works for names in the US for example (well, that is how I understand it anyway!). 

I would guess that de-duping is one of the areas that takes ETL developers the longest to get right -- or close to right.</description>
		<content:encoded><![CDATA[<p>Nice to hear you Garrett. The de-duping process is a rather complex one in general with every solution having some significant downside. Soundex only works for names in the US for example (well, that is how I understand it anyway!). </p>
<p>I would guess that de-duping is one of the areas that takes ETL developers the longest to get right &#8212; or close to right.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Garrett Fitzgerald</title>
		<link>http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/comment-page-1/#comment-1195</link>
		<dc:creator>Garrett Fitzgerald</dc:creator>
		<pubDate>Sun, 09 Mar 2008 07:15:51 +0000</pubDate>
		<guid isPermaLink="false">http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/#comment-1195</guid>
		<description>Tod, when I worked in the direct mail industry, we had a program from PostalSoft/FirstLogic/Business Objects that did Match/Consolidate, Merge/Purge, or whatever other term for deduping that you want to come up with. :-) It had some pretty elaborate settings, and small tweaks  could make great differences in the outcome. After using that, I&#039;m not terribly surprised that the functionality doesn&#039;t come natively. :-)</description>
		<content:encoded><![CDATA[<p>Tod, when I worked in the direct mail industry, we had a program from PostalSoft/FirstLogic/Business Objects that did Match/Consolidate, Merge/Purge, or whatever other term for deduping that you want to come up with. <img src='http://blog.todmeansfox.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  It had some pretty elaborate settings, and small tweaks  could make great differences in the outcome. After using that, I&#8217;m not terribly surprised that the functionality doesn&#8217;t come natively. <img src='http://blog.todmeansfox.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tod McKenna</title>
		<link>http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/comment-page-1/#comment-1162</link>
		<dc:creator>Tod McKenna</dc:creator>
		<pubDate>Sat, 01 Mar 2008 07:36:03 +0000</pubDate>
		<guid isPermaLink="false">http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/#comment-1162</guid>
		<description>Hi Josh,

Please let me know when ready! 

Deduplication of names is particularly tricky. At least with addresses, you have a finite number of known &#039;Good &#039; values. Lists can be obtained from the USPS for example. Names, on the other hand are too variable.</description>
		<content:encoded><![CDATA[<p>Hi Josh,</p>
<p>Please let me know when ready! </p>
<p>Deduplication of names is particularly tricky. At least with addresses, you have a finite number of known &#8216;Good &#8216; values. Lists can be obtained from the USPS for example. Names, on the other hand are too variable.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Blog about Process Checkers &#187; Blog Archive &#187; Fast Friday links</title>
		<link>http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/comment-page-1/#comment-1160</link>
		<dc:creator>Blog about Process Checkers &#187; Blog Archive &#187; Fast Friday links</dc:creator>
		<pubDate>Fri, 29 Feb 2008 20:28:53 +0000</pubDate>
		<guid isPermaLink="false">http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/#comment-1160</guid>
		<description>[...] http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/The process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the &#8230; [...]</description>
		<content:encoded><![CDATA[<p>[...] <a href="http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/The" rel="nofollow">http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/The</a> process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the &#8230; [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: josh</title>
		<link>http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/comment-page-1/#comment-1159</link>
		<dc:creator>josh</dc:creator>
		<pubDate>Fri, 29 Feb 2008 20:27:43 +0000</pubDate>
		<guid isPermaLink="false">http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/#comment-1159</guid>
		<description>Years ago I worked for Social Services.  Our biggest problem was reports of people that needed assistance (ok; not our biggest problem, but it was mine!) and spelling of names was questionable.  Even clients (that&#039;s code for welfare folks) sometimes didn&#039;t know how to spell their name (either due to drugs/alcohol abuse or brain damage...) Anyway;
I attended a coference in New York -- it&#039;s outcome was a long document; which I filtered out to be a soundex replacement like function specifically for matching names.
For instance, my last name is Assing -- but it&#039;s pronounced Ausing the nyiis function calls that a match.  That is just one example.
I used to sell the function&#039;s source code; but am planning on making it available for free to anyone that wants it under GPL licensing.
-josh</description>
		<content:encoded><![CDATA[<p>Years ago I worked for Social Services.  Our biggest problem was reports of people that needed assistance (ok; not our biggest problem, but it was mine!) and spelling of names was questionable.  Even clients (that&#8217;s code for welfare folks) sometimes didn&#8217;t know how to spell their name (either due to drugs/alcohol abuse or brain damage&#8230;) Anyway;<br />
I attended a coference in New York &#8212; it&#8217;s outcome was a long document; which I filtered out to be a soundex replacement like function specifically for matching names.<br />
For instance, my last name is Assing &#8212; but it&#8217;s pronounced Ausing the nyiis function calls that a match.  That is just one example.<br />
I used to sell the function&#8217;s source code; but am planning on making it available for free to anyone that wants it under GPL licensing.<br />
-josh</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: I wanna talk about Plan Checkers &#187; Blog Archive &#187; Fast Friday links</title>
		<link>http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/comment-page-1/#comment-1158</link>
		<dc:creator>I wanna talk about Plan Checkers &#187; Blog Archive &#187; Fast Friday links</dc:creator>
		<pubDate>Fri, 29 Feb 2008 19:46:10 +0000</pubDate>
		<guid isPermaLink="false">http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/#comment-1158</guid>
		<description>[...] http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/You can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well. As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just &#8230; [...]</description>
		<content:encoded><![CDATA[<p>[...] <a href="http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/You" rel="nofollow">http://blog.todmeansfox.com/2008/02/29/etl-subsystem-7-removing-duplicates/You</a> can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well. As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just &#8230; [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>

