- You shall compile and document all requirements and mappings; segregate the work by business process. You may have more than one of these business processes, some of which may come before others.
- Do not begin without first conducting a thorough data profile; otherwise, you will be punished for your inequities, as will the generations that come after you.
- Do not think commandments one or two are in vain, lest you will become overrun by the dead line, scope creepers, and a great exodus of people from your tribe; if this happens to you, do not swear or curse, for you have been warned.
- Remember that latency and timeliness are equal in importance to non-volatility and having a traceable lineage; a staging area may lead you to this promised land.
- Honor the rules of data conformance.
- Do not kill dirty data: you shall clean them, or take them back to their sources for retribution.
- Do not commit the worst data integration transgression of all and ignore data quality, your ignorance will not be forgiven.
- Do not be shy about stealing your neighbor’s work, for his trials have led to best practices that you can make equally good use of.
- Do not rely solely on business keys; surrogates are your friend and will permit you to engage in slowly changing your dimensions.
- You shall covet a proper audit and log system; for on the day of judgment, you will need proof of your compliance.
Posts Tagged Cleansing
This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
The process of de-duping goes well beyond removing or identifying pure duplicates during data integration. This process actually seeks to remove redundant, misspelled, or otherwise ‘almost matches’ from the data stream, selecting the most appropriate version for the warehouse.
Generally, the practice involves an input record and a reference table, but could also span several databases and require access to web services and 3rd-party software. The reference table is usually a dimension (such as Product, Customer, Employee, Address, etc.) or a staging table that contains the relevant attributes to perform the lookup. If a staging table is used (a must for large reference sets), then it must be maintained as part of dimension processing. For more information about reference tables and master data, try searching for “Master Data Management” on Google. If I feel up to the challenge, I’ll post more about MDM in a future series of articles.
It is common to have duplicate data in an organization, especially if that organization has multiple systems for managing customers, products, orders, inventory, etc. Consider a small business with in-house software to track customers and a separate database Online. The same customer could be in both systems with different account numbers and a slightly different name (“Martha Jones” vs. “Martha A. Jones”). As part of data warehousing, we need to eliminate these cases as best we can. That is, of course if it is important for your data warehouse to do so.
If an organization does not need to eliminate these duplicates from their data warehouse, then they should proceed no further. However, in order to do any form of historical analysis, you need data that you can follow uninterrupted back in time and across multiple sources. This subsystem helps you do that.
Fuzzy Matching
Fuzzy matching is the process of determining if two strings are approximately equal (ideally returning some form of confidence score). Algorithms such as Levenshtein Distance and SoundEx have been developed to provide this type of functionality.
If the duplication of information is minimal, or if the volume of data is manageable, then it should be a rather straight-forward exercise to write your own fuzzy matching solution using any of the known algorithms. But you should plan on spending some time on this endeavor. Also, as fuzzy matching is not an exact science, plan on some ongoing human intervention.
One thing is certain: The more atomic your data is, the easier your job of de-duping will be. It is considerably easier to look for duplicate address information when the street number, name, unit number, city, state, and zip are broken out into atomic parts!
SSIS and VFP do not have any built-in, native ability to de-dupe data. Organizations with severe duplication problems will likely need to invest in a 3rd-party cleansing solution. Regardless, here is what you could do using SSIS and hand-coded in VFP:
SQL Server 2005 Integration Services (SSIS)
SSIS provides three very good Data Flow tasks for de-duplication:
- Lookup Transformation:
- Performs a lookup against a reference table. The lookup uses an equi-join on one or more columns (composite join) and is case sensitive. Use this task for exact matching to a reference set.
- Fuzzy Lookup
- This transformation uses fuzzy matching (pattern-based) against a reference table in a SQL Server 2005 database. You must supply the reference table, which is typically built from your existing data. You can customize the fuzzy lookup by specifying the maximum number of matches to return per input row, token delimiters, and similarity thresholds. Each match includes a similarity score (how similar the input and reference values are) and a confidence score (the likelihood that a match has been found).
- Fuzzy Grouping
- The Fuzzy Grouping identifies rows of data that are likely to be duplicates. Then it selects a row that best represents the group and marks that row with a similarity score of 1. The other rows in the group are given a score ranging from 0 to 1. The closer to 1, the better the match.
It is a best practice to first use a Lookup Transformation to find exact matches and then divert the unmatched data into a Fuzzy Lookup. The package will run faster (the Fuzzy Lookup is more expensive than the equi-join Lookup) and will be easier to maintain (you will clearly see what records are being diverted as non exact matches).
You should read more about how the Fuzzy Lookup and Fuzzy Groupings work in Books Online. They are both quite interesting. Also, they’re not too easy to get the hang of at first and have quite a few configuration options. Later, I will demonstrate these tasks in a future posting.
If you need a 3rd-Party tool, then integrating it into the Control Flow would be ideal. Otherwise, you will need to stage the dirty data and run the vendor’s product separately on that stage. You can then pick up the de-duped data in SSIS and continue processing. You may be able to use the Execute External Process task (runs a Win32 Executable or batch file) and put the entire operation into an SSIS Control Flow (which is the second-best solution).
Intelligent Search Technology has an integrated deduplication task for SSIS. I have not evaluated this task, but sounds pretty good. Read about it here.
Hand Coding with Visual FoxPro (VFP9)
FoxPro does not have any native or built-in solution for de-duping data. This isn’t odd, as most (if not all) programming and databases languages don’t either (I would love to know of a language that has some form of de-duping functionality). You can, however, build your own by using functions such as the following:
- Levenshtein Algorithm
- Check out the discussion on the Fox Wiki.
- LIKE/ LIKEC
- Can use wildcards to see if expression 1 is like expression 2. The LIKEC version is for double-byte characters. This function might be appropriate under certain circumstances for identifying data to de-dupe.
- SOUNDEX
- Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. More about SoundEx.
- Craig Boyd’s SoundsLike Function
- Check out Craig Boyd’s improvement for the SOUNDEX function on the VFP Wiki.
- MetaPhone and Double Metaphone
- Here are a couple of gems translated for us by Craig. The code is posted on the FoxPro Wiki
- The ‘Oliver’ Function
- I, like Tyler Akins at rumkin.com, first heard of this using PHP. You can see Tyler Akins’ implementation using FoxPro code here. Pretty cool!
Note: The code (zip format) for Craig’s work can be found here at Replacement Software. You can read more about this over at Sweet Potato Software, article “Spelling Checker and VFP” as well.
As mentioned above, it is best to conduct your searches using atomic data. If the source data is not currently tokenized (did I just make up a word?), you can handle this easily in FoxPro by splitting compound fields yourself. I wrote a couple posts last year on this very subject: “Parsing firstname / middlename / lastname” and “Parsing City/State/Zip“. You can then use the individual name and address parts to perform the search, increasing your likelihood of finding good matches. Also, it would be a good idea to convert all strings to upper or lower case (I prefer upper) as well.
From Here
Although SSIS and FoxPro do not offer built-in de-duplication algorithms for names, addresses, products, and the like, they both offer enough tools, tasks, and functions so that you can write your own. Plan on spending plenty of time on this endeavor, as de-duplication is not an exact science and will undoubtedly require some human intervention during the process.
Next post, I’ll discuss ETL Best Practice Subsystem 8: Data Conformance. Data conformity across dimensions and fact tables is one of the hallmarks of a fully integrated dimensional data warehouse.
Conforming data and building master data are two very important aspects of data warehousing, migrating legacy systems, and federating reports. A common task is to compare a string to a known format and return the matching value. For example, you may need to take parcel identifier information from multiple towns and do some smart parsing to obtain atomic map, blocks, and lots. Town A might store their Parcel ID in the format MMM-BBB/LLL, while another in the format MM/LL.BBB. You could store all known formats in a table, do some magic to convert all parts to a format code, and finally find the match in the formats table.
In VFP, there are many ways to compare the contents of a string to a particular format. TRANSFORM() and LIKE() come to mind. Using TRANSFORM, you can do something like the following:
lcTestVal = "SOMETEST_VAL" ? TRANSFORM(lcTestVal,"XXXXXXXX_XXX") == lcTestVal && returns .T. ? TRANSFORM(lcTestVal,"XXXXXXX_XX") == lcTestVal && returns .F.
LIKEwise, you can do some comparisons with LIKE:
lcTestVal = "SOMETEST_VAL" ? LIKE("????????_???",lcTestVal) && returns .T. ? LIKE("???????_??",lcTestVal) && returns .F.
But in some cases, it might be necessary to compare a single string against multiple formats, returning the format that the string matches (a common task when conforming data for data warehousing). There are a lot of ways to handle this, but simply using TRANSFORM or LIKE won’t do it alone.
Take the following example, where we have three different account number formats that come from three disparate source systems. You could manage this easily in FoxPro:
lcFormat1 = "NCCCCCN" lcFormat2 = "CCCNNNNNN" lcFormat3 = "NNNNNNNCCC" DIMENSION aTestValues [10] aTestValues[1] = "3UUFGP7" aTestValues[2] = "MCK000989" aTestValues[3] = "0090892LLG" aTestValues[4] = "9ABCD2" aTestValues[5] = "JKL230945" aTestValues[6] = "JKLM00989" aTestValues[7] = "JJ2000989" aTestValues[8] = "3U99G7" aTestValues[9] = "3UUFGP" aTestValues[10] = "7XYZA1" FOR x = 1 TO 10 lcTestValMasked = CHRTRAN(CHRTRAN(UPPER(aTestValues[x]),"ABCDEFGHIJKLMNOPQRSTUVWXYZ","CCCCCCCCCCCCCCCCCCCCCCCCCC"),"1234567890","NNNNNNNNNN") ? ICASE( lcTestValMasked == lcFormat1 , "String " + aTestValues[x] + " matches format 1",; lcTestValMasked == lcFormat2 , "String " + aTestValues[x] + " matches format 2",; lcTestValMasked == lcFormat3 , "String " + aTestValues[x] + " matches format 3",; "String " + aTestValues[x] + " does not match any format") NEXT
In the above example, there are three acceptable format codes that we’re using to compare against 10 values. Using ICASE, we can easily determine which codes match the format string. The inner CHRTRAN converts all character values to the letter “C”, while the outer CHRTRAN converts numbers to “N”. If we did the numbers first, our “N”s would become “C”s! Of course, you can use “X” and 9, “A” and “#”, or whatever letters you want to use as your mask.
Ideally, the format codes would belong to a table which would contain additional information about the code, such as where it came from and its status (active or inactive). Then, a loop through the table, or a seek on the format column would point you to the corresponding record.
I would love to hear some additional solutions on this. If you have any, please comment!
This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Data Cleansing is a Data Quality process that is designed to catch and resolve data quality violations. Proper data cleansing will add tremendous value to the data, giving your business users plenty of reasons to trust it. Sometimes referred to as ‘scrubbing’, this subsystem is a must for all Data Warehousing projects.
Most quality issues will be discovered during data Profiling, which I discussed here. Other issues will be identified by business users who have had to deal with problem data in the past. Some cleansing operations will be performed because of Master Data initiatives (for example, converting customer codes to universal codes for comparisons).
Cleansing data can consist of tasks like the following:
- Parsing city, state, and zip from an address string
- Changing the case of a person’s last name to UPPER
- Finding the delivery point from a street address and zip code
- Stripping out unwanted characters from text
- Swapping NULLs for some other value suitable for the DWH
- Deriving a value from one or more fields to create a new field
- Concatenating two or more fields into one (for example, turning day, month, year fields into a date)
Obviously, the above list is not exhaustive. There are myriad ways to improve the quality of data during integration. The important thing to keep in mind is that this subsystem, more than any other, actually seeks to change the data. As quality is the driving factor, these changes should always be for the better and serve some business purpose. Any change to the data should add value and confidence.
Garbage in, Garbage out
Think of data cleansing as the process of cleaning up all that garbage that got in in the first place. OLTP systems (especially legacy systems that have been kicking around for a few decades) are notorious for doing some strange things with the data. I have encountered cases where fields were re-purposed (“hey we’re not using store_id2 anymore, so now we have some place to put the manager’s phone number!”), algorithms re-written (but historical data not recalculated or stamped with a version), and nulls represented as anything but NULL (99, -99, -1, “N/A”, etc…).
Not only that, but it’s likely that you have run into cases where users were able to free-form type into data entry fields. For example, a list of valid products might include “Widget A” and “Widget B”, but because a user can type what they want, you get “WedgetA” and “WDGT-B”. Of course, from an application architecture point of view, this is a terrible design problem. Anyway, as a data integrator, expect this and worse.
One man’s junk is another man’s treasure
But there is light. Many of these quality problems can be corrected outright through integration, while others can be addressed directly in the source OLTP system. Many home-grown or custom applications compensate for bad data in some way (lookup/mapping tables, loose database constraints, routines that go back and ‘fix’ the data, etc…). Through ETL, you will have to do similar work.
What might seem like worthless data is oftentimes salvageable. In fact, I have been involved in projects where large chunks of data were simply discarded because the business users felt it was unworkable. After some tinkering, it turned out to be far from the truth. Not only was some of the data useful, but we were able to take our ‘dirty data proof’ back to the application developers to get the application fixed (fixing DQ issues at the source is always preferable).
SQL Server 2005 Integration Services (SSIS)
SSIS offers a great set of tasks for cleansing data. If you add the OLEDB Command (for SQL code) and Script Component (for VB code) to the list below, then there really are no limits to what can be achieved. My only criticism is that using expressions in SSIS is quite difficult at first and takes some getting used to (see Integration Services Expression Language). The language and syntax is native to SSIS. On the plus side, the expression builder dialog box is helpful. Expressions are used throughout SSIS, but especially in the Derived Column task.
Typically, data cleansing will involve one or more of the following SSIS tasks:
- Lookup Transform – performs a lookup using a join to another table, adding columns from the lookup to the data flow
- Character Map – for string data, this transform applies a variety of string functions to the input column (changing case, for example)
- Copy/Map – create a copy of a column
- Data Conversion – convert data types from one to another
- Derived Column – use this task to create or replace existing data by concatenation, exploding, applying calculations, and doing comparisons
- Fuzzy Lookup – a component you can use for inexact matching
- Fuzzy Grouping – uses algorithms to match or group disparate data to a configurable degree of accuracy
In future posts, I will demonstrate how many of these tasks work (likely through a SQL Server 2008 lens).
Hand Coding with Visual FoxPro (VFP9)
I’ve done a ton of VFP-related data cleansing over the past decade. Of course, VFP doesn’t have any fancy squares that turn green when a cleansing operation succeeds (or red when it fails!), but the language itself will give you everything you need to start cleansing data with little effort. I started trying to count the number of different commands you could use for cleaning in VFP. I stopped when I hit 50. There are likely more than a 100. Commands like SUBSTRING, ALLTRIM, CHRTRAN, LEFT, RIGHT, OCCURS, AT, LOWER, PROPER, SOUNDEX, LIKE, TRANSFORM, etc. Consult with the Visual FoxPro documentation or resources such as The Visual FoxPro Wiki for more details regarding VFP’s language features. I know this is a bit of a cop-out on my part but hey, I could spend days trying to summarize the capabilities.
What I can do is share some data cleansing / VFP implementation experience with you. The Logical Data Map (LDM) is a mapping document (sometimes done in Excel or relational database if you prefer) that tells you all of the source-to-target mappings for the integration. In VFP, you can create a VFP table (or tables) and embed cleansing code snippets into memo fields. This code can execute as part of a metadata-driven cleansing engine. I demonstrate how this can work in my articles for Advisor media. You can see some examples in my FoxForward material as well (see: 2007 Fox Forward Presentation Logical Data Map Database).
From here
Both SSIS and VFP have great data cleansing capabilities; in both systems, data cleansing is easy. In VFP, you can expect a mature language and blazing speeds (FoxPro is excellent when working with string data). SSIS makes it very easy to derive columns and change data types. SSIS can also make use of multiple processors and has some good built-in logging functionality. In the end, I prefer VFP because it is much easier to work with FoxPro’s languange (as oposed to VB and the SSIS expression language). Also, in VFP I find it much easier to manage metadata and the LDM.
In the next article, I’ll discuss the Error Event Management subsystem and how VFP and SSIS can deal with that!
In my series, “Business Intelligence Through Web Analysis“, I talk about using web stats to build a web analytics solution in FoxPro. I have created a webstats table, and have populated it with some data from my access logs. I have a requirement to be able to sort my webstats table by IP address (among other things).
The format of an IP address doesn’t lend itself to proper indexing (it is a 32-bit numeric address written as four numbers (octets) separated by periods). To accommodate my requirement, my first thought was to break the IP address into 4 separate numeric fields when importing. This would be good normalization. After all, the rule is that atomic data is better than concatenated data. But rules are meant to be broken. And, more importantly, does it make sense to atomize an IP address? An IP address is not really 4 separate numbers, the entire address is meaningful, even though each piece holds some significance.
So, I decided to keep the IP address intact.
Still needing a good way to sort and group, I thought about converting the IP address to its binary representation. MySQL has a function, INET_ATON (Adress TO Number), that does exactly that. FoxPro’s CREATEBINARY function seems to do the trick at first. I can pass in the complete IP address like so:
? CREATEBINARY("128.0.0.0") ? CREATEBINARY("168.212.226.204")
But if you see these outputs, you’ll realize that they’re not great for indexing (which is the only reason I want to convert them). CREATEBINARY creates strings of varying length, which is not what I want for indexing.
Back to the drawing board. I could represent each octet as 0 and 1 like this:
lnOct = 122 lcBin = '' DO WHILE lnOct > 0 lcBin = STR(lnOct%2, 1) + lcBin lnOct = INT(lnOct/2) ENDDO ? PADL(lcBin,8,'0')
But this seems terribly inefficient (I admit, no benchmarking was done, but this code is overkill!).
With all this in mind, I think I have a viable solution. It doesn’t require a custom function, nor does it require storing data in separate fields. I know I can easily extract each octet of the IP address by using GETWORDNUM. And, knowing a little something about IP addresses, each octet has 2^8 (or 256, values 0 to 255) possible combinations. Therefore, an IP Address can be represented in a numeric form like so:
? VAL(GETWORDNUM(ip_address,1,'.')) * 256^3 + ; VAL(GETWORDNUM(ip_address,2,'.')) * 256^2 + ; VAL(GETWORDNUM(ip_address,3,'.')) * 256 + ; VAL(GETWORDNUM(ip_address,4,'.'))
So my index for webstats.dbf would be:
SELECT webstats INDEX ON BINTOC( ; VAL(GETWORDNUM(ip_address,1,'.')) * 256^3 + ; VAL(GETWORDNUM(ip_address,2,'.')) * 256^2 + ; VAL(GETWORDNUM(ip_address,3,'.')) * 256 + ; VAL(GETWORDNUM(ip_address,4,'.')); ,8) TAG IP
Your comments and feedback are welcome. Is there a better solution?
Parsing City/State/Zip
Mar 14
Many applications today still carry the city, state, and zip in a composite field. The problem with this type of concatenation is that you can’t easily query the data or use it for other things (such as building a report by state). There really is no benefit or good reason to keep them together. I have yet to hear a good ‘excuse’:
1.) This is how I got the data
2.) It makes it easier for the end-user to enter everything in a single field
3.) I don’t have to worry about combining them on labels
4.) (insert your bad excuse here)
To add value to these legacy composite fields, you can write a program to split them up.
Usually, the data has some form such as “city, state zip” where the state is almost certainly abbreviated. The following program can be used to parse out city, state, and zip from a concatenated field into atomic parts. Two assumption are made with this program (important ones, at that): that state will be 2 digits and that city names don’t begin with a number. Another excellent approach would be to grab a zipcode file from the post office (or other data supplier) so that you could compare each segment to determine not only that it IS a zip code or city or state abbreviation (or full state name for that matter) but that it actually exists! I’ll leave that part up to you! Here is my simple solution to get things atomic:
lcCity = "" lcState = "" lcZip = "" lcCSZ= "Marlborough, MA 01752" *-- lcCSZ= "North Providence RI 02988" *-- lcCSZ= "Boston MA" ? parse_csz(lcCSZ , @lcCity, @lcState , @lcZip ) ? "city: " + lcCity ? "state: " + lcState ? "zip: " + lcZip FUNCTION parse_csz LPARAMETERS tcFullString , rcCity , rcState , rcZip LOCAL lcFullString AS String LOCAL lnSegments AS Number STORE "" TO rcCity , rcState , rcZip && reference parameters *-- remove any commas and convert to spaces lcFullString = STRTRAN( ALLTRIM(tcFullString ) , "," , " " ) *-- get the number of segments lnSegments = GETWORDCOUNT( lcFullString ) *-- try to be smart and figure out what to do depending *-- on the number of segments found DO CASE CASE lnSegments >= 3 rcZip = GETWORDNUM( lcFullString, lnSegments) rcState = GETWORDNUM( lcFullString, lnSegments- 1) FOR x = 1 TO lnSegments - 2 rcCity = rcCity + " " + GETWORDNUM( lcFullString, x ) NEXT rcCity = ALLTRIM( rcCity ) CASE lnSegments = 2 IF ISDIGIT( GETWORDNUM(lcFullString, 2) ) rcZip = GETWORDNUM(lcFullString, 2) ELSE IF LEN( GETWORDNUM(lcFullString, 2) ) > 2 rcCity = GETWORDNUM(lcFullString, 1) + " " + ; GETWORDNUM(lcFullString, 2) ELSE rcState = GETWORDNUM(lcFullString, 2) rcCity = GETWORDNUM(lcFullString, 1) ENDIF ENDIF CASE lnSegments = 1 IF ISDIGIT( lcFullString ) rcZip = lcFullString ELSE IF LEN( lcFullString ) > 2 rcCity = lcFullString ELSE rcState = lcFullString ENDIF ENDIF ENDCASE *-- Return in a mailing format RETURN rcCity + " " + rcState + " " + rcZ *-- lcCSZ= "North Providence RI 02988" *-- lcCSZ= "Boston MA" ? parse_csz(lcCSZ , @lcCity, @lcState , @lcZip ) ? "city: " + lcCity ? "state: " + lcState ? "zip: " + lcZip FUNCTION parse_csz LPARAMETERS tcFullString , rcCity , rcState , rcZip LOCAL lcFullString AS String LOCAL lnSegments AS Number STORE "" TO rcCity , rcState , rcZip && reference parameters *-- remove any commas and convert to spaces lcFullString = STRTRAN( ALLTRIM(tcFullString ) , "," , " " ) *-- get the number of segments lnSegments = GETWORDCOUNT( lcFullString ) *-- try to be smart and figure out what to do depending *-- on the number of segments found DO CASE CASE lnSegments >= 3 rcZip = GETWORDNUM( lcFullString, lnSegments) rcState = GETWORDNUM( lcFullString, lnSegments- 1) FOR x = 1 TO lnSegments - 2 rcCity = rcCity + " " + GETWORDNUM( lcFullString, x ) NEXT rcCity = ALLTRIM( rcCity ) CASE lnSegments = 2 IF ISDIGIT( GETWORDNUM(lcFullString, 2) ) rcZip = GETWORDNUM(lcFullString, 2) ELSE IF LEN( GETWORDNUM(lcFullString, 2) ) > 2 rcCity = GETWORDNUM(lcFullString, 1) + " " + ; GETWORDNUM(lcFullString, 2) ELSE rcState = GETWORDNUM(lcFullString, 2) rcCity = GETWORDNUM(lcFullString, 1) ENDIF ENDIF CASE lnSegments = 1 IF ISDIGIT( lcFullString ) rcZip = lcFullString ELSE IF LEN( lcFullString ) > 2 rcCity = lcFullString ELSE rcState = lcFullString ENDIF ENDIF ENDCASE *-- Return in a mailing format RETURN rcCity + " " + rcState + " " + rcZ *-- lcCSZ= "Boston MA" ? parse_csz(lcCSZ , @lcCity, @lcState , @lcZip ) ? "city: " + lcCity ? "state: " + lcState ? "zip: " + lcZip FUNCTION parse_csz LPARAMETERS tcFullString , rcCity , rcState , rcZip LOCAL lcFullString AS String LOCAL lnSegments AS Number STORE "" TO rcCity , rcState , rcZip && reference parameters *-- remove any commas and convert to spaces lcFullString = STRTRAN( ALLTRIM(tcFullString ) , "," , " " ) *-- get the number of segments lnSegments = GETWORDCOUNT( lcFullString ) *-- try to be smart and figure out what to do depending *-- on the number of segments found DO CASE CASE lnSegments >= 3 rcZip = GETWORDNUM( lcFullString, lnSegments) rcState = GETWORDNUM( lcFullString, lnSegments- 1) FOR x = 1 TO lnSegments - 2 rcCity = rcCity + " " + GETWORDNUM( lcFullString, x ) NEXT rcCity = ALLTRIM( rcCity ) CASE lnSegments = 2 IF ISDIGIT( GETWORDNUM(lcFullString, 2) ) rcZip = GETWORDNUM(lcFullString, 2) ELSE IF LEN( GETWORDNUM(lcFullString, 2) ) > 2 rcCity = GETWORDNUM(lcFullString, 1) + " " + ; GETWORDNUM(lcFullString, 2) ELSE rcState = GETWORDNUM(lcFullString, 2) rcCity = GETWORDNUM(lcFullString, 1) ENDIF ENDIF CASE lnSegments = 1 IF ISDIGIT( lcFullString ) rcZip = lcFullString ELSE IF LEN( lcFullString ) > 2 rcCity = lcFullString ELSE rcState = lcFullString ENDIF ENDIF ENDCASE *-- Return in a mailing format RETURN rcCity + " " + rcState + " " + rcZ *-- Return in a mailing format RETURN rcCity + " " + rcState + " " + rcZip