Posts Tagged Data Profiling

10 Commandments of Data Integration

  1. You shall compile and document all requirements and mappings; segregate the work by business process. You may have more than one of these business processes, some of which may come before others.
  2. Do not begin without first conducting a thorough data profile; otherwise, you will be punished for your inequities, as will the generations that come after you.
  3. Do not think commandments one or two are in vain, lest you will become overrun by the dead line, scope creepers, and a great exodus of people from your tribe; if this happens to you, do not swear or curse, for you have been warned.
  4. Remember that latency and timeliness are equal in importance to non-volatility and having a traceable lineage; a staging area may lead you to this promised land.
  5. Honor the rules of data conformance.
  6. Do not kill dirty data: you shall clean them, or take them back to their sources for retribution.
  7. Do not commit the worst data integration transgression of all and ignore data quality, your ignorance will not be forgiven.
  8. Do not be shy about stealing your neighbor’s work, for his trials have led to best practices that you can make equally good use of.
  9. Do not rely solely on business keys; surrogates are your friend and will permit you to engage in slowly changing your dimensions.
  10. You shall covet a proper audit and log system; for on the day of judgment, you will need proof of your compliance.

Tags: , , , , ,

No Comments

Hand Coding the Data Warehouse with VFP

I have posted a new whitepaper that supplements the material I presented at FoxForward and at AFUG in Atlanta. This whitepaper was published in the FoxForward 2007 Book that Kevin provided to conference attendees, which contains lots of other great session material.

This whitepaper makes the case for using Visual FoxPro as a hand-coding tool for developing data profiling, metadata, ETL, and other data warehousing applications.

Download it here.

Tags: , , , , , ,

No Comments

Parsing firstname / middlename / lastname

There are many ways to parse or tokenize data in VFP. I like using STREXTRACT and GETWORDNUM for example. But there are cases in which the actual parsing needs to be a bit smarter than what these common methods offer. Take a concatenated name field for example. Common to many legacy (and unfortunately, some modern) systems, is a ‘name’ or ‘address1′ field that stores a person or company’s full name. This method is not looked upon favorably by data integrators or people who like atomic, normalized data (like me!). These types of fields don’t work well with indexes either.

Of course, the design solution is to create 5 or more fields that represent a person’s name: prefix / first / middle / last / suffix. These atomic values can then be arranged as needed (”last, first middle”; “first last”, initials only, etc.) in reports and for display. But oftentimes, making this decision now is too late.

An alternative is to write a smart parser that can place each name element into its appropriate box. In the following code example, I provide a solution that can handle this tasks. My only assumption for this code is that the name is stored in prefix / first / middle / last / suffix order. By data profiling, you can determine how true and consistent this assumption is on your data. You may discover that 95% of the file is stored in lastname / firstname order instead. Whatever the case may be, you can use the method below as a starting point.

The following snippet gets us started, and in the end, loops through the names and prints the parsed results to screen:

#DEFINE PREFIX  1
#DEFINE FIRSTNAME  2
#DEFINE MIDDLENAME  3
#DEFINE SURNAME  4
#DEFINE SUFFIX  5
 
CLEAR
 
*-- some names to demonstrate
DIMENSION aMyNames[13]
aMyNames[1] = "Andrew MacNeill"
aMyNames[2] = "Andy Kramek"
aMyNames[3] = "Bernard Bout"
aMyNames[4] = "Calvin Hsia"
aMyNames[5] = "Cesar Chalom"
aMyNames[6] = "Craig Baily"
aMyNames[7] = "Craig Berntson"
aMyNames[8] = "Emerson Santon Reed"
aMyNames[9] = "Eric den Doop"
aMyNames[10] = "gonzomaximus"
aMyNames[11] = "Tod J. McKenna II"
aMyNames[12] = "Dr. Ralph Kimball"
aMyNames[13] = "Miss Lucy Jones"
 
*-- placeholder for tokenized name, passed by reference into split_name
DIMENSION aParts[1]
 
FOR x = 1 TO 13
  cLastFirst = split_name(aMyNames[x], @aParts)
  ? cLastFirst
  ? "firstname: " + aParts[FIRSTNAME] + CHR(9)
  ?? "lastname: " + aParts[SURNAME]
NEXT

Here’s the code in function “split_name”

FUNCTION split_name
LPARAMETERS tcName , taParts
 
*-- variables and arrays I will use
LOCAL cPart AS Character
LOCAL x , nTotParts AS Integer
LOCAL lIsSuffix , lIsPrefix AS Logical
LOCAL ARRAY aTemp[1]
EXTERNAL ARRAY taParts
 
*-- get the data from tcName into an array for easy looping
tcName = STRTRAN(UPPER(ALLT(tcName))," ",CHR(13))
DIMENSION taParts[5]
STORE "" TO taParts
DIMENSION aTemp[1]
FOR x = 1 TO MEMLINES(tcName)
  cPart = ALLTRIM(MLINE(tcName,x))
  IF !EMPTY(cPart)
    aTemp[ALEN(aTemp,1)] = cPart
    IF x < MEMLINES(tcName)
      DIMENSION aTemp[ALEN(aTemp,1)+1]
    ENDIF
  ENDIF
NEXT
 
*-- now loop through temp array and put each name segment into a smart little box
nTotParts = ALEN(aTemp,1)
FOR x = 1 TO nTotParts
  IF VARTYPE(aTemp[x])!="C"
    LOOP
  ENDIF
  *-- get rid of any periods used in the part, and convert to uppercase
  cPart = STRTRAN(ALLTRIM(UPPER(aTemp[x])),".","")
  IF nTotParts = 1
    *-- Madonna? gonzomaximus? Treat single names as a lastname
    taParts[SURNAME] = cPart
  ELSE
    *-- Note: A data profile on a name file I parsed a few years back revealed
    *-- the following common name prefixes. Beef this list up if you have others!
    lIsPrefix = INLIST(cPart,"MR","MRS","MS","MISS","DR","PROF","SIR",;
                  "MASTER","REV","REVERAND","FATHER","SISTER","BROTHER",;
                  "BR","SIS","ATTORNEY","COL","COLONOL","REP","PRES")
    IF lIsPrefix
      taParts[PREFIX] = cPart
      LOOP
    ENDIF
    *-- Note: The same profile mentioned above revealed the following suffixes
    lIsSuffix = INLIST(cPart,"PHD","II","III","IV","V","JR","SR","ESQ","CPA","CEA","DD","MD","JD")
    IF lIsSuffix
      taParts[SUFFIX] = cPart
      LOOP
    ENDIF
    IF x = nTotParts
      IF EMPTY(taParts[SURNAME])
        taParts[SURNAME] = cPart
      ELSE
        IF EMPTY(taParts[FIRSTNAME])
          taParts[FIRSTNAME] = cPart
        ELSE
          taParts[MIDDLENAME] = cPart
        ENDIF
      ENDIF
    ELSE
      IF EMPTY(taParts[FIRSTNAME])
        taParts[FIRSTNAME] = cPart
      ELSE
        IF EMPTY(aParts[MIDDLENAME])
          taParts[MIDDLENAME] = cPart
        ELSE
          taParts[SURNAME] = cPart
        ENDIF
      ENDIF
    ENDIF
  ENDIF
NEXT
 
RETURN ALLTRIM(ALLTRIM(taParts[SURNAME]) + ", " +;
         ALLTRIM(taParts[FIRSTNAME]) + " " + ;
         ALLTRIM(taParts[MIDDLENAME]) )
ENDFUNC

So there you have it. The code above looks at the tcName variable and attempts to dump its parts into the taParts array. Some defined constants exist to help ease the “where the heck am I in this array” confusions (a practice I use for stuff like this all the time). As an added bonus, the code returns the name in last / first middle order, which is usually a good format for display purposes.

Tags: , , ,

1 Comment