Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Statistics’


Published August 21st, 2007

Newton’s Method for Approximating the Zeros of a Real-valued Function

Need to find the root of a real-valued function and simple algebra just won’t cut it? Then Newton’s method is for you (if you can remember back to your calculus days, that is).

Newton’s method is a root-finding algorithm and uses iteration to solve nonlinear equations. The process involves making a guess at the true solution and then applying a formula to get a better guess. This process repeats until the (approximate) answer is found. The methodology is part of numerical analysis, which is the “branch of mathematics dealing with methods for obtaining approximate numerical solutions of mathematical problems (dictionary.com)”. Specifically, the type of problems fit into the “continuous mathematics” category that are common in medicine, engineering, science, and business industries. (For a good discussion on continuous verses discrete mathematics, check out this blog entry by Mark C. Chu-Carroll or this page from ESC.EDU.) As you might guess, continuous mathematics is very important in business intelligence, specifically statistical analysis of data.

The general form to find the approximation is as follows:

x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\,\!

Here’s how it works: Start with an initial guess (get out your graph paper!). Then the function draws a tangent to approximate the function based on your initial guess. Next, the function computes the x-intercept. This x-intercept will be better than the initial guess, and closer to the answer (the root). It then repeats with the x-intercept as the new guess.

The FoxPro code:

LOCAL lcFunction, lnInitialGuess, lnIterations 
 
*-- set the fuction, my initial guess, and number of iterations
lcFunction = "(1/4)*x^3 - 3*x^2 + (3/4)*x - 2"
lnInitialGuess = 2
lnIterations = 10 
 
*-- run newton
newton(lcFunction, lnInitialGuess, lnIterations )
 
*-- foxpro does newton:
FUNCTION newton
LPARAMETERS tcFunction, tnInitialGuess , tnIterations 
 
LOCAL lcDerivative AS String
LOCAL lnNewGuess, lnIter , x , lnY , lnSlope , lnCol5 , lnCol6
 
CREATE CURSOR crNewton (col_n n(2), col_Xn n(14,5), col_fXn n(14,5), col_f1Xn n(14,5), col_ys n(14,5) , col_ng n(14,5))
 
limh = .000001
lcFunction = ALLT(tcFunction)
lnInitialGuess = EVL(tnInitialGuess,0)
lnIterations = EVL(tnIterations,1) 
 
lnNewGuess = lnInitialGuess
 
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORMAL

In this run, with the equation I have defined, the approximation of x is 11.80326. If you plug this number in for x in the original equation, it evaluates to .00014. Pretty close! And that’s exactly what this algorithm is great for. Try these additional functions to get the hang of it. You can change the guess and iterations as you like as well:

  • “x + 1″
  • “x^3-12″
  • “x*(x^3)-(3/4)*x”
  • “((x^4+1)*(x^3+2))-11″

 

Published July 10th, 2007

Getting Started with Data Profiling

In my last post I said I would publish some code that you could use to get started profiling your FoxPro tables. I’ve included the following sample here, as well as in the Download Section for you to use. This example is written as a procedure, but you should consider converting this to objects. You can create patterns and really cut down on the amount of work you do. Although this code is generic, you may in fact need to write modified versions for certain tables in certain environments. OO techniques lend themselves to this nicely.

Anyway, here is how you would call the function listed below:

LOCAL lcSkipList, lcDirectory, lcTableName,;
     lcResultAlias  AS String 
 
lcDirectory = "C:\\Work\\tables_to_profile\\"
lcTableName = 'some_table'
lcResultAlias = 'some_resultset'
 
*-- enter the columns, separated by comma, to skip
lcSkipList = '' 
 
*-- run the data profile
profile_table(lcTableName, lcDirectory, lcResultAlias, lcSkipList)

The following function, ‘profile_table’, is as follows (please note, you will need my get_median function from here):

FUNCTION profile_table
  LPARAMETERS tcTableName, tcDirectory, tcAltName, tcSkipList
 
  LOCAL nFldCount AS Integer
 
  tcAltName = EVL(tcAltName,'profiled')
  tcSkipList = EVL(tcSkipList,'')
  lcCursorName = 'db_' + tcAltName 
 
  TRY
    IF USED(tcAltName)
      USE IN (tcAltName)
    ENDIF
    USE (ADDBS(tcDirectory)+tcTableName) IN 0 ALIAS (tcAltName )
    nFldCount = AFIELDS(aColumns,tcAltName)
  CATCH
    nFldCount = -1
  ENDTRY
 
  IF nFldCount > 0
    IF USED(lcCursorName )
      USE IN (lcCursorName )
    ENDIF
 
    CREATE CURSOR (lcCursorName)(;
			table_name varchar(45)	,;
			fld_name   varchar(45)	,;
			fld_type   char(1)		,;
			fld_size   integer      ,;
			fld_prec   integer      ,;
			fld_null   logical      ,;
			range_low  varchar(60)	,;
			range_high varchar(60)	,;
			domain     memo         ,;
			uniq_cnt   integer      ,;
			empty_cnt  integer      ,;
			full_cnt   integer      ,;
			dele_cnt   integer      ,;
			null_cnt   integer      ,;
			most_freq  varchar(60)  ,;
			data_mean  numeric(20,4),;
			data_var   numeric(20,4),;
			data_std   numeric(20,4),;
			median     numeric(20,4),;
			location   varchar(200) )
 
    FOR n = 1 TO nFldCount 
 
        *-- afield properties
        m.table_name = tcTableName
        m.fld_name = aColumns[n,1]
        m.fld_type = aColumns[n,2]
        m.fld_size = aColumns[n,3]
        m.fld_prec = aColumns[n,4]
        m.fld_null = aColumns[n,5]
 
        ? tcTableName + ' Field ' + m.fld_name
        ? DATETIME()
 
        IF ALLTRIM(LOWER(m.fld_name)) $ ALLTRIM(LOWER(tcSkipList))
          ?? ' skipped'
          LOOP
        ENDIF
 
        m.location = tcDirectory
 
        lcSQL = "SELECT " + m.fld_name + " AS prof_fld FROM " + ;
                        tcAltName + " INTO CURSOR crTmpProf"
        &lcSQL 
 
        *-- note to self, need to test to see if using SQL-Select is more efficient
        SELECT crTmpProf
        CALCULATE MIN(prof_fld) TO m.range_low
        CALCULATE MAX(prof_fld) TO m.range_high
        CALCULATE CNT() FOR EMPTY(prof_fld) TO m.empty_cnt
        CALCULATE CNT() FOR !EMPTY(prof_fld) TO m.full_cnt
        CALCULATE CNT() FOR DELETED() TO m.dele_cnt
        CALCULATE CNT() FOR ISNULL(prof_fld) TO m.null_cnt 
 
        *-- numbers only!
        IF INLIST(m.fld_type,"N","I","Y","B","F")
          CALCULATE AVG(prof_fld) TO m.data_mean
          CALCULATE STD(prof_fld) TO m.data_std
          CALCULATE VAR(prof_fld) TO m.data_var
          m.median = get_median(2,"crTmpProf","prof_fld")
        ELSE
          STORE -1 TO m.data_mean , m.data_std , m.data_var, m.median
        ENDIF
 
        *-- distinct list and domain
        SELECT prof_fld, COUNT(prof_fld) as freq ;
            FROM crTmpProf GROUP BY prof_fld ;
            ORDER BY freq DESC INTO CURSOR crTmpDistProf
 
        SELECT crTmpDistProf
        LOCATE
        m.uniq_cnt = _TALLY
        m.most_freq = crTmpDistProf.prof_fld
        m.domain = ''
        SCAN
          m.domain = m.domain + TRANSFORM(crTmpDistProf.prof_fld) + CHR(13)
        ENDSCAN
 
        SELECT (lcCursorName)
        APPEND BLANK
        GATHER MEMVAR MEMO
 
        ? REPLICATE('+',12)
 
    NEXT
  ENDIF
ENDFUNC

So there you have it. Now you have a table that lists each attribute, with additional, highly useful information. You can use this metadata to conduct index analysis, check for misspellings, verify that lookup codes exist in external tables, verify that numerical data hasn’t deviated too far, etc. This type of metadata is essential to have before you begin any quality or integration project!

Data profiling is a hefty subject — one in which I promise to address again in the near future.

Published May 31st, 2007

Business Intelligence Through Web Analysis

I just received a surprising email regarding my previous post about Analyzing Apache’s Raw Access Logs in FoxPro. The commenter wrote “dude, you’re nuts. Why waste your time on this?!? I use Awstats and that works fine”.

I sat back in my chair quite puzzled. After all, I know that Awstats doesn’t even come close to giving me the answers I need to grow my business and website. Then I realized my folly: I jumped right into my example in my last post without fully explaining the goal of the project. I’ll try to redeem myself now.

Awstats and similar tools (in my case, provided with cPanel) are retrospective reporting tool. They give you nice charts and display some very interesting numbers and facts about the types of traffic generated on your site. If you’re good with numbers and can associate events to dates (in your head) then you may be able to notice some interesting patterns (like, “hey – it seems that whenever I post a new blog entry, my hits double!”).

But this isn’t good enough for more serious projects (but isn’t a bad place to start either).

When growing your business (whether you’re a blogger or selling widgets), this type of information is invaluable. You need good, consistent, scientific analysis to pull it all together (hunches and gut feelings don’t count). Pool raw data (from a variety of sources), integrate it, clean and add value to it, and compile it to create some incredibly useful and valuable information (Read: Business Intelligence). This information can help you make decisions like (a) how much to spend on advertising, (b) whether or not to sponsor an event (such as FoxForward), (c) how many blog posts to make per week to keep the interest of readers, (d) should I sign up with Google AdSense, or (e) what design elements and layout plans are making the greatest impact.

Data smog is a real issue, however. Too much data can not only waste your valuable analysis time, but the integration of this meaningless data will do nothing but eat away at your resources (and give you a headache). The key then is to do a little preparation before you begin a project like this (duh!). I think there are two important steps (1) monetize all elements of your business, and (2) identify all key performance indicators (KPIs). Armed with this information, you will be able to build a dimensional model (in VFP of course!) with an incredibly rich fact table. Monetizing helps you assign value to all your tasks, and KPIs allow you to measure the benefits of these endeavors. Throughout the course of this project, I’ll be itemizing these two items in greater detail. As an example, for my blog todmeansfox, I’ve monetized the following items:

  • Posting a new blog entry: $90.00 / post (my time and effort to post based on my current rates, discounted)
  • Responding to posts: $10.00 / comment
  • Social networking: $90.00 / contact (includes setting up accounts on del.ico.us for example)
  • Advertising: $40.00 / free advertising (example, updating my fox.wiki profile)
  • Advertising: case by case (I have not done any advertising yet)
  • Sponsoring: case by case (I have not sponsored an event yet)
  • Updating CSS files: $60.00 / hour (how much does it ‘cost’ me to update my site’s layout and colors?)

Next, I tried to identify all important KPIs:

  • referrals
  • length of stay
  • out clicks
  • total number of hits and page views
  • total number of comments made
  • total number of emails received
  • Blackstone Providence leads
  • consulting job hours from website leads

Of course, as I do more data profiling, I may uncover some other very useful pieces of data that I can later integrate.

The goal of this project, therefore, is rather simple: make better business decisions. Using the Internet, I will gather the right data, integrate it in a meaningful way, and use OLAP to analyze and report on the findings. I’ll use data mining, profiling, and trend analysis to identify abuse and spam, as well as identify areas were improvements in layout, methodology, and content can make a greater impact. My hope is to generate more business by leveraging my current assets. On top of that, I want to do it in the open so others can benefit.

The first step in my process is data profiling, where I’ll gather the data I think I can use, analyze it, test it for quality, and prepare it for my dimensional model. My last blog post attempted to start that process by simply downloading and parsing the daily Apache raw access log file (which I should mention is in the NCSA combined/XLF/ELF log format, in case you were wondering).

As you can see, Awstats can only get me so far (actually, not that far at all).

Published January 6th, 2007

Interpolation

For those that aren’t familiar with the term, and since I’m not too good at creating my own definitions (go on, ask Andy Kramek!), I thought I’d borrow a definition from wikipedia to get started:

“In the mathematical subfield of numerical analysis, interpolation is a method of constructing new data points from a discrete set of known data points.” (Source)

If you follow that source link to the wikipedia article, you can read lots of nice things about interpolation and how it can be used in the math and science fields. About 8 years ago, I needed such a function: I needed the ability to find a set of values (x) that fell between two known values (f(x)). I wrote the following linear interpolation algorithm in FoxPro to get my answers:

*-- this sets up a test cursor with some data points
CREATE CURSOR crDataSet (fld_x n(4,2), fld_fx n(4,2))
INDEX ON fld_x TAG fld_x
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (1,1.75)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (3,2.00)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (4,3.00)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (5,4.00)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (7,4.25)
 
? interpolate(2,"crDataSet")   && returns 1.875
? interpolate(3,"crDataSet")     && returns 2.000
? interpolate(3.1,"crDataSet")  && returns 2.100
? interpolate(3.9,"crDataSet")  && returns 2.900
? interpolate(10,"crDataSet")   && returns 4.625
 
FUNCTION interpolate
    LPARAMETERS tnX, tcDataSet
 
    LOCAL lcfld_x , lcfld_fx AS String
    LOCAL lcOldNear , lcOldExac AS String
    LOCAL X1, Y1, X2, Y2 AS Number
    LOCAL lFound AS Logical 
 
    *-- name of fields in tcDataSet cursor
    lcfld_x = FIELD(1)    lcfld_fx = FIELD(2)
 
    *-- important to adjust these settings
    lcOldNear = SET('NEAR')
    lcOldExac = SET('EXACT')
    SET NEAR ON
    SET EXACT OFF 
 
    SELECT (tcDataSet)
 
    lFound = SEEK(tnX,tcDataSet)
    SET NEAR &lcOldNear
    SET EXACT &lcOldExac
 
    IF lFound
        *-- if found, then just return the point
        RETURN EVAL(tcDataSet + '.' + lcfld_fx)
    ELSE
        *- not found, so need to interpolate
        cXStr = tcDataSet + '.' + lcfld_x
        cYStr = tcDataSet + '.' + lcfld_fx
 
        *-- if the seek put us at eof, back up 1
        SELE (tcDataSet)
        IF EOF(tcDataSet)
            SKIP -1
        ENDIF
 
        *-- evaluate the first point
        X2 = EVAL(cXStr)
        Y2 = EVAL(cYStr)
 
        *-- evaluate the second point
        IF RECNO(tcDataSet) = 1
            SKIP +1
        ELSE
            SKIP -1
        ENDIF
        X1 = EVAL(cXStr)
        Y1 = EVAL(cYStr)
 
        *-- draw a line and return the result!
        RETURN ymxb(x1,y1,x2,y2,tnX)
    ENDIF
 
ENDFUNC  
 
FUNCTION ymxb
    LPARAMETERS X1,Y1,X2,Y2,X3
 
    *-- y = mx + b
    LOCAL nSlope , B1
 
    nSlope = (Y1-Y2)/(X1-X2)
    B1 = -((nSlope) * X1 - Y1)
    RETURN (nSlope) * X3 + B1
 
ENDFUNC

There you have it!

A couple of notes: For space reasons, I’ve omitted some key error checking and bullet-proofing (I started doing it, but realized that an extra 20 lines of code for this example was overkill). But it would be a good idea, for example to verify that the columns in the cursor passed were numbers. Come to think of it, you should also very that the correct parameters were passed!

In addition, I remember from math class that such a linear method is not the most accurate method. I am using a straight line between points and not a curve. I suppose a complimentary function could be added along with ymxb to get a curve between the two points, but until I need to do that — I’m not game!

Published December 12th, 2006

Sample Standard Deviation (and SQL)

The CALCULATE command contains a range of statistical operations that can be utilized in FoxPro. These commands work on fields in a table (or cursor) and can be filtered using a FOR clause and SCOPE operator. Calculating the mean, standard deviation, and variance is rather straightforward:

CALCULATE AVG(fieldname) TO variable
CALCULATE STD(fieldname) TO variable
CALCULATE VAR(fieldname) TO variable

And some other useful functions:

CALCULATE MIN(fieldname) TO variable
CALCULATE MAX(fieldname) TO variable
CALCULATE CNT(fieldname) TO variable
CALCULATE SUM(fieldname) TO variable

In a previous blog entry I wrote a program to calculate the median from a set of data. VFP does not have a native median function.

Population vs. Sample Data.

Visual FoxPro doesn’t have a built-in standard deviation formula for a sample. The functions CALCULATE STD() is a population function. The same is true of variance. The question then becomes, how can I get a sample standard deviation? I’m glad you asked! Here’s how:

FUNCTION std_devs
    LPARAMETERS tcTableField, tcPredicates
    LOCAL nRet, nsavg , nscnt AS float  
 
    SELECT(JUSTSTEM(tcTableField))
 
    IF !EMPTY(tcPredicates)
        CALCULATE AVG(&tcTableField) FOR &tcPredicates TO nsavg
        CALCULATE CNT(&tcTableField) FOR &tcPredicates TO nscnt
        SELECT SUM( (&tcTableField- nsavg) ^ 2 ) as ssum ;
            FROM JUSTSTEM(tcTableField)
            WHERE &tcPredicates
            INTO CURSOR crTmpSDEVS2
    ELSE
        CALCULATE AVG(&tcTableField) TO nsavg
        CALCULATE CNT(&tcTableField) TO nscnt
        SELECT SUM( (&tcTableField- nsavg) ^ 2 ) as ssum ;
            FROM JUSTSTEM(tcTableField)
            INTO CURSOR crTmpSDEVS2
    ENDIF
 
    nRet = SQRT(crTmpSDEVS2.ssum / (nscnt-1) )
 
    USE IN crTmpSDEVS2  RETURN nRet
 
ENDFUNC

What about STD or VAR in an SQL statement? Again, glad you asked!

SELECT DISTINCT std_devp("crtmp.value") AS stddev ;
    FROM crtmp INTO CURSOR cResult
SELECT DISTINCT std_devs("crtmp.value") AS stddev ;
    FROM crtmp INTO CURSOR cResult
 
*-- here is the population version fit for SQL:
FUNCTION std_devp
    LPARAMETERS tcTableField, tcPredicates
    LOCAL nRet AS float
 
    IF !EMPTY(tcPredicates)
        CALCULATE STD(&tcTableField) ;
            FOR &tcPredicates TO nRet IN JUSTSTEM(tcTableField)
    ELSE
        CALCULATE STD(&tcTableField) ;
            TO nRet IN JUSTSTEM(tcTableField)
    ENDIF
 
    RETURN nRet 
 
ENDFUNC

A big issue with a custom function in an SQL statement is performance. The sample standard deviation function I wrote above is certainly not going to win any sprints (but might just hold out during a marathon). Some environmental tests also need to be done in the above code (for example, resetting the alias back and checking for deleted would be nice improvements). For what it’s worth, this should be a good stepping stone for those in need of a good sample standard deviation formula in pure FoxPro. In addition, using it as a function call in an SQL statement increases its usability and worth.

Published November 20th, 2006

An Improved Median Function

After my test script found errors with my previous median function, I’ve gone ahead and made some improvements. Again, the goal here isn’t to be thorough but is to demonstrate a boundary-testing technique that is easy to conceive, implement, and build-upon. Like I mentioned in my previous post, this single test script can be the foundation for a testing class. I would also go further by wrapping all of this into a table-driven engine, which would call the function, define its parameters, and identify its expected return value. With VFP9, structured error handling will also help.

I imagine that a revisit of this topic at some point will be in order. Until then, enjoy the much improved get_median function:

*--The following code is an improved get_median
FUNCTION get_median_clean
LPARAMETERS tnType, teName, teColDel
 
    *-- boundary testing
    IF !BETWEEN(tnType,1,3)
        RETURN 0
    ENDIF   
 
    DO CASE
    CASE tnType = 1 .and. VARTYPE(teName)<>"C"
        RETURN 0
    CASE tnType = 1 .and. (!ISDIGIT(teName) .or.;
                VARTYPE(teColDel)<>"C")
        RETURN 0
    CASE INLIST(tnType,2,3) .and. TYPE(teColDel)<>"N"
        RETURN 0
    CASE tnType = 2 .and. (!USED(teName) .or. ;
                 EMPTY(FIELD(teColDel,teName)))
        RETURN 0
    CASE tnType = 2 .and. ;
                 VARTYPE(EVALUATE(FIELD(teColDel,teName))) <> "N"
        RETURN 0
    CASE tnType = 3 .and. ;
                 VARTYPE(EVALUATE(teColDel[1,teColDel])) <> "N"
        RETURN 0
    ENDCASE
 
    *-- local declerations
    LOCAL lcSelCol AS String
    LOCAL lnItems, lnHalf, lnInc AS Integer
    LOCAL lnMedian AS Number
 
    *-- create the aMedian Array
    DO CASE
    CASE tnType = 1        && use a delimted string
        lnItems = ALINES( aMedian , STRTRAN(teName ,teColDel,CHR(13)) )
        FOR lnInc = 1 TO lnItems
            aMedian[lnInc] = VAL(aMedian[lnInc])
        NEXT    CASE tnType = 2        && calculate using an alias
        lcSelCol = FIELD(teColDel,teName)
        lcSQL = "SELECT " + lcSelCol + " FROM " + teName + ;
                " ORDER BY " + lcSelCol + " INTO ARRAY aMedian"
        &lcSQL
        lnItems= _TALLY
    CASE tnType = 3        && calculate using an array
        lnItems= ALEN(&teName,1)
        DIMENSION aMedian[lnItems]
        FOR lnInc = 1 TO lnItems
            aMedian[lnInc] = &teName.[lnInc,teColDel]
        NEXT
    ENDCASE
 
    *-- find the median
    ASORT(aMedian)
    lnHalf = CEILING( lnItems / 2)
    IF MOD(lnHalf,2) = 0
        lnMedian = ( aMedian[lnHalf] + aMedian[lnHalf+1] ) / 2
    ELSE
        lnMedian = aMedian[lnHalf]
    ENDIF
    RETURN lnMedian
 
ENDFUNC