Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Statistics’


Published December 12th, 2006

Sample Standard Deviation (and SQL)

The CALCULATE command contains a range of statistical operations that can be utilized in FoxPro. These commands work on fields in a table (or cursor) and can be filtered using a FOR clause and SCOPE operator. Calculating the mean, standard deviation, and variance is rather straightforward:

CALCULATE AVG(fieldname) TO variable
CALCULATE STD(fieldname) TO variable
CALCULATE VAR(fieldname) TO variable

And some other useful functions:

CALCULATE MIN(fieldname) TO variable
CALCULATE MAX(fieldname) TO variable
CALCULATE CNT(fieldname) TO variable
CALCULATE SUM(fieldname) TO variable

In a previous blog entry I wrote a program to calculate the median from a set of data. VFP does not have a native median function.

Population vs. Sample Data.

Visual FoxPro doesn’t have a built-in standard deviation formula for a sample. The functions CALCULATE STD() is a population function. The same is true of variance. The question then becomes, how can I get a sample standard deviation? I’m glad you asked! Here’s how:

FUNCTION std_devs
    LPARAMETERS tcTableField, tcPredicates
    LOCAL nRet, nsavg , nscnt AS float  
 
    SELECT(JUSTSTEM(tcTableField))
 
    IF !EMPTY(tcPredicates)
        CALCULATE AVG(&tcTableField) FOR &tcPredicates TO nsavg
        CALCULATE CNT(&tcTableField) FOR &tcPredicates TO nscnt
        SELECT SUM( (&tcTableField- nsavg) ^ 2 ) as ssum ;
            FROM JUSTSTEM(tcTableField)
            WHERE &tcPredicates
            INTO CURSOR crTmpSDEVS2
    ELSE
        CALCULATE AVG(&tcTableField) TO nsavg
        CALCULATE CNT(&tcTableField) TO nscnt
        SELECT SUM( (&tcTableField- nsavg) ^ 2 ) as ssum ;
            FROM JUSTSTEM(tcTableField)
            INTO CURSOR crTmpSDEVS2
    ENDIF
 
    nRet = SQRT(crTmpSDEVS2.ssum / (nscnt-1) )
 
    USE IN crTmpSDEVS2  RETURN nRet
 
ENDFUNC

What about STD or VAR in an SQL statement? Again, glad you asked!

SELECT DISTINCT std_devp("crtmp.value") AS stddev ;
    FROM crtmp INTO CURSOR cResult
SELECT DISTINCT std_devs("crtmp.value") AS stddev ;
    FROM crtmp INTO CURSOR cResult
 
*-- here is the population version fit for SQL:
FUNCTION std_devp
    LPARAMETERS tcTableField, tcPredicates
    LOCAL nRet AS float
 
    IF !EMPTY(tcPredicates)
        CALCULATE STD(&tcTableField) ;
            FOR &tcPredicates TO nRet IN JUSTSTEM(tcTableField)
    ELSE
        CALCULATE STD(&tcTableField) ;
            TO nRet IN JUSTSTEM(tcTableField)
    ENDIF
 
    RETURN nRet 
 
ENDFUNC

A big issue with a custom function in an SQL statement is performance. The sample standard deviation function I wrote above is certainly not going to win any sprints (but might just hold out during a marathon). Some environmental tests also need to be done in the above code (for example, resetting the alias back and checking for deleted would be nice improvements). For what it’s worth, this should be a good stepping stone for those in need of a good sample standard deviation formula in pure FoxPro. In addition, using it as a function call in an SQL statement increases its usability and worth.

Published November 20th, 2006

An Improved Median Function

After my test script found errors with my previous median function, I’ve gone ahead and made some improvements. Again, the goal here isn’t to be thorough but is to demonstrate a boundary-testing technique that is easy to conceive, implement, and build-upon. Like I mentioned in my previous post, this single test script can be the foundation for a testing class. I would also go further by wrapping all of this into a table-driven engine, which would call the function, define its parameters, and identify its expected return value. With VFP9, structured error handling will also help.

I imagine that a revisit of this topic at some point will be in order. Until then, enjoy the much improved get_median function:

*--The following code is an improved get_median
FUNCTION get_median_clean
LPARAMETERS tnType, teName, teColDel
 
    *-- boundary testing
    IF !BETWEEN(tnType,1,3)
        RETURN 0
    ENDIF   
 
    DO CASE
    CASE tnType = 1 .and. VARTYPE(teName)<>"C"
        RETURN 0
    CASE tnType = 1 .and. (!ISDIGIT(teName) .or.;
                VARTYPE(teColDel)<>"C")
        RETURN 0
    CASE INLIST(tnType,2,3) .and. TYPE(teColDel)<>"N"
        RETURN 0
    CASE tnType = 2 .and. (!USED(teName) .or. ;
                 EMPTY(FIELD(teColDel,teName)))
        RETURN 0
    CASE tnType = 2 .and. ;
                 VARTYPE(EVALUATE(FIELD(teColDel,teName))) <> "N"
        RETURN 0
    CASE tnType = 3 .and. ;
                 VARTYPE(EVALUATE(teColDel[1,teColDel])) <> "N"
        RETURN 0
    ENDCASE
 
    *-- local declerations
    LOCAL lcSelCol AS String
    LOCAL lnItems, lnHalf, lnInc AS Integer
    LOCAL lnMedian AS Number
 
    *-- create the aMedian Array
    DO CASE
    CASE tnType = 1        && use a delimted string
        lnItems = ALINES( aMedian , STRTRAN(teName ,teColDel,CHR(13)) )
        FOR lnInc = 1 TO lnItems
            aMedian[lnInc] = VAL(aMedian[lnInc])
        NEXT    CASE tnType = 2        && calculate using an alias
        lcSelCol = FIELD(teColDel,teName)
        lcSQL = "SELECT " + lcSelCol + " FROM " + teName + ;
                " ORDER BY " + lcSelCol + " INTO ARRAY aMedian"
        &lcSQL
        lnItems= _TALLY
    CASE tnType = 3        && calculate using an array
        lnItems= ALEN(&teName,1)
        DIMENSION aMedian[lnItems]
        FOR lnInc = 1 TO lnItems
            aMedian[lnInc] = &teName.[lnInc,teColDel]
        NEXT
    ENDCASE
 
    *-- find the median
    ASORT(aMedian)
    lnHalf = CEILING( lnItems / 2)
    IF MOD(lnHalf,2) = 0
        lnMedian = ( aMedian[lnHalf] + aMedian[lnHalf+1] ) / 2
    ELSE
        lnMedian = aMedian[lnHalf]
    ENDIF
    RETURN lnMedian
 
ENDFUNC

Published November 16th, 2006

First test case using get_median

Here is a simple test script that will run get_median seven times, passing in a variety of parameters along the way. The only valid case occurs when the first parameter of “1″ is passed and parameters two and three are valid. This seemingly mindless test case (which many astute and seasoned programmers will scoff at) actually does a lot for me. First, it helps to ensure that any future changes I make to this or similar functions will run when unexpected parameters are passed in. Second, it allowed me to create a few useful functions that can be re-used on other test cases. Third, it forms the basis of a testing project that can be built upon as time and resources permit. The premise is that all my test cases can be chained together (preferably in a table) and run just before the project reaches QA or system testing.

What is left to do? For starters, this test script can be enhanced to not only pass in a string of values, but also an array or cursor. It is possible that doing so will introduce other issues that will need to be fixed. The end result, nevertheless, will be one bullet-proof get_median function that is as flexible as it is safe! In addition, this code will form the basis of a test class and be added to a testing project. For now, I’ll leave it in PRG form for simplicity.

Tomorrow, I’ll post an updated get_median that fixes many of the bugs, while incorporating many of my boundary testing examples from prior entries.

*-- The following code is my first test of get_median
ON ERROR ;
    STRTOFILE(capture_error(ERROR( ), MESSAGE( ), MESSAGE(1),;
    PROGRAM( ), LINENO( )),"test_err.log",1)
 
*-- What if the driver is unexpected, or para 2 and 3 are invalid?
FOR x = -2 TO 4
    ePara2 = get_string(8)
    epara3 = ","
    eRetVal = get_median_clean(x,ePara2,ePara3)
 
    STRTOFILE(TTOC(DATETIME())+ " get_median " + ;
        PADL(TRANSFORM(x),4) + CHR(9) + ;
        TRANSFORM(ePara2) + CHR(9) + ;
        TRANSFORM(epara3) + CHR(9) + ;
        TRANSFORM(eRetVal) + CHR(13) + CHR(10) ,"test_run.log",1)
 
NEXT
 
FUNCTION get_string()
    LPARA tnSize
 
    tnSize = IIF(VARTYPE(tnSize)<>"N",0,tnSize)
    LOCAL lcString
    lcString = ALLTRIM(STR(get_int()))
    FOR n = 2 TO tnSize
        lcString = lcString + "," + ALLTRIM(STR(get_int()))
    NEXT
    RETURN lcString
ENDFUNC
 
FUNCTION get_int
    RETURN (RAND()*1000) - 500
ENDFUNC 
 
FUNCTION capture_error (merror, mess, mess1, mprog, mlineno)
    LOCAL cMess, cCRLF
    cCRLF = CHR(13) + CHR(10)
    cMess = 'Error number: ' + LTRIM(STR(merror)) + cCRLF + ;
            'Error message: ' + mess + cCRLF + ;
            'Line of code with error: ' + mess1 + cCRLF + ;
            'Line number: ' + LTRIM(STR(mlineno)) + cCRLF + ;
            'Program with error: ' + mprog + cCRLF + ;
            REPLICATE("=",30) + cCRLF
    RETURN cMess
ENDFUNC

Published November 2nd, 2006

Boundary Testing — CASE: Get Median function

Last week I talked about boundary testing parameters with a promise to discuss testing methods this week. I’m going to adjust this promise a little due to the magnitude of the task. I felt it would be beneficial to review a case first.

This week, I present a simple program that I wrote several years ago that calculates the median of a range of values. I have enhanced (famous last words) the program to accept a variety of inputs that will help to demonstrate my boundary testing example next week. It has a driver parameter (tnType) and two additional parameters (teName, teColDel) that hold different data types depending on the driver.

There is no concept of overloading functions in VFP. The technique I am using for the get_median fucntion below simulates this concept. Finding the median of a cursor, array or delimeted string is not particularly hard to do, but developing a bullet-proof solution that handles each case in one shot, is.

The following program works, but has several flaws. Next week I will present an improved get_median function that addresses these flaws. I’ll expose some of them using boundary testing routines.

*-- The following creates a simple test to run get_median
cString = "1,2,3,4,5"
CREATE CURSOR crTemp (col1 c(1), value n(1))
INSERT INTO crTemp VALUES ("1",1)
INSERT INTO crTemp VALUES ("2",2)
INSERT INTO crTemp VALUES ("3",3)
INSERT INTO crTemp VALUES ("4",4)
 
DIMENSION aTestMedian[5,2]
aTestMedian[1,1] = 1
aTestMedian[2,1] = 2
aTestMedian[3,1] = 3
aTestMedian[4,1] = 4
aTestMedian[5,1] = 5
 
? get_median(1,cString,',')
? get_median(2,'crTemp','value')
? get_median(3,'aTestMedian',1)
 
*-- The following is the first version of get_median
FUNCTION get_median
    LPARAMETERS tnType, teName, teColDel
 
    *-- local declerations
    LOCAL lcSelCol AS String
    LOCAL lnItems, lnHalf, lnInc AS Integer
    LOCAL lnMedian AS Number
 
    *-- create the aMedian Array
    DO CASE
    CASE tnType = 1        && use a delimted string
        lnItems= ALINES( aMedian , STRTRAN(teName ,teColDel,CHR(13)) )
        FOR lnInc = 1 TO lnItems
            aMedian[lnInc] = VAL(aMedian[lnInc])
        NEXT
    CASE tnType = 2       && calculate using an alias
        lcSelCol = FIELD(teColDel,teName)
        lcSQL = "SELECT " + lcSelCol + " FROM " + teName + ;
              " ORDER BY " + lcSelCol + " INTO ARRAY aMedian"
        &lcSQL
        lnItems= _TALLY
    CASE tnType = 3        && calculate using an array
        lnItems= ALEN(&teName,1)
        DIMENSION aMedian[lnItems]
        FOR lnInc = 1 TO lnItems
            aMedian[lnInc] = &teName.[lnInc,teColDel]
        NEXT
    ENDCASE
 
   *-- find the medianASORT(aMedian)
    lnHalf = CEILING( lnItems / 2)
    IF MOD(lnHalf,2) = 0
        lnMedian = ( aMedian[lnHalf] + aMedian[lnHalf+1] ) / 2
    ELSE
        lnMedian = aMedian[lnHalf]
    ENDIF
 
    RETURN lnMedian
 
ENDFUNC