Posts Tagged Statistics

Analytical Databases

 Analytical DatabasesMost of what I discuss on Tod means Fox (and especially in my 34 Subsystem series) has to do with relational database engines (SQL Server 2005 and Visual FoxPro in particular). For most data integration projects, desktop and web solutions, and data warehouses, the relational database is all you’ll need. It can be used to create and manage both relational and dimensional models. But in Business Intelligence applications, chances are that you’ll need an additional, more specialized form of multidimensional data storage and retrieval. Analytical databases cover this need. In fact, Dr. Codd, the inventor of the relational database, often explained how analytical databases are a “necessary companion” to relational databases.

The OLAP Story

OLAP stands for Online Analytical Processing. It is a terrible name to describe multidimensional analytical databases. Instead of a more meaningful name, like FASMI (Fast Analysis of Shared Multidimensional Information) proposed by Nigel Pendse of “OLAP Report” we seem to be stuck with OLAP. With great pain, I will continue to discuss analytical databases using the OLAP terminology.

Basically, an OLAP (*sigh*) database has certain characteristics that set it apart from relational databases. And, coincidentally, Pendse included all characteristics in the FASMI acronym. Check out the OLAP Report for more details. If you’re just getting into business analytics, or are in the process of evaluating analytical database software – you must visit this site!

The Smartness Factor

I like to think of the relational database as “dumb” and the analytic database as “smart”. Let me explain:

The primary focus of Analytic servers is to get the (often pre-calculated) data out of the database as quickly as possible, allowing the user to zoom in and out along different hierarchies. Contrast this with the Codd’s relational model which seeks to eliminate data anomalies at transaction time through normalization: data retrieval is slow and often complex. With this in mind, you can get a feel for how a database engine can store, catalog, and retrieve data differently. (In fact, this is what makes dimensional modeling in general so favorable for querying and analytics – it is not bound by the restrictiveness imposed by normalization. I’d also like to note that as with dimensional models, OLAP databases are multidimensional.)

While the relational database plays a very important role, so does the multidimensional (OLAP) database. It’s simply built differently to service a different need.

Primarily, analytic servers can manage aggregates, aggregate navigation, and complex calculations and summaries across hierarchies at blazing speeds. These skills were borrowed from both navigational database designs (think of the DOM or a Wiki) and hierarchal designs (trees with leaves and branches). A lot of this has to do with how the data is stored.

Storage

Analytics servers offer a different type of storage. MOLAP, or Multidimensional OLAP, can be much more efficient than relational engines. Some tools (like SQL Server Analysis Services) allow you to store your analytical databases in a relational way (ROLAP) or using a hybrid approach (HOLAP). Personally, I see no benefit at all with ROLAP, aside from real-time systems where using the relational database is a must because of the latency involved with updating the OLAP database. (Not to digress too much here, but even real-time environments can use MOLAP storage by using partitions and caches correctly.)

MOLAP is more natural and faster. This may differ depending on your tool choice, but I beg someone to tell me differently. For a detailed discussion on how MOLAP and ROLAP engines store data, you can try this resource “Analysis Services 2005 Performance Guide” (which defines the way Analysis Services does it).

HOLAP storage is an interesting option and could actually perform well –even better than MOLAP in some instances — with less disk usage. But a lot of this depends on how many aggregates are defined and how often the system needs to query the relational database. It might be a good consideration, but if you’re not sure and you have plenty of disk space (disk space is cheap), then go with MOLAP.

The smartness factor2 (the MDX language)

I like to think of SQL as “dumb” and MDX as “smart”. Let me explain:

The MultiDimensional eXpression (MDX) language was created about 10 years ago by Microsoft to work specifically with multidimensional data stored in analytical servers. This OLAP-specific language improves upon the SQL syntax, removing much of the bulkiness associated with the language. MDX is an elegant and highly relevant partner to the analytical database. While you could get out what you need using SQL, most every SQL statement will be a challenging one. It’s like cutting down a tree with a hand saw as opposed to using a chain saw. MDX is painless and intuitive. Concepts such as dimensions, hierarchies, and sets are built into the syntax. For more on MDX and the mdXML standard, please visit XML For Analysis.

Central to the MDX language is the cube concept, which deserves a proper introduction:

Cubes

OLAP data is stored in structures called cubes. As you know, a cube is a 3-dimensional solid that, given a point on three of its faces which form a vertex, can take you to a precise point somewhere within the cube itself.  Analytical DatabasesThis point represents the aggregate metric you want to view. The faces represent the different dimensions (like Product, Customer, and Time) that are used to find the point. The dimensions are further broken down by its hierarchies on each face.

Cubes are sometimes — and more aptly — called hypercubes (or a tesseract or “n-cube”) because analysis often makes it necessary to examine more than three points. As soon as you look at more than three faces, you need higher dimensionality.

I admit, when I first heard about cubes I was a bit intimidated. I felt that they were too complex and advanced. Now, working with cubes is natural. Notice though that the concept of a cube is almost the same thing as a star schema. The only difference (at least that I can really think of) is that a star schema generally stores atomic data, and barring any usable aggregate fact table, calculations need to be done on the fly. The cube theoretically stores the results of these calculations inside the cube.

Dimensional Modeling

If you read my blog, then you know I advocate dimensional models for proper data warehousing. If you also advocate and use dimensional models, then the concepts discussed so far will fit you well.

What’s great about using Dimensional Models in your relational database is that your OLAP database will almost mirror the design, making the development and deployment of your cubes so much easier. Not only that, but the cubes will likely load faster and be easier to maintain when built off of the dimensional model. Note you can still achieve most of what you can do without an analytics server, just by using a properly constructed set of business process dimensional models. The OLAP database is just smarter, making several tasks (navigating hierarchies, security, and calculations to name a few) easier.

Tags: , , , ,

Newton’s Method for Approximating the Zeros of a Real-valued Function

Need to find the root of a real-valued function and simple algebra just won’t cut it? Then Newton’s method is for you (if you can remember back to your calculus days, that is).

Newton’s method is a root-finding algorithm and uses iteration to solve nonlinear equations. The process involves making a guess at the true solution and then applying a formula to get a better guess. This process repeats until the (approximate) answer is found. The methodology is part of numerical analysis, which is the “branch of mathematics dealing with methods for obtaining approximate numerical solutions of mathematical problems (dictionary.com)”. Specifically, the type of problems fit into the “continuous mathematics” category that are common in medicine, engineering, science, and business industries. (For a good discussion on continuous verses discrete mathematics, check out this blog entry by Mark C. Chu-Carroll or this page from ESC.EDU.) As you might guess, continuous mathematics is very important in business intelligence, specifically statistical analysis of data.

The general form to find the approximation is as follows:

450bac44dc3e19e718bd2aa922f3d9ba Newtons Method for Approximating the Zeros of a Real valued Function

Here’s how it works: Start with an initial guess (get out your graph paper!). Then the function draws a tangent to approximate the function based on your initial guess. Next, the function computes the x-intercept. This x-intercept will be better than the initial guess, and closer to the answer (the root). It then repeats with the x-intercept as the new guess.

The FoxPro code:

LOCAL lcFunction, lnInitialGuess, lnIterations 

*-- set the fuction, my initial guess, and number of iterations
lcFunction = "(1/4)*x^3 - 3*x^2 + (3/4)*x - 2"
lnInitialGuess = 2
lnIterations = 10 
 
*-- run newton
newton(lcFunction, lnInitialGuess, lnIterations )
 
*-- foxpro does newton:
FUNCTION newton
LPARAMETERS tcFunction, tnInitialGuess , tnIterations 
 
LOCAL lcDerivative AS String
LOCAL lnNewGuess, lnIter , x , lnY , lnSlope , lnCol5 , lnCol6
 
CREATE CURSOR crNewton (col_n n(2), col_Xn n(14,5), col_fXn n(14,5), col_f1Xn n(14,5), col_ys n(14,5) , col_ng n(14,5))
 
limh = .000001
lcFunction = ALLT(tcFunction)
lnInitialGuess = EVL(tnInitialGuess,0)
lnIterations = EVL(tnIterations,1) 
 
lnNewGuess = lnInitialGuess
 
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORM
*-- set the fuction, my initial guess, and number of iterations
lcFunction = "(1/4)*x^3 - 3*x^2 + (3/4)*x - 2"
lnInitialGuess = 2
lnIterations = 10 

*-- run newton
newton(lcFunction, lnInitialGuess, lnIterations )
 
*-- foxpro does newton:
FUNCTION newton
LPARAMETERS tcFunction, tnInitialGuess , tnIterations 
 
LOCAL lcDerivative AS String
LOCAL lnNewGuess, lnIter , x , lnY , lnSlope , lnCol5 , lnCol6
 
CREATE CURSOR crNewton (col_n n(2), col_Xn n(14,5), col_fXn n(14,5), col_f1Xn n(14,5), col_ys n(14,5) , col_ng n(14,5))
 
limh = .000001
lcFunction = ALLT(tcFunction)
lnInitialGuess = EVL(tnInitialGuess,0)
lnIterations = EVL(tnIterations,1) 
 
lnNewGuess = lnInitialGuess
 
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORM
*-- run newton
newton(lcFunction, lnInitialGuess, lnIterations )

*-- foxpro does newton:
FUNCTION newton
LPARAMETERS tcFunction, tnInitialGuess , tnIterations 
 
LOCAL lcDerivative AS String
LOCAL lnNewGuess, lnIter , x , lnY , lnSlope , lnCol5 , lnCol6
 
CREATE CURSOR crNewton (col_n n(2), col_Xn n(14,5), col_fXn n(14,5), col_f1Xn n(14,5), col_ys n(14,5) , col_ng n(14,5))
 
limh = .000001
lcFunction = ALLT(tcFunction)
lnInitialGuess = EVL(tnInitialGuess,0)
lnIterations = EVL(tnIterations,1) 
 
lnNewGuess = lnInitialGuess
 
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORM
*-- foxpro does newton:
FUNCTION newton
LPARAMETERS tcFunction, tnInitialGuess , tnIterations 
 
LOCAL lcDerivative AS String
LOCAL lnNewGuess, lnIter , x , lnY , lnSlope , lnCol5 , lnCol6
 
CREATE CURSOR crNewton (col_n n(2), col_Xn n(14,5), col_fXn n(14,5), col_f1Xn n(14,5), col_ys n(14,5) , col_ng n(14,5))
 
limh = .000001
lcFunction = ALLT(tcFunction)
lnInitialGuess = EVL(tnInitialGuess,0)
lnIterations = EVL(tnIterations,1) 
 
lnNewGuess = lnInitialGuess
 
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORM
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORMAL

In this run, with the equation I have defined, the approximation of x is 11.80326. If you plug this number in for x in the original equation, it evaluates to .00014. Pretty close! And that’s exactly what this algorithm is great for. Try these additional functions to get the hang of it. You can change the guess and iterations as you like as well:

  • “x + 1″
  • “x^3-12″
  • “x*(x^3)-(3/4)*x”
  • “((x^4+1)*(x^3+2))-11″

 

Tags: , ,

Business Intelligence Through Web Analysis

I just received a surprising email regarding my previous post about Analyzing Apache’s Raw Access Logs in FoxPro. The commenter wrote “dude, you’re nuts. Why waste your time on this?!? I use Awstats and that works fine”.

I sat back in my chair quite puzzled. After all, I know that Awstats doesn’t even come close to giving me the answers I need to grow my business and website. Then I realized my folly: I jumped right into my example in my last post without fully explaining the goal of the project. I’ll try to redeem myself now.

Awstats and similar tools (in my case, provided with cPanel) are retrospective reporting tools. They give you nice charts and display some very interesting numbers and facts about the types of traffic generated on your site. If you’re good with numbers and can associate events to dates (in your head) then you may be able to notice some interesting patterns (like, “hey – it seems that whenever I post a new blog entry, my hits double!”).

But this isn’t good enough for more serious projects (but isn’t a bad place to start either).

When growing your business (whether you’re a blogger or selling widgets), this type of information is invaluable. You need good, consistent, scientific analysis to pull it all together (hunches and gut feelings don’t count). Pool raw data (from a variety of sources), integrate it, clean and add value to it, and compile it to create some incredibly useful and valuable information (Read: Business Intelligence). This information can help you make decisions like (a) how much to spend on advertising, (b) whether or not to sponsor an event (such as FoxForward), (c) how many blog posts to make per week to keep the interest of readers, (d) should I sign up with Google AdSense, or (e) what design elements and layout plans are making the greatest impact.

Data smog is a real issue, however. Too much data can not only waste your valuable analysis time, but the integration of this meaningless data will do nothing but eat away at your resources (and give you a headache). The key then is to do a little preparation before you begin a project like this (duh!). I think there are two important steps (1) monetize all elements of your business, and (2) identify all key performance indicators (KPIs). Armed with this information, you will be able to build a dimensional model (in VFP of course!) with an incredibly rich fact table. Monetizing helps you assign value to all your tasks, and KPIs allow you to measure the benefits of these endeavors. Throughout the course of this project, I’ll be itemizing these two items in greater detail. As an example, for my blog todmeansfox, I’ve monetized the following items:

  • Posting a new blog entry: $90.00 / post (my time and effort to post based on my current rates, discounted)
  • Responding to posts: $10.00 / comment
  • Social networking: $90.00 / contact (includes setting up accounts on del.ico.us for example)
  • Advertising: $40.00 / free advertising (example, updating my fox.wiki profile)
  • Advertising: case by case (I have not done any advertising yet)
  • Sponsoring: case by case (I have not sponsored an event yet)
  • Updating CSS files: $60.00 / hour (how much does it ‘cost’ me to update my site’s layout and colors?)

Next, I tried to identify all important KPIs:

  • referrals
  • length of stay
  • out clicks
  • total number of hits and page views
  • total number of comments made
  • total number of emails received
  • Blackstone Providence leads
  • consulting job hours from website leads

Of course, as I do more data profiling, I may uncover some other very useful pieces of data that I can later integrate.

The goal of this project, therefore, is rather simple: make better business decisions. Using the Internet, I will gather the right data, integrate it in a meaningful way, and use OLAP to analyze and report on the findings. I’ll use data mining, profiling, and trend analysis to identify abuse and spam, as well as identify areas were improvements in layout, methodology, and content can make a greater impact. My hope is to generate more business by leveraging my current assets. On top of that, I want to do it in the open so others can benefit.

The first step in my process is data profiling, where I’ll gather the data I think I can use, analyze it, test it for quality, and prepare it for my dimensional model. My last blog post attempted to start that process by simply downloading and parsing the daily Apache raw access log file (which I should mention is in the NCSA combined/XLF/ELF log format, in case you were wondering).

As you can see, Awstats can only get me so far (actually, not that far at all).

Tags: , , , ,

Analyze Apache Raw Access Logs in FoxPro

I’ve been increasingly interested in examining the access logs on my server. For the past 6 or 7 years, I’ve used some of the cool little tools provided by cPanel located in their Web/FPT Statistics folder (like Webalizer and Awstats). While nice to look at, I’ve always felt a bit handicapped by the interface and general content. I have my own ideas on what I’d like to see and how I want to see it.

In this and subsequent articles, I’ll build an Apache Raw Access Log analysis application that will (a) download the access logs once or twice a day, (b) dump the data into a database, (c) integrate this data with other information (perhaps some demographics info, error logs, site information, date and times of blog posts, etc) (d) and in the end, build a suite of reports and filters that can be used to identify things like abuse, spam, and positive traffic.

My hope is to do a good enough job with this that, in the end, could end up in the Fox Open Source community if there’s even a remote bit of interest. This should evolve in such a way that any access log from any server would be supported, but in the initial stages, I’ll be using my Raw Access Logs from Apache. For those interested, here are the details of my server configuration:

Operating system: Linux
Apache version: 1.3.37 (Unix)
cPanel Build: 10.9.0-STABLE 9966
cPanel Pro: 1.0 (RC36)

The Log File

I started by looking at the raw files created by Awstats on my server (/tmp/awstats/). There is a ton of information in there which is already parsed and used by the various free analysis tools. But what I really need is a raw file that contains a line for each ‘hit’ to my server (including 200, 404, and 500 messages). The Raw Access Logs are perfect (in cPanel, you can manage them by opening the Raw Log Manager). They contain a line for each hit, some referral information, http status codes, etc.

The only problem with the file is its format (NCSA combined/XLF/ELF)! A single line looks like:

206.17.xxx.xxx – - [23/May/2007:07:59:21 -0500] “GET /Styles/color.css HTTP/1.1″ 304 – “http://www.rabbitstewlounge.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1)”

Fortunately, FoxPro can handle this sort of thing quite easily. With a log file containing potentially thousands of lines each day, you can see how reading this file without parsing can be nightmarish (not to mention highly inefficient). So, this is where I started. I downloaded today’s access file from my site, and wrote the following code. As this project matures, I’ll be sure to clean this up (hopefully with some reader’s suggestions!) and provide the source in zip format. For now, follow along:

Get Started

First, download your raw access file. Load cPanel and click “Raw Access Logs”. Select the raw access log for your domain and unzip it. In this example, I’ve also changed it’s name from “rabbitstewlounge.com” to “raw_log.txt”.

The Basic Code

I left out some environmental settings, such as SET SAFETY and SET MEMOWIDTH for now. Eventually this will be part of a project, which will set all these things later. Also, as this code is merely used to test the theory, I haven’t included any real error checking. I’d be interested in hearing some ideas on how to tweak this for better speed and reliability.

I did consider writing this using low-level file functions (FOPEN, FGETS, FREAD, etc), but in the end I felt that the format was too wacky and I had better rely on STREXTRACT and SUBSTRING instead.

CREATE DATABASE C:/webstats/webstats
OPEN DATABASE C:/webstats/webstats
CREATE TABLE C:/webstats/webstats.dbf (;
    stat_id integer AUTOINC,;
    ip_address character(15),;
    date_time datetime,;
    gmt_offset character(5),;
    access_method character(4),;
    location varchar(200),;
    http_version character(10),;
    http_code character(3),;
    bytes integer,;
    referrer varchar(200),;
    agent varchar(200),;
    source_file varchar(60))
 
lcData = FILETOSTR("C:/webstats/raw_log.txt")
 
FOR n = 1 TO MEMLINES(lcData)
 
    lcLine = MLINE(lcData,n)
 
    *-- IP address is first.
    lcIP = ALLTRIM(SUBSTR(lcLine,1,AT("- -",lcLine)-1))
 
    *-- date time and GMT offset info. Use STREXTRACT
    *-- to get the data between the first two quotes
    *-- on the line. Then break the datetime string
    *-- into separate lines to use MLINE to build a
    *-- proper datetime string
 
    lcDateTimeGMT = ;
        STRTRAN(STREXTRACT(lcLine,"[","]")," ",CHR(13))
    lcDT = ;
        STRTRAN(STRTRAN(MLINE(lcDateTimeGMT,1),"/",CHR(13)),":",CHR(13))
    IF !EMPTY(lcDT)
        ltDatetime = DATETIME(;
            VAL(MLINE(lcDT,3)),;
            get_monthnum(MLINE(lcDT,2)),;
            VAL(MLINE(lcDT,1)),;
            VAL(MLINE(lcDT,4)),;
            VAL(MLINE(lcDT,5)),;
            VAL(MLINE(lcDT,6)))
    ELSE
        ltDatetime = {}
    ENDIF
 
    lcGMTOffset = MLINE(lcDateTimeGMT,2)
 
    *-- Request data. Just to 'switch-it-up' a bit,
    *-- I'm using GETWORDNUM here because
    *-- I need 3 chunks of data within the
    *-- STREXTRACT, all conveniently separated
    *-- by a space.
 
    lcRequest = STREXTRACT(lcLine,'"','"')
    lcMethod = GETWORDNUM(lcRequest,1)
    lcLocation = GETWORDNUM(lcRequest,2)
    lcHTTPVer = GETWORDNUM(lcRequest,3)
 
    *-- HTTP Code and bytes returned.
    lcHTTPCode = SUBSTR(lcLine, ;
        AT(lcRequest,lcLine)+LEN(lcRequest)+2,3)
    lnBytes = VAL(SUBSTR(lcLine,AT(" " + lcHTTPCode + ;
        " ",lcLine)+LEN(lcHTTPCode)+2,;
        AT('"',lcLine,3) - (AT(" " + lcHTTPCode + " ",lcLine)+;
        LEN(lcHTTPCode)+2)))
 
    *-- referrer
    lcReferer = STREXTRACT(lcLine,'"','"',3)
 
    *-- User's agent
    lcAgent = STREXTRACT(lcLine,'"','"',5)
 
    INSERT INTO webstats (ip_address,date_time,gmt_offset,;
        access_method, location,http_version,http_code,;
        bytes,referrer,agent,source_file ) ;
    VALUES  (lcIP , ltDatetime , lcGMTOffset , lcMethod ,;
        lcLocation , lcHTTPVer ,lcHTTPCode,lnBytes,lcReferer,;
        lcAgent,"raw_log.txt" )
 
NEXT
 
*-- Functions
FUNCTION get_monthnum
LPARAMETERS tcAbbr
 
DIMENSION aMonthList[12]
aMonthList[1] = "JAN"
aMonthList[2] = "FEB"
aMonthList[3] = "MAR"
aMonthList[4] = "APR"
aMonthList[5] = "MAY"
aMonthList[6] = "JUN"
aMonthList[7] = "JUL"
aMonthList[8] = "AUG"
aMonthList[9] = "SEP"
aMonthList[10] = "OCT"
aMonthList[11] = "NOV"
aMonthList[12] = "DEC"
 
RETURN ASCAN(aMonthList,LEFT(UPPER(ALLTRIM(tcAbbr)),3))
ENDFUNC

Next entry, I’ll see about downloading this file automatically (might take some ingenuity) and building in some logic to do so periodically. After that, I’ll write some algorithms to identify potential abuse situations, spam, and increased traffic, setting up email notifications in the process. Stay tuned!

Tags: , , , , , ,

Interpolation

For those that aren’t familiar with the term, and since I’m not too good at creating my own definitions (go on, ask Andy Kramek!), I thought I’d borrow a definition from wikipedia to get started:

“In the mathematical subfield of numerical analysis, interpolation is a method of constructing new data points from a discrete set of known data points.” (Source)

If you follow that source link to the wikipedia article, you can read lots of nice things about interpolation and how it can be used in the math and science fields. About 8 years ago, I needed such a function: I needed the ability to find a set of values (x) that fell between two known values (f(x)). I wrote the following linear interpolation algorithm in FoxPro to get my answers:

*-- this sets up a test cursor with some data points
CREATE CURSOR crDataSet (fld_x n(4,2), fld_fx n(4,2))
INDEX ON fld_x TAG fld_x
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (1,1.75)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (3,2.00)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (4,3.00)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (5,4.00)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (7,4.25)
 
? interpolate(2,"crDataSet")   && returns 1.875
? interpolate(3,"crDataSet")     && returns 2.000
? interpolate(3.1,"crDataSet")  && returns 2.100
? interpolate(3.9,"crDataSet")  && returns 2.900
? interpolate(10,"crDataSet")   && returns 4.625
 
FUNCTION interpolate
    LPARAMETERS tnX, tcDataSet
 
    LOCAL lcfld_x , lcfld_fx AS String
    LOCAL lcOldNear , lcOldExac AS String
    LOCAL X1, Y1, X2, Y2 AS Number
    LOCAL lFound AS Logical 
 
    *-- name of fields in tcDataSet cursor
    lcfld_x = FIELD(1)    lcfld_fx = FIELD(2)
 
    *-- important to adjust these settings
    lcOldNear = SET('NEAR')
    lcOldExac = SET('EXACT')
    SET NEAR ON
    SET EXACT OFF 
 
    SELECT (tcDataSet)
 
    lFound = SEEK(tnX,tcDataSet)
    SET NEAR &lcOldNear
    SET EXACT &lcOldExac
 
    IF lFound
        *-- if found, then just return the point
        RETURN EVAL(tcDataSet + '.' + lcfld_fx)
    ELSE
        *- not found, so need to interpolate
        cXStr = tcDataSet + '.' + lcfld_x
        cYStr = tcDataSet + '.' + lcfld_fx
 
        *-- if the seek put us at eof, back up 1
        SELE (tcDataSet)
        IF EOF(tcDataSet)
            SKIP -1
        ENDIF
 
        *-- evaluate the first point
        X2 = EVAL(cXStr)
        Y2 = EVAL(cYStr)
 
        *-- evaluate the second point
        IF RECNO(tcDataSet) = 1
            SKIP +1
        ELSE
            SKIP -1
        ENDIF
        X1 = EVAL(cXStr)
        Y1 = EVAL(cYStr)
 
        *-- draw a line and return the result!
        RETURN ymxb(x1,y1,x2,y2,tnX)
    ENDIF
 
ENDFUNC  
 
FUNCTION ymxb
    LPARAMETERS X1,Y1,X2,Y2,X3
 
    *-- y = mx + b
    LOCAL nSlope , B1
 
    nSlope = (Y1-Y2)/(X1-X2)
    B1 = -((nSlope) * X1 - Y1)
    RETURN (nSlope) * X3 + B1
 
ENDFUNC

There you have it!

A couple of notes: For space reasons, I’ve omitted some key error checking and bullet-proofing (I started doing it, but realized that an extra 20 lines of code for this example was overkill). But it would be a good idea, for example to verify that the columns in the cursor passed were numbers. Come to think of it, you should also very that the correct parameters were passed!

In addition, I remember from math class that such a linear method is not the most accurate method. I am using a straight line between points and not a curve. I suppose a complimentary function could be added along with ymxb to get a curve between the two points, but until I need to do that — I’m not game!

Tags: , , , ,

Sample Standard Deviation (and SQL)

The CALCULATE command contains a range of statistical operations that can be utilized in FoxPro. These commands work on fields in a table (or cursor) and can be filtered using a FOR clause and SCOPE operator. Calculating the mean, standard deviation, and variance is rather straightforward:

CALCULATE AVG(fieldname) TO variable
CALCULATE STD(fieldname) TO variable
CALCULATE VAR(fieldname) TO variable

And some other useful functions:

CALCULATE MIN(fieldname) TO variable
CALCULATE MAX(fieldname) TO variable
CALCULATE CNT(fieldname) TO variable
CALCULATE SUM(fieldname) TO variable

In a previous blog entry I wrote a program to calculate the median from a set of data. VFP does not have a native median function.

Population vs. Sample Data.

Visual FoxPro doesn’t have a built-in standard deviation formula for a sample. The functions CALCULATE STD() is a population function. The same is true of variance. The question then becomes, how can I get a sample standard deviation? I’m glad you asked! Here’s how:

FUNCTION std_devs
    LPARAMETERS tcTableField, tcPredicates
    LOCAL nRet, nsavg , nscnt AS float  
 
    SELECT(JUSTSTEM(tcTableField))
 
    IF !EMPTY(tcPredicates)
        CALCULATE AVG(&tcTableField) FOR &tcPredicates TO nsavg
        CALCULATE CNT(&tcTableField) FOR &tcPredicates TO nscnt
        SELECT SUM( (&tcTableField- nsavg) ^ 2 ) as ssum ;
            FROM JUSTSTEM(tcTableField)
            WHERE &tcPredicates
            INTO CURSOR crTmpSDEVS2
    ELSE
        CALCULATE AVG(&tcTableField) TO nsavg
        CALCULATE CNT(&tcTableField) TO nscnt
        SELECT SUM( (&tcTableField- nsavg) ^ 2 ) as ssum ;
            FROM JUSTSTEM(tcTableField)
            INTO CURSOR crTmpSDEVS2
    ENDIF
 
    nRet = SQRT(crTmpSDEVS2.ssum / (nscnt-1) )
 
    USE IN crTmpSDEVS2  RETURN nRet
 
ENDFUNC

What about STD or VAR in an SQL statement? Again, glad you asked!

SELECT DISTINCT std_devp("crtmp.value") AS stddev ;
    FROM crtmp INTO CURSOR cResult
SELECT DISTINCT std_devs("crtmp.value") AS stddev ;
    FROM crtmp INTO CURSOR cResult

*-- here is the population version fit for SQL:
FUNCTION std_devp
    LPARAMETERS tcTableField, tcPredicates
    LOCAL nRet AS float
 
    IF !EMPTY(tcPredicates)
        CALCULATE STD(&tcTableField) ;
            FOR &tcPredicates TO nRet IN JUSTSTEM(tcTableField)
    ELSE
        CALCULATE STD(&tcTableField) ;
            TO nRet IN JUSTSTEM(tcTableField)
    ENDIF
 
    RETURN nRet 
 
ENDFU
*-- here is the population version fit for SQL:
FUNCTION std_devp
    LPARAMETERS tcTableField, tcPredicates
    LOCAL nRet AS float
 
    IF !EMPTY(tcPredicates)
        CALCULATE STD(&tcTableField) ;
            FOR &tcPredicates TO nRet IN JUSTSTEM(tcTableField)
    ELSE
        CALCULATE STD(&tcTableField) ;
            TO nRet IN JUSTSTEM(tcTableField)
    ENDIF
 
    RETURN nRet 
 
ENDFUNC

A big issue with a custom function in an SQL statement is performance. The sample standard deviation function I wrote above is certainly not going to win any sprints (but might just hold out during a marathon). Some environmental tests also need to be done in the above code (for example, resetting the alias back and checking for deleted would be nice improvements). For what it’s worth, this should be a good stepping stone for those in need of a good sample standard deviation formula in pure FoxPro. In addition, using it as a function call in an SQL statement increases its usability and worth.

Tags: , , , ,