Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for May, 2007


Published May 31st, 2007

Fox Houses

I’d like a Fox house (heck, I’d settle for a den!).I love this article by Rob Paddock, “The House That Fox Built” featured in CoDe magazine. It’s a well-spirited, honest, and perhaps a bit on the sentimental side. Take a look!

Published May 31st, 2007

Business Intelligence Through Web Analysis

I just received a surprising email regarding my previous post about Analyzing Apache’s Raw Access Logs in FoxPro. The commenter wrote “dude, you’re nuts. Why waste your time on this?!? I use Awstats and that works fine”.

I sat back in my chair quite puzzled. After all, I know that Awstats doesn’t even come close to giving me the answers I need to grow my business and website. Then I realized my folly: I jumped right into my example in my last post without fully explaining the goal of the project. I’ll try to redeem myself now.

Awstats and similar tools (in my case, provided with cPanel) are retrospective reporting tool. They give you nice charts and display some very interesting numbers and facts about the types of traffic generated on your site. If you’re good with numbers and can associate events to dates (in your head) then you may be able to notice some interesting patterns (like, “hey – it seems that whenever I post a new blog entry, my hits double!”).

But this isn’t good enough for more serious projects (but isn’t a bad place to start either).

When growing your business (whether you’re a blogger or selling widgets), this type of information is invaluable. You need good, consistent, scientific analysis to pull it all together (hunches and gut feelings don’t count). Pool raw data (from a variety of sources), integrate it, clean and add value to it, and compile it to create some incredibly useful and valuable information (Read: Business Intelligence). This information can help you make decisions like (a) how much to spend on advertising, (b) whether or not to sponsor an event (such as FoxForward), (c) how many blog posts to make per week to keep the interest of readers, (d) should I sign up with Google AdSense, or (e) what design elements and layout plans are making the greatest impact.

Data smog is a real issue, however. Too much data can not only waste your valuable analysis time, but the integration of this meaningless data will do nothing but eat away at your resources (and give you a headache). The key then is to do a little preparation before you begin a project like this (duh!). I think there are two important steps (1) monetize all elements of your business, and (2) identify all key performance indicators (KPIs). Armed with this information, you will be able to build a dimensional model (in VFP of course!) with an incredibly rich fact table. Monetizing helps you assign value to all your tasks, and KPIs allow you to measure the benefits of these endeavors. Throughout the course of this project, I’ll be itemizing these two items in greater detail. As an example, for my blog todmeansfox, I’ve monetized the following items:

  • Posting a new blog entry: $90.00 / post (my time and effort to post based on my current rates, discounted)
  • Responding to posts: $10.00 / comment
  • Social networking: $90.00 / contact (includes setting up accounts on del.ico.us for example)
  • Advertising: $40.00 / free advertising (example, updating my fox.wiki profile)
  • Advertising: case by case (I have not done any advertising yet)
  • Sponsoring: case by case (I have not sponsored an event yet)
  • Updating CSS files: $60.00 / hour (how much does it ‘cost’ me to update my site’s layout and colors?)

Next, I tried to identify all important KPIs:

  • referrals
  • length of stay
  • out clicks
  • total number of hits and page views
  • total number of comments made
  • total number of emails received
  • Blackstone Providence leads
  • consulting job hours from website leads

Of course, as I do more data profiling, I may uncover some other very useful pieces of data that I can later integrate.

The goal of this project, therefore, is rather simple: make better business decisions. Using the Internet, I will gather the right data, integrate it in a meaningful way, and use OLAP to analyze and report on the findings. I’ll use data mining, profiling, and trend analysis to identify abuse and spam, as well as identify areas were improvements in layout, methodology, and content can make a greater impact. My hope is to generate more business by leveraging my current assets. On top of that, I want to do it in the open so others can benefit.

The first step in my process is data profiling, where I’ll gather the data I think I can use, analyze it, test it for quality, and prepare it for my dimensional model. My last blog post attempted to start that process by simply downloading and parsing the daily Apache raw access log file (which I should mention is in the NCSA combined/XLF/ELF log format, in case you were wondering).

As you can see, Awstats can only get me so far (actually, not that far at all).

Published May 25th, 2007

The CBS/Jericho Nuts Campaign is Really Taking Off!

Wow, what a difference a week makes! I wrote last week of my displeasure with CBS canceling my favorite show, Jericho. A petition was started and fan messageboards were buzzing. But then I got busy with work and my attention drifted away from Jericho and onto other things.

A colleague at work then informed that things have really taken off. First, many of the show’s actors and actresses have joined the discussion on the discussion boards. Skeet Ulrich wrote:

A week ago our show was canceled and it feels like time has stood still ever since. We loved bringing you this story. […] Your support is incredible and means more to all of us than you will ever know!

And from Lennie “The Hawk” James, who played my favorite character, Hawkins:

I write to you from my home in London utterly staggered by the events of the last week or so. I, along with all the cast and crew from the show, was totally gutted (UK slang for ‘very upset’) when news came down that our show was canceled.

Many others have chimed in as well. I have never seen anything like this before. The cast (and I bet even some crew members!) have stepped up and have become involved. Imagine if CBS continued the show now? Fans have had an opportunity to interact with their favorite characters. This can only create an even stronger base audience.

But perhaps the biggest and loudest development over the past week is the Nuts campaign!

As I write this, CBS has had the pleasure of getting MORE THAN 18,000 lbs of NUTS from JERICHO fans. Let me say that again: More than 18,000 POUNDS!

Check out http://www.nutsonline.com/jericho. In true capitalist spirit, this company stepped up and made ordering Jericho nuts super easy. All you have to do is go to their website, specify the amount you want to spend. And off they go.

Here are some YouTube videos:

http://www.youtube.com/watch?v=J9UogT5AWRo
http://www.youtube.com/watch?v=yoxePewKK8o
http://www.youtube.com/watch?v=g07OKqGSb0o

You might not be a fan of Jericho, but why not help Jericho fans make this statement by sending CBS your very own bag of NUTS!

Other websites:
http://www.jericholives.com/
http://www.jerichorallypoint.com/sm/index.php

Published May 25th, 2007

Analyze Apache Raw Access Logs in FoxPro

I’ve been increasingly interested in examining the access logs on my server. For the past 6 or 7 years, I’ve used some of the cool little tools provided by cPanel located in their Web/FPT Statistics folder (like Webalizer and Awstats). While nice to look at, I’ve always felt a bit handicapped by the interface and general content. I have my own ideas on what I’d like to see and how I want to see it.

In this and subsequent articles, I’ll build an Apache Raw Access Log analysis application that will (a) download the access logs once or twice a day, (b) dump the data into a database, (c) integrate this data with other information (perhaps some demographics info, error logs, site information, date and times of blog posts, etc) (d) and in the end, build a suite of reports and filters that can be used to identify things like abuse, spam, and positive traffic.

My hope is to do a good enough job with this that, in the end, could end up in the Fox Open Source community if there’s even a remote bit of interest. This should evolve in such a way that any access log from any server would be supported, but in the initial stages, I’ll be using my Raw Access Logs from Apache. For those interested, here are the details of my server configuration:

Operating system: Linux
Apache version: 1.3.37 (Unix)
cPanel Build: 10.9.0-STABLE 9966
cPanel Pro: 1.0 (RC36)

The Log File

I started by looking at the raw files created by Awstats on my server (/tmp/awstats/). There is a ton of information in there which is already parsed and used by the various free analysis tools. But what I really need is a raw file that contains a line for each ‘hit’ to my server (including 200, 404, and 500 messages). The Raw Access Logs are perfect (in cPanel, you can manage them by opening the Raw Log Manager). They contain a line for each hit, some referral information, http status codes, etc.

The only problem with the file is its format (NCSA combined/XLF/ELF)! A single line looks like:

206.17.xxx.xxx - - [23/May/2007:07:59:21 -0500] “GET /Styles/color.css HTTP/1.1″ 304 - “http://www.rabbitstewlounge.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1)”

Fortunately, FoxPro can handle this sort of thing quite easily. With a log file containing potentially thousands of lines each day, you can see how reading this file without parsing can be nightmarish (not to mention highly inefficient). So, this is where I started. I downloaded today’s access file from my site, and wrote the following code. As this project matures, I’ll be sure to clean this up (hopefully with some reader’s suggestions!) and provide the source in zip format. For now, follow along:

Get Started

First, download your raw access file. Load cPanel and click “Raw Access Logs”. Select the raw access log for your domain and unzip it. In this example, I’ve also changed it’s name from “rabbitstewlounge.com” to “raw_log.txt”.

The Basic Code

I left out some environmental settings, such as SET SAFETY and SET MEMOWIDTH for now. Eventually this will be part of a project, which will set all these things later. Also, as this code is merely used to test the theory, I haven’t included any real error checking. I’d be interested in hearing some ideas on how to tweak this for better speed and reliability.

I did consider writing this using low-level file functions (FOPEN, FGETS, FREAD, etc), but in the end I felt that the format was too wacky and I had better rely on STREXTRACT and SUBSTRING instead.

CREATE DATABASE C:/webstats/webstats
OPEN DATABASE C:/webstats/webstats
CREATE TABLE C:/webstats/webstats.dbf (;
    stat_id integer AUTOINC,;
    ip_address character(15),;
    date_time datetime,;
    gmt_offset character(5),;
    access_method character(4),;
    location varchar(200),;
    http_version character(10),;
    http_code character(3),;
    bytes integer,;
    referrer varchar(200),;
    agent varchar(200),;
    source_file varchar(60))
 
lcData = FILETOSTR("C:/webstats/raw_log.txt")
 
FOR n = 1 TO MEMLINES(lcData)
 
    lcLine = MLINE(lcData,n)
 
    *-- IP address is first.
    lcIP = ALLTRIM(SUBSTR(lcLine,1,AT("- -",lcLine)-1))
 
    *-- date time and GMT offset info. Use STREXTRACT
    *-- to get the data between the first two quotes
    *-- on the line. Then break the datetime string
    *-- into separate lines to use MLINE to build a
    *-- proper datetime string
 
    lcDateTimeGMT = ;
        STRTRAN(STREXTRACT(lcLine,"[","]")," ",CHR(13))
    lcDT = ;
        STRTRAN(STRTRAN(MLINE(lcDateTimeGMT,1),"/",CHR(13)),":",CHR(13))
    IF !EMPTY(lcDT)
        ltDatetime = DATETIME(;
            VAL(MLINE(lcDT,3)),;
            get_monthnum(MLINE(lcDT,2)),;
            VAL(MLINE(lcDT,1)),;
            VAL(MLINE(lcDT,4)),;
            VAL(MLINE(lcDT,5)),;
            VAL(MLINE(lcDT,6)))
    ELSE
        ltDatetime = {}
    ENDIF
 
    lcGMTOffset = MLINE(lcDateTimeGMT,2)
 
    *-- Request data. Just to 'switch-it-up' a bit,
    *-- I'm using GETWORDNUM here because
    *-- I need 3 chunks of data within the
    *-- STREXTRACT, all conveniently separated
    *-- by a space.
 
    lcRequest = STREXTRACT(lcLine,'"','"')
    lcMethod = GETWORDNUM(lcRequest,1)
    lcLocation = GETWORDNUM(lcRequest,2)
    lcHTTPVer = GETWORDNUM(lcRequest,3)
 
    *-- HTTP Code and bytes returned.
    lcHTTPCode = SUBSTR(lcLine, ;
        AT(lcRequest,lcLine)+LEN(lcRequest)+2,3)
    lnBytes = VAL(SUBSTR(lcLine,AT(" " + lcHTTPCode + ;
        " ",lcLine)+LEN(lcHTTPCode)+2,;
        AT('"',lcLine,3) - (AT(" " + lcHTTPCode + " ",lcLine)+;
        LEN(lcHTTPCode)+2)))
 
    *-- referrer
    lcReferer = STREXTRACT(lcLine,'"','"',3)
 
    *-- User's agent
    lcAgent = STREXTRACT(lcLine,'"','"',5)
 
    INSERT INTO webstats (ip_address,date_time,gmt_offset,;
        access_method, location,http_version,http_code,;
        bytes,referrer,agent,source_file ) ;
    VALUES  (lcIP , ltDatetime , lcGMTOffset , lcMethod ,;
        lcLocation , lcHTTPVer ,lcHTTPCode,lnBytes,lcReferer,;
        lcAgent,"raw_log.txt" )
 
NEXT
 
*-- Functions
FUNCTION get_monthnum
LPARAMETERS tcAbbr
 
DIMENSION aMonthList[12]
aMonthList[1] = "JAN"
aMonthList[2] = "FEB"
aMonthList[3] = "MAR"
aMonthList[4] = "APR"
aMonthList[5] = "MAY"
aMonthList[6] = "JUN"
aMonthList[7] = "JUL"
aMonthList[8] = "AUG"
aMonthList[9] = "SEP"
aMonthList[10] = "OCT"
aMonthList[11] = "NOV"
aMonthList[12] = "DEC"
 
RETURN ASCAN(aMonthList,LEFT(UPPER(ALLTRIM(tcAbbr)),3))
ENDFUNC

Next entry, I’ll see about downloading this file automatically (might take some ingenuity) and building in some logic to do so periodically. After that, I’ll write some algorithms to identify potential abuse situations, spam, and increased traffic, setting up email notifications in the process. Stay tuned!

Published May 23rd, 2007

David Fulton

Some of you might know Dr. David Fulton as the guy who sold FP for Dos to Microsoft. (Imagine how different things would be today without that transaction!). Others might know him as having one of the worlld’s finest Stradivari and Guarneri violin collections!

More on Dr. Fulton.

Published May 17th, 2007

First FoxPro, now Jericho

It seems like all my favorites are getting canceled. I only watched one network show on television with any kind of regularity: Jericho on CBS. Now, after a mediocre second-half of the first season (CBS’ fault, by the way), CBS opted to cancel it in favor of more reality TV (something about a Lord of the Flies rip-off with kids in Mexico). If you watch Jericho, or want to sign another petition, please do so at: petitiononline.com

If you’re really into Jericho, please mail a bag of nuts to their studio (they’ll know why you sent them — it’s a reference to the final episode in which Jake tells the invading town of New Bern “Nuts” when asked to surrender Jericho):

CBS TV
ATTN: President RE: JERICHO
51 West 52dn Street
New York New York 10019

But I’m not posting this here for Jericho support. I’m wasting my digital space today (and possibly your time!) to rant about how Corporations really run our lives. Soma Somasegar of Microsoft tells us (my words:), ‘Hey, no more FoxPro. I know its hard to hear that you need to learn something else, but why don’t you check out .NET!’. And CBS (Kelly Kahl, CBS’ chief scheduling executive in particular) is saying (my words:), ‘hey, no more Jericho. I know it’s hard to hear that you need to watch something else, but why don’t you check out CSI?’.

Whether its our livelihood or our down-time, we’re being run by a handful of corporations that mold and shape us into their image. So much for the little guy. As FoxPro developers, we’re outnumbered. The old saying that ‘power exists in numbers’ is very true. That’s one of the reasons people like me like to hold on to the niche, the subculture, the atypical, exceptional, and extraordinary!

Jericho has a decent chance of being picked up by the Sci-fi Network, TNT, or others. I wish the same can be said for FoxPro.