Published May 25th, 2007
Analyze Apache Raw Access Logs in FoxPro
I’ve been increasingly interested in examining the access logs on my server. For the past 6 or 7 years, I’ve used some of the cool little tools provided by cPanel located in their Web/FPT Statistics folder (like Webalizer and Awstats). While nice to look at, I’ve always felt a bit handicapped by the interface and general content. I have my own ideas on what I’d like to see and how I want to see it.
In this and subsequent articles, I’ll build an Apache Raw Access Log analysis application that will (a) download the access logs once or twice a day, (b) dump the data into a database, (c) integrate this data with other information (perhaps some demographics info, error logs, site information, date and times of blog posts, etc) (d) and in the end, build a suite of reports and filters that can be used to identify things like abuse, spam, and positive traffic.
My hope is to do a good enough job with this that, in the end, could end up in the Fox Open Source community if there’s even a remote bit of interest. This should evolve in such a way that any access log from any server would be supported, but in the initial stages, I’ll be using my Raw Access Logs from Apache. For those interested, here are the details of my server configuration:
Operating system: Linux
Apache version: 1.3.37 (Unix)
cPanel Build: 10.9.0-STABLE 9966
cPanel Pro: 1.0 (RC36)
The Log File
I started by looking at the raw files created by Awstats on my server (/tmp/awstats/). There is a ton of information in there which is already parsed and used by the various free analysis tools. But what I really need is a raw file that contains a line for each ‘hit’ to my server (including 200, 404, and 500 messages). The Raw Access Logs are perfect (in cPanel, you can manage them by opening the Raw Log Manager). They contain a line for each hit, some referral information, http status codes, etc.
The only problem with the file is its format (NCSA combined/XLF/ELF)! A single line looks like:
206.17.xxx.xxx - - [23/May/2007:07:59:21 -0500] “GET /Styles/color.css HTTP/1.1″ 304 - “http://www.rabbitstewlounge.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1)”
Fortunately, FoxPro can handle this sort of thing quite easily. With a log file containing potentially thousands of lines each day, you can see how reading this file without parsing can be nightmarish (not to mention highly inefficient). So, this is where I started. I downloaded today’s access file from my site, and wrote the following code. As this project matures, I’ll be sure to clean this up (hopefully with some reader’s suggestions!) and provide the source in zip format. For now, follow along:
Get Started
First, download your raw access file. Load cPanel and click “Raw Access Logs”. Select the raw access log for your domain and unzip it. In this example, I’ve also changed it’s name from “rabbitstewlounge.com” to “raw_log.txt”.
The Basic Code
I left out some environmental settings, such as SET SAFETY and SET MEMOWIDTH for now. Eventually this will be part of a project, which will set all these things later. Also, as this code is merely used to test the theory, I haven’t included any real error checking. I’d be interested in hearing some ideas on how to tweak this for better speed and reliability.
I did consider writing this using low-level file functions (FOPEN, FGETS, FREAD, etc), but in the end I felt that the format was too wacky and I had better rely on STREXTRACT and SUBSTRING instead.
CREATE DATABASE C:/webstats/webstats OPEN DATABASE C:/webstats/webstats CREATE TABLE C:/webstats/webstats.dbf (; stat_id integer AUTOINC,; ip_address character(15),; date_time datetime,; gmt_offset character(5),; access_method character(4),; location varchar(200),; http_version character(10),; http_code character(3),; bytes integer,; referrer varchar(200),; agent varchar(200),; source_file varchar(60)) lcData = FILETOSTR("C:/webstats/raw_log.txt") FOR n = 1 TO MEMLINES(lcData) lcLine = MLINE(lcData,n) *-- IP address is first. lcIP = ALLTRIM(SUBSTR(lcLine,1,AT("- -",lcLine)-1)) *-- date time and GMT offset info. Use STREXTRACT *-- to get the data between the first two quotes *-- on the line. Then break the datetime string *-- into separate lines to use MLINE to build a *-- proper datetime string lcDateTimeGMT = ; STRTRAN(STREXTRACT(lcLine,"[","]")," ",CHR(13)) lcDT = ; STRTRAN(STRTRAN(MLINE(lcDateTimeGMT,1),"/",CHR(13)),":",CHR(13)) IF !EMPTY(lcDT) ltDatetime = DATETIME(; VAL(MLINE(lcDT,3)),; get_monthnum(MLINE(lcDT,2)),; VAL(MLINE(lcDT,1)),; VAL(MLINE(lcDT,4)),; VAL(MLINE(lcDT,5)),; VAL(MLINE(lcDT,6))) ELSE ltDatetime = {} ENDIF lcGMTOffset = MLINE(lcDateTimeGMT,2) *-- Request data. Just to 'switch-it-up' a bit, *-- I'm using GETWORDNUM here because *-- I need 3 chunks of data within the *-- STREXTRACT, all conveniently separated *-- by a space. lcRequest = STREXTRACT(lcLine,'"','"') lcMethod = GETWORDNUM(lcRequest,1) lcLocation = GETWORDNUM(lcRequest,2) lcHTTPVer = GETWORDNUM(lcRequest,3) *-- HTTP Code and bytes returned. lcHTTPCode = SUBSTR(lcLine, ; AT(lcRequest,lcLine)+LEN(lcRequest)+2,3) lnBytes = VAL(SUBSTR(lcLine,AT(" " + lcHTTPCode + ; " ",lcLine)+LEN(lcHTTPCode)+2,; AT('"',lcLine,3) - (AT(" " + lcHTTPCode + " ",lcLine)+; LEN(lcHTTPCode)+2))) *-- referrer lcReferer = STREXTRACT(lcLine,'"','"',3) *-- User's agent lcAgent = STREXTRACT(lcLine,'"','"',5) INSERT INTO webstats (ip_address,date_time,gmt_offset,; access_method, location,http_version,http_code,; bytes,referrer,agent,source_file ) ; VALUES (lcIP , ltDatetime , lcGMTOffset , lcMethod ,; lcLocation , lcHTTPVer ,lcHTTPCode,lnBytes,lcReferer,; lcAgent,"raw_log.txt" ) NEXT *-- Functions FUNCTION get_monthnum LPARAMETERS tcAbbr DIMENSION aMonthList[12] aMonthList[1] = "JAN" aMonthList[2] = "FEB" aMonthList[3] = "MAR" aMonthList[4] = "APR" aMonthList[5] = "MAY" aMonthList[6] = "JUN" aMonthList[7] = "JUL" aMonthList[8] = "AUG" aMonthList[9] = "SEP" aMonthList[10] = "OCT" aMonthList[11] = "NOV" aMonthList[12] = "DEC" RETURN ASCAN(aMonthList,LEFT(UPPER(ALLTRIM(tcAbbr)),3)) ENDFUNC
Next entry, I’ll see about downloading this file automatically (might take some ingenuity) and building in some logic to do so periodically. After that, I’ll write some algorithms to identify potential abuse situations, spam, and increased traffic, setting up email notifications in the process. Stay tuned!
Data Warehousing and Business Intelligence consultant, with expertise in business analysis, data modeling, and data integration. Extensive experience developing vertical and integrated desktop and Internet applications spanning municipal, clinical, and financial industries.