Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Open Source’


Published March 12th, 2008

The Future of Open Source in BI

At last Thursday’s TDWI Benelux Chapter meeting, Davy Nys of Pentaho gave an overview of how open source could/might/will change the face of Business Intelligence. He gave a reasonably good vendor-neutral presentation (important for TDWI events). His session was a nice compliment to the “BI Trends” presentation given by Steve Hoberman an hour earlier.

Perhaps there is a trend for organizations to turn to more open-source software solutions for BI projects. After all, Davy’s company and others like Talend and CloverETL are making great strides in competing for market share.

As the big players in BI continue to merge and consolidate, it is pretty exciting to see several open source vendors and tools emerge. Is this a reflection of the community’s general dissatisfaction regarding commercial software? Are the open source solutions better? Is this truly a trend to be reckoned with? Should MS and others be worried?

Open Source Considerations

Davy stressed the importance of reducing the TCO of BI software. Without licensing fees, open-source can do just that. As Rick Sherman predicts in an article for DM Review, TCO will become a much more significant factor in the adoption of any and all BI trends. Licensing costs could impact TCO in such a dramatic way that a company can save a significant amount on their investment by switching to open source.

TCO isn’t the only consideration. Before evaluating open-source software, Davy suggests to examine the vibrancy of the community. A vibrant community with contributors and enthusiasts is a good sign for future product development and support.

Licensing is yet another very important consideration. As part of his presentation, Davy initiated a discussion on viral verses non-viral licensing. here’s is how I understand it: In a “viral” agreement, any source code changes to the product must be returned back to the public. Non-viral agreements allow companies to modify the source as they see fit without having to report back to the community. With viral licensing, I would have tremendous concerns about intellectual property and protecting business practices and methods.

What I found strange about the Q&A session and roundtable discussion that followed his presentation was the focus on the Pentaho business model. The concern of some BI professionals is thus: How can the economics associated with running an open source software company be sustainable over a long period of time? The question is relevant because as BI Professionals, we need to supply solutions that will be supportable, scalable, and usable in the future. The concern is that a company — like Pentaho — might not live and thrive long enough to meet the long-term needs of the business. I am not qualified to answer this question, but I admit, I wonder myself. Davy did an excellent job of presenting the case, however. If you need details on how Pentaho and other open source organizations make their money, it would be best to contact them directly. These are valid concerns and should be part of tool analysis that should go on early in a project’s planning.

I am open to the possibilities that open source can provide for BI applications. I use open source software all the time (from Wordpress to MySQl to Codeplex), but for a mission critical business initiative? Before making a decision like that, I would certainly need to have more information and a project to try it out on.

To learn more about Davy, you can check out his LinkedIn profile or visit the Pentaho website.

Published May 25th, 2007

Analyze Apache Raw Access Logs in FoxPro

I’ve been increasingly interested in examining the access logs on my server. For the past 6 or 7 years, I’ve used some of the cool little tools provided by cPanel located in their Web/FPT Statistics folder (like Webalizer and Awstats). While nice to look at, I’ve always felt a bit handicapped by the interface and general content. I have my own ideas on what I’d like to see and how I want to see it.

In this and subsequent articles, I’ll build an Apache Raw Access Log analysis application that will (a) download the access logs once or twice a day, (b) dump the data into a database, (c) integrate this data with other information (perhaps some demographics info, error logs, site information, date and times of blog posts, etc) (d) and in the end, build a suite of reports and filters that can be used to identify things like abuse, spam, and positive traffic.

My hope is to do a good enough job with this that, in the end, could end up in the Fox Open Source community if there’s even a remote bit of interest. This should evolve in such a way that any access log from any server would be supported, but in the initial stages, I’ll be using my Raw Access Logs from Apache. For those interested, here are the details of my server configuration:

Operating system: Linux
Apache version: 1.3.37 (Unix)
cPanel Build: 10.9.0-STABLE 9966
cPanel Pro: 1.0 (RC36)

The Log File

I started by looking at the raw files created by Awstats on my server (/tmp/awstats/). There is a ton of information in there which is already parsed and used by the various free analysis tools. But what I really need is a raw file that contains a line for each ‘hit’ to my server (including 200, 404, and 500 messages). The Raw Access Logs are perfect (in cPanel, you can manage them by opening the Raw Log Manager). They contain a line for each hit, some referral information, http status codes, etc.

The only problem with the file is its format (NCSA combined/XLF/ELF)! A single line looks like:

206.17.xxx.xxx - - [23/May/2007:07:59:21 -0500] “GET /Styles/color.css HTTP/1.1″ 304 - “http://www.rabbitstewlounge.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1)”

Fortunately, FoxPro can handle this sort of thing quite easily. With a log file containing potentially thousands of lines each day, you can see how reading this file without parsing can be nightmarish (not to mention highly inefficient). So, this is where I started. I downloaded today’s access file from my site, and wrote the following code. As this project matures, I’ll be sure to clean this up (hopefully with some reader’s suggestions!) and provide the source in zip format. For now, follow along:

Get Started

First, download your raw access file. Load cPanel and click “Raw Access Logs”. Select the raw access log for your domain and unzip it. In this example, I’ve also changed it’s name from “rabbitstewlounge.com” to “raw_log.txt”.

The Basic Code

I left out some environmental settings, such as SET SAFETY and SET MEMOWIDTH for now. Eventually this will be part of a project, which will set all these things later. Also, as this code is merely used to test the theory, I haven’t included any real error checking. I’d be interested in hearing some ideas on how to tweak this for better speed and reliability.

I did consider writing this using low-level file functions (FOPEN, FGETS, FREAD, etc), but in the end I felt that the format was too wacky and I had better rely on STREXTRACT and SUBSTRING instead.

CREATE DATABASE C:/webstats/webstats
OPEN DATABASE C:/webstats/webstats
CREATE TABLE C:/webstats/webstats.dbf (;
    stat_id integer AUTOINC,;
    ip_address character(15),;
    date_time datetime,;
    gmt_offset character(5),;
    access_method character(4),;
    location varchar(200),;
    http_version character(10),;
    http_code character(3),;
    bytes integer,;
    referrer varchar(200),;
    agent varchar(200),;
    source_file varchar(60))
 
lcData = FILETOSTR("C:/webstats/raw_log.txt")
 
FOR n = 1 TO MEMLINES(lcData)
 
    lcLine = MLINE(lcData,n)
 
    *-- IP address is first.
    lcIP = ALLTRIM(SUBSTR(lcLine,1,AT("- -",lcLine)-1))
 
    *-- date time and GMT offset info. Use STREXTRACT
    *-- to get the data between the first two quotes
    *-- on the line. Then break the datetime string
    *-- into separate lines to use MLINE to build a
    *-- proper datetime string
 
    lcDateTimeGMT = ;
        STRTRAN(STREXTRACT(lcLine,"[","]")," ",CHR(13))
    lcDT = ;
        STRTRAN(STRTRAN(MLINE(lcDateTimeGMT,1),"/",CHR(13)),":",CHR(13))
    IF !EMPTY(lcDT)
        ltDatetime = DATETIME(;
            VAL(MLINE(lcDT,3)),;
            get_monthnum(MLINE(lcDT,2)),;
            VAL(MLINE(lcDT,1)),;
            VAL(MLINE(lcDT,4)),;
            VAL(MLINE(lcDT,5)),;
            VAL(MLINE(lcDT,6)))
    ELSE
        ltDatetime = {}
    ENDIF
 
    lcGMTOffset = MLINE(lcDateTimeGMT,2)
 
    *-- Request data. Just to 'switch-it-up' a bit,
    *-- I'm using GETWORDNUM here because
    *-- I need 3 chunks of data within the
    *-- STREXTRACT, all conveniently separated
    *-- by a space.
 
    lcRequest = STREXTRACT(lcLine,'"','"')
    lcMethod = GETWORDNUM(lcRequest,1)
    lcLocation = GETWORDNUM(lcRequest,2)
    lcHTTPVer = GETWORDNUM(lcRequest,3)
 
    *-- HTTP Code and bytes returned.
    lcHTTPCode = SUBSTR(lcLine, ;
        AT(lcRequest,lcLine)+LEN(lcRequest)+2,3)
    lnBytes = VAL(SUBSTR(lcLine,AT(" " + lcHTTPCode + ;
        " ",lcLine)+LEN(lcHTTPCode)+2,;
        AT('"',lcLine,3) - (AT(" " + lcHTTPCode + " ",lcLine)+;
        LEN(lcHTTPCode)+2)))
 
    *-- referrer
    lcReferer = STREXTRACT(lcLine,'"','"',3)
 
    *-- User's agent
    lcAgent = STREXTRACT(lcLine,'"','"',5)
 
    INSERT INTO webstats (ip_address,date_time,gmt_offset,;
        access_method, location,http_version,http_code,;
        bytes,referrer,agent,source_file ) ;
    VALUES  (lcIP , ltDatetime , lcGMTOffset , lcMethod ,;
        lcLocation , lcHTTPVer ,lcHTTPCode,lnBytes,lcReferer,;
        lcAgent,"raw_log.txt" )
 
NEXT
 
*-- Functions
FUNCTION get_monthnum
LPARAMETERS tcAbbr
 
DIMENSION aMonthList[12]
aMonthList[1] = "JAN"
aMonthList[2] = "FEB"
aMonthList[3] = "MAR"
aMonthList[4] = "APR"
aMonthList[5] = "MAY"
aMonthList[6] = "JUN"
aMonthList[7] = "JUL"
aMonthList[8] = "AUG"
aMonthList[9] = "SEP"
aMonthList[10] = "OCT"
aMonthList[11] = "NOV"
aMonthList[12] = "DEC"
 
RETURN ASCAN(aMonthList,LEFT(UPPER(ALLTRIM(tcAbbr)),3))
ENDFUNC

Next entry, I’ll see about downloading this file automatically (might take some ingenuity) and building in some logic to do so periodically. After that, I’ll write some algorithms to identify potential abuse situations, spam, and increased traffic, setting up email notifications in the process. Stay tuned!

Published May 15th, 2007

Microsoft and Open Source

As you know, Microsoft and Open Source have had an oil/water type of relationship. Here’s an article that gives us 10 things that MS loves and hates about Open Source. I post this here because FoxPro made the list:

The year-old open source project hosting Web site started by Microsoft lets users share open source development projects. The big news is that portions of Visual FoxPro will be posted as open source on Codeplex. A new version of the Web site is released every three weeks adding additional features and updates. As of early March, there were 1,029 projects on the site.