Posts Tagged Analysis

Chaos Theory and the Data Warehouse

Have you ever considered the Data Warehouse as a chaotic system? The work of the Data Warehouse team is never complete: new requirements trickle in every day, and user feedback gets more and more sophisticated as time passes. Chaos Theory can help explain this, and in the end, offer us some insight into how we can better plan Data Warehouse development, deployment, and maintenance.

butterfly effect 150x150 Chaos Theory and the Data WarehouseThe Data Warehouse is a process which forms the center of an information supply supply chain, with several inputs and several outputs. Each input and each output is subject to change based on factors such as vendor upgrades, new interfaces, expanded interfaces, and perhaps most importantly end-user (client) evolution. All of these changes happen continuously. As people use the Data Warehouse, they become more inquisitive. They want their output and analysis rolled up or down in different ways. Predicting (i.e. planning) for Data Warehouse change can be as difficult as predicting (and therefore planning for) the weather. This environment of ever-changing needs fits neatly into the confines of Chaos Theory. But what is chaos in this context? What is Chaos Theory exactly?

From the book “Chaos Theory Tamed”, author Garnett P. Williams writes:

Chaos is sustained and disorderly-looking long-term evolution that satisfies certain mathematical criteria and that occurs in a deterministic non-linear system. Chaos theory is the principles and mathematical operations underlining chaos. (pg 9)

Meteorologist Edward Lorenz in the 1960s determined that even the tiniest differences in an initial measurement can have a huge impact on an outcome. In other words, as his butterfly effect posits, a butterfly flapping its wings in Africa can affect weather patterns in North America. Weather is a system which has a highly sensitive dependence on its initial inputs.

The foundation of the Data Warehouse is only as stable as how you control for the tiniest changes to the inputs into the information structure. As weather, it too has a highly sensitive dependence on inputs. One tiny change to a source system can have almost catastrophic effects on the Data Warehouse.

Finding Order

However, despite the chaos, we should be able to find some order. This is what Lorenz and scientists after him tried to do. The first step in this process is understanding that even seemingly random changes are not always as random as they seem. If we can understand that changes to our Data Warehouse are not random, then we can build a better Data Warehouse.

There are a few things you can do to tame the chaos:

  • Be consistent and systematic. The more predictable you and your Data Warehouse team are, the easier it will be handle change. In other words, control any and all variables that you can.
  • Adopt proven analysis and development methodologies that others have had success with. This is not to say that some level of adaptation to your environment, team skills, and situation are not required, but rather, start off with a good foundation and follow along where it makes sense.
  • Keep the team close. Quality and frequent interaction among the people who make and run the DWH is essential.
  • Stay in the groove like an improvisational jazz band. If your data modelers are not in tune with your decision-support analysts who are not in tune with your DBA, then you can’t expect to handle the challenges of chaos.
  • Feedback and evolution are two very important aspects of Data Warehousing. Keep your ear to the wall and try to anticipate changes before they occur. This takes practice, but (back to the improvisational jazz band analogy) practice makes perfect.
  • Keep in step. In the Data Warehouse world, change is natural and will come in waves. More significantly, if changes cannot be implemented quickly, your clients will lose confidence in your ability to keep up.
  • Think and act quickly. The longer you debate, the longer your client must wait. While they wait, they construct workarounds or look elsewhere. If you’re lucky and they do wait for you, their change may become outdated and no longer relevant; an opportunity might have been missed (and you’ve essentially failed them).
  • Don’t be afraid to be wrong. The consequence of acting quickly is that you might get something wrong. Just be agile enough to respond and deliver new change with urgency.

I’ll post more thoughts on this over the next weeks. I’m particularly interested in how users of the Data Warehouse become more and more sophisticated as they use its tools and applications.

Tags: , , , , , ,

Analytical Databases

 Analytical DatabasesMost of what I discuss on Tod means Fox (and especially in my 34 Subsystem series) has to do with relational database engines (SQL Server 2005 and Visual FoxPro in particular). For most data integration projects, desktop and web solutions, and data warehouses, the relational database is all you’ll need. It can be used to create and manage both relational and dimensional models. But in Business Intelligence applications, chances are that you’ll need an additional, more specialized form of multidimensional data storage and retrieval. Analytical databases cover this need. In fact, Dr. Codd, the inventor of the relational database, often explained how analytical databases are a “necessary companion” to relational databases.

The OLAP Story

OLAP stands for Online Analytical Processing. It is a terrible name to describe multidimensional analytical databases. Instead of a more meaningful name, like FASMI (Fast Analysis of Shared Multidimensional Information) proposed by Nigel Pendse of “OLAP Report” we seem to be stuck with OLAP. With great pain, I will continue to discuss analytical databases using the OLAP terminology.

Basically, an OLAP (*sigh*) database has certain characteristics that set it apart from relational databases. And, coincidentally, Pendse included all characteristics in the FASMI acronym. Check out the OLAP Report for more details. If you’re just getting into business analytics, or are in the process of evaluating analytical database software – you must visit this site!

The Smartness Factor

I like to think of the relational database as “dumb” and the analytic database as “smart”. Let me explain:

The primary focus of Analytic servers is to get the (often pre-calculated) data out of the database as quickly as possible, allowing the user to zoom in and out along different hierarchies. Contrast this with the Codd’s relational model which seeks to eliminate data anomalies at transaction time through normalization: data retrieval is slow and often complex. With this in mind, you can get a feel for how a database engine can store, catalog, and retrieve data differently. (In fact, this is what makes dimensional modeling in general so favorable for querying and analytics – it is not bound by the restrictiveness imposed by normalization. I’d also like to note that as with dimensional models, OLAP databases are multidimensional.)

While the relational database plays a very important role, so does the multidimensional (OLAP) database. It’s simply built differently to service a different need.

Primarily, analytic servers can manage aggregates, aggregate navigation, and complex calculations and summaries across hierarchies at blazing speeds. These skills were borrowed from both navigational database designs (think of the DOM or a Wiki) and hierarchal designs (trees with leaves and branches). A lot of this has to do with how the data is stored.

Storage

Analytics servers offer a different type of storage. MOLAP, or Multidimensional OLAP, can be much more efficient than relational engines. Some tools (like SQL Server Analysis Services) allow you to store your analytical databases in a relational way (ROLAP) or using a hybrid approach (HOLAP). Personally, I see no benefit at all with ROLAP, aside from real-time systems where using the relational database is a must because of the latency involved with updating the OLAP database. (Not to digress too much here, but even real-time environments can use MOLAP storage by using partitions and caches correctly.)

MOLAP is more natural and faster. This may differ depending on your tool choice, but I beg someone to tell me differently. For a detailed discussion on how MOLAP and ROLAP engines store data, you can try this resource “Analysis Services 2005 Performance Guide” (which defines the way Analysis Services does it).

HOLAP storage is an interesting option and could actually perform well –even better than MOLAP in some instances — with less disk usage. But a lot of this depends on how many aggregates are defined and how often the system needs to query the relational database. It might be a good consideration, but if you’re not sure and you have plenty of disk space (disk space is cheap), then go with MOLAP.

The smartness factor2 (the MDX language)

I like to think of SQL as “dumb” and MDX as “smart”. Let me explain:

The MultiDimensional eXpression (MDX) language was created about 10 years ago by Microsoft to work specifically with multidimensional data stored in analytical servers. This OLAP-specific language improves upon the SQL syntax, removing much of the bulkiness associated with the language. MDX is an elegant and highly relevant partner to the analytical database. While you could get out what you need using SQL, most every SQL statement will be a challenging one. It’s like cutting down a tree with a hand saw as opposed to using a chain saw. MDX is painless and intuitive. Concepts such as dimensions, hierarchies, and sets are built into the syntax. For more on MDX and the mdXML standard, please visit XML For Analysis.

Central to the MDX language is the cube concept, which deserves a proper introduction:

Cubes

OLAP data is stored in structures called cubes. As you know, a cube is a 3-dimensional solid that, given a point on three of its faces which form a vertex, can take you to a precise point somewhere within the cube itself.  Analytical DatabasesThis point represents the aggregate metric you want to view. The faces represent the different dimensions (like Product, Customer, and Time) that are used to find the point. The dimensions are further broken down by its hierarchies on each face.

Cubes are sometimes — and more aptly — called hypercubes (or a tesseract or “n-cube”) because analysis often makes it necessary to examine more than three points. As soon as you look at more than three faces, you need higher dimensionality.

I admit, when I first heard about cubes I was a bit intimidated. I felt that they were too complex and advanced. Now, working with cubes is natural. Notice though that the concept of a cube is almost the same thing as a star schema. The only difference (at least that I can really think of) is that a star schema generally stores atomic data, and barring any usable aggregate fact table, calculations need to be done on the fly. The cube theoretically stores the results of these calculations inside the cube.

Dimensional Modeling

If you read my blog, then you know I advocate dimensional models for proper data warehousing. If you also advocate and use dimensional models, then the concepts discussed so far will fit you well.

What’s great about using Dimensional Models in your relational database is that your OLAP database will almost mirror the design, making the development and deployment of your cubes so much easier. Not only that, but the cubes will likely load faster and be easier to maintain when built off of the dimensional model. Note you can still achieve most of what you can do without an analytics server, just by using a properly constructed set of business process dimensional models. The OLAP database is just smarter, making several tasks (navigating hierarchies, security, and calculations to name a few) easier.

Tags: , , , ,

Newton’s Method for Approximating the Zeros of a Real-valued Function

Need to find the root of a real-valued function and simple algebra just won’t cut it? Then Newton’s method is for you (if you can remember back to your calculus days, that is).

Newton’s method is a root-finding algorithm and uses iteration to solve nonlinear equations. The process involves making a guess at the true solution and then applying a formula to get a better guess. This process repeats until the (approximate) answer is found. The methodology is part of numerical analysis, which is the “branch of mathematics dealing with methods for obtaining approximate numerical solutions of mathematical problems (dictionary.com)”. Specifically, the type of problems fit into the “continuous mathematics” category that are common in medicine, engineering, science, and business industries. (For a good discussion on continuous verses discrete mathematics, check out this blog entry by Mark C. Chu-Carroll or this page from ESC.EDU.) As you might guess, continuous mathematics is very important in business intelligence, specifically statistical analysis of data.

The general form to find the approximation is as follows:

450bac44dc3e19e718bd2aa922f3d9ba Newtons Method for Approximating the Zeros of a Real valued Function

Here’s how it works: Start with an initial guess (get out your graph paper!). Then the function draws a tangent to approximate the function based on your initial guess. Next, the function computes the x-intercept. This x-intercept will be better than the initial guess, and closer to the answer (the root). It then repeats with the x-intercept as the new guess.

The FoxPro code:

LOCAL lcFunction, lnInitialGuess, lnIterations 

*-- set the fuction, my initial guess, and number of iterations
lcFunction = "(1/4)*x^3 - 3*x^2 + (3/4)*x - 2"
lnInitialGuess = 2
lnIterations = 10 
 
*-- run newton
newton(lcFunction, lnInitialGuess, lnIterations )
 
*-- foxpro does newton:
FUNCTION newton
LPARAMETERS tcFunction, tnInitialGuess , tnIterations 
 
LOCAL lcDerivative AS String
LOCAL lnNewGuess, lnIter , x , lnY , lnSlope , lnCol5 , lnCol6
 
CREATE CURSOR crNewton (col_n n(2), col_Xn n(14,5), col_fXn n(14,5), col_f1Xn n(14,5), col_ys n(14,5) , col_ng n(14,5))
 
limh = .000001
lcFunction = ALLT(tcFunction)
lnInitialGuess = EVL(tnInitialGuess,0)
lnIterations = EVL(tnIterations,1) 
 
lnNewGuess = lnInitialGuess
 
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORM
*-- set the fuction, my initial guess, and number of iterations
lcFunction = "(1/4)*x^3 - 3*x^2 + (3/4)*x - 2"
lnInitialGuess = 2
lnIterations = 10 

*-- run newton
newton(lcFunction, lnInitialGuess, lnIterations )
 
*-- foxpro does newton:
FUNCTION newton
LPARAMETERS tcFunction, tnInitialGuess , tnIterations 
 
LOCAL lcDerivative AS String
LOCAL lnNewGuess, lnIter , x , lnY , lnSlope , lnCol5 , lnCol6
 
CREATE CURSOR crNewton (col_n n(2), col_Xn n(14,5), col_fXn n(14,5), col_f1Xn n(14,5), col_ys n(14,5) , col_ng n(14,5))
 
limh = .000001
lcFunction = ALLT(tcFunction)
lnInitialGuess = EVL(tnInitialGuess,0)
lnIterations = EVL(tnIterations,1) 
 
lnNewGuess = lnInitialGuess
 
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORM
*-- run newton
newton(lcFunction, lnInitialGuess, lnIterations )

*-- foxpro does newton:
FUNCTION newton
LPARAMETERS tcFunction, tnInitialGuess , tnIterations 
 
LOCAL lcDerivative AS String
LOCAL lnNewGuess, lnIter , x , lnY , lnSlope , lnCol5 , lnCol6
 
CREATE CURSOR crNewton (col_n n(2), col_Xn n(14,5), col_fXn n(14,5), col_f1Xn n(14,5), col_ys n(14,5) , col_ng n(14,5))
 
limh = .000001
lcFunction = ALLT(tcFunction)
lnInitialGuess = EVL(tnInitialGuess,0)
lnIterations = EVL(tnIterations,1) 
 
lnNewGuess = lnInitialGuess
 
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORM
*-- foxpro does newton:
FUNCTION newton
LPARAMETERS tcFunction, tnInitialGuess , tnIterations 
 
LOCAL lcDerivative AS String
LOCAL lnNewGuess, lnIter , x , lnY , lnSlope , lnCol5 , lnCol6
 
CREATE CURSOR crNewton (col_n n(2), col_Xn n(14,5), col_fXn n(14,5), col_f1Xn n(14,5), col_ys n(14,5) , col_ng n(14,5))
 
limh = .000001
lcFunction = ALLT(tcFunction)
lnInitialGuess = EVL(tnInitialGuess,0)
lnIterations = EVL(tnIterations,1) 
 
lnNewGuess = lnInitialGuess
 
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORM
*-- Loop for each user iterations
FOR lnIter = 1 TO lnIterations
 
  *-- Set guess
  x = lnNewGuess
 
  *-- Run Function
  lnY = &lcFunction
 
  *-- Find Derivative
  lnHoldx = x
  x = x + limh
  lnFuncpluslimh = &lcFunction
  x = lnHoldx
  lnFuncminlimh = &lcFunction
  lnSlope = (lnFuncpluslimh - lnFuncminlimh) / limh 
 
  lnys = lny/lnSlope 
 
  *-- Find New guess
  lnNewGuess = x - lnys
 
  *-- Populate the cursor
  INSERT INTO crNewton (col_n  ,col_Xn ,col_fXn ,col_f1Xn ,col_ys ,col_ng) ;
                VALUES (lnIter ,x      ,lnY     ,lnSlope  ,lnys   ,lnNewGuess)
 
  *-- Reset the guess and continue
  lnNewGuess = lnNewGuess
 
NEXT
 
SELECT crNewton
LOCATE
BROWSE NORMAL

In this run, with the equation I have defined, the approximation of x is 11.80326. If you plug this number in for x in the original equation, it evaluates to .00014. Pretty close! And that’s exactly what this algorithm is great for. Try these additional functions to get the hang of it. You can change the guess and iterations as you like as well:

  • “x + 1″
  • “x^3-12″
  • “x*(x^3)-(3/4)*x”
  • “((x^4+1)*(x^3+2))-11″

 

Tags: , ,

Sorting IP Addresses in VFP

In my series, “Business Intelligence Through Web Analysis“, I talk about using web stats to build a web analytics solution in FoxPro. I have created a webstats table, and have populated it with some data from my access logs. I have a requirement to be able to sort my webstats table by IP address (among other things).

The format of an IP address doesn’t lend itself to proper indexing (it is a 32-bit numeric address written as four numbers (octets) separated by periods). To accommodate my requirement, my first thought was to break the IP address into 4 separate numeric fields when importing. This would be good normalization. After all, the rule is that atomic data is better than concatenated data. But rules are meant to be broken. And, more importantly, does it make sense to atomize an IP address? An IP address is not really 4 separate numbers, the entire address is meaningful, even though each piece holds some significance.

So, I decided to keep the IP address intact.

Still needing a good way to sort and group, I thought about converting the IP address to its binary representation. MySQL has a function, INET_ATON (Adress TO Number), that does exactly that. FoxPro’s CREATEBINARY function seems to do the trick at first. I can pass in the complete IP address like so:

? CREATEBINARY("128.0.0.0")
? CREATEBINARY("168.212.226.204")

But if you see these outputs, you’ll realize that they’re not great for indexing (which is the only reason I want to convert them). CREATEBINARY creates strings of varying length, which is not what I want for indexing.

Back to the drawing board. I could represent each octet as 0 and 1 like this:

lnOct = 122
lcBin = ''
DO WHILE lnOct > 0
    lcBin = STR(lnOct%2, 1) + lcBin
    lnOct = INT(lnOct/2)
ENDDO
? PADL(lcBin,8,'0')

But this seems terribly inefficient (I admit, no benchmarking was done, but this code is overkill!).

With all this in mind, I think I have a viable solution. It doesn’t require a custom function, nor does it require storing data in separate fields. I know I can easily extract each octet of the IP address by using GETWORDNUM. And, knowing a little something about IP addresses, each octet has 2^8 (or 256, values 0 to 255) possible combinations. Therefore, an IP Address can be represented in a numeric form like so:

? VAL(GETWORDNUM(ip_address,1,'.')) * 256^3 + ;
  VAL(GETWORDNUM(ip_address,2,'.')) * 256^2 + ;
  VAL(GETWORDNUM(ip_address,3,'.')) * 256   + ;
  VAL(GETWORDNUM(ip_address,4,'.'))

So my index for webstats.dbf would be:

SELECT webstats
INDEX ON BINTOC( ;
    VAL(GETWORDNUM(ip_address,1,'.')) * 256^3 + ;
    VAL(GETWORDNUM(ip_address,2,'.')) * 256^2 + ;
    VAL(GETWORDNUM(ip_address,3,'.')) * 256 + ;
    VAL(GETWORDNUM(ip_address,4,'.'));  ,8) TAG IP

Your comments and feedback are welcome. Is there a better solution?

Tags: , , , , ,

Business Intelligence Through Web Analysis

I just received a surprising email regarding my previous post about Analyzing Apache’s Raw Access Logs in FoxPro. The commenter wrote “dude, you’re nuts. Why waste your time on this?!? I use Awstats and that works fine”.

I sat back in my chair quite puzzled. After all, I know that Awstats doesn’t even come close to giving me the answers I need to grow my business and website. Then I realized my folly: I jumped right into my example in my last post without fully explaining the goal of the project. I’ll try to redeem myself now.

Awstats and similar tools (in my case, provided with cPanel) are retrospective reporting tools. They give you nice charts and display some very interesting numbers and facts about the types of traffic generated on your site. If you’re good with numbers and can associate events to dates (in your head) then you may be able to notice some interesting patterns (like, “hey – it seems that whenever I post a new blog entry, my hits double!”).

But this isn’t good enough for more serious projects (but isn’t a bad place to start either).

When growing your business (whether you’re a blogger or selling widgets), this type of information is invaluable. You need good, consistent, scientific analysis to pull it all together (hunches and gut feelings don’t count). Pool raw data (from a variety of sources), integrate it, clean and add value to it, and compile it to create some incredibly useful and valuable information (Read: Business Intelligence). This information can help you make decisions like (a) how much to spend on advertising, (b) whether or not to sponsor an event (such as FoxForward), (c) how many blog posts to make per week to keep the interest of readers, (d) should I sign up with Google AdSense, or (e) what design elements and layout plans are making the greatest impact.

Data smog is a real issue, however. Too much data can not only waste your valuable analysis time, but the integration of this meaningless data will do nothing but eat away at your resources (and give you a headache). The key then is to do a little preparation before you begin a project like this (duh!). I think there are two important steps (1) monetize all elements of your business, and (2) identify all key performance indicators (KPIs). Armed with this information, you will be able to build a dimensional model (in VFP of course!) with an incredibly rich fact table. Monetizing helps you assign value to all your tasks, and KPIs allow you to measure the benefits of these endeavors. Throughout the course of this project, I’ll be itemizing these two items in greater detail. As an example, for my blog todmeansfox, I’ve monetized the following items:

  • Posting a new blog entry: $90.00 / post (my time and effort to post based on my current rates, discounted)
  • Responding to posts: $10.00 / comment
  • Social networking: $90.00 / contact (includes setting up accounts on del.ico.us for example)
  • Advertising: $40.00 / free advertising (example, updating my fox.wiki profile)
  • Advertising: case by case (I have not done any advertising yet)
  • Sponsoring: case by case (I have not sponsored an event yet)
  • Updating CSS files: $60.00 / hour (how much does it ‘cost’ me to update my site’s layout and colors?)

Next, I tried to identify all important KPIs:

  • referrals
  • length of stay
  • out clicks
  • total number of hits and page views
  • total number of comments made
  • total number of emails received
  • Blackstone Providence leads
  • consulting job hours from website leads

Of course, as I do more data profiling, I may uncover some other very useful pieces of data that I can later integrate.

The goal of this project, therefore, is rather simple: make better business decisions. Using the Internet, I will gather the right data, integrate it in a meaningful way, and use OLAP to analyze and report on the findings. I’ll use data mining, profiling, and trend analysis to identify abuse and spam, as well as identify areas were improvements in layout, methodology, and content can make a greater impact. My hope is to generate more business by leveraging my current assets. On top of that, I want to do it in the open so others can benefit.

The first step in my process is data profiling, where I’ll gather the data I think I can use, analyze it, test it for quality, and prepare it for my dimensional model. My last blog post attempted to start that process by simply downloading and parsing the daily Apache raw access log file (which I should mention is in the NCSA combined/XLF/ELF log format, in case you were wondering).

As you can see, Awstats can only get me so far (actually, not that far at all).

Tags: , , , ,

Analyze Apache Raw Access Logs in FoxPro

I’ve been increasingly interested in examining the access logs on my server. For the past 6 or 7 years, I’ve used some of the cool little tools provided by cPanel located in their Web/FPT Statistics folder (like Webalizer and Awstats). While nice to look at, I’ve always felt a bit handicapped by the interface and general content. I have my own ideas on what I’d like to see and how I want to see it.

In this and subsequent articles, I’ll build an Apache Raw Access Log analysis application that will (a) download the access logs once or twice a day, (b) dump the data into a database, (c) integrate this data with other information (perhaps some demographics info, error logs, site information, date and times of blog posts, etc) (d) and in the end, build a suite of reports and filters that can be used to identify things like abuse, spam, and positive traffic.

My hope is to do a good enough job with this that, in the end, could end up in the Fox Open Source community if there’s even a remote bit of interest. This should evolve in such a way that any access log from any server would be supported, but in the initial stages, I’ll be using my Raw Access Logs from Apache. For those interested, here are the details of my server configuration:

Operating system: Linux
Apache version: 1.3.37 (Unix)
cPanel Build: 10.9.0-STABLE 9966
cPanel Pro: 1.0 (RC36)

The Log File

I started by looking at the raw files created by Awstats on my server (/tmp/awstats/). There is a ton of information in there which is already parsed and used by the various free analysis tools. But what I really need is a raw file that contains a line for each ‘hit’ to my server (including 200, 404, and 500 messages). The Raw Access Logs are perfect (in cPanel, you can manage them by opening the Raw Log Manager). They contain a line for each hit, some referral information, http status codes, etc.

The only problem with the file is its format (NCSA combined/XLF/ELF)! A single line looks like:

206.17.xxx.xxx – - [23/May/2007:07:59:21 -0500] “GET /Styles/color.css HTTP/1.1″ 304 – “http://www.rabbitstewlounge.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1)”

Fortunately, FoxPro can handle this sort of thing quite easily. With a log file containing potentially thousands of lines each day, you can see how reading this file without parsing can be nightmarish (not to mention highly inefficient). So, this is where I started. I downloaded today’s access file from my site, and wrote the following code. As this project matures, I’ll be sure to clean this up (hopefully with some reader’s suggestions!) and provide the source in zip format. For now, follow along:

Get Started

First, download your raw access file. Load cPanel and click “Raw Access Logs”. Select the raw access log for your domain and unzip it. In this example, I’ve also changed it’s name from “rabbitstewlounge.com” to “raw_log.txt”.

The Basic Code

I left out some environmental settings, such as SET SAFETY and SET MEMOWIDTH for now. Eventually this will be part of a project, which will set all these things later. Also, as this code is merely used to test the theory, I haven’t included any real error checking. I’d be interested in hearing some ideas on how to tweak this for better speed and reliability.

I did consider writing this using low-level file functions (FOPEN, FGETS, FREAD, etc), but in the end I felt that the format was too wacky and I had better rely on STREXTRACT and SUBSTRING instead.

CREATE DATABASE C:/webstats/webstats
OPEN DATABASE C:/webstats/webstats
CREATE TABLE C:/webstats/webstats.dbf (;
    stat_id integer AUTOINC,;
    ip_address character(15),;
    date_time datetime,;
    gmt_offset character(5),;
    access_method character(4),;
    location varchar(200),;
    http_version character(10),;
    http_code character(3),;
    bytes integer,;
    referrer varchar(200),;
    agent varchar(200),;
    source_file varchar(60))
 
lcData = FILETOSTR("C:/webstats/raw_log.txt")
 
FOR n = 1 TO MEMLINES(lcData)
 
    lcLine = MLINE(lcData,n)
 
    *-- IP address is first.
    lcIP = ALLTRIM(SUBSTR(lcLine,1,AT("- -",lcLine)-1))
 
    *-- date time and GMT offset info. Use STREXTRACT
    *-- to get the data between the first two quotes
    *-- on the line. Then break the datetime string
    *-- into separate lines to use MLINE to build a
    *-- proper datetime string
 
    lcDateTimeGMT = ;
        STRTRAN(STREXTRACT(lcLine,"[","]")," ",CHR(13))
    lcDT = ;
        STRTRAN(STRTRAN(MLINE(lcDateTimeGMT,1),"/",CHR(13)),":",CHR(13))
    IF !EMPTY(lcDT)
        ltDatetime = DATETIME(;
            VAL(MLINE(lcDT,3)),;
            get_monthnum(MLINE(lcDT,2)),;
            VAL(MLINE(lcDT,1)),;
            VAL(MLINE(lcDT,4)),;
            VAL(MLINE(lcDT,5)),;
            VAL(MLINE(lcDT,6)))
    ELSE
        ltDatetime = {}
    ENDIF
 
    lcGMTOffset = MLINE(lcDateTimeGMT,2)
 
    *-- Request data. Just to 'switch-it-up' a bit,
    *-- I'm using GETWORDNUM here because
    *-- I need 3 chunks of data within the
    *-- STREXTRACT, all conveniently separated
    *-- by a space.
 
    lcRequest = STREXTRACT(lcLine,'"','"')
    lcMethod = GETWORDNUM(lcRequest,1)
    lcLocation = GETWORDNUM(lcRequest,2)
    lcHTTPVer = GETWORDNUM(lcRequest,3)
 
    *-- HTTP Code and bytes returned.
    lcHTTPCode = SUBSTR(lcLine, ;
        AT(lcRequest,lcLine)+LEN(lcRequest)+2,3)
    lnBytes = VAL(SUBSTR(lcLine,AT(" " + lcHTTPCode + ;
        " ",lcLine)+LEN(lcHTTPCode)+2,;
        AT('"',lcLine,3) - (AT(" " + lcHTTPCode + " ",lcLine)+;
        LEN(lcHTTPCode)+2)))
 
    *-- referrer
    lcReferer = STREXTRACT(lcLine,'"','"',3)
 
    *-- User's agent
    lcAgent = STREXTRACT(lcLine,'"','"',5)
 
    INSERT INTO webstats (ip_address,date_time,gmt_offset,;
        access_method, location,http_version,http_code,;
        bytes,referrer,agent,source_file ) ;
    VALUES  (lcIP , ltDatetime , lcGMTOffset , lcMethod ,;
        lcLocation , lcHTTPVer ,lcHTTPCode,lnBytes,lcReferer,;
        lcAgent,"raw_log.txt" )
 
NEXT
 
*-- Functions
FUNCTION get_monthnum
LPARAMETERS tcAbbr
 
DIMENSION aMonthList[12]
aMonthList[1] = "JAN"
aMonthList[2] = "FEB"
aMonthList[3] = "MAR"
aMonthList[4] = "APR"
aMonthList[5] = "MAY"
aMonthList[6] = "JUN"
aMonthList[7] = "JUL"
aMonthList[8] = "AUG"
aMonthList[9] = "SEP"
aMonthList[10] = "OCT"
aMonthList[11] = "NOV"
aMonthList[12] = "DEC"
 
RETURN ASCAN(aMonthList,LEFT(UPPER(ALLTRIM(tcAbbr)),3))
ENDFUNC

Next entry, I’ll see about downloading this file automatically (might take some ingenuity) and building in some logic to do so periodically. After that, I’ll write some algorithms to identify potential abuse situations, spam, and increased traffic, setting up email notifications in the process. Stay tuned!

Tags: , , , , , ,

Interpolation

For those that aren’t familiar with the term, and since I’m not too good at creating my own definitions (go on, ask Andy Kramek!), I thought I’d borrow a definition from wikipedia to get started:

“In the mathematical subfield of numerical analysis, interpolation is a method of constructing new data points from a discrete set of known data points.” (Source)

If you follow that source link to the wikipedia article, you can read lots of nice things about interpolation and how it can be used in the math and science fields. About 8 years ago, I needed such a function: I needed the ability to find a set of values (x) that fell between two known values (f(x)). I wrote the following linear interpolation algorithm in FoxPro to get my answers:

*-- this sets up a test cursor with some data points
CREATE CURSOR crDataSet (fld_x n(4,2), fld_fx n(4,2))
INDEX ON fld_x TAG fld_x
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (1,1.75)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (3,2.00)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (4,3.00)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (5,4.00)
INSERT INTO crDataSet (fld_x , fld_fx) VALUES (7,4.25)
 
? interpolate(2,"crDataSet")   && returns 1.875
? interpolate(3,"crDataSet")     && returns 2.000
? interpolate(3.1,"crDataSet")  && returns 2.100
? interpolate(3.9,"crDataSet")  && returns 2.900
? interpolate(10,"crDataSet")   && returns 4.625
 
FUNCTION interpolate
    LPARAMETERS tnX, tcDataSet
 
    LOCAL lcfld_x , lcfld_fx AS String
    LOCAL lcOldNear , lcOldExac AS String
    LOCAL X1, Y1, X2, Y2 AS Number
    LOCAL lFound AS Logical 
 
    *-- name of fields in tcDataSet cursor
    lcfld_x = FIELD(1)    lcfld_fx = FIELD(2)
 
    *-- important to adjust these settings
    lcOldNear = SET('NEAR')
    lcOldExac = SET('EXACT')
    SET NEAR ON
    SET EXACT OFF 
 
    SELECT (tcDataSet)
 
    lFound = SEEK(tnX,tcDataSet)
    SET NEAR &lcOldNear
    SET EXACT &lcOldExac
 
    IF lFound
        *-- if found, then just return the point
        RETURN EVAL(tcDataSet + '.' + lcfld_fx)
    ELSE
        *- not found, so need to interpolate
        cXStr = tcDataSet + '.' + lcfld_x
        cYStr = tcDataSet + '.' + lcfld_fx
 
        *-- if the seek put us at eof, back up 1
        SELE (tcDataSet)
        IF EOF(tcDataSet)
            SKIP -1
        ENDIF
 
        *-- evaluate the first point
        X2 = EVAL(cXStr)
        Y2 = EVAL(cYStr)
 
        *-- evaluate the second point
        IF RECNO(tcDataSet) = 1
            SKIP +1
        ELSE
            SKIP -1
        ENDIF
        X1 = EVAL(cXStr)
        Y1 = EVAL(cYStr)
 
        *-- draw a line and return the result!
        RETURN ymxb(x1,y1,x2,y2,tnX)
    ENDIF
 
ENDFUNC  
 
FUNCTION ymxb
    LPARAMETERS X1,Y1,X2,Y2,X3
 
    *-- y = mx + b
    LOCAL nSlope , B1
 
    nSlope = (Y1-Y2)/(X1-X2)
    B1 = -((nSlope) * X1 - Y1)
    RETURN (nSlope) * X3 + B1
 
ENDFUNC

There you have it!

A couple of notes: For space reasons, I’ve omitted some key error checking and bullet-proofing (I started doing it, but realized that an extra 20 lines of code for this example was overkill). But it would be a good idea, for example to verify that the columns in the cursor passed were numbers. Come to think of it, you should also very that the correct parameters were passed!

In addition, I remember from math class that such a linear method is not the most accurate method. I am using a straight line between points and not a curve. I suppose a complimentary function could be added along with ymxb to get a curve between the two points, but until I need to do that — I’m not game!

Tags: , , , ,