Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Programming’


Published July 11th, 2007

Conforming Temperature in FoxPro (some conversions)

Conforming data is an essential part of data integration. Conforming data takes on many shapes and may include updating data types and lengths, attribute names and conventions, data atomicity, universal conversions, etc. The idea is that, in all circumstances, the attribute means (and therefore acts) the same across all entities in the database.

For example, let’s look at temperature. Business requirements (and perhaps your region of operation) will dictate the format temperature should be stored in. When integrating data, chances are great that you’ll have different temperature formats (Celsius and Fahrenheit) coming from different sources. In order to have a conformed database, you create a rule that states that all temperatures are to be stored in X format. I like Celsius, personally, so I convert all Fahrenheit temperatures to Celsius:

nCelsius = (nFahrenheit  - 32) * (5/9)

If you disagree, then go the other way:

nFahrenheit  = (9/5) * nCelsius + 32

Like Kelvin? (You must be a scientist!)

nKelvin = nCelsius + 273.16

There are virtually hundreds of these types of conforming decisions to be made in a typical large integration project. Build yourself a toolbox of common conversions, and apply them to your data as needed.

Published June 5th, 2007

Sorting IP Addresses in VFP

In my series, “Business Intelligence Through Web Analysis“, I talk about using web stats to build a web analytics solution in FoxPro. I have created a webstats table, and have populated it with some data from my access logs. I have a requirement to be able to sort my webstats table by IP address (among other things).

The format of an IP address doesn’t lend itself to proper indexing (it is a 32-bit numeric address written as four numbers (octets) separated by periods). To accommodate my requirement, my first thought was to break the IP address into 4 separate numeric fields when importing. This would be good normalization. After all, the rule is that atomic data is better than concatenated data. But rules are meant to be broken. And, more importantly, does it make sense to atomize an IP address? An IP address is not really 4 separate numbers, the entire address is meaningful, even though each piece holds some significance.

So, I decided to keep the IP address intact.

Still needing a good way to sort and group, I thought about converting the IP address to its binary representation. MySQL has a function, INET_ATON (Adress TO Number), that does exactly that. FoxPro’s CREATEBINARY function seems to do the trick at first. I can pass in the complete IP address like so:

? CREATEBINARY("128.0.0.0")
? CREATEBINARY("168.212.226.204")

But if you see these outputs, you’ll realize that they’re not great for indexing (which is the only reason I want to convert them). CREATEBINARY creates strings of varying length, which is not what I want for indexing.

Back to the drawing board. I could represent each octet as 0 and 1 like this:

lnOct = 122
lcBin = ''
DO WHILE lnOct > 0
    lcBin = STR(lnOct%2, 1) + lcBin
    lnOct = INT(lnOct/2)
ENDDO
? PADL(lcBin,8,'0')

But this seems terribly inefficient (I admit, no benchmarking was done, but this code is overkill!).

With all this in mind, I think I have a viable solution. It doesn’t require a custom function, nor does it require storing data in separate fields. I know I can easily extract each octet of the IP address by using GETWORDNUM. And, knowing a little something about IP addresses, each octet has 2^8 (or 256, values 0 to 255) possible combinations. Therefore, an IP Address can be represented in a numeric form like so:

? VAL(GETWORDNUM(ip_address,1,'.')) * 256^3 + ;
  VAL(GETWORDNUM(ip_address,2,'.')) * 256^2 + ;
  VAL(GETWORDNUM(ip_address,3,'.')) * 256   + ;
  VAL(GETWORDNUM(ip_address,4,'.'))

So my index for webstats.dbf would be:

SELECT webstats
INDEX ON BINTOC( ;
    VAL(GETWORDNUM(ip_address,1,'.')) * 256^3 + ;
    VAL(GETWORDNUM(ip_address,2,'.')) * 256^2 + ;
    VAL(GETWORDNUM(ip_address,3,'.')) * 256 + ;
    VAL(GETWORDNUM(ip_address,4,'.'));  ,8) TAG IP

Your comments and feedback are welcome. Is there a better solution?

Published May 16th, 2007

Take the Survey!

I took it Take a moment to fill out this quick survey (takes about 5 minutes) published by A List Apart. The purpose of this survey is to create some metrics that can be used to gauge the vitality of Web Development as a whole. From A List Apart’s website:

Designers, developers, project managers. Writers and editors. Information architects and usability specialists. People who make websites have been at it for more than a dozen years, yet almost nothing is known, statistically, about our profession. Who are we? Where do we live? What are our titles, our skills, our educational backgrounds? Where and with whom do we work? What do we earn? What do we value?

Take the Survey here: http://alistapart.com/articles/webdesignsurvey

Published May 10th, 2007

What’s faster? SQL or looping to return a set of records?

When you need to get a set of records (for example, to return a list of items from a lookup table, to prepare a report, or to fill a read-only grid), which method should you use?

Of course, the answer is ‘it depends’ and many factors are at play. It is part of our job as developers (or programmers, if you’d prefer) to pick the solution that best fits the scenario. As much has been written on this topic already, I won’t be revealing anything new here: Just a re-package, if you will.

First, the basics. There are several ways to go through a table with the intention of producing a set of resulting records. You can SCAN, DO WHILE, use a FOR..NEXT loop, or use SQL syntax. The first three options are procedural and process one record at a time. The last (SQL) is set-based. To get an idea on what looping mechanism to use, take a look at Andy Kramek’s blog entry from March 2006 called “Writing Better Code (Part 2)”. In this entry, Andy does a great job of showing you the effects of indexes on the three looping mechanisms, and should give you a good idea which one to use and when.

Using SQL syntax, you can return a result of records as well. You can replace your looping syntax with SQL syntax and get the same results (using much less code). When gathering data for a report, for example, I have often used the loop method to populate a special report cursor. I was inclined to stick with the procedural approach in the past because (a) it was easy enough to write, (b) easy to debug, (c) allowed me to easily insert a custom progress bar if needed, and (d) because I thought it was best. As my SQL skills improved, I found that SQL syntax is actually (a) easier to write, (b) just as easy to debug, (c) capable of supporting a custom progress bar, and (d) may in fact be best under many circumstances. This is especially true if you are trying to write portable code (duh!).

To demonstrate my point, I first took a table with 125,000 records with no index (CDX) file. This first case will process each record in the table. My benchmarking methodology here would make anyone who has ever done benchmarking in a scientific way before blood boil, but hey, I never claimed to be a scientist (I don’t even play one on TV)!

First, are we all on the same page?

SET ANSI OFF
SET EXACT OFF
SET TALK OFF
SET DELETED ON
CLEAR

Carry on…

USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
LOCATE
DO WHILE NOT EOF()
    INSERT INTO crName VALUES (name_list.name)
    SKIP
ENDDO
? SECONDS() - nSecs && results: 0.313 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
FOR nRec = 1 TO RECCOUNT()
    GOTO nRec
    INSERT INTO crName VALUES (name_list.name)
NEXT
? SECONDS() - nSecs && results: 0.281 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL
nSecs = SECONDS()
SELECT name FROM name_list INTO CURSOR crName
? SECONDS() - nSecs && results: 0.000 seconds (10 runs)
USE IN name_list

So what on earth is going on? The SQL Statement processes that entire result in (literally) no time at all! The answer can be found in a comment made in a recent blog posting of mine. Sergey explained that VFP doesn’t actually run any SQL. All it does is create a filtered cursor on the underlying dataset. This is why RECCOUNT could be “wrong”, as discussed in my blog entry. Check out that link to Sergey’s site to read more about this. You can check what you got simply by issuing JUSTEXT(DBF()) (as Sergey does): “DBF” means a filtered cursor was created, while “TMP” means that a new cursor was created.

Additional Notes:

When there is a CDX file (with no index set), the FOR loop seems to suffer a little: On average, the FOR loop takes an additional .020 seconds to run. The existence of a CDX file does not seem to bother the DO WHILE or SQL method in the above case.

Now, let’s add a condition that the name was added to the table on or after 01/01/2005. The first run of this test is with no CDX file at all. If a CDX file exists, but no index set, then the results are a bit different (see notes after the code):

 
USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
LOCATE
DO WHILE NOT EOF()
    IF created>={^2005-01-01}
        INSERT INTO crName VALUES (name_list.name)
    ENDIF
    SKIP
ENDDO
? SECONDS() - nSecs  && results: 0.234 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
FOR nRec = 1 TO RECCOUNT()
    GOTO nRec
    IF created>={^2005-01-01}
        INSERT INTO crName VALUES (name_list.name)
    ENDIF
NEXT
? SECONDS() - nSecs  && results: 0.235 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL
nSecs = SECONDS()
SELECT name FROM name_list WHERE created>={^2005-01-01} INTO CURSOR crName
? SECONDS() - nSecs  && results: 0.125 seconds (10 runs)
USE IN name_list

Here, with no CDX file, VFP seems to actually process the SQL (it doesn’t create a filtered cursor), and does so rather fast. If you add an INDEX on created date, FoxPro goes ahead and creates the filtered cursor (which again is processed in zero seconds).

OK, so now what if I actually use the INDEX, and SET ORDER TO before all of this?

USE name_list IN 1 EXCL ORDER created
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
LOCATE
DO WHILE NOT EOF()
    IF created>={^2005-01-01}
        INSERT INTO crName VALUES (name_list.name)
    ENDIF
    SKIP
ENDDO
? SECONDS() - nSecs  && results: 1.047 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL ORDER created
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
FOR nRec = 1 TO RECCOUNT()
    GOTO nRec
    IF created>={^2005-01-01}
        INSERT INTO crName VALUES (name_list.name)
    ENDIF
NEXT
? SECONDS() - nSecs  && results: 0.235 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL ORDER created
nSecs = SECONDS()
SELECT name FROM name_list WHERE created>={^2005-01-01} INTO CURSOR crName
? SECONDS() - nSecs  && results: 0.000 seconds (10 runs)
USE IN name_list

The Do WHILE really gets hit hard, as you can see. The FOR LOOP works as
well as before, and the SQL once again gives us a filtered cursor.

As you can see, you can really take advantage of this behavior to create simple cursors with simple predicates. Even if VFP can’t create the filtered cursor (for example, applying an ORDER BY clause on the SQL statements), the results are still usually faster.

In a subsequent posting, I’ll increase the complexity of these tests. In the meantime, I would be interested in any feedback to these cases.

Published May 1st, 2007

VFP gotcha with _TALLY, RECCOUNT, DELETED, and an INDEX

Update: FYI, this is a known “issue” and works this way by default. See comments for more details.

Maybe I’m missing something (and I hope I am). The following condition exists in VFP7 through to VFP9SP1 (and maybe earlier than 7). I don’t remember ever reading anything about it either.

This one is hard to explain (and believe), so you may need to run the example code to see what I mean. And right off the bat: This issue has nothing to do with the RECCOUNT/_TALLY results we all know about regarding deleted records (RECCOUNT always returns the number of rows in the table regardless of the SET DELETED setting). No, this issue has something to do with an index and the current setting of DELETED. When an index is set on a column involved in a WHERE clause of an SQL statement, RECCOUNT will return the total number of records from the ‘FROM’ table instead of the results from the cursor when SET DELETED is OFF! It seems as though the resulting cursor has a bunch of deleted records (the scrollbar in the browse window only goes half-way up (if you know what I mean)), but they can’t be accessed regardless of the SET DELETED setting.

The problem is with code like this:

SET DELETED OFF
SELECT sometable
SET INDEX ON somefield   && can exist in a CDX too
SELECT * FROM sometable WHERE somefield = somevalue INTO CURSOR crTemp
? RECCOUNT('crTemp')   && wrong! RECCOUNT returns the wrong reccount!
? _TALLY               && correct results

Here’s a complete example, using a table called ‘name_list’, which has a column for name c(30) and some deleted records. In the first example, the table name_list does NOT have an index on ‘name’. RECCOUNT and _TALLY return expected results.

USE name_list IN 0 EXCL 
 
SET DELETED OFF
 
SELECT name FROM name_list WHERE name>='H' INTO CURSOR crName
? "WITH DELE OFF:  "
?? _TALLY              && 81,019
?? RECCOUNT('crName')  && 81,019
 
SET DELETED ON
 
SELECT name FROM name_list WHERE name>='H' INTO CURSOR crName
? "WITH DELE ON: "
?? _TALLY              && 80,951
?? RECCOUNT('crName')  && 80,951

Now, add an index to the name column and be amazed:

USE name_list IN 0 EXCL
INDEX ON name TAG name
 
SET DELETED OFF
 
SELECT name FROM name_list WHERE name>='H' INTO CURSOR crName
? "WITH DELE OFF:  "
?? _TALLY              && 81,019
?? RECCOUNT('crName')  && 126,607 -- the actual number of records in name_list!?!
 
SET DELETED ON
 
SELECT name FROM name_list WHERE name>='H' INTO CURSOR crName
? "WITH DELE ON: "
?? _TALLY              && 80,951
?? RECCOUNT('crName')  && 80,951

In the above example, it doesn’t matter if the index is set to name (SET ORDER TO name) or not! Just the fact that the index exists throws off RECCOUNT(). If you do a count on crName with DELETED OFF, you get the expected results:

USE name_list IN 0 EXCL
INDEX ON name TAG name
 
SET DELETED OFF
 
SELECT name FROM name_list WHERE name>='H' INTO CURSOR crName
? "WITH DELE OFF:  "
?? _TALLY              && 81,019
?? RECCOUNT('crName')  && 126,607 -- the actual number of records in name_list!?!
COUNT TO ln
?? ln                  && 81,019

The moral is not to use RECCOUNT() to get the total number of records from
an SQL statement. Always use _TALLY. But unfortunately, this undocumented behavior has certainly crept into someone’s code somewhere. Consider a situation where you want to test to see if the SQL returns any records. You might do something like this:

SELECT * FROM sometable WHERE somefield = somevalue INTO CURSOR crTemp
IF RECCOUNT() > 0
  ...
ENDIF

This code is fatal if in fact there happens to be an index on somefield and SET DELETED is OFF! If anyone has some insight, I’d love to hear it…

Published April 27th, 2007

HTML Standards? Anyone? Anyone?

Only a sleight deviation, I promise…I have a question for the spaces.live team. Why on earth can’t you get the headers right on these pages? Firefox can handle it, but I truly wonder what some older browsers do with this site. The W3C says this about this very page:

The MIME Media Type
(application/xhtml+xml) indicated parse mode should be XML, but the DOCTYPE Declaration indicates
SGML mode. Using XML mode based on the Content-Type header."

This is just one of 469 validation failures.

To make matters worse, this site also fails the Section 508 test (Google: “Cynthia Says”), which is designed for setting a minimum bar for accessibility standards.

OK… sorry. It’s not like my Blogger account fairs much better (but I think David Shae has some Standards-compliant templates in their somewhere…)