Posts Tagged SQL

Differences Between Aggregates and Summaries

I find the need to put down some of my thoughts regarding aggregates and summaries, especially on how they apply to SQL and DW/BI. To me, aggregates and summaries are not the same, but I have come to terms with the fact that most users, developers, modelers, and architects use the terms interchangeably. There are two pairs of terms this applies to (one pair is the noun form, while the other is the verb form):

  • n. Aggregate / Summary
  • v. Aggregating / Summarizing

Aggregate (plural aggregates) and summary (plural summaries) datum are created from functions placed upon underlying atomic data. The primary differences between the two lies in two areas: (a) what type of function is applied and (b) how the atomic data is organized as inputs to the function.

General rule: Any function that requires a grouping (typically along a hierarchy) is an aggregate, while any function applied to the entire population of data is a summary. Summaries can be derived from aggregates, but I am hard pressed to come up with an example where the reverse is true.

Functions such as SUM, AVG, MIN, and MAX can be used to derive both aggregates and summaries (by aggregating or summarizing), as you can apply them to an entire dataset or to groups within the dataset. In SQL however, you are forced to define a grouping when you include one or more attributes with the summary statistic. In this case, you have created an aggregate:

SELECT region, SUM(sale_amount) 
    FROM customer c 
    JOIN sales s ON c.cid = s.cid 
    GROUP BY region

A summary:

SELECT SUM(sale_amount) FROM sales

Note that you can create summaries from aggregates. Obviously, SELECT sum(sale_amount) FROM sales doesn’t really tell you much about your data. To aggregate by time or product line is much more useful. Consider the following, very useful aggregate and summary table:

-- get sales by region and month for 2007
SELECT region, sale_month, SUM(sale_amount) AS amount 
    INTO #tmpAggregate
    FROM customer c 
    JOIN sales s ON c.cid = s.cid
    WHERE YEAR =  '2007' and region = 'NE'
    GROUP BY region, sale_month
 
-- summarize the above by giving you total sales
-- in the North East for 2007. Note that this type of
-- summary is usually done in a reporting tool
SELECT SUM(amount) 
    INTO #tmpSumary
    FROM #tmpAggregate

Frequencies and other categorical functions are only valuable when applied against groups. Imagine a histogram without categories! Likewise, nearly all of the aggregate functions available in the majority of programming languages are most useful when applied against groups. This includes functions such as standard deviations, variances, means, and so on.

When an elementary arithmetic function (such as division) is performed on groups of data, the resulting calculation can be called a summary. If the function is applied within the row, it is called a calculation. To restate: arithmetic functions that span groups are a special type of summary. An example of this might be to find the ratio of one group as compared to another (a GROUP BY clause applied to two different study groups in the data).

While it might be a good idea to put together a table to illustrate my point about how functions and inputs affect the terminology, I’ll leave that exercise for another day. That type of project will take a little more thought!

Another difference between an aggregate and a summary is how the two are presented. Aggregates are rarely presented with the atomic data — not to suggest that the underlying data should not be available. Summaries most always come before or after the underlying data is presented to the user.

Am I off base? Do I have it all wrong?

Tags: ,

3 Comments

Use .sql Files with SQLEXEC for Better Script Management

The SQLEXEC command in FoxPro allows you to sends SQL statements to a database where they can be executed, returning one or more result sets. You might make a call like this in your application:

SQLEXEC(lnConnHand,'SELECT * FROM Customer')

Of course, using SQLEXEC in this way isn’t a great architectural choice, although serviceable. It is difficult to manage code in this way and difficult to support varying back end databases. I would never code customer applications in this way. Instead, I would load an external SQL file and execute that instead.

SQLEXEC(lnConnHand,FILETOSTR("c:sqlserver2005script1.sql"))

This approach not only decouples the proprietary SQL code from my application, but it also allows me to test and review the scripts in the native application (and likely get syntax coloring and intellisense). For example, script1.sql might contain TSQL code that I can review (and test) in SSMS. Even better, if I have a lone client running SQL Server 2000, I can use a different script if a new command (like INTERSECT, for example) is not available. This approach will prove even more valuable as some clients upgrade to SQL Server 2008. I will be able to take advantage of some of the new, faster features of SQL Server 2008 while still supporting older versions.

Using Metadata

To take this a step further, I have built metadata to manage these scripts. Consider an application that supports 4 different back-end databases: Visual FoxPro, SQL Server 2000, SQL Server 2005, and MySQL. Create the following table and store all SQL for each back-end database in the appropriate memo field:

CREATE TABLE sql_meta FREE (;
  query_id character(15),;
  sql_vfp m,;
  sql_ss2005 m,;
  sql_ss2000 m,;
  sql_mysql m)

Include this table in your project. In your beforebuild event in your project hook, import or update the sql scripts from your development folders into this table when you build your application.

Now, when you want to execute a script, your code would look a little like this:

*-- create the query/sql management object and execute the code
oMQ = CREATEOBJECT( "manage_query" ) 
SQLEXEC(lnConnHand, oMQ.get_sql('SQL_ss2005','CUSTOMER'))
 
*-- ok, so not very bullet-proof but you get the idea:
DEFINE CLASS manage_query AS Custom
 
  PROCEDURE get_sql
    LPARAMETERS tcBackEnd, tcQueryId	
    SELECT sql_meta 
    LOCATE FOR query_id = tcQueryId 
    RETURN sql_meta.&tcBackEnd
  ENDPROC 
 
ENDDEFINE

I’m a big fan of writing readable code. In fact, it is a priority in my development tasks. I’m also a big fan of making testing easier. This approach allows me to view and test each SQL script outside of my business and data access objects. It also makes for more readable and manageable code.

I’ve left out some examples on how you could use parameters or SQLPREPARE using this approach. I’ll be sure to make this a topic in a future entry.

Tags: , ,

7 Comments

Find Unmatched Keys from Two Data Sets

Here’s a real fast way to find the keys (natural or surrogate) that are missing from Set A or Set B. This query is similar to finding the symmetric difference between two tables, except that it only compares keys:

SELECT DISTINCT EVL(tableA.pkey, tableB.pkey) AS not_in ;
  FROM tableA ;
    FULL OUTER JOIN tableB ON tableA.pkey  = tableB.pkey ;
  WHERE tableA.pkey IS NULL OR tableB.pkey IS NULL

I used EVL in order to show a unique set of values that are missing from either TableA or TableB. To see exactly which keys are missing from each table, simply leave the EVL out and display both columns.

I tested this on two tables (A & B) with more than a million records using an integer surrogate key. The query took 8.6 seconds. A similar approach works in TSQL. Just use COALESCE instead of EVL and you’re off to the races!

Tags: , , ,

2 Comments

The “Not Intersect” or “Symmetric Difference” Between Two Tables

A few months back I wrote an entry on how to achieve an INTERSECT using VFP. That post was inspired by the INTERSECT command that exists in SQL Server 2005. In set theory, the intersect can be defined as follows: “x is an element of sets A and B if and only if x is an element of A and x is an element of B.”

But what about returning non-intersecting data from two sets? This would be like a union, but that we only want a list of unique rows within each table, that is, the difference. The difference between two sets could be defined as: “x is an element of set A or B when x is not an element of A and B.” In set theory, this is called a “symmetric difference” (like XOR for x-basers!).

The SQL to accomplish this is almost identical to the Intersect query. Here is the general form in VFP:

SELECT [key field], [attributes..] ;
  FROM (;
    SELECT [key field], [attributes..] FROM A ;
    UNION ALL ;
    SELECT [key field], [attributes..] FROM B ;
  ) AS tmp ;
  GROUP BY [key field], [attributes..] ;
  HAVING COUNT(*) = 1

The change is in the HAVING clause, where instead of checking to see if COUNT(*) = 2, we check to see if COUNT(*) = 1. This gives us the “Not Intersect” of two sets: perfect for determining how two tables are different! The result is a complete list of non-intersecting data.

Here’s an actual example in VFP (followed by an example using T-SQL):

*-- two sets
CREATE CURSOR crTempL (keyfld i)
CREATE CURSOR crTempR (keyfld i)
 
*-- test data
INSERT INTO crTempL VALUES(1)
INSERT INTO crTempL VALUES(3)
INSERT INTO crTempL VALUES(5)
INSERT INTO crTempL VALUES(7)
INSERT INTO crTempL VALUES(9)
 
INSERT INTO crTempR VALUES(3)
INSERT INTO crTempR VALUES(6)
INSERT INTO crTempR VALUES(9)
INSERT INTO crTempR VALUES(12)
INSERT INTO crTempR VALUES(15)
 
*-- the difference
SELECT keyfld ;
  FROM (;
    SELECT keyfld FROM crTempL;
    UNION ALL ;
    SELECT keyfld FROM crTempR;
  ) AS tmp1 ;
  GROUP BY keyfld  ;
  HAVING COUNT(*) = 1

Now, here is the equivalent in T-SQL.

-- two sets
CREATE TABLE #crTempL (keyfld INTEGER)
CREATE TABLE #crTempR (keyfld INTEGER)
 
-- test data
INSERT INTO #crTempL VALUES(1)
INSERT INTO #crTempL VALUES(3)
INSERT INTO #crTempL VALUES(5)
INSERT INTO #crTempL VALUES(7)
INSERT INTO #crTempL VALUES(9)
 
INSERT INTO #crTempR VALUES(3)
INSERT INTO #crTempR VALUES(6)
INSERT INTO #crTempR VALUES(9)
INSERT INTO #crTempR VALUES(12)
INSERT INTO #crTempR VALUES(15)
 
-- the query
SELECT keyfld 
  FROM (
    SELECT keyfld FROM #crTempL
    UNION ALL 
    SELECT keyfld FROM #crTempR
  ) AS tmp1 
  GROUP BY keyfld  
  HAVING COUNT(*) = 1

As you can see, the syntax is almost identical. Thank you ANSI SQL standards!

Can anyone come up with a better or more efficient way to find the Symmetric Difference between two tables in TSQL, VFP, or any other DB?

Tags: , , ,

1 Comment

Delete Duplicates from a Table

The need to delete duplicate records from a table rises up all the time, whether you are importing data from a third party, integrating systems, or cleaning dirty data. Duplicate data is a reoccurring quality issue.

A duplicate records does not necessarily mean an exact copy of another record (as if you pressed CTRL+Y with SET CARRY ON in VFP). It could mean that all of the important identifying attributes are duplicated but other data, such as a surrogate key or a last updated timestamp are not. So here is a technique that I’ve used to delete a row if it is in fact a duplicate based on attributes in the table:

DELETE FROM [tablename] ;
  WHERE [primary key] NOT IN ;
    (SELECT MAX([primary key]) FROM [tablename] ;
      GROUP BY [attribute1], [attribute2], [attribute3])

To apply this to a Customer entity, for example, you can try the following (the first half of this code is setup, scroll down to see the delete):

*-- example table
CREATE TABLE Customer FREE   (;
    cust_key integer         ,;  && a surrogate key
    cust_id character(3)     ,;  && natural key
    cust_fname varchar(20)   ,;  && first name
    cust_lname varchar(20)   ,;  && last name
    cust_phone character(14) ,;  && phone number
    cust_lupd datetime       ,;  && last updated
    cust_user character(15)   )   && user last updated
 
*-- need data 
INSERT INTO Customer VALUES (1,"AM1","Andrew","MacNeill","555-5551",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (2,"AK1","Andy","Kramek","555-5552",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (3,"CC1","Cesar","Chalom","555-5553",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (4,"CB1","Craig","Baily","555-5554",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (5,"DC1","Dave","Crozier","555-5555",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (6,"DS1","David","Stevenson","555-5556",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (7,"GF1","Garrett","Fitzgerald","555-5557",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (8,"KR1","Kevin","Ragsdale","555-5558",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (9,"KR1","Kevin","Ragsdale","555-5558",DATETIME(),"ZorroMeansFox")
INSERT INTO Customer VALUES (10,"RS1","Rick","Schummer","555-5559",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (11,"RS2","Rick","Strahl","555-5510",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (12,"RS2","Rick","Strahl","555-5510",DATETIME(),"ZorroMeansFox")
INSERT INTO Customer VALUES (13,"RS2","Rick","Strahl","555-5510",DATETIME(),"FanakMeansFox")
INSERT INTO Customer VALUES (14,"SB1","Stephen","Bodnar","555-5511",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (15,"SD1","Stuart","Dunkeld","555-5512",DATETIME(),"TodMeansFox")
INSERT INTO Customer VALUES (16,"TR1","Ted","Roche","555-5513",DATETIME(),"TodMeansFox")
 
*-- delete duplicates (we don't care about the surrogate 
*-- key or the audit information).
DELETE FROM Customer ;
  WHERE cust_key NOT IN ;
    (SELECT MAX(cust_key) FROM Customer ;
      GROUP BY cust_id, cust_fname, cust_lname, cust_phone)
 
*-- results
SET DELETED OFF
BROWSE NORMAL

This works by selecting the most recent issuance of the primary key (using MAX) and discarding the others using the subquery. If you run the subquery on its own, you can see how SELECT MAX plus the GROUP BY clause gives you the most recent version of the row that is distinct from the others (’Rick’ and ‘Kevin’ only show once) by leaving out fields that don’t identify the row (like auditing information or perhaps de-identified attributes). Next, we delete everything that is not in this list.

This works in all ANSI SQL environments were subqueries are supported in the DELETE statement.

Tags: , , ,

9 Comments

What’s faster? SQL or looping to return a set of records?

When you need to get a set of records (for example, to return a list of items from a lookup table, to prepare a report, or to fill a read-only grid), which method should you use?

Of course, the answer is ‘it depends’ and many factors are at play. It is part of our job as developers (or programmers, if you’d prefer) to pick the solution that best fits the scenario. As much has been written on this topic already, I won’t be revealing anything new here: Just a re-package, if you will.

First, the basics. There are several ways to go through a table with the intention of producing a set of resulting records. You can SCAN, DO WHILE, use a FOR..NEXT loop, or use SQL syntax. The first three options are procedural and process one record at a time. The last (SQL) is set-based. To get an idea on what looping mechanism to use, take a look at Andy Kramek’s blog entry from March 2006 called “Writing Better Code (Part 2)”. In this entry, Andy does a great job of showing you the effects of indexes on the three looping mechanisms, and should give you a good idea which one to use and when.

Using SQL syntax, you can return a result of records as well. You can replace your looping syntax with SQL syntax and get the same results (using much less code). When gathering data for a report, for example, I have often used the loop method to populate a special report cursor. I was inclined to stick with the procedural approach in the past because (a) it was easy enough to write, (b) easy to debug, (c) allowed me to easily insert a custom progress bar if needed, and (d) because I thought it was best. As my SQL skills improved, I found that SQL syntax is actually (a) easier to write, (b) just as easy to debug, (c) capable of supporting a custom progress bar, and (d) may in fact be best under many circumstances. This is especially true if you are trying to write portable code (duh!).

To demonstrate my point, I first took a table with 125,000 records with no index (CDX) file. This first case will process each record in the table. My benchmarking methodology here would make anyone who has ever done benchmarking in a scientific way before blood boil, but hey, I never claimed to be a scientist (I don’t even play one on TV)!

First, are we all on the same page?

SET ANSI OFF
SET EXACT OFF
SET TALK OFF
SET DELETED ON
CLEAR

Carry on…

USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
LOCATE
DO WHILE NOT EOF()
    INSERT INTO crName VALUES (name_list.name)
    SKIP
ENDDO
? SECONDS() - nSecs && results: 0.313 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
FOR nRec = 1 TO RECCOUNT()
    GOTO nRec
    INSERT INTO crName VALUES (name_list.name)
NEXT
? SECONDS() - nSecs && results: 0.281 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL
nSecs = SECONDS()
SELECT name FROM name_list INTO CURSOR crName
? SECONDS() - nSecs && results: 0.000 seconds (10 runs)
USE IN name_list

So what on earth is going on? The SQL Statement processes that entire result in (literally) no time at all! The answer can be found in a comment made in a recent blog posting of mine. Sergey explained that VFP doesn’t actually run any SQL. All it does is create a filtered cursor on the underlying dataset. This is why RECCOUNT could be “wrong”, as discussed in my blog entry. Check out that link to Sergey’s site to read more about this. You can check what you got simply by issuing JUSTEXT(DBF()) (as Sergey does): “DBF” means a filtered cursor was created, while “TMP” means that a new cursor was created.

Additional Notes:

When there is a CDX file (with no index set), the FOR loop seems to suffer a little: On average, the FOR loop takes an additional .020 seconds to run. The existence of a CDX file does not seem to bother the DO WHILE or SQL method in the above case.

Now, let’s add a condition that the name was added to the table on or after 01/01/2005. The first run of this test is with no CDX file at all. If a CDX file exists, but no index set, then the results are a bit different (see notes after the code):

 
USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
LOCATE
DO WHILE NOT EOF()
    IF created>={^2005-01-01}
        INSERT INTO crName VALUES (name_list.name)
    ENDIF
    SKIP
ENDDO
? SECONDS() - nSecs  && results: 0.234 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
FOR nRec = 1 TO RECCOUNT()
    GOTO nRec
    IF created>={^2005-01-01}
        INSERT INTO crName VALUES (name_list.name)
    ENDIF
NEXT
? SECONDS() - nSecs  && results: 0.235 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL
nSecs = SECONDS()
SELECT name FROM name_list WHERE created>={^2005-01-01} INTO CURSOR crName
? SECONDS() - nSecs  && results: 0.125 seconds (10 runs)
USE IN name_list

Here, with no CDX file, VFP seems to actually process the SQL (it doesn’t create a filtered cursor), and does so rather fast. If you add an INDEX on created date, FoxPro goes ahead and creates the filtered cursor (which again is processed in zero seconds).

OK, so now what if I actually use the INDEX, and SET ORDER TO before all of this?

USE name_list IN 1 EXCL ORDER created
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
LOCATE
DO WHILE NOT EOF()
    IF created>={^2005-01-01}
        INSERT INTO crName VALUES (name_list.name)
    ENDIF
    SKIP
ENDDO
? SECONDS() - nSecs  && results: 1.047 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL ORDER created
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
FOR nRec = 1 TO RECCOUNT()
    GOTO nRec
    IF created>={^2005-01-01}
        INSERT INTO crName VALUES (name_list.name)
    ENDIF
NEXT
? SECONDS() - nSecs  && results: 0.235 seconds (10 runs)
USE IN name_list
 
*-----
 
USE name_list IN 1 EXCL ORDER created
nSecs = SECONDS()
SELECT name FROM name_list WHERE created>={^2005-01-01} INTO CURSOR crName
? SECONDS() - nSecs  && results: 0.000 seconds (10 runs)
USE IN name_list

The Do WHILE really gets hit hard, as you can see. The FOR LOOP works as
well as before, and the SQL once again gives us a filtered cursor.

As you can see, you can really take advantage of this behavior to create simple cursors with simple predicates. Even if VFP can’t create the filtered cursor (for example, applying an ORDER BY clause on the SQL statements), the results are still usually faster.

In a subsequent posting, I’ll increase the complexity of these tests. In the meantime, I would be interested in any feedback to these cases.

Tags: , , ,

4 Comments

SQL implementations, database servers, and VFP

When connecting to a database server (perhaps using the ADO connection object or SQL Passthrough (SPT)), you need to keep in mind that some familiar VFP SQL syntax won’t work against a database that doesn’t support it. For a simple example, in SPT, you cannot do the following on a SQL Server 2005 database (for more info on connecting to SQL Server, see the article ‘How to Connect to SQL Server from Visual FoxPro’):

cConn = ;
    "driver=SQL Server;server=SERVERNAME;database=DATABASE;uid=;pwd="
nHnd = SQLSTRINGCONNECT(cConn,.f.)
IF nHnd <= 0
  AERROR(aErrs)
  =MESSAGEBOX('Connection failed. ' + aErrs[1,2], 16, 'SQL Error')
ELSE
  *-- VFP allows abbreviation of most keywords to 4 letters,
  *-- SQL Server does not. This code generates an error:
  cSQL = "SELE TOP 50 * FROM store_transactions"
  SQLEXEC(nHnd ,cSQL, "crTemp")
  =SQLDISCONNECT(nHnd)
  SELECT crTemp
  BROWSE NORMAL
ENDIF

While this may be annoying for those of us who are trying to save a few miles in our carpel tunnels, it should be a reminder that the SQL we use is native to the database we’ve connected to. This is extremely important to understand because there may be some behavioral differences between VFP and other languages, which may not trigger a syntax error! I have yet to find a good resource that details these differences from a Visual FoxPro point of view (if anyone has a link, please pass it along!). Here is a site, prepared by Troels Arvin called ‘Comparison of different SQL implementations’ that compares Standard ANSI SQL to PostgreSQL, DB2, MS SQL, MySQL, and Oracle. Troels has licensed this under Creative Commons, which means it might be worth adding the document to the fox wiki with a VFP category.

On the flip side, it opens several doors into the proprietary SQL language of the connected database. For example, SQL Server 2005 supports different types of table operators (like APPLY, PIVOT, and UNPIVOT) and some new SET operations (EXCEPT and INTERSECT). The OVER clause is also very handy (for analytical rankings such as row_number, rank, dense_rank, and ntitle). While I’m no T-SQL guru, consider the following syntax as a healthy example of what I’m talking about. It uses the PIVOT operation (and a CASE statement for good measure) to create a silly little pivot table that lists how many registers are used just a few times, more than a few, and a lot of times.

cConn = ;
    "driver=SQL Server;server=SERVERNAME;database=DATABASE;uid=;pwd="
nHnd = SQLSTRINGCONNECT(cConn,.f.)
IF nHnd <= 0
  AERROR(aErrs)
  =MESSAGEBOX('Connection failed. ' + aErrs[1,2], 16, 'SQL Error')
ELSE
  cSQL = "SELECT store_id, afew, more, alot " + ;
      "FROM ( " + ;
      "  SELECT store_id, store_register, " + ;
      "    CASE WHEN COUNT(tran_id) < 10 THEN 'afew' " + ;
      "      WHEN COUNT(tran_id) < 50 THEN 'more' " + ;
      "      WHEN COUNT(tran_id) >= 50 THEN 'alot' " + ;
      "    END AS totals " + ;
      "  FROM store_transactions " + ;
      "  GROUP BY store_id, store_register " + ;
      ") as tmp " + ;
      "PIVOT( COUNT(store_register) " + ;
      "  FOR totals IN(afew,more,alot)  ) AS tmp2"  
 
  SQLEXEC(nHnd ,cSQL, "crTemp")
  =SQLDISCONNECT(nHnd)
  SELECT crTemp
  BROWSE NORMAL
ENDIF

Also note that using SQLEXEC, you can create stored procedures in the target database, among other things.

With VFP9, there is greater support for subqueries and other SQL elements common to other languages. Because of this, it is easy to write VFP-style SQL code that can be interpreted by a database server such as SQL Server without modification. But be aware of the nuances between the various SQL implementations and certainly take advantage of the propriety features (like PIVOT)!

Tags: , ,

No Comments