This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Before you can aggregate, summarize, look up, or merge data, you will need it ordered in some way. Sorting, which changes the physical order of the underlying data, is seldom necessary. Instead, indexing, which changes the logical order of the underlying data is often all you need. The Kimball Group seems to use Sort interchangeably with Order, so I won’t deviate in this post. This may be because some databases vendors do so as well.
If you’re loading data from a mainframe, then the utility programs DFSORT or SORT will be of some use. Other utilities exist depending on the vendor, but these are the main ones. The idea is to sort the raw physical data before it hits the ETL pipeline. Doing so will save processing time. Otherwise, you can simply load raw data into your staging area and then use standard SQL to order your data.
Sometimes sorting is implicit, so be careful. In SQL Server, sorting occurs when you use ORDER BY, GROUP BY, SELECT DISTINCT, and UNION. You should, for example, avoid the SELECT DISTINCT clause whenever possible, as the additional sorting can hinder performance. Many programmers and (ehem) trained DBAs automatically add the distinct clause to their queries. Be certain that the distinct clause is necessary; otherwise performance will take a hit.
Beware that sometimes too much sorting can be a problem. Here is a general rule: Only sort when you need to! Sorting is an expensive operation and I’ve seen it overused and abused.
Clusters and Heaps
A clustered index determines the physical order of data in a table. That is, data is inserted into the table in order. Pages are linked for faster sequential access. This is in contrast to a heap table, in which data is added to the end of the table and the data pages are not linked.
Clustered indexes can make a big impact in the staging area. Consider a surrogate key lookup dimension that contains 4 columns: surrogate key, business key, valid_from, and valid_to. A clustered index on business key, valid_from, and valid_to will return the surrogate key upon lookup faster than if your lookup dimension was a heap. This is mainly because “queries that return a range of values using operators such as BETWEEN, >, >=, <, and <=" are well suited for the clustered index structure.
This works well during surrogate key lookups where we typically want the surrogate key of some business key on a particular date. Of course INSERTs and UPDATEs take longer on clustered tables because the physical structure of the table needs to be maintained. So be sure to balance speed with need. In my experience: Maintaining a clustered lookup dimension is almost always well worth the effort, especially when the original dimension is big and wide.
Here are some best practices that might help you keep your sorting operations under control:
- Create one tempdb data file for each physical processor on the machine (more)
- Put the tempdb files on their own drive (more)
- Make sure your data types are appropriate for the column (storing numbers in a varchar field is not very efficient)
- Keep the columns you sort on as narrow as possible (here is where good data profiling will help)
- Integer sorting is much faster than character sorting, so consider using integers instead of character data whenever possible (more)
SQL Server 2005 Integration Services (SSIS)
As I have stated, sorting is often overused. The Sort component that ships with SSIS is good for small data sets that fit entirely into memory. Performance will degrade as soon as there is overflow. One way around the problem:
- In the Source Adapter, set the data access mode to “The results of an SQL statement”
- Use and ORDER BY clause in the SQL Statement
- Open the “Advanced Editor” of the Source component
- On the Input and Output Properties tab, click the Source Output node and set the property IsSorted to True.
- Under Output Columns, set the SortKey for each column involved in the index (start at 1, use a negative number to denote a descending index order)
Hand Coding with Visual FoxPro (VFP9)
Physically sorting a table in FoxPro is not necessary. The physical order of the table is irrelevant because of how VFP uses its indexes. You can test this easily by creating a table with a few million rows containing a randomly generated integer value and another column with some other data (perhaps system time or something). Using the SORT command (or SELECT INTO with an ORDER BY) create a second table in the order of the integer column. Without creating an index, attempt to retrieve a row from the table based on some integer value. Now, create an index on the column in both tables. You should notice no performance differences.
So while you’ll need to understand indexes and rushmore in VFP, you won’t need to bother with physical sorts.
In the next post, I’ll discuss lineage and dependency analysis. If you work in the financial sector (like me) then this topic will be of critical interest to you. If you don’t, I suspect you still care about lineage and dependencies!