Posts Tagged VFP

New VFP User Group in Iraq

The FoxPro community has always been diverse and vibrant, with user groups spread out all over the world. Although VFP9 is the last installment we’ll get from Microsoft, I’m happy to see and re-report that a new group has been created: Iraqi Visual FoxPro Programmers Group (Iraqi.vfppg)

You can read the original announcement here on Foxite.

Founder Ammar Hadi writes:

It was my dream to build this group since about more than 4 years ago, and at that time I send an e-mail to one of the Iraqi Foxers who is a member in foxite too (we never met) and told him that I will work on creating a foxpro group in Iraq. He show his readiness to help me on that. But at that time, I was totally engaged in my Medical study to get the certificate of Neurosurgical Profession. Now it is about a year and a half since I finished my study and training and got the certificate to work as a professional neurosurgeon (with very good marks ;-D ….. ). Now I can say that I got some more time to make my dream come true.

I wish Ammar the very best!

Tags: , , ,

ETL Subsystem 31: Paralleling and Pipelining

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

intro 150x150 ETL Subsystem 31: Paralleling and PipeliningOf all the subsystems that I’ve discussed so far, this one resulted in the most research. I had to (a) learn more about how paralleling works and (b) experiment with my environment to better understand it.

Honestly, I’ve taken this subsystem for granted over the years. And for VFP, I’ve done little exploration in this arena. For SSIS, I have tended to adjust the settings I can adjust (more on this below), monitor the results, and tweak my performance as needed. In some environments, this lackluster approach will get you by just fine. If you have very small load windows and performance is critical, then you’ll need to make a better effort.

So what is Paralleling and Pipelining?

Lumped together into a single subsystem, these two performance means are related but different. They’re cousins, I suppose. Running your ETL processes in parallel means that your ETL system is carrying out multiple operations simultaneously. Pipelining your ETL processes means that you can start new operations before the previous ones complete.

Paralleling and Pipelining are quite desirable. And depending on your tool of choice, taking advantage of them can be painless or painful.

How can you apply them?

You can achieve parallel processing by utilizing the CPUs on a single machine, or you can utilize multiple machines. The first option is the easiest to set up and results can be quite good. For a multiple CPU (or core) setup, you are actually running code (programs, algorithms) simultaneously, potentially doubling performance, all on the same box. You can scale out (i.e. scale horizontally) your ETL processes by adding computers (see What is distributed computing? by Kirk Pearson), allowing you to take advantage of the CPUs, RAM, and I/O of each machine. The latter has some significant design implications, but well worth it if your environment needs it.

assembly line women.thumbnail ETL Subsystem 31: Paralleling and PipeliningPipelining increases throughput. Unlike parallelism, it will not allow the instruction to run faster, but rather it permits downstream processes to start before the upstream process finishes. A great analogy is an assembly line, where parts are added to the whole as it travels down the line.

Getting parallelism and pipelining to work together is the Holy Grail of ETL performance. While certain performance techniques are available at all phases of data integration (from Extraction and CDC, to surrogate key handling and using partitions for fast loading), none can compare to the gains you can realize with this subsystem.

You should also keep in mind that CPU multitasking is different from parallel processing, and multithreading is different from pipelining. A multitasking process shares CPU resources, giving the illusion of parallelism (although one man’s illusion can be another man’s reality). Multithreaded applications share the same memory, but operate on different engine threads (i.e. a subtask). Multitasking and multithreading, like pipelining, increase throughput and also play an important role in performance tuning. I’ll talk a little more about this below in my section about FoxPro. Otherwise, if anyone is interested, I can try to elaborate in another post.

Where can this best be utilized?

Here are some ideas on where you can focus your efforts:

  • When loading historical data or retrieving data from multiple similar sources, execute the same package for different date ranges, at the same time (in SSIS for example, use multiple Execute Package Tasks or run the same package multiple times together as Jamie Thomson explores here), or you could design your historical load packages to break apart the data into separate threads.
  • Spread out UPDATE statements. This can be real handy if you have a few accumulating snapshot fact tables.
  • Spread out complex algorithms and routines that can operate on a subset of data.
  • Load staging tables while downstream processing loads your dimensional model
  • Do lookups (especially surrogate key lookups) in parallel
  • Distribute your conformed dimensions to other machines, data marts, etc. in parallel


SQL Server 2005 Integration Services (SSIS)

As you know, I use SSIS and VFP for ETL (not at the same time or on the same project though). With SSIS, I can quickly create complex routines that can automatically take advantage of multiple processors. The native support for buffers, execution trees, and parallelism makes my job pretty easy (which is why I suppose I’ve taken this subsystem for granted over the years). Simply understand how SSIS works, adjust the settings you need to adjust, monitor your performance, and tweak as needed.

To get a grip, the following resources are invaluable:


Hand Coding with Visual FoxPro (VFP9)

While SSIS and SQL Server have built-in mechanisms to manage most of the paralleling and pipelining responsibilities for you, FoxPro does not. You can achieve some very good results using VFP and multithreading, but you have to be extremely creative in how you handle paralleling and pipelining. If you don’t think this is the case, I’d love to hear how paralleling and pipelining can be achieved with VFP!!!

Of course, the VFP community is — and has always been — quite creative. As with most of this sort of thing, Calvin Hsia is near the front of the line. MTmyVFP (True VFP multi-threading) on CodePlex is a creative example using Hsia’s Multithreading class. For more information and a ton of details, check out:

As I’ve stated before, multithreading is not parallelism, nor is it pipelining. But if you utilize MTmyVFP (or similar solution) in your VFP ETL system, you will realize many performance benefits. Lastly, there was a pretty interesting, albiet short, discussion on this issue here.

From here

This post might have come off a bit long-winded, but there were quite a few important points to make. I hope that I’ve been able to distill what I’ve learned and that in the end, it all makes some sense. In my next ETL post, I’ll talk about ETL Subsystem 32: Security.

Tags: , , , , , , , , , ,

20 Year Old FoxPro Marketing Video

I love this stuff. FoxPro 2.0 was revolutionary and ground-breaking in many ways. The focus on both Mac and PC, the graphic interface, the data access speed. Memo fields. Form view. Debugging. I can really see why FoxPro developers simply are not willing to let go (myself included).

It really is a shame that Microsoft didn’t keep this up. For those looking for some nostalgia, take a look (or a second look) at this marketing video from around 1990 or 91:

Tags: , ,

FoxPro in the News

A nice article in Computer World, uncharacteristically highlighting the tool of choice, Visual FoxPro. Titled “A greener environment through better data management. Data supporting farmers to improve productivity”:

The database software, which is written in Microsoft’s Visual FoxPro, gathers data from interactive PDFs and Web site entries. According to database developer Anne Allen, once the data is received from the farmers, the Visual FoxPro software delivers a report which in turn is sent back to the farmer in an email.

I’d like to see more good press on the VFP front!

Tags: ,

ETL Subsystem 28: Sorting

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Before you can aggregate, summarize, look up, or merge data, you will need it ordered in some way. Sorting, which changes the physical order of the underlying data, is seldom necessary. Instead, indexing, which changes the logical order of the underlying data is often all you need. The Kimball Group seems to use Sort interchangeably with Order, so I won’t deviate in this post. This may be because some databases vendors do so as well.

If you’re loading data from a mainframe, then the utility programs DFSORT or SORT will be of some use. Other utilities exist depending on the vendor, but these are the main ones. The idea is to sort the raw physical data before it hits the ETL pipeline. Doing so will save processing time. Otherwise, you can simply load raw data into your staging area and then use standard SQL to order your data.

Sometimes sorting is implicit, so be careful. In SQL Server, sorting occurs when you use ORDER BY, GROUP BY, SELECT DISTINCT, and UNION. You should, for example, avoid the SELECT DISTINCT clause whenever possible, as the additional sorting can hinder performance. Many programmers and (ehem) trained DBAs automatically add the distinct clause to their queries. Be certain that the distinct clause is necessary; otherwise performance will take a hit.

Beware that sometimes too much sorting can be a problem. Here is a general rule: Only sort when you need to! Sorting is an expensive operation and I’ve seen it overused and abused.

Clusters and Heaps

A clustered index determines the physical order of data in a table. That is, data is inserted into the table in order. Pages are linked for faster sequential access. This is in contrast to a heap table, in which data is added to the end of the table and the data pages are not linked.

Clustered indexes can make a big impact in the staging area. Consider a surrogate key lookup dimension that contains 4 columns: surrogate key, business key, valid_from, and valid_to. A clustered index on business key, valid_from, and valid_to will return the surrogate key upon lookup faster than if your lookup dimension was a heap. This is mainly because “queries that return a range of values using operators such as BETWEEN, >, >=, <, and <=" are well suited for the clustered index structure.

This works well during surrogate key lookups where we typically want the surrogate key of some business key on a particular date. Of course INSERTs and UPDATEs take longer on clustered tables because the physical structure of the table needs to be maintained. So be sure to balance speed with need. In my experience: Maintaining a clustered lookup dimension is almost always well worth the effort, especially when the original dimension is big and wide.

Best Practices

Here are some best practices that might help you keep your sorting operations under control:

  • Create one tempdb data file for each physical processor on the machine (more)
  • Put the tempdb files on their own drive (more)
  • Make sure your data types are appropriate for the column (storing numbers in a varchar field is not very efficient)
  • Keep the columns you sort on as narrow as possible (here is where good data profiling will help)
  • Integer sorting is much faster than character sorting, so consider using integers instead of character data whenever possible (more)


SQL Server 2005 Integration Services (SSIS)

As I have stated, sorting is often overused. The Sort component that ships with SSIS is good for small data sets that fit entirely into memory. Performance will degrade as soon as there is overflow. One way around the problem:

  1. In the Source Adapter, set the data access mode to “The results of an SQL statement”
  2. Use and ORDER BY clause in the SQL Statement
  3. Open the “Advanced Editor” of the Source component
  4. On the Input and Output Properties tab, click the Source Output node and set the property IsSorted to True.
  5. Under Output Columns, set the SortKey for each column involved in the index (start at 1, use a negative number to denote a descending index order)


Hand Coding with Visual FoxPro (VFP9)

Physically sorting a table in FoxPro is not necessary. The physical order of the table is irrelevant because of how VFP uses its indexes. You can test this easily by creating a table with a few million rows containing a randomly generated integer value and another column with some other data (perhaps system time or something). Using the SORT command (or SELECT INTO with an ORDER BY) create a second table in the order of the integer column. Without creating an index, attempt to retrieve a row from the table based on some integer value. Now, create an index on the column in both tables. You should notice no performance differences.

So while you’ll need to understand indexes and rushmore in VFP, you won’t need to bother with physical sorts.

From Here

In the next post, I’ll discuss lineage and dependency analysis. If you work in the financial sector (like me) then this topic will be of critical interest to you. If you don’t, I suspect you still care about lineage and dependencies!

Tags: , , , ,