This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
One of the major pieces in the Data Integration puzzle is data extraction. Simply put, data extraction is the process of taking data from some source system to be used in some other process. Usually, in an ETL environment, extracted data is staged for further processing (staging is important for lineage and recovery purposes.).
Extracting comes in a few different flavors:
- The source system ‘pushes’ the data into the ETL system (through an export, direct write into a staging area, etc.)
- The ETL process ‘pulls’ the data from the source system (usually using SQL to read the data)
- For real-time/right-time systems, this mechanism makes data available to the ETL system quite frequently throughout the day, possibly through database triggers or message queues (see MS Message Queuing for more details)
Depending on the source system, the extraction may be different. Also, depending on the purpose of the process (complete historical load verses incremental loads) the connection may be different. It is imperative that the decisions on how to extract data from source systems are well documented in the ETL technical documentation. This document will aid integration specialists, together with the data mappings, when developing the extract system and methodology.
Furthermore, you should strive to extract only relevant data. Extra, non-essential data will take up additional disk space and memory during processing. This is especially true in SSIS where memory is quite important.
In the perfect world
Ideally, the source system will push data to the ETL system on some pre-determined, business-relevant interval (hourly, daily, weekly, etc.). The source system will be responsible for determining the changed data; however, you should never fully trust source systems and always build a CDC system (see my previous post ETL Subsystem 2: Changed Data Capture for more information) to pick up any changed data missed by the source system.
The ideal extraction strategy for Pull systems relies on trigger-level insert and last update dates on each row in each table in the source. Pulling the data by date range — perhaps one subset for inserts and another for updates — is quite efficient and as long as the database is supplying the dates through triggers, you can use this method with confidence. If the source system uses application logic to populate these dates: steer clear. You’re asking for trouble. There is always at least one rogue update process that will fail to manually insert the correct date!
Things are usually less than ideal. Your CDC strategy will help determine how you acquire extract data and what to do with it when you do. Some additional ways to access extracted data including using vendor-supplied APIs, web services, direct data access, reading file-based reports, tapping into the transaction logs, or using some external 3rd Party tool. Be concerned with difficult-to-access legacy systems, especially if the experts are beginning to ask for retirement!
It is also possible to use database replication. A Publisher can distribute transactional data to any Subscribers such as an OLTP or ODS server and your ETL system. This scenario is best in a controlled environment where the IT department can manage the replication across systems.
Data Extracted. Now What?
I prefer to stage extracted data for later use (even if ‘later’ is only a few seconds away). You acquire the data from the source, and drop it off on disk in some native format (DBF files for FoxPro, RAW files for SSIS). This allows you to return to this step if problems were encountered downstream, if downstream processes need to wait for some other condition before continuing, for data lineage purposes, and for recovery in case of failure. You will not want to have to go back to the source to re-capture this data. It isn’t efficient and a lot can happen to the data while the ETL process fumbles around.
SQL Server 2005 Integration Services (SSIS)
Data extraction in Integration Services is rather easy. There are many different ways to connect to source systems. Dot NET Providers, for example, give you access to many types of systems including Informix Dynamic Server, SAP, and Sybase. Connection types included:
- DataReader Source: Allows you to read data from a .NET Provider (see the ADO.NET team blog for more details) using the sqlcommand property. Tip: The DataReader usually performs better than OLE DB even for SQL Server. Also, be aware that you can sort data for further processing by specifying the IsSorted property = True on the Input and Output Properties tab. Remember to set the SortKeyPosition for each Output Column involved in the sort as well.
- OLE DB Source: Use OLE DB to retrieve data from a variety of sources. You can use this adapter to request a result from a table, view, stored procedure, or SQL query. I like to develop my queries in SSMS or the source system’s editor (MySQL Query Browser or VFP IDE for example) if I have it. This helps to get any syntax quirks out of the way. SSIS is a bit lacking when it comes to writing queries using the provided edit box.
- Excel Source: Simplifies access to MS Excel data. I prefer to stay away from using Excel as a data source for incremental loads because it is too easy for the user to break the process by changing the format of the file.
- Flat File Source: A great adapter for reading flat files (fixed, delimited, ragged right). You will actually set up most of the information in the flat file connection and not in the source adapter. I use the flat file adapter more than any other particularly because it is easy to get data from source systems in a flat file. XML is slowly taking over, however.
- XML Adapter: Used to read XML data and help convert it to a relational form. Use this adapter to talk to XML, either via file or direct data from a variable, shred the collections and hierarchies, and produce multiple outputs. I even use this adapter to shred XML for other purposes (non-data integration purposes) because it is so easy. Tip: Accept XML with XSD schema definitions only; this will save you a ton of time and effort profiling the XML for determining destination details.
- Raw File Source: Internal SSIS format that can be used to stage data or exchange data among packages. This is a very fast mechanism to read from because the format of the raw file is the same format that SSIS uses in memory; SSIS simply loads the raw data and you’re off to the races.
I also make heavy use of the FTP Control Flow Task and have also used the Web Service Control Flow Task to retrieve data. Both of these tasks are relatively easy to set up and use.
Also, for real-time systems (using the dribble extraction approach), you will likely be looking into the Message Queue Task and WMI Event Watcher Task. I have not had the privilege of working in a dribble environment (yet), so I don’t have much more to offer at this time.
Hand Coding with Visual FoxPro (VFP9)
Of course, much of the above is easily accomplished in Visual Foxpro. You won’t have a colorful drag-and-drop editor to work with but developing your own classes to accomplish what SSIS can do is rather trivial (technically, you can then use the Toolbox Panel and get some drag and drop if you want). With the available adapters (ADO, ODBC, XML, Native), and the ability to read and append from files directly, FoxPro is well positioned as a data extraction tool. Not only that, but you can use the native DBF format as a staging format, making downstream operations that much faster (don’t stage to flat files when you can stage into native tables in a staging database).
I outlined a rather comprehensive data extraction and staging strategy in my article “From Source to Warehouse: A Microsoft Visual FoxPro Data Integration Strategy“. I also talked at length about certain techniques in my Fox Forward and AFUG presentations earlier this year (2007 Fox Forward Presentation Examples).
Once data is extracted and sitting in a staging area (or in memory if you don’t use a physical stage), you can move on to cleaning it. The 4th subsystem is just that: The Data Cleansing Subsystem.