Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Visual Studio’


Published April 25th, 2008

ID Lookup Custom Transform (Part 3 of 3)

In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In Part 2, I walked through AcquireConnections and PreExecute. In this post, I’ll discuss ProcessInput. Please read the introduction , part 1, and part 2 if you haven’t already!

ProcessInput

ProcessInput accepts each row from the incoming buffer. This buffer is available via a parameter apply named “buffer” of type PipelineBuffer. The buffer contains each column defined in the column collection (IDTSInputColumnCollection90) of the input.

For ID Lookup, each row from the buffer is examined, and based on properties set by the user, appropriate action is taken.

The first step is to see if I can find an ID in the master lookup table based on key fields in the incoming data file. This functionality is controlled by a series of properties on the component. Normally, this search would be conducted on one or more input columns (lname, fname, and dob for example). The developer, after having done a data profile on the source file, will know exactly what columns will be used in the lookup. The code to build the SQL statement is defined like so:

public override void ProcessInput(int inputID, PipelineBuffer buffer)
{
 
  bool nullOutput = false;      // if the input contains a NULL, we want to reject the row     
 
  if (!buffer.EndOfRowset)
  {
    while (buffer.NextRow())
    {
      string strSelectWhere = "";   // to store the where clause for the search
 
      // if the columnIndex is in the input string. Recall that I figured this out in PreExecute
      if (indexMatchValue > -1) 
      {
        // build a where string for the search
        if (buffer.IsNull(indexMatchValue))
          nullOutput = true;      // fail a search on NULL as an UNMATCHED row
        else
          strSelectWhere += " lname='" + buffer.GetString(indexMatchValue);
      }

The block of code starting with line ” if (indexMatchValue > -1)” is repeated several times, with some variation, for many additional properties set by the user. The purpose is to build up a where string (stored in strSelectWhere) for the search to be made on the data table (dtMstr). This data table was filled in the PreExecute method.

Before going any further, I want to see if there are any nulls in any of my search columns. If so, I want to redirect to one of my outputs (specifically, IDLookupNoMatchOut), and move on to the next row:

      if (nullOutput == true)
      {
         buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
         continue;
      }

Once the strSelectWhere statement is built, I try to find a match in the lookup table:

      DataRow[] aDR;
      aDR = dtMstr.Select(strSelectWhere);

I had my doubts about the speed of this approach. I really thought that the Select method, repeated for each row in the buffer, would kill my grand idea. But I was pleasantly surprised to see this perform well beyond my expectations. Someday as time permits, I’ll try benchmarking it.

Based on the results from the Select, the logic flows along the following set of actions:

  • (1) If exactly one match is found in the lookup table, use buffer.DirectRow(int outputID) to output the row to the “IDLookupOut” Output.
  • (2) If more than one match is found, the candidate key in the source file is a duplicate in the master data table. Use buffer.DirectRow(int outputID) to output the row to the “IDLookupDuplicateOut” Output for later handling.
  • If no matches are found, do another Select on a historical lookup table in a different database. The code for this is not shown to keep this posting simple.
    • (3) If exactly one match is found in the historical lookup table, direct the output to both the “IDLookupOut” Output and to IDAddToMasterData Output. The latter will allow me to populate the current master data with the ID found so that we can maintain referential integrity in the database.
    • (4) If more than one match is found, direct the output to the “IDLookupDuplicateOut” Output for later handling.
  • (5) If no matches are found in the local or historical data, direct the row to the “IDLookupNoMatchOut” Output.

For components that conditionally direct inputs to multiple outputs (like a Conditional Split or my ID Lookup component), you need to use buffer.DirectRow to tell the component what output to use.

Here’s the code:

      if (aDR.Length == 1)
      {
        // (1) One record found. Action: Direct row to IDLookupOut
        buffer.SetString(indexIDValue, aDR[0]["id"].ToString());   // the column to put the ID
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupOut"].ID);
      }
      else if (aDR.Length > 1)
      {
        // (2) More than one entry found. Action: Direct row to IDLookupDuplicateOut
        buffer.SetString(indexIDValueDupe, aDR[0]["id"].ToString()); // store dupe ids 
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupDuplicateOut"].ID);
      }
      else if (aDR.Length == 0)
      {   
        if (Allow3rdPartyLookup== true)
        {
          // (3) Search the historical database, if single match found Action: 
          //       Direct row to IDLookupOut
          //       and Direct Row to IDAddToMasterData 
          // (4) If duplicates found in 3rd party database, Action: 
          //       Direct row to IDLookupDuplicateOut
          // (5.a) Nothing found. Action: 
          //       Direct row to IDLookupNoMatchOut
        }
        else
        {
          // (5.b) Nothing found. Action: Direct row to IDLookupNoMatchOut
          buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
        }
      }
    }
  }
}

Notice my use of buffer.SetString() to plug in the ID once I find it. I also add additional information to the pipeline (which is not shown in the code examples) to identify how the match was derived.

Conclusion

And there you have it! I ended up leaving some of the code out. As I said, this is business logic and proprietary. I did want to show, though, how I went about solving a fairly complex problem using a custom component. Again, I did have the benefit of examining a fully functional set of packages that had implemented all the logic in a Data Flow. This helped immensely, as the logic was plain as day.

So in the end, the new component makes creating new packages easier by only having to drop a single component on the Data Flow. There are a handful of properties to set (around 6 depending on the incoming data) and four Outputs to account for. This component also performs much, much better. My estimates put it at around 600% over the Data Flow approach.

Lastly, while writing these entries, I forced myself to examine the code in a different way (i.e. for public consumption and scrutiny). This made me think harder about the logic and how I used C#. The exercise was quite valuable and I ended up learning a lot in the process. With that said, please feel free to identify any mistakes I might have made, or more importantly, any places where you think improvements can be made. Thanks for reading!

 

Published April 23rd, 2008

ID Lookup Custom Transform (Part 2 of 3)

In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In this post, I’ll discuss the AcquireConnections and PreExecute methods. Please read the introduction and part 1 if you haven’t already!

AcquireConnections

The purpose of AcquireConnections is to establish connections to the specified connection managers at design and runtime. Although not totally necessary, it is a best practice and is recommended to connect to all external data sources using this method.

The first step is to prepare a runtime connection. In the ProvideComponentProperties method, I created a new runtime connection using interface IDTSRuntimeConnection90. I added this connection to the RuntimeConnectionCollection of my ComponentMetaData and named it “IDLookupConnection”. The code in ProvideComponentProperties is straightforward and looks like this:

IDTSRuntimeConnectionCollection90 oRuntimeConnectionCollection 
    = ComponentMetaData.RuntimeConnectionCollection;
IDTSRuntimeConnection90 oRuntimeConnection;
oRuntimeConnection = oRuntimeConnectionCollection.New();
oRuntimeConnection.Name = "IDLookupConnection";

As with my other examples, only the essential code is shown above. In my production copy, I use structured error handling and class members instead of literal strings.

Next, I wrote the following code in the AcquireConnection method:

private SqlConnection dbConn;
 
public override void AcquireConnections(object transaction)
{
IDTSRuntimeConnection90 conn = ComponentMetaData.RuntimeConnectionCollection["IDLookupConnection"];
ConnectionManager cm =  DtsConvert.ToConnectionManage(conn.ConnectionManager);
ConnectionManagerAdoNet cmObj = cm.InnerObject as ConnectionManagerAdoNet;
dbConn = cmObj.AcquireConnection(transaction) as SqlConnection;  
 
}

In the above code, I am converting conn (IDTSConnectionManager90) to a managed ConnectionManager (cm). Then, I can use InnerObject to get an AdoNet Connection Manager (cmObj) object. The AcquireConnection method of cmObj returns a SqlConnection object (dbConn) which opens a connection to my SQL Server database. I’ll use dbConn later in PreExecute to create SqlCommands.

Not to be forgotten, I use the ReleaseConnections method (not shown) to release dbConn when the component finishes its work.

PreExecute

The PreExecute method is called once for the life of the component at runtime. It is a good idea to do as much ‘prep’ work in here as possible. This includes applying behavior based on property values, loading data for lookups, setting up ’static’ items, and preparing the component to start accepting input. There is a temptation at times to use ProcessInput for some of these things: resist it! ProcessInput runs once for each buffer.

It is in PreExecute where I think I could do some better optimization. Currently, I am returning about 100,000 values from a master lookup table in about 6 seconds (crude benchmarking). This data is then used in ProcessInput for matching. I believe I can do better in here, but for now, it works and won’t be modified any time in the near future. Of course, if you see something awry, let me know!

The meat of the programming is in this method, but a lot of the logic is repeated (for example, I fetch LineageIDs for several new output IDTSOutputColumn90 columns added to each of my Outputs in ProvideComponentProperties). Here is such an example:

// get references to the input so we can use its buffer metadata
IDTSInput90 input = ComponentMetaData.InputCollection["IDLookupInput"];
 
// get reference to the output collection and its columns
IDTSOutput90 output = ComponentMetaData.OutputCollection["IDLookupOut"];
IDTSOutputColumnCollection90 outputColumns = output.OutputColumnCollection;
 
foreach (IDTSOutputColumn90 column in outputColumns)
{
  // NOTE: MS Documentation is not clear, cannot use output.buffer 
  // for synch transforms, need to use input.buffer instead (which 
  // is the same as what would be there for the output.buffer)
  intLineageID = (int)BufferManager.FindColumnByLineageID(input.Buffer,column.LineageID);
  strColumnName = column.Name.ToString().ToLower();
 
  if (strColumnName == "id_matchvalue") {
    indexMatchValue = intLineageID;
    exit;
  }
}

 

You might have noticed by now that much of what I am doing is more-or-less boilerplate code that is provided by Books Online. The trick is applying the right pieces for the right situation and building a solution that solves a specific problem better than what can be done using out-of-the-box tasks and components.

Despite all the good code examples in BOL, I still found some parts to be documented poorly. Take for example the issue with finding the LineageID of an output column for synchronous components. At first glance, you might think you can use BufferManager.FindColumnByLineageID(output.Buffer, column.LineageID). But for synchronous transforms, an output buffer is not created — it is the same as the input buffer. To find the LineageId, you need to use input.Buffer and not output.Buffer. Not very intuitive at all! But I digress…

In PreExecute, after I match a variety of LineageIDs to custom members of my IDLookup class, I continue by writing the code to fetch data from my lookup table. Actually, in the real implementation, this is also repeated for more than one data source — depending on how the developer set the properties of the component. But for the sake of simplicity, I’ll show one of the blocks of code for the primary lookup (the names have changed to protect the innocent!):

string sqlStr = "SELECT id, account, fname, lname, ssn, dob, ext_id, empid FROM mstr.MasterLookup WHERE status = 'A'";
SqlCommand sql = new SqlCommand(sqlStr, dbConn);
daMstr = new SqlDataAdapter();
daMstr.SelectCommand = sql;
dtMstr = new DataTable();
daMstr.Fill(dtMstr);

And there you have it! I simply fill a SqlDataAdapter (daMstr) with a DataTable (dtMstr) based on a SQL Statement (SqlCommand sql). The query returns several thousand rows from the master lookup table and stores them in daMstrSec, which I have defined as a custom member of my class.

Next, I’ll talk about ProcessInput!

Published April 17th, 2008

ID Lookup Custom Transform (Part 1 of 3)

As promised, I’ll discuss in this post some of the details of my custom transform component I wrote in C# for SSIS. Unfortunately, I cannot publish the component because it contains business logic and company secrets! I will, however, talk about how I went about it and detail some of its finer and not-so-finer points.

This topic will be presented in multiple parts because of its length. In Part 1, I’ll discuss the ProvideComponentProperties method in depth. Part 2 will focus on AcquireConnections and PreExecute methods. Finally, I’ll look at ProcessInput and give some closing thoughts.

Before reading on, please have a look at my previous post to get some background on the problem.

Setup and Constraints

Looking up ID Numbers in our environment is rather complex due to the number of valid search combinations and the disparity and reliability of the source files. Because all source data comes from many different vendors and affiliates, the structure and content is often provided on a “best effort” basis. Therefore, extra care must be taken when accepting a file for import.

The current environment makes use of staging tables, which I will continue to use. Data without our ID is staged, the ID is added in a Data Flow (or if it is not found, it is flagged as unmatched), and then staged again for later processing. This made implementing the component a breeze. Essentially, I deleted every component in the original Data Flow between the Source and Destination(s) and dropped in the new component.

I also had to focus on recreating the logic in C# that existed in the package, and do so without loosing performance. Gaining a degree of maintainability was not good enough: It needed to perform better too. Thankfully, the team before me went through all the hard work of turning the “business” problem of determining an ID into an easily followed data flow in SSIS. Decomposing the pieces and turning them into C# was infinitely easier as a result.

Settin up the ProvideComponentProperties method

The ProvideComponentProperties method is called when the component is dropped on the Data Flow view. Its purpose is to initialize the component’s ComponentMetaData. The ComponentMetaData contains the custom properties, inputs, and outputs of the component.

To set this up, I’ve done the following in the ProvideComponentProperties method:

// although unlikely that this function would be called more 
// than once, it is a good idea to start out clean
RemoveAllInputsOutputsAndCustomProperties()
 
// define specific properties of ComponentMetaData
ComponentMetaData.ValidateExternalMetadata = false;
ComponentMetaData.UsesDispositions = true;
ComponentMetaData.Name = "IDLookup"

ValidateExternalMetadata, when set to false, tells the component not to perform any validations against external sources at design time. UsesDispositions tells the component to include (true) or not to include (false) error output. I also added additional metadata (Description, ContactInfo, etc..) not shown in the snippet above.

// add custom properties
IDTSCustomProperty90 propAllow3rdPartyLookup = ComponentMetaData.CustomPropertyCollection.New();
 
// set properties on the new, um, property
propAllow3rdPartyLookup.Name = "Allow3rdPartyLookup";
propAllow3rdPartyLookup.Description = "Also search the external database";
propAllow3rdPartyLookup.ExpressionType = DTSCustomPropertyExpressionType.CPET_NOTIFY;

The above chunk of code adds a new property to my component called “Allow3rdPartyLookup”. This property’s value will be used later on to determine if I need to look up the ID in a different database. The ExpressionType property gives me the ability to allow (CPET_NOTIFY) or disallow (CPET_NONE) expressions for the property’s value. I’m not certain if there are any general rules-of-thumb for setting this property: I always seem to allow expressions and have never disallowed them. This provides more flexibility and allows property settings to be dynamic, both good things.

I have added an additional 6 properties (not shown here) that are used to further customize the behavior of the component.

Finally, I set up the inputs and outputs. The component will accept a single input and direct rows to one of five outputs (including the error output if needed). Here is a sample view showing the input and outputs:

ID Lookup

The following code sets up the Input:

IDTSInput90 RuntimeInput = ComponentMetaData.InputCollection.New();
RuntimeInput.Name = "IDLookupInput";
RuntimeInput.ErrorRowDisposition = DTSRowDisposition.RD_FailComponent;

For the Input, the ErrorRowDisposition tells the component how to handle errors. In this case, I want to fail the component.

ID Lookup is synchronous, which means that it processes buffers as it gets them from the upstream component. To put it another way: Its output is processed with its input. An asynchronous component, like a Sort, sends the output at a different time than the input.

Setting up the outputs is rather straightforward. First, create a new IDTSOutput90 object and assign it a name. To be able to feed the input to the output, set the SynchronousInputID of the output to the ID of the input:

// Specify each output
IDTSOutput90 RuntimeOutput = ComponentMetaData.OutputCollection.New();
RuntimeOutput.Name = "IDLookupOut";
RuntimeOutput.SynchronousInputID = RuntimeInput.ID;
RuntimeOutput.ExclusionGroup = 1;
 
// Add a new column to the Output to store the ID
IDTSOutputColumn90 outputColum1 = ComponentMetaData.OutputCollection["IDLookupOut"].OutputColumnCollection.New();
outputColum1.Name = "FoundID";
outputColum1.SetDataTypeProperties(DataType.DT_STR, 12, 0, 0, 1252);

Note: The above is repeated for the remaining 3 outputs: IDAddToMasterData, IDLookupDuplicateOut, and IDLookupNoMatchOut. I’ve not included those here for brevity.

Setting ExclusionGroup to 1 on all non-error Outputs tells the object that I want to send rows — exclusively and at my discretion — to a specific output (or to more than one as you’ll see in Part 3). This is done manually using the DirectRow method in the ProcessInput method.

In my next post, I’ll take you through AcquireConnections and PreExecute methods. In the meantime, your questions, comments, or concerns are welcome!

Published November 5th, 2007

VFPBuild to build Visual FoxPro Applications Using MSBuild

As I mentioned last week, I have been using MSBuild for some time now to build and deploy complex data warehousing applications. Many parts of this project involve Visual FoxPro. I wrote that I had not yet experimented doing builds with VFP using MSBuild. Alan Stevens quickly pointed me towards his CodePlex project “VfpMsBuildTarget” (aka:VFPBuild).

The goals of the project are simple:

  1. Provide a GUI for including projects/files in a scheduled build, and set their appropriate properties
  2. Create a VFP build target to be used by the MSBuild engine, the build system found in Visual Studio.

While the first goal might be a ways off, the second is close to being complete.

I had a chance to look at this project, test it, debug it a little, and compile a short list of improvements I would like to see (or add myself through CodePlex). The bugs were minor and easy to fix, and my list of improvements is pretty short. It should be noted that the body of work gone into this Task already is excellent. With little effort, I was able to get it going. And with a little more effort, it can be used in production. Note: Before you can have a go at it, you will need to compile vfpprojectbuilder.dll in VFP and VfpBuild.dll in C#.

Documentation

We’ll need some documentation. Alan knows this (and has promised some soon). Even without though, I made it through with little problem. I think I am at an advantage because I (a) know VFP well, (b) know C# well, and (c) know MSBuild well. We would need instructions on how to compile the dlls as well as how to construct the MSBuild XML. If you’re new to MSBuild, you might struggle with setting up the xml. For a simple test, here is a sample screenshot. I would have provided the code but the XML would have been eaten alive by the browser:

Visual FoxPro MSBuild VFPBuild sample


To execute the above, your cmd would look similar to the following:

C:>msbuild projectx.msbuild

Functionality

It doesn’t seem to be too much effort to add support for lRebuildAll, lShowErrors, and lBuildNewGUIDs of the project.build() method. Other than that, it gives you the ability to set build options and to specify a project to load.

From Here

I would like to see goal #2 of this project buttoned up soon. I have a production environment (actually, a few) that could really utilize a stable assembly like this. The build I downloaded was a bit buggy (there were references to class properties that did not exist, for example this.nLevel and this.nBuildAction). Fixing these, plus some enhanced bullet-proofing (defensive coding) and documentation would get us there I believe.

I’ll post more as I go. For now, you can get the active VFPMSBuildTarget project files here.

Published August 28th, 2007

David Woods Learns FoxPro

David Woods talks about a recent project where he is connecting to an older legacy FoxPro system using ADO.NET. He lists out 6 “stumbling points“, which include some common errors during the connection process as well as some words of wisdom when writing queries. Actually, the best point is number 6. He writes:

6. This applies to everyone. CODE TO INTERFACES. By coding to the IConnection, ICommand, etc. interfaces my switched back and forth between odbc and oledb took seconds not days.

This is solid advice and should be followed by all OOP developers. Coding to interfaces (and not objects) saves time, simplifies designs, and increases flexibility especially among teams of developers. Thanks David for the advice!

Published August 25th, 2007

Guineu means Fox

If you haven’t discovered yet, the Foxpert himself, Christof Wollenhaupt, is engaged in an interesting and important extension to VFP9: Guineu. It is billed as an “alternative runtime library for Microsoft Visual FoxPro® 9.0 that runs on any Microsoft .NET compatible platform”.

Hopefully I’ll be able to take a closer look by participating in some beta testing.

Oh, and yes, according to Christof:

Guineu is Catalan, a language spoken in Northern Spain, Andorra and some parts of France. La guineu (it’s female) translates to “fox” in English. When pronounced the “U” is silent and “E” and “U” are two separate vowels.

…guineu does means fox!