Published April 25th, 2008
ID Lookup Custom Transform (Part 3 of 3)
In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In Part 2, I walked through AcquireConnections and PreExecute. In this post, I’ll discuss ProcessInput. Please read the introduction , part 1, and part 2 if you haven’t already!
ProcessInput
ProcessInput accepts each row from the incoming buffer. This buffer is available via a parameter apply named “buffer” of type PipelineBuffer. The buffer contains each column defined in the column collection (IDTSInputColumnCollection90) of the input.
For ID Lookup, each row from the buffer is examined, and based on properties set by the user, appropriate action is taken.
The first step is to see if I can find an ID in the master lookup table based on key fields in the incoming data file. This functionality is controlled by a series of properties on the component. Normally, this search would be conducted on one or more input columns (lname, fname, and dob for example). The developer, after having done a data profile on the source file, will know exactly what columns will be used in the lookup. The code to build the SQL statement is defined like so:
public override void ProcessInput(int inputID, PipelineBuffer buffer) { bool nullOutput = false; // if the input contains a NULL, we want to reject the row if (!buffer.EndOfRowset) { while (buffer.NextRow()) { string strSelectWhere = ""; // to store the where clause for the search // if the columnIndex is in the input string. Recall that I figured this out in PreExecute if (indexMatchValue > -1) { // build a where string for the search if (buffer.IsNull(indexMatchValue)) nullOutput = true; // fail a search on NULL as an UNMATCHED row else strSelectWhere += " lname='" + buffer.GetString(indexMatchValue); }
The block of code starting with line ” if (indexMatchValue > -1)” is repeated several times, with some variation, for many additional properties set by the user. The purpose is to build up a where string (stored in strSelectWhere) for the search to be made on the data table (dtMstr). This data table was filled in the PreExecute method.
Before going any further, I want to see if there are any nulls in any of my search columns. If so, I want to redirect to one of my outputs (specifically, IDLookupNoMatchOut), and move on to the next row:
if (nullOutput == true) { buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID); continue; }
Once the strSelectWhere statement is built, I try to find a match in the lookup table:
DataRow[] aDR; aDR = dtMstr.Select(strSelectWhere);
I had my doubts about the speed of this approach. I really thought that the Select method, repeated for each row in the buffer, would kill my grand idea. But I was pleasantly surprised to see this perform well beyond my expectations. Someday as time permits, I’ll try benchmarking it.
Based on the results from the Select, the logic flows along the following set of actions:
- (1) If exactly one match is found in the lookup table, use buffer.DirectRow(int outputID) to output the row to the “IDLookupOut” Output.
- (2) If more than one match is found, the candidate key in the source file is a duplicate in the master data table. Use buffer.DirectRow(int outputID) to output the row to the “IDLookupDuplicateOut” Output for later handling.
- If no matches are found, do another Select on a historical lookup table in a different database. The code for this is not shown to keep this posting simple.
- (3) If exactly one match is found in the historical lookup table, direct the output to both the “IDLookupOut” Output and to IDAddToMasterData Output. The latter will allow me to populate the current master data with the ID found so that we can maintain referential integrity in the database.
- (4) If more than one match is found, direct the output to the “IDLookupDuplicateOut” Output for later handling.
- (5) If no matches are found in the local or historical data, direct the row to the “IDLookupNoMatchOut” Output.
For components that conditionally direct inputs to multiple outputs (like a Conditional Split or my ID Lookup component), you need to use buffer.DirectRow to tell the component what output to use.
Here’s the code:
if (aDR.Length == 1) { // (1) One record found. Action: Direct row to IDLookupOut buffer.SetString(indexIDValue, aDR[0]["id"].ToString()); // the column to put the ID buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupOut"].ID); } else if (aDR.Length > 1) { // (2) More than one entry found. Action: Direct row to IDLookupDuplicateOut buffer.SetString(indexIDValueDupe, aDR[0]["id"].ToString()); // store dupe ids buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupDuplicateOut"].ID); } else if (aDR.Length == 0) { if (Allow3rdPartyLookup== true) { // (3) Search the historical database, if single match found Action: // Direct row to IDLookupOut // and Direct Row to IDAddToMasterData // (4) If duplicates found in 3rd party database, Action: // Direct row to IDLookupDuplicateOut // (5.a) Nothing found. Action: // Direct row to IDLookupNoMatchOut } else { // (5.b) Nothing found. Action: Direct row to IDLookupNoMatchOut buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID); } } } } }
Notice my use of buffer.SetString() to plug in the ID once I find it. I also add additional information to the pipeline (which is not shown in the code examples) to identify how the match was derived.
Conclusion
And there you have it! I ended up leaving some of the code out. As I said, this is business logic and proprietary. I did want to show, though, how I went about solving a fairly complex problem using a custom component. Again, I did have the benefit of examining a fully functional set of packages that had implemented all the logic in a Data Flow. This helped immensely, as the logic was plain as day.
So in the end, the new component makes creating new packages easier by only having to drop a single component on the Data Flow. There are a handful of properties to set (around 6 depending on the incoming data) and four Outputs to account for. This component also performs much, much better. My estimates put it at around 600% over the Data Flow approach.
Lastly, while writing these entries, I forced myself to examine the code in a different way (i.e. for public consumption and scrutiny). This made me think harder about the logic and how I used C#. The exercise was quite valuable and I ended up learning a lot in the process. With that said, please feel free to identify any mistakes I might have made, or more importantly, any places where you think improvements can be made. Thanks for reading!
Data Warehousing and Business Intelligence consultant, with expertise in business analysis, data modeling, and data integration. Extensive experience developing vertical and integrated desktop and Internet applications spanning municipal, clinical, and financial industries.