Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for April, 2008


Published April 30th, 2008

Missing Him Already…

I’m saddened by the news that Ken Murphy has passed away. I have learned so much from Ken (although we’ve never met) and I already miss him. My condolences to his family…

Here’s how he felt about the “end” of FoxPro:

The decision to cease further development of VFP is one that I believe Microsoft, and especially, the SQL server division will come to regret. I develop database apps for charities, and it has been my experience that most of these charities eventually move to a SQL server back end. I would suggest that the same is true of many small businesses. VFP is a fantastic language for developing entry level database apps for small to medium businesses. The power and speed of VFP allows people like me to develop tools for these smaller organizations that allow them to grow. Evenutally, they outgrow the VFP back end and typically move to a SQL solution. If SQL were a major league baseball team, VFP would be their farm team. I wonder how many major league teams would succeed if they were to get rid of all their farm teams?

This will be how I remember him.

Published April 28th, 2008

Life in Redmond

Imagine this:

“Everyone has an office (a few people share), doors are often closed, and people are heads down working. […] Sounds are muted, there’s no animation from the employees, it just seems like it could be an accounting firm as much as it could be a software developer. I’m not sure what I was expecting, maybe more people running around with Nerf dart guns, maybe more people excited and jazzed about what they’re working on. A few loud “Yes”s or “Aha”s, following by someone outside their cube, reveling passerbys with the amazing code they’re just written. ”

Source: Steve Jones of SQLServerCentral.com

That’s what things are like, according to Steve Jones of SQLServerCentral.com, for the SQL Server team at Microsoft as they busily prepare the next great version of SQL Server. Things have been delayed already. And as it goes, they may need to change the name to SQL Server 2009. “2008″ already sounds dated.

Published April 25th, 2008

ID Lookup Custom Transform (Part 3 of 3)

In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In Part 2, I walked through AcquireConnections and PreExecute. In this post, I’ll discuss ProcessInput. Please read the introduction , part 1, and part 2 if you haven’t already!

ProcessInput

ProcessInput accepts each row from the incoming buffer. This buffer is available via a parameter apply named “buffer” of type PipelineBuffer. The buffer contains each column defined in the column collection (IDTSInputColumnCollection90) of the input.

For ID Lookup, each row from the buffer is examined, and based on properties set by the user, appropriate action is taken.

The first step is to see if I can find an ID in the master lookup table based on key fields in the incoming data file. This functionality is controlled by a series of properties on the component. Normally, this search would be conducted on one or more input columns (lname, fname, and dob for example). The developer, after having done a data profile on the source file, will know exactly what columns will be used in the lookup. The code to build the SQL statement is defined like so:

public override void ProcessInput(int inputID, PipelineBuffer buffer)
{
 
  bool nullOutput = false;      // if the input contains a NULL, we want to reject the row     
 
  if (!buffer.EndOfRowset)
  {
    while (buffer.NextRow())
    {
      string strSelectWhere = "";   // to store the where clause for the search
 
      // if the columnIndex is in the input string. Recall that I figured this out in PreExecute
      if (indexMatchValue > -1) 
      {
        // build a where string for the search
        if (buffer.IsNull(indexMatchValue))
          nullOutput = true;      // fail a search on NULL as an UNMATCHED row
        else
          strSelectWhere += " lname='" + buffer.GetString(indexMatchValue);
      }

The block of code starting with line ” if (indexMatchValue > -1)” is repeated several times, with some variation, for many additional properties set by the user. The purpose is to build up a where string (stored in strSelectWhere) for the search to be made on the data table (dtMstr). This data table was filled in the PreExecute method.

Before going any further, I want to see if there are any nulls in any of my search columns. If so, I want to redirect to one of my outputs (specifically, IDLookupNoMatchOut), and move on to the next row:

      if (nullOutput == true)
      {
         buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
         continue;
      }

Once the strSelectWhere statement is built, I try to find a match in the lookup table:

      DataRow[] aDR;
      aDR = dtMstr.Select(strSelectWhere);

I had my doubts about the speed of this approach. I really thought that the Select method, repeated for each row in the buffer, would kill my grand idea. But I was pleasantly surprised to see this perform well beyond my expectations. Someday as time permits, I’ll try benchmarking it.

Based on the results from the Select, the logic flows along the following set of actions:

  • (1) If exactly one match is found in the lookup table, use buffer.DirectRow(int outputID) to output the row to the “IDLookupOut” Output.
  • (2) If more than one match is found, the candidate key in the source file is a duplicate in the master data table. Use buffer.DirectRow(int outputID) to output the row to the “IDLookupDuplicateOut” Output for later handling.
  • If no matches are found, do another Select on a historical lookup table in a different database. The code for this is not shown to keep this posting simple.
    • (3) If exactly one match is found in the historical lookup table, direct the output to both the “IDLookupOut” Output and to IDAddToMasterData Output. The latter will allow me to populate the current master data with the ID found so that we can maintain referential integrity in the database.
    • (4) If more than one match is found, direct the output to the “IDLookupDuplicateOut” Output for later handling.
  • (5) If no matches are found in the local or historical data, direct the row to the “IDLookupNoMatchOut” Output.

For components that conditionally direct inputs to multiple outputs (like a Conditional Split or my ID Lookup component), you need to use buffer.DirectRow to tell the component what output to use.

Here’s the code:

      if (aDR.Length == 1)
      {
        // (1) One record found. Action: Direct row to IDLookupOut
        buffer.SetString(indexIDValue, aDR[0]["id"].ToString());   // the column to put the ID
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupOut"].ID);
      }
      else if (aDR.Length > 1)
      {
        // (2) More than one entry found. Action: Direct row to IDLookupDuplicateOut
        buffer.SetString(indexIDValueDupe, aDR[0]["id"].ToString()); // store dupe ids 
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupDuplicateOut"].ID);
      }
      else if (aDR.Length == 0)
      {   
        if (Allow3rdPartyLookup== true)
        {
          // (3) Search the historical database, if single match found Action: 
          //       Direct row to IDLookupOut
          //       and Direct Row to IDAddToMasterData 
          // (4) If duplicates found in 3rd party database, Action: 
          //       Direct row to IDLookupDuplicateOut
          // (5.a) Nothing found. Action: 
          //       Direct row to IDLookupNoMatchOut
        }
        else
        {
          // (5.b) Nothing found. Action: Direct row to IDLookupNoMatchOut
          buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
        }
      }
    }
  }
}

Notice my use of buffer.SetString() to plug in the ID once I find it. I also add additional information to the pipeline (which is not shown in the code examples) to identify how the match was derived.

Conclusion

And there you have it! I ended up leaving some of the code out. As I said, this is business logic and proprietary. I did want to show, though, how I went about solving a fairly complex problem using a custom component. Again, I did have the benefit of examining a fully functional set of packages that had implemented all the logic in a Data Flow. This helped immensely, as the logic was plain as day.

So in the end, the new component makes creating new packages easier by only having to drop a single component on the Data Flow. There are a handful of properties to set (around 6 depending on the incoming data) and four Outputs to account for. This component also performs much, much better. My estimates put it at around 600% over the Data Flow approach.

Lastly, while writing these entries, I forced myself to examine the code in a different way (i.e. for public consumption and scrutiny). This made me think harder about the logic and how I used C#. The exercise was quite valuable and I ended up learning a lot in the process. With that said, please feel free to identify any mistakes I might have made, or more importantly, any places where you think improvements can be made. Thanks for reading!

 

Published April 23rd, 2008

ID Lookup Custom Transform (Part 2 of 3)

In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In this post, I’ll discuss the AcquireConnections and PreExecute methods. Please read the introduction and part 1 if you haven’t already!

AcquireConnections

The purpose of AcquireConnections is to establish connections to the specified connection managers at design and runtime. Although not totally necessary, it is a best practice and is recommended to connect to all external data sources using this method.

The first step is to prepare a runtime connection. In the ProvideComponentProperties method, I created a new runtime connection using interface IDTSRuntimeConnection90. I added this connection to the RuntimeConnectionCollection of my ComponentMetaData and named it “IDLookupConnection”. The code in ProvideComponentProperties is straightforward and looks like this:

IDTSRuntimeConnectionCollection90 oRuntimeConnectionCollection 
    = ComponentMetaData.RuntimeConnectionCollection;
IDTSRuntimeConnection90 oRuntimeConnection;
oRuntimeConnection = oRuntimeConnectionCollection.New();
oRuntimeConnection.Name = "IDLookupConnection";

As with my other examples, only the essential code is shown above. In my production copy, I use structured error handling and class members instead of literal strings.

Next, I wrote the following code in the AcquireConnection method:

private SqlConnection dbConn;
 
public override void AcquireConnections(object transaction)
{
IDTSRuntimeConnection90 conn = ComponentMetaData.RuntimeConnectionCollection["IDLookupConnection"];
ConnectionManager cm =  DtsConvert.ToConnectionManage(conn.ConnectionManager);
ConnectionManagerAdoNet cmObj = cm.InnerObject as ConnectionManagerAdoNet;
dbConn = cmObj.AcquireConnection(transaction) as SqlConnection;  
 
}

In the above code, I am converting conn (IDTSConnectionManager90) to a managed ConnectionManager (cm). Then, I can use InnerObject to get an AdoNet Connection Manager (cmObj) object. The AcquireConnection method of cmObj returns a SqlConnection object (dbConn) which opens a connection to my SQL Server database. I’ll use dbConn later in PreExecute to create SqlCommands.

Not to be forgotten, I use the ReleaseConnections method (not shown) to release dbConn when the component finishes its work.

PreExecute

The PreExecute method is called once for the life of the component at runtime. It is a good idea to do as much ‘prep’ work in here as possible. This includes applying behavior based on property values, loading data for lookups, setting up ’static’ items, and preparing the component to start accepting input. There is a temptation at times to use ProcessInput for some of these things: resist it! ProcessInput runs once for each buffer.

It is in PreExecute where I think I could do some better optimization. Currently, I am returning about 100,000 values from a master lookup table in about 6 seconds (crude benchmarking). This data is then used in ProcessInput for matching. I believe I can do better in here, but for now, it works and won’t be modified any time in the near future. Of course, if you see something awry, let me know!

The meat of the programming is in this method, but a lot of the logic is repeated (for example, I fetch LineageIDs for several new output IDTSOutputColumn90 columns added to each of my Outputs in ProvideComponentProperties). Here is such an example:

// get references to the input so we can use its buffer metadata
IDTSInput90 input = ComponentMetaData.InputCollection["IDLookupInput"];
 
// get reference to the output collection and its columns
IDTSOutput90 output = ComponentMetaData.OutputCollection["IDLookupOut"];
IDTSOutputColumnCollection90 outputColumns = output.OutputColumnCollection;
 
foreach (IDTSOutputColumn90 column in outputColumns)
{
  // NOTE: MS Documentation is not clear, cannot use output.buffer 
  // for synch transforms, need to use input.buffer instead (which 
  // is the same as what would be there for the output.buffer)
  intLineageID = (int)BufferManager.FindColumnByLineageID(input.Buffer,column.LineageID);
  strColumnName = column.Name.ToString().ToLower();
 
  if (strColumnName == "id_matchvalue") {
    indexMatchValue = intLineageID;
    exit;
  }
}

 

You might have noticed by now that much of what I am doing is more-or-less boilerplate code that is provided by Books Online. The trick is applying the right pieces for the right situation and building a solution that solves a specific problem better than what can be done using out-of-the-box tasks and components.

Despite all the good code examples in BOL, I still found some parts to be documented poorly. Take for example the issue with finding the LineageID of an output column for synchronous components. At first glance, you might think you can use BufferManager.FindColumnByLineageID(output.Buffer, column.LineageID). But for synchronous transforms, an output buffer is not created — it is the same as the input buffer. To find the LineageId, you need to use input.Buffer and not output.Buffer. Not very intuitive at all! But I digress…

In PreExecute, after I match a variety of LineageIDs to custom members of my IDLookup class, I continue by writing the code to fetch data from my lookup table. Actually, in the real implementation, this is also repeated for more than one data source — depending on how the developer set the properties of the component. But for the sake of simplicity, I’ll show one of the blocks of code for the primary lookup (the names have changed to protect the innocent!):

string sqlStr = "SELECT id, account, fname, lname, ssn, dob, ext_id, empid FROM mstr.MasterLookup WHERE status = 'A'";
SqlCommand sql = new SqlCommand(sqlStr, dbConn);
daMstr = new SqlDataAdapter();
daMstr.SelectCommand = sql;
dtMstr = new DataTable();
daMstr.Fill(dtMstr);

And there you have it! I simply fill a SqlDataAdapter (daMstr) with a DataTable (dtMstr) based on a SQL Statement (SqlCommand sql). The query returns several thousand rows from the master lookup table and stores them in daMstrSec, which I have defined as a custom member of my class.

Next, I’ll talk about ProcessInput!

Published April 21st, 2008

Book Review: SOA Approach to Integration

I posted a review of the book “SOA Approach to Integration” by Matjaz, B. Juric, Ramesh Loganathan, Dr. P., and G Sarang (published by Packt Publishing) over at Amazon this past weekend. Please check it out if you get the chance. Unlike my last review, this one is more favorable!

I wanted to read more about SOA for two reasons: curiosity and to round-out my knowledge of various integration strategies. Those who know me, know me as a “data guy”. I like to design data models, create databases, normalize things, and sketch integration strategies in UML. Boring. I know.

I suppose this comes directly from my background as a VFP application developer. In the nineties, I developed a dozen or so customized, vertical applications that existed for the most part in departmental islands. Their purpose was to solve business problems, usually at the process level. I soon began writing code to integrate these applications, the fancy term is “Enterprise Application Integration (EAI)”, but I never really called it that. Using Remote Procedure Calls (RPC) and shared objects, I was able to build point-to-point bridges allowing these islands to communicate with one another.

When I had the chance to start developing data warehouses, I jumped. I no longer write applications, instead, I do a lot of data modeling and I write code and design workflows to integrate data from any number of disparate applications spread out across an enterprise. I find this work more than just “satisfying”.

SOA is a different approach to integrating an enterprise. It is like EAI in some ways, but overall, the SOA approach is more advanced and scalable. Up until I read this book, I could not easily draw the line between exposing a few functions in a peer-to-peer api/RPC scenario, to this “Enterprise Service Bus” that coordinates and orchestrates entire business processes off in some far off place using XML and web services.

As you know from my postings and articles, I talk a lot about “Business Processes” in regards to dimensional modeling. This book brought me greater insight into what a “process” is and what it could be. In Dimensional Modeling, we take a bottom-up approach to building an enterprise database. Using conformed dimensions, we start process-by-process to construct a complete data warehouse. Unlike what some detractors and skeptics conclude (are there really any of those still?), we’re not creating new silos or islands, but rather an integrated, highly valuable data warehouse organized by business process, facilitated by the use of conformed dimensions. SOA looks at the business process in much the same way, but while the data warehouse typically gets a hold of a transaction after it occurs, SOA is part of the transaction. They’re two pees in the same pod.

While I agree that SOA is necessary for real-time transactional and document-related (”doc-literal”) integration, I don’t feel that data warehouses are threatened by the emergence of this “technology”. SOA solves a “business logic” problem, where business logic is spread out across an organization. Data warehousing solves reporting, analytical, and data exploration problems. A fully integrated organization will rely on SOA and data warehousing.

To buy this book, click here.

Check out these other reviews as well:

Tech Initiatives
Ken Guest’s online diary
Enterprise Architecture SOA and More

Published April 17th, 2008

ID Lookup Custom Transform (Part 1 of 3)

As promised, I’ll discuss in this post some of the details of my custom transform component I wrote in C# for SSIS. Unfortunately, I cannot publish the component because it contains business logic and company secrets! I will, however, talk about how I went about it and detail some of its finer and not-so-finer points.

This topic will be presented in multiple parts because of its length. In Part 1, I’ll discuss the ProvideComponentProperties method in depth. Part 2 will focus on AcquireConnections and PreExecute methods. Finally, I’ll look at ProcessInput and give some closing thoughts.

Before reading on, please have a look at my previous post to get some background on the problem.

Setup and Constraints

Looking up ID Numbers in our environment is rather complex due to the number of valid search combinations and the disparity and reliability of the source files. Because all source data comes from many different vendors and affiliates, the structure and content is often provided on a “best effort” basis. Therefore, extra care must be taken when accepting a file for import.

The current environment makes use of staging tables, which I will continue to use. Data without our ID is staged, the ID is added in a Data Flow (or if it is not found, it is flagged as unmatched), and then staged again for later processing. This made implementing the component a breeze. Essentially, I deleted every component in the original Data Flow between the Source and Destination(s) and dropped in the new component.

I also had to focus on recreating the logic in C# that existed in the package, and do so without loosing performance. Gaining a degree of maintainability was not good enough: It needed to perform better too. Thankfully, the team before me went through all the hard work of turning the “business” problem of determining an ID into an easily followed data flow in SSIS. Decomposing the pieces and turning them into C# was infinitely easier as a result.

Settin up the ProvideComponentProperties method

The ProvideComponentProperties method is called when the component is dropped on the Data Flow view. Its purpose is to initialize the component’s ComponentMetaData. The ComponentMetaData contains the custom properties, inputs, and outputs of the component.

To set this up, I’ve done the following in the ProvideComponentProperties method:

// although unlikely that this function would be called more 
// than once, it is a good idea to start out clean
RemoveAllInputsOutputsAndCustomProperties()
 
// define specific properties of ComponentMetaData
ComponentMetaData.ValidateExternalMetadata = false;
ComponentMetaData.UsesDispositions = true;
ComponentMetaData.Name = "IDLookup"

ValidateExternalMetadata, when set to false, tells the component not to perform any validations against external sources at design time. UsesDispositions tells the component to include (true) or not to include (false) error output. I also added additional metadata (Description, ContactInfo, etc..) not shown in the snippet above.

// add custom properties
IDTSCustomProperty90 propAllow3rdPartyLookup = ComponentMetaData.CustomPropertyCollection.New();
 
// set properties on the new, um, property
propAllow3rdPartyLookup.Name = "Allow3rdPartyLookup";
propAllow3rdPartyLookup.Description = "Also search the external database";
propAllow3rdPartyLookup.ExpressionType = DTSCustomPropertyExpressionType.CPET_NOTIFY;

The above chunk of code adds a new property to my component called “Allow3rdPartyLookup”. This property’s value will be used later on to determine if I need to look up the ID in a different database. The ExpressionType property gives me the ability to allow (CPET_NOTIFY) or disallow (CPET_NONE) expressions for the property’s value. I’m not certain if there are any general rules-of-thumb for setting this property: I always seem to allow expressions and have never disallowed them. This provides more flexibility and allows property settings to be dynamic, both good things.

I have added an additional 6 properties (not shown here) that are used to further customize the behavior of the component.

Finally, I set up the inputs and outputs. The component will accept a single input and direct rows to one of five outputs (including the error output if needed). Here is a sample view showing the input and outputs:

ID Lookup

The following code sets up the Input:

IDTSInput90 RuntimeInput = ComponentMetaData.InputCollection.New();
RuntimeInput.Name = "IDLookupInput";
RuntimeInput.ErrorRowDisposition = DTSRowDisposition.RD_FailComponent;

For the Input, the ErrorRowDisposition tells the component how to handle errors. In this case, I want to fail the component.

ID Lookup is synchronous, which means that it processes buffers as it gets them from the upstream component. To put it another way: Its output is processed with its input. An asynchronous component, like a Sort, sends the output at a different time than the input.

Setting up the outputs is rather straightforward. First, create a new IDTSOutput90 object and assign it a name. To be able to feed the input to the output, set the SynchronousInputID of the output to the ID of the input:

// Specify each output
IDTSOutput90 RuntimeOutput = ComponentMetaData.OutputCollection.New();
RuntimeOutput.Name = "IDLookupOut";
RuntimeOutput.SynchronousInputID = RuntimeInput.ID;
RuntimeOutput.ExclusionGroup = 1;
 
// Add a new column to the Output to store the ID
IDTSOutputColumn90 outputColum1 = ComponentMetaData.OutputCollection["IDLookupOut"].OutputColumnCollection.New();
outputColum1.Name = "FoundID";
outputColum1.SetDataTypeProperties(DataType.DT_STR, 12, 0, 0, 1252);

Note: The above is repeated for the remaining 3 outputs: IDAddToMasterData, IDLookupDuplicateOut, and IDLookupNoMatchOut. I’ve not included those here for brevity.

Setting ExclusionGroup to 1 on all non-error Outputs tells the object that I want to send rows — exclusively and at my discretion — to a specific output (or to more than one as you’ll see in Part 3). This is done manually using the DirectRow method in the ProcessInput method.

In my next post, I’ll take you through AcquireConnections and PreExecute methods. In the meantime, your questions, comments, or concerns are welcome!