Posts Tagged SSIS

ID Lookup Custom Transform (Part 3 of 3)

In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In Part 2, I walked through AcquireConnections and PreExecute. In this post, I’ll discuss ProcessInput. Please read the introduction , part 1, and part 2 if you haven’t already!

ProcessInput

ProcessInput accepts each row from the incoming buffer. This buffer is available via a parameter apply named “buffer” of type PipelineBuffer. The buffer contains each column defined in the column collection (IDTSInputColumnCollection90) of the input.

For ID Lookup, each row from the buffer is examined, and based on properties set by the user, appropriate action is taken.

The first step is to see if I can find an ID in the master lookup table based on key fields in the incoming data file. This functionality is controlled by a series of properties on the component. Normally, this search would be conducted on one or more input columns (lname, fname, and dob for example). The developer, after having done a data profile on the source file, will know exactly what columns will be used in the lookup. The code to build the SQL statement is defined like so:

public override void ProcessInput(int inputID, PipelineBuffer buffer)
{
 
  bool nullOutput = false;      // if the input contains a NULL, we want to reject the row     
 
  if (!buffer.EndOfRowset)
  {
    while (buffer.NextRow())
    {
      string strSelectWhere = "";   // to store the where clause for the search
 
      // if the columnIndex is in the input string. Recall that I figured this out in PreExecute
      if (indexMatchValue > -1) 
      {
        // build a where string for the search
        if (buffer.IsNull(indexMatchValue))
          nullOutput = true;      // fail a search on NULL as an UNMATCHED row
        else
          strSelectWhere += " lname='" + buffer.GetString(indexMatchValue);
      }

The block of code starting with line ” if (indexMatchValue > -1)” is repeated several times, with some variation, for many additional properties set by the user. The purpose is to build up a where string (stored in strSelectWhere) for the search to be made on the data table (dtMstr). This data table was filled in the PreExecute method.

Before going any further, I want to see if there are any nulls in any of my search columns. If so, I want to redirect to one of my outputs (specifically, IDLookupNoMatchOut), and move on to the next row:

      if (nullOutput == true)
      {
         buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
         continue;
      }

Once the strSelectWhere statement is built, I try to find a match in the lookup table:

      DataRow[] aDR;
      aDR = dtMstr.Select(strSelectWhere);

I had my doubts about the speed of this approach. I really thought that the Select method, repeated for each row in the buffer, would kill my grand idea. But I was pleasantly surprised to see this perform well beyond my expectations. Someday as time permits, I’ll try benchmarking it.

Based on the results from the Select, the logic flows along the following set of actions:

  • (1) If exactly one match is found in the lookup table, use buffer.DirectRow(int outputID) to output the row to the “IDLookupOut” Output.
  • (2) If more than one match is found, the candidate key in the source file is a duplicate in the master data table. Use buffer.DirectRow(int outputID) to output the row to the “IDLookupDuplicateOut” Output for later handling.
  • If no matches are found, do another Select on a historical lookup table in a different database. The code for this is not shown to keep this posting simple.
    • (3) If exactly one match is found in the historical lookup table, direct the output to both the “IDLookupOut” Output and to IDAddToMasterData Output. The latter will allow me to populate the current master data with the ID found so that we can maintain referential integrity in the database.
    • (4) If more than one match is found, direct the output to the “IDLookupDuplicateOut” Output for later handling.
  • (5) If no matches are found in the local or historical data, direct the row to the “IDLookupNoMatchOut” Output.

For components that conditionally direct inputs to multiple outputs (like a Conditional Split or my ID Lookup component), you need to use buffer.DirectRow to tell the component what output to use.

Here’s the code:

      if (aDR.Length == 1)
      {
        // (1) One record found. Action: Direct row to IDLookupOut
        buffer.SetString(indexIDValue, aDR[0]["id"].ToString());   // the column to put the ID
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupOut"].ID);
      }
      else if (aDR.Length > 1)
      {
        // (2) More than one entry found. Action: Direct row to IDLookupDuplicateOut
        buffer.SetString(indexIDValueDupe, aDR[0]["id"].ToString()); // store dupe ids 
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupDuplicateOut"].ID);
      }
      else if (aDR.Length == 0)
      {   
        if (Allow3rdPartyLookup== true)
        {
          // (3) Search the historical database, if single match found Action: 
          //       Direct row to IDLookupOut
          //       and Direct Row to IDAddToMasterData 
          // (4) If duplicates found in 3rd party database, Action: 
          //       Direct row to IDLookupDuplicateOut
          // (5.a) Nothing found. Action: 
          //       Direct row to IDLookupNoMatchOut
        }
        else
        {
          // (5.b) Nothing found. Action: Direct row to IDLookupNoMatchOut
          buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
        }
      }
    }
  }
}

Notice my use of buffer.SetString() to plug in the ID once I find it. I also add additional information to the pipeline (which is not shown in the code examples) to identify how the match was derived.

Conclusion

And there you have it! I ended up leaving some of the code out. As I said, this is business logic and proprietary. I did want to show, though, how I went about solving a fairly complex problem using a custom component. Again, I did have the benefit of examining a fully functional set of packages that had implemented all the logic in a Data Flow. This helped immensely, as the logic was plain as day.

So in the end, the new component makes creating new packages easier by only having to drop a single component on the Data Flow. There are a handful of properties to set (around 6 depending on the incoming data) and four Outputs to account for. This component also performs much, much better. My estimates put it at around 600% over the Data Flow approach.

Lastly, while writing these entries, I forced myself to examine the code in a different way (i.e. for public consumption and scrutiny). This made me think harder about the logic and how I used C#. The exercise was quite valuable and I ended up learning a lot in the process. With that said, please feel free to identify any mistakes I might have made, or more importantly, any places where you think improvements can be made. Thanks for reading!

 

Tags: , , , ,

2 Comments

ID Lookup Custom Transform (Part 2 of 3)

In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In this post, I’ll discuss the AcquireConnections and PreExecute methods. Please read the introduction and part 1 if you haven’t already!

AcquireConnections

The purpose of AcquireConnections is to establish connections to the specified connection managers at design and runtime. Although not totally necessary, it is a best practice and is recommended to connect to all external data sources using this method.

The first step is to prepare a runtime connection. In the ProvideComponentProperties method, I created a new runtime connection using interface IDTSRuntimeConnection90. I added this connection to the RuntimeConnectionCollection of my ComponentMetaData and named it “IDLookupConnection”. The code in ProvideComponentProperties is straightforward and looks like this:

IDTSRuntimeConnectionCollection90 oRuntimeConnectionCollection 
    = ComponentMetaData.RuntimeConnectionCollection;
IDTSRuntimeConnection90 oRuntimeConnection;
oRuntimeConnection = oRuntimeConnectionCollection.New();
oRuntimeConnection.Name = "IDLookupConnection";

As with my other examples, only the essential code is shown above. In my production copy, I use structured error handling and class members instead of literal strings.

Next, I wrote the following code in the AcquireConnection method:

private SqlConnection dbConn;
 
public override void AcquireConnections(object transaction)
{
IDTSRuntimeConnection90 conn = ComponentMetaData.RuntimeConnectionCollection["IDLookupConnection"];
ConnectionManager cm =  DtsConvert.ToConnectionManage(conn.ConnectionManager);
ConnectionManagerAdoNet cmObj = cm.InnerObject as ConnectionManagerAdoNet;
dbConn = cmObj.AcquireConnection(transaction) as SqlConnection;  
 
}

In the above code, I am converting conn (IDTSConnectionManager90) to a managed ConnectionManager (cm). Then, I can use InnerObject to get an AdoNet Connection Manager (cmObj) object. The AcquireConnection method of cmObj returns a SqlConnection object (dbConn) which opens a connection to my SQL Server database. I’ll use dbConn later in PreExecute to create SqlCommands.

Not to be forgotten, I use the ReleaseConnections method (not shown) to release dbConn when the component finishes its work.

PreExecute

The PreExecute method is called once for the life of the component at runtime. It is a good idea to do as much ‘prep’ work in here as possible. This includes applying behavior based on property values, loading data for lookups, setting up ’static’ items, and preparing the component to start accepting input. There is a temptation at times to use ProcessInput for some of these things: resist it! ProcessInput runs once for each buffer.

It is in PreExecute where I think I could do some better optimization. Currently, I am returning about 100,000 values from a master lookup table in about 6 seconds (crude benchmarking). This data is then used in ProcessInput for matching. I believe I can do better in here, but for now, it works and won’t be modified any time in the near future. Of course, if you see something awry, let me know!

The meat of the programming is in this method, but a lot of the logic is repeated (for example, I fetch LineageIDs for several new output IDTSOutputColumn90 columns added to each of my Outputs in ProvideComponentProperties). Here is such an example:

// get references to the input so we can use its buffer metadata
IDTSInput90 input = ComponentMetaData.InputCollection["IDLookupInput"];
 
// get reference to the output collection and its columns
IDTSOutput90 output = ComponentMetaData.OutputCollection["IDLookupOut"];
IDTSOutputColumnCollection90 outputColumns = output.OutputColumnCollection;
 
foreach (IDTSOutputColumn90 column in outputColumns)
{
  // NOTE: MS Documentation is not clear, cannot use output.buffer 
  // for synch transforms, need to use input.buffer instead (which 
  // is the same as what would be there for the output.buffer)
  intLineageID = (int)BufferManager.FindColumnByLineageID(input.Buffer,column.LineageID);
  strColumnName = column.Name.ToString().ToLower();
 
  if (strColumnName == "id_matchvalue") {
    indexMatchValue = intLineageID;
    exit;
  }
}

 

You might have noticed by now that much of what I am doing is more-or-less boilerplate code that is provided by Books Online. The trick is applying the right pieces for the right situation and building a solution that solves a specific problem better than what can be done using out-of-the-box tasks and components.

Despite all the good code examples in BOL, I still found some parts to be documented poorly. Take for example the issue with finding the LineageID of an output column for synchronous components. At first glance, you might think you can use BufferManager.FindColumnByLineageID(output.Buffer, column.LineageID). But for synchronous transforms, an output buffer is not created — it is the same as the input buffer. To find the LineageId, you need to use input.Buffer and not output.Buffer. Not very intuitive at all! But I digress…

In PreExecute, after I match a variety of LineageIDs to custom members of my IDLookup class, I continue by writing the code to fetch data from my lookup table. Actually, in the real implementation, this is also repeated for more than one data source — depending on how the developer set the properties of the component. But for the sake of simplicity, I’ll show one of the blocks of code for the primary lookup (the names have changed to protect the innocent!):

string sqlStr = "SELECT id, account, fname, lname, ssn, dob, ext_id, empid FROM mstr.MasterLookup WHERE status = 'A'";
SqlCommand sql = new SqlCommand(sqlStr, dbConn);
daMstr = new SqlDataAdapter();
daMstr.SelectCommand = sql;
dtMstr = new DataTable();
daMstr.Fill(dtMstr);

And there you have it! I simply fill a SqlDataAdapter (daMstr) with a DataTable (dtMstr) based on a SQL Statement (SqlCommand sql). The query returns several thousand rows from the master lookup table and stores them in daMstrSec, which I have defined as a custom member of my class.

Next, I’ll talk about ProcessInput!

Tags: , , , ,

1 Comment

ID Lookup Custom Transform (Part 1 of 3)

As promised, I’ll discuss in this post some of the details of my custom transform component I wrote in C# for SSIS. Unfortunately, I cannot publish the component because it contains business logic and company secrets! I will, however, talk about how I went about it and detail some of its finer and not-so-finer points.

This topic will be presented in multiple parts because of its length. In Part 1, I’ll discuss the ProvideComponentProperties method in depth. Part 2 will focus on AcquireConnections and PreExecute methods. Finally, I’ll look at ProcessInput and give some closing thoughts.

Before reading on, please have a look at my previous post to get some background on the problem.

Setup and Constraints

Looking up ID Numbers in our environment is rather complex due to the number of valid search combinations and the disparity and reliability of the source files. Because all source data comes from many different vendors and affiliates, the structure and content is often provided on a “best effort” basis. Therefore, extra care must be taken when accepting a file for import.

The current environment makes use of staging tables, which I will continue to use. Data without our ID is staged, the ID is added in a Data Flow (or if it is not found, it is flagged as unmatched), and then staged again for later processing. This made implementing the component a breeze. Essentially, I deleted every component in the original Data Flow between the Source and Destination(s) and dropped in the new component.

I also had to focus on recreating the logic in C# that existed in the package, and do so without loosing performance. Gaining a degree of maintainability was not good enough: It needed to perform better too. Thankfully, the team before me went through all the hard work of turning the “business” problem of determining an ID into an easily followed data flow in SSIS. Decomposing the pieces and turning them into C# was infinitely easier as a result.

Settin up the ProvideComponentProperties method

The ProvideComponentProperties method is called when the component is dropped on the Data Flow view. Its purpose is to initialize the component’s ComponentMetaData. The ComponentMetaData contains the custom properties, inputs, and outputs of the component.

To set this up, I’ve done the following in the ProvideComponentProperties method:

// although unlikely that this function would be called more 
// than once, it is a good idea to start out clean
RemoveAllInputsOutputsAndCustomProperties()
 
// define specific properties of ComponentMetaData
ComponentMetaData.ValidateExternalMetadata = false;
ComponentMetaData.UsesDispositions = true;
ComponentMetaData.Name = "IDLookup"

ValidateExternalMetadata, when set to false, tells the component not to perform any validations against external sources at design time. UsesDispositions tells the component to include (true) or not to include (false) error output. I also added additional metadata (Description, ContactInfo, etc..) not shown in the snippet above.

// add custom properties
IDTSCustomProperty90 propAllow3rdPartyLookup = ComponentMetaData.CustomPropertyCollection.New();
 
// set properties on the new, um, property
propAllow3rdPartyLookup.Name = "Allow3rdPartyLookup";
propAllow3rdPartyLookup.Description = "Also search the external database";
propAllow3rdPartyLookup.ExpressionType = DTSCustomPropertyExpressionType.CPET_NOTIFY;

The above chunk of code adds a new property to my component called “Allow3rdPartyLookup”. This property’s value will be used later on to determine if I need to look up the ID in a different database. The ExpressionType property gives me the ability to allow (CPET_NOTIFY) or disallow (CPET_NONE) expressions for the property’s value. I’m not certain if there are any general rules-of-thumb for setting this property: I always seem to allow expressions and have never disallowed them. This provides more flexibility and allows property settings to be dynamic, both good things.

I have added an additional 6 properties (not shown here) that are used to further customize the behavior of the component.

Finally, I set up the inputs and outputs. The component will accept a single input and direct rows to one of five outputs (including the error output if needed). Here is a sample view showing the input and outputs:

ID Lookup

The following code sets up the Input:

IDTSInput90 RuntimeInput = ComponentMetaData.InputCollection.New();
RuntimeInput.Name = "IDLookupInput";
RuntimeInput.ErrorRowDisposition = DTSRowDisposition.RD_FailComponent;

For the Input, the ErrorRowDisposition tells the component how to handle errors. In this case, I want to fail the component.

ID Lookup is synchronous, which means that it processes buffers as it gets them from the upstream component. To put it another way: Its output is processed with its input. An asynchronous component, like a Sort, sends the output at a different time than the input.

Setting up the outputs is rather straightforward. First, create a new IDTSOutput90 object and assign it a name. To be able to feed the input to the output, set the SynchronousInputID of the output to the ID of the input:

// Specify each output
IDTSOutput90 RuntimeOutput = ComponentMetaData.OutputCollection.New();
RuntimeOutput.Name = "IDLookupOut";
RuntimeOutput.SynchronousInputID = RuntimeInput.ID;
RuntimeOutput.ExclusionGroup = 1;
 
// Add a new column to the Output to store the ID
IDTSOutputColumn90 outputColum1 = ComponentMetaData.OutputCollection["IDLookupOut"].OutputColumnCollection.New();
outputColum1.Name = "FoundID";
outputColum1.SetDataTypeProperties(DataType.DT_STR, 12, 0, 0, 1252);

Note: The above is repeated for the remaining 3 outputs: IDAddToMasterData, IDLookupDuplicateOut, and IDLookupNoMatchOut. I’ve not included those here for brevity.

Setting ExclusionGroup to 1 on all non-error Outputs tells the object that I want to send rows — exclusively and at my discretion — to a specific output (or to more than one as you’ll see in Part 3). This is done manually using the DirectRow method in the ProcessInput method.

In my next post, I’ll take you through AcquireConnections and PreExecute methods. In the meantime, your questions, comments, or concerns are welcome!

Tags: , , , ,

2 Comments

An SSIS Custom Transformation Success Story

I recently inherited eight SSIS packages that all do essentially the same thing: They extract data from disparate sources, cleanse and conform them, and finally load them into an analytical data warehouse (dimensional model). Welcome to Data Integration 101!

The project calls for another two dozen of these extracts. Unfortunately, the team before me operated with little functional documentation, a limited knowledge of data integration, and did not focus on using tried and true best practices. There was no data profiling, no logical data mapping, no quality controls, no auditing, and very little in the way of logging. Essentially, things are a bit of a mess.

My primary focus over the next 6 months is to clean things up. I’ve already introduced better documentation practices, naming conventions, and this crazy concept of the “package template”. There is no way on earth I am developing 20 similar packages without using a template!

I’ve also started work on a Data Quality Manager Application (for now, it will consist of a table-driven set of SQL statements that can be run via stored procedure on a scheduler) and a Logical Data Map (LDM). The LDM will provide much needed documentation and give the other SSIS developers a little more direction.

I plan on posting much more about the LDM, Data Quality Manager, and package templates in a later post. For now, I want to gloat a little about a custom transformation task component I wrote.

Update: I almost never re-publish a post, but felt I needed to in this case. Darren Green of http://www.sqlis.com, which is an excellent SSIS resource by the way, commented about my use of the term “Task” to refer to Data Flow “components”. He’s right, so I made the change! Sorry for any confusion. In fact, while reading his comment, I had an “Ah-Ha” moment: The way to tell the difference between the two Scripts in SSIS is to call them by their correct names: Script Task (of the Control Flow) and Script Component (of the Data Flow). Duh!

The Problem

When the data comes into SSIS, work needs to be done to relate the rows from the source to those of the target. The mapping is a little different depending on the source file, but in the end, we need to match any number of significant characteristics (aka the file’s candidate key) to a single unique ID Number in our data warehouse database.

For example, one file might contain Social Security number, another might have last name and date of birth, a third might have the last name and the last 4 digits of a credit card number, yet another would have an account number from a billing system. Attached to all these identifiers are some important metrics that need to be added to a fact table in the data warehouse.

To do this, the developers before me used three Data Flow tasks, and over 100 Transformations (no joke). The second (and meatiest) Data Flow task took the extracted data (now in a staging table) and then embarked on a long drive through the various possible match combinations. First, there would be a Conditional Split that checked to see if the row had a valid account number, if not, then the output would hit another Conditional Split that would check on last name and so on. If there was a valid account number, a Lookup component would check the account dimension in the data warehouse for a match. If exactly one match was found (it is possible to have repeating account numbers), it would add the ID to the pipeline and drop the results into another staging table as a “match”. If not found — or there was a duplicate — another Lookup component would be used on some other identifiable column. Repeat. Repeat. Repeat!

The other two data flows did various cleanup work.

To add complexity to this mess, the attributes in each source are different. Sometimes account number has an extra check digit at the end. Other times, dashes are inserted in different positions.

The original data flow

A picture is worth a thousand words. Here is one of the data flows. The quality is bad on purpose: This is business logic, and as messy as it is, I don’t want to give away any company secrets! But, here is a high view so you can taste the pasta.

The developers did an admirable job of wading through all this and managed to implement a pure SSIS solution that works. They did this all without a data profile, LDM, or any real functional documentation. However, maintaining this, or repeating this logic in 20 more packages, makes me want to extract my eyes from my head, transform them into soup, and load them into my mouth.

A Better Approach

So I spent an hour looking at all this and thought to myself how easy and simple this would be using Visual FoxPro. A few SELECT statements, perhaps a CASE here, an IF there, and a REPLACE command for good measure.

Unfortunately, I can’t use VFP for this project. We’re locked into SSIS and there isn’t a single VFP license in my reach. The next best option, of course, is to design a custom component using C#.

In my next post, I’ll explain the nuts and bolts in detail. But to summarize, my Custom Task component (called “ID Lookup”) uses about 100 lines of code in the ProcessInput method, and another 100 lines in PreExecute (give or take). It essentially acts as a Lookup and Conditional Split with some extra logic built in that is specific to our situation. There are properties on the component that allow the developer to specify what input columns should be used in the matching process.

The new data flow using the custom transformation task

The new Data Flow looks a little more manageable.

ID Lookup achieves the exact same results as the three data flows and 100 transformations did. As an added bonus, it performs over 600% better! And I’m sure with a bit more tweaking, I can squeeze out additional performance.

Aesop Time

I suppose that being a senior-level software developer in my previous life helped out a bit. I’m not certain that the team before me even thought about writing some code to solve the problem. I also have the added benefit of seeing eight different packages in a completed state. It is much easier to identify patterns when you have something tangible to look at.

So the moral is, don’t be afraid of writing custom tasks or Data Flow components. This was not my first, and it won’t be my last. Now, when we start to write the new packages, we’ll drop this transformation into the Data Flow, right smack between the appropriate staging tables.

As I said, in my next post, I’ll take a look at some of the code I used to create the component. Keep in mind that I cannot show you our business-logic, but I will show the framework.

Tags: , , , , ,

6 Comments

SSIS Debugging: Find in Files, Naming Conventions, and a Problem Solved!

I recently inherited several complex SSIS packages that are now being promoted from the development environment (ALPHA) to testing (BETA). There was one lingering issue though that took me a few hours to figure out.

We’ve implemented custom logging and auditing in the packages. In the Alpha environment, we set up Agent jobs to run groups of related tasks. For some odd reason, a log entry was being inserted into our audit table that was from a previous job step and not from the one we expected.

To give a simplified example: There are 3 SSIS packages being called by 3 job steps. Each loads a CSV file into a staging area for dimension processing. Step 1 reports that file A was processed. Step 2 reports on file B, and step 3 on file C. The step 3 job was reporting that file A was processed, but that it was the step 3 job that did it.

This wasn’t noticed until recently for several reasons, but primarily because we generally don’t rely on this particular log entry to determine if our files have been staged. So no one looks at it. In fact, this log entry merely serves as a bracket in the log table for other log entries posted in the same job. Nevertheless, I wanted to fix it.

So, where was the problem? All the logging is done using an Execute SQL Task. Every task checked out OK. The correct variables were being used in all cases. I was about ready to blame SSIS or corruption in the package when I stumbled upon a valuable clue: The name of the exact SQL Task Causing the problem!

I found the pesky task by doing a “Find in Files” search (BIDS -> Edit -> Find and Replace -> Find in Files) on the exact message being inserted into my audit table. You can use regular expressions and wildcards in this dialog so I was sure to find at least a clue. It turned out though that my exact search found the phrase; it was in a task named “Execute SQL Task”.

Red Flag!

I meticulously went through each package last month and renamed each and every task using the naming conventions posted on Jamie Thomson’s blog. This is a great best-practice and I certainly advise all SSIS developers to use this (or similar) convention. Anyway, the developers before me did a fairly good job of renaming tasks so they had meaningful names, but it was nearly impossible to look into sysdtslog90 or our custom audit tables and identify what kind of task the record was referring to. So I felt that while these packages were still in development, it was worth the few days effort to make these changes. All I did was add the suggested prefix before each task and renamed a few that I thought needed better descriptive names (sometimes shortening them in the process).

I know for sure that I didn’t miss any! Right?

Well, yes. I did. And so did the developers before me. When the package was being developed, the developers used cut & paste to move similar tasks from an existing package into this one. Then, a series of Sequence containers and Loop containers were added and tasks were moved around. A Sequence container was sized over the troublesome “Execute SQL Task” – to be forever (almost) lost in the Control Flow maze. I first saw it in the Package Explorer view and then went back to the Control Flow to see where it had gone. The task ran an SQL statement that updated the audit table.

Advice

Take care when using cut & paste to move items from one package to another. Not only could you run into some simple problems like the one detailed here, you could also forget to update a variable or an expression – producing odd or destructive results on your database. For us, this was just a simple audit log entry with zero business impact. Are there other cases buried in the packages? Need to check right now…

Also, use the naming convention best practice as discussed by Jamie. There are a lot of great reasons to do so (ease in development, debugging, readability).

I have never been in the habit of using the “Find in Files” function. But it sure saved me here. If you haven’t been using it in your packages, then give it a try!

Tags: , , , ,

No Comments

Book Review: SQL Server Integration Services Using Visual Studio 2005

I was recently asked to take a look at and review a new book titled “SQL Server Integration Services Using Visual Studio 2005″ released in December of 2007 through PACKT Publishing. As I am always looking for good reference materials, and because I am currently in a position where I will be introducing Integration Services to folks untrained in the fine art of SSIS in the near future, I jumped at the chance.

The book was a fast read (about 4 days, a few hours a day) and contained many of the most common Control Flow and Data Flow Tasks. The author gives context for each task by providing some examples on when they should be used. The author also takes some time orienting the user to BIDS by providing definitions and several pages of menu and other UI feature descriptions.

But with that said:

  • The book was a fast read because more than half of its 296 pages were devoted to screenshots (not a scientific calculation, but about half is how it felt).
  • In many cases, the common tasks demonstrated where done in a vacuum, not allowing a beginner to get an idea of how multiple tasks can work together to provide a more complete solution. I know this is a beginner guide, but surely we could start small and build into something big?
  • The context provided for each task is in the form of an example situation which is nice, but does not give a beginner enough to go on in order to know that a particular task is right for the job they need done.
  • BIDS and VS share the same shell. Why not just say that right away and then call Visual Studio BIDS instead? All the other books and reference materials I have come across use “BIDS” and not “Visual Studio”. I suppose this is a bit of a pet peeve of mine!
  • Many definitions were inadequate or did not really define the subject. A great example is the OLEDB Command! I’m not even sure what the definition refers to. He writes: “Parameterized queries are a powerful means to extract just the information needed to satisfy a given criteria. You would use a parametric query when you try finding patient information, given the patient-ID from a patient database. This transformation requires a data flow task, and an OLEDB data source or a flat file source.” That is word-for-word. Of course, the OLEDB Command merely executes an SQL command for each row in a dataset!
  • The UI orientation was awkward at best. I don’t feel that the author did a good enough job of describing how the menus and toolbars change depending on the user’s view and context.

There are problems on multiple fronts: (a) grammar, (b) style, (c) consistency, (d) completeness, (e) conciseness, and most importantly (f) technical accuracy. See my review on Amazon for some details.

For another example, on page 124, the author writes about preparing a flat file for the Bulk Insert Task:

Normally, this file should be available, as this is the starting point. Since this is just a demo, we will be using a file with 10 rows, of which the 1st row is a column header, a ludicrously small file for this heavy weight tool. This file can be created with a text editor such as notepad, but usually, it is resident in legacy data stores.

What if the file is not available? Is it really ludicrous to import 10 rows into SQL Server using SSIS? Couldn’t modern applications produce delimited files?

Unfortunately for the reader, most paragraphs read this way. I know and understand what the author is trying to say, but for a beginner (which is who this book is designed for) a paragraph structured in this way will undoubtedly stand in the way of his or her learning. I found myself having to read paragraphs like this two or three times to understand them.

What I found odd was an entire chapter dedicated to the Active X Script Task. Even the author admits that the task is only included for backward compatibility. According to BOL:

This feature will be removed in the next version of Microsoft SQL Server. Do not use this feature in new development work, and modify applications that currently use this feature as soon as possible.

So I really, really have to wonder why on earth this would be given any attention at all in a beginner’s book!

Also, what about more information regarding precedent constraints, variables, logging, error handling, parent packages, configuration files, connection managers, data sources, data source views, the expression builder, and the like? I don’t consider any of these items beyond the scope of a beginner’s book, and in fact, if a beginner is not armed with this basic information, their packages are likely not to be very maintainable, scalable, or functional!

My Recommendation?

Although I found this book lacking in many areas, others who have reviewed it think otherwise. Please take a moment and read the reviews at Amazon for more details (mine included).

As data integration projects are very likely to fail or run way over budget and time, it is critical that beginners have a proper understanding of the tool they are about to use. If they are using SSIS as part of a data warehousing project to deliver Business Intelligence (as opposed to writing maintenance plans and MSMQ applications for example), then it is also critical that the beginner is armed with the proper theories and best practices to best position him or her for data integration success. This book does not provide this understanding and reads more like a help file or cookbook.

Closing thoughts

It was difficult for me to write this review. As an author, blogger, and musician though I feel that good constructive feedback/criticism is essential for growth and continuous improvement (sort of like how an ETL developer should approach his or her data integration project each day!). I’ve been the recipient of plenty of negative reviews, and with each, I’ve managed to get better. Lastly, to the great credit of Packt Publishing, whom I contacted before writing this entry here and on Amazon, they told me to go ahead and give my honest assessment.

Other reviews of this book include:

Tags: , , , ,

2 Comments

SSIS Expressions (by Scott Whigham)

In yesterday’s post, I made a comment that SSIS expressions were difficult to get used to. Earlier today, I ran across this video on YouTube. If you’re new to SSIS, or you simply have a problem with expressions, check it out. Here is a blurb:

Expressions are new in SSIS (meaning that they were not in DTS) but they allow you to create extremely flexible packages. Expressions can be used to assign values to variables, help you determine whether to execute a task, and to assign properties. If you ever found yourself writing ActiveX code (VBScript) in DTS to determine which task to execute based on a variable’s value, then you probably can forego the scripting all together in favor of expressions! They are very powerful in SSIS and you’ll want to have a solid understanding of how to write them and how to use them.

I love the comedy around online help. I totally feel for him…

Hope you found this useful!

Tags: ,

No Comments