Archive for April, 2008

Missing Him Already…

I’m saddened by the news that Ken Murphy has passed away. I have learned so much from Ken (although we’ve never met) and I already miss him. My condolences to his family…

Here’s how he felt about the “end” of FoxPro:

The decision to cease further development of VFP is one that I believe Microsoft, and especially, the SQL server division will come to regret. I develop database apps for charities, and it has been my experience that most of these charities eventually move to a SQL server back end. I would suggest that the same is true of many small businesses. VFP is a fantastic language for developing entry level database apps for small to medium businesses. The power and speed of VFP allows people like me to develop tools for these smaller organizations that allow them to grow. Evenutally, they outgrow the VFP back end and typically move to a SQL solution. If SQL were a major league baseball team, VFP would be their farm team. I wonder how many major league teams would succeed if they were to get rid of all their farm teams?

This will be how I remember him.

Tags:

No Comments

Life in Redmond

Imagine this:

“Everyone has an office (a few people share), doors are often closed, and people are heads down working. [...] Sounds are muted, there’s no animation from the employees, it just seems like it could be an accounting firm as much as it could be a software developer. I’m not sure what I was expecting, maybe more people running around with Nerf dart guns, maybe more people excited and jazzed about what they’re working on. A few loud “Yes”s or “Aha”s, following by someone outside their cube, reveling passerbys with the amazing code they’re just written. ”

Source: Steve Jones of SQLServerCentral.com

That’s what things are like, according to Steve Jones of SQLServerCentral.com, for the SQL Server team at Microsoft as they busily prepare the next great version of SQL Server. Things have been delayed already. And as it goes, they may need to change the name to SQL Server 2009. “2008″ already sounds dated.

Tags: ,

1 Comment

ID Lookup Custom Transform (Part 3 of 3)

In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In Part 2, I walked through AcquireConnections and PreExecute. In this post, I’ll discuss ProcessInput. Please read the introduction , part 1, and part 2 if you haven’t already!

ProcessInput

ProcessInput accepts each row from the incoming buffer. This buffer is available via a parameter apply named “buffer” of type PipelineBuffer. The buffer contains each column defined in the column collection (IDTSInputColumnCollection90) of the input.

For ID Lookup, each row from the buffer is examined, and based on properties set by the user, appropriate action is taken.

The first step is to see if I can find an ID in the master lookup table based on key fields in the incoming data file. This functionality is controlled by a series of properties on the component. Normally, this search would be conducted on one or more input columns (lname, fname, and dob for example). The developer, after having done a data profile on the source file, will know exactly what columns will be used in the lookup. The code to build the SQL statement is defined like so:

public override void ProcessInput(int inputID, PipelineBuffer buffer)
{
 
  bool nullOutput = false;      // if the input contains a NULL, we want to reject the row     
 
  if (!buffer.EndOfRowset)
  {
    while (buffer.NextRow())
    {
      string strSelectWhere = "";   // to store the where clause for the search
 
      // if the columnIndex is in the input string. Recall that I figured this out in PreExecute
      if (indexMatchValue > -1) 
      {
        // build a where string for the search
        if (buffer.IsNull(indexMatchValue))
          nullOutput = true;      // fail a search on NULL as an UNMATCHED row
        else
          strSelectWhere += " lname='" + buffer.GetString(indexMatchValue);
      }

The block of code starting with line ” if (indexMatchValue > -1)” is repeated several times, with some variation, for many additional properties set by the user. The purpose is to build up a where string (stored in strSelectWhere) for the search to be made on the data table (dtMstr). This data table was filled in the PreExecute method.

Before going any further, I want to see if there are any nulls in any of my search columns. If so, I want to redirect to one of my outputs (specifically, IDLookupNoMatchOut), and move on to the next row:

      if (nullOutput == true)
      {
         buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
         continue;
      }

Once the strSelectWhere statement is built, I try to find a match in the lookup table:

      DataRow[] aDR;
      aDR = dtMstr.Select(strSelectWhere);

I had my doubts about the speed of this approach. I really thought that the Select method, repeated for each row in the buffer, would kill my grand idea. But I was pleasantly surprised to see this perform well beyond my expectations. Someday as time permits, I’ll try benchmarking it.

Based on the results from the Select, the logic flows along the following set of actions:

  • (1) If exactly one match is found in the lookup table, use buffer.DirectRow(int outputID) to output the row to the “IDLookupOut” Output.
  • (2) If more than one match is found, the candidate key in the source file is a duplicate in the master data table. Use buffer.DirectRow(int outputID) to output the row to the “IDLookupDuplicateOut” Output for later handling.
  • If no matches are found, do another Select on a historical lookup table in a different database. The code for this is not shown to keep this posting simple.
    • (3) If exactly one match is found in the historical lookup table, direct the output to both the “IDLookupOut” Output and to IDAddToMasterData Output. The latter will allow me to populate the current master data with the ID found so that we can maintain referential integrity in the database.
    • (4) If more than one match is found, direct the output to the “IDLookupDuplicateOut” Output for later handling.
  • (5) If no matches are found in the local or historical data, direct the row to the “IDLookupNoMatchOut” Output.

For components that conditionally direct inputs to multiple outputs (like a Conditional Split or my ID Lookup component), you need to use buffer.DirectRow to tell the component what output to use.

Here’s the code:

      if (aDR.Length == 1)
      {
        // (1) One record found. Action: Direct row to IDLookupOut
        buffer.SetString(indexIDValue, aDR[0]["id"].ToString());   // the column to put the ID
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupOut"].ID);
      }
      else if (aDR.Length > 1)
      {
        // (2) More than one entry found. Action: Direct row to IDLookupDuplicateOut
        buffer.SetString(indexIDValueDupe, aDR[0]["id"].ToString()); // store dupe ids 
        buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupDuplicateOut"].ID);
      }
      else if (aDR.Length == 0)
      {   
        if (Allow3rdPartyLookup== true)
        {
          // (3) Search the historical database, if single match found Action: 
          //       Direct row to IDLookupOut
          //       and Direct Row to IDAddToMasterData 
          // (4) If duplicates found in 3rd party database, Action: 
          //       Direct row to IDLookupDuplicateOut
          // (5.a) Nothing found. Action: 
          //       Direct row to IDLookupNoMatchOut
        }
        else
        {
          // (5.b) Nothing found. Action: Direct row to IDLookupNoMatchOut
          buffer.DirectRow(ComponentMetaData.OutputCollection["IDLookupNoMatchOut"].ID);
        }
      }
    }
  }
}

Notice my use of buffer.SetString() to plug in the ID once I find it. I also add additional information to the pipeline (which is not shown in the code examples) to identify how the match was derived.

Conclusion

And there you have it! I ended up leaving some of the code out. As I said, this is business logic and proprietary. I did want to show, though, how I went about solving a fairly complex problem using a custom component. Again, I did have the benefit of examining a fully functional set of packages that had implemented all the logic in a Data Flow. This helped immensely, as the logic was plain as day.

So in the end, the new component makes creating new packages easier by only having to drop a single component on the Data Flow. There are a handful of properties to set (around 6 depending on the incoming data) and four Outputs to account for. This component also performs much, much better. My estimates put it at around 600% over the Data Flow approach.

Lastly, while writing these entries, I forced myself to examine the code in a different way (i.e. for public consumption and scrutiny). This made me think harder about the logic and how I used C#. The exercise was quite valuable and I ended up learning a lot in the process. With that said, please feel free to identify any mistakes I might have made, or more importantly, any places where you think improvements can be made. Thanks for reading!

 

Tags: , , , ,

2 Comments

ID Lookup Custom Transform (Part 2 of 3)

In part 1 of this series, I discussed the ProvideComponentProperties method of my custom ID Lookup component in SSIS. In this post, I’ll discuss the AcquireConnections and PreExecute methods. Please read the introduction and part 1 if you haven’t already!

AcquireConnections

The purpose of AcquireConnections is to establish connections to the specified connection managers at design and runtime. Although not totally necessary, it is a best practice and is recommended to connect to all external data sources using this method.

The first step is to prepare a runtime connection. In the ProvideComponentProperties method, I created a new runtime connection using interface IDTSRuntimeConnection90. I added this connection to the RuntimeConnectionCollection of my ComponentMetaData and named it “IDLookupConnection”. The code in ProvideComponentProperties is straightforward and looks like this:

IDTSRuntimeConnectionCollection90 oRuntimeConnectionCollection 
    = ComponentMetaData.RuntimeConnectionCollection;
IDTSRuntimeConnection90 oRuntimeConnection;
oRuntimeConnection = oRuntimeConnectionCollection.New();
oRuntimeConnection.Name = "IDLookupConnection";

As with my other examples, only the essential code is shown above. In my production copy, I use structured error handling and class members instead of literal strings.

Next, I wrote the following code in the AcquireConnection method:

private SqlConnection dbConn;
 
public override void AcquireConnections(object transaction)
{
IDTSRuntimeConnection90 conn = ComponentMetaData.RuntimeConnectionCollection["IDLookupConnection"];
ConnectionManager cm =  DtsConvert.ToConnectionManage(conn.ConnectionManager);
ConnectionManagerAdoNet cmObj = cm.InnerObject as ConnectionManagerAdoNet;
dbConn = cmObj.AcquireConnection(transaction) as SqlConnection;  
 
}

In the above code, I am converting conn (IDTSConnectionManager90) to a managed ConnectionManager (cm). Then, I can use InnerObject to get an AdoNet Connection Manager (cmObj) object. The AcquireConnection method of cmObj returns a SqlConnection object (dbConn) which opens a connection to my SQL Server database. I’ll use dbConn later in PreExecute to create SqlCommands.

Not to be forgotten, I use the ReleaseConnections method (not shown) to release dbConn when the component finishes its work.

PreExecute

The PreExecute method is called once for the life of the component at runtime. It is a good idea to do as much ‘prep’ work in here as possible. This includes applying behavior based on property values, loading data for lookups, setting up ’static’ items, and preparing the component to start accepting input. There is a temptation at times to use ProcessInput for some of these things: resist it! ProcessInput runs once for each buffer.

It is in PreExecute where I think I could do some better optimization. Currently, I am returning about 100,000 values from a master lookup table in about 6 seconds (crude benchmarking). This data is then used in ProcessInput for matching. I believe I can do better in here, but for now, it works and won’t be modified any time in the near future. Of course, if you see something awry, let me know!

The meat of the programming is in this method, but a lot of the logic is repeated (for example, I fetch LineageIDs for several new output IDTSOutputColumn90 columns added to each of my Outputs in ProvideComponentProperties). Here is such an example:

// get references to the input so we can use its buffer metadata
IDTSInput90 input = ComponentMetaData.InputCollection["IDLookupInput"];
 
// get reference to the output collection and its columns
IDTSOutput90 output = ComponentMetaData.OutputCollection["IDLookupOut"];
IDTSOutputColumnCollection90 outputColumns = output.OutputColumnCollection;
 
foreach (IDTSOutputColumn90 column in outputColumns)
{
  // NOTE: MS Documentation is not clear, cannot use output.buffer 
  // for synch transforms, need to use input.buffer instead (which 
  // is the same as what would be there for the output.buffer)
  intLineageID = (int)BufferManager.FindColumnByLineageID(input.Buffer,column.LineageID);
  strColumnName = column.Name.ToString().ToLower();
 
  if (strColumnName == "id_matchvalue") {
    indexMatchValue = intLineageID;
    exit;
  }
}

 

You might have noticed by now that much of what I am doing is more-or-less boilerplate code that is provided by Books Online. The trick is applying the right pieces for the right situation and building a solution that solves a specific problem better than what can be done using out-of-the-box tasks and components.

Despite all the good code examples in BOL, I still found some parts to be documented poorly. Take for example the issue with finding the LineageID of an output column for synchronous components. At first glance, you might think you can use BufferManager.FindColumnByLineageID(output.Buffer, column.LineageID). But for synchronous transforms, an output buffer is not created — it is the same as the input buffer. To find the LineageId, you need to use input.Buffer and not output.Buffer. Not very intuitive at all! But I digress…

In PreExecute, after I match a variety of LineageIDs to custom members of my IDLookup class, I continue by writing the code to fetch data from my lookup table. Actually, in the real implementation, this is also repeated for more than one data source — depending on how the developer set the properties of the component. But for the sake of simplicity, I’ll show one of the blocks of code for the primary lookup (the names have changed to protect the innocent!):

string sqlStr = "SELECT id, account, fname, lname, ssn, dob, ext_id, empid FROM mstr.MasterLookup WHERE status = 'A'";
SqlCommand sql = new SqlCommand(sqlStr, dbConn);
daMstr = new SqlDataAdapter();
daMstr.SelectCommand = sql;
dtMstr = new DataTable();
daMstr.Fill(dtMstr);

And there you have it! I simply fill a SqlDataAdapter (daMstr) with a DataTable (dtMstr) based on a SQL Statement (SqlCommand sql). The query returns several thousand rows from the master lookup table and stores them in daMstrSec, which I have defined as a custom member of my class.

Next, I’ll talk about ProcessInput!

Tags: , , , ,

1 Comment

Book Review: SOA Approach to Integration

I posted a review of the book “SOA Approach to Integration” by Matjaz, B. Juric, Ramesh Loganathan, Dr. P., and G Sarang (published by Packt Publishing) over at Amazon this past weekend. Please check it out if you get the chance. Unlike my last review, this one is more favorable!

I wanted to read more about SOA for two reasons: curiosity and to round-out my knowledge of various integration strategies. Those who know me, know me as a “data guy”. I like to design data models, create databases, normalize things, and sketch integration strategies in UML. Boring. I know.

I suppose this comes directly from my background as a VFP application developer. In the nineties, I developed a dozen or so customized, vertical applications that existed for the most part in departmental islands. Their purpose was to solve business problems, usually at the process level. I soon began writing code to integrate these applications, the fancy term is “Enterprise Application Integration (EAI)”, but I never really called it that. Using Remote Procedure Calls (RPC) and shared objects, I was able to build point-to-point bridges allowing these islands to communicate with one another.

When I had the chance to start developing data warehouses, I jumped. I no longer write applications, instead, I do a lot of data modeling and I write code and design workflows to integrate data from any number of disparate applications spread out across an enterprise. I find this work more than just “satisfying”.

SOA is a different approach to integrating an enterprise. It is like EAI in some ways, but overall, the SOA approach is more advanced and scalable. Up until I read this book, I could not easily draw the line between exposing a few functions in a peer-to-peer api/RPC scenario, to this “Enterprise Service Bus” that coordinates and orchestrates entire business processes off in some far off place using XML and web services.

As you know from my postings and articles, I talk a lot about “Business Processes” in regards to dimensional modeling. This book brought me greater insight into what a “process” is and what it could be. In Dimensional Modeling, we take a bottom-up approach to building an enterprise database. Using conformed dimensions, we start process-by-process to construct a complete data warehouse. Unlike what some detractors and skeptics conclude (are there really any of those still?), we’re not creating new silos or islands, but rather an integrated, highly valuable data warehouse organized by business process, facilitated by the use of conformed dimensions. SOA looks at the business process in much the same way, but while the data warehouse typically gets a hold of a transaction after it occurs, SOA is part of the transaction. They’re two pees in the same pod.

While I agree that SOA is necessary for real-time transactional and document-related (”doc-literal”) integration, I don’t feel that data warehouses are threatened by the emergence of this “technology”. SOA solves a “business logic” problem, where business logic is spread out across an organization. Data warehousing solves reporting, analytical, and data exploration problems. A fully integrated organization will rely on SOA and data warehousing.

To buy this book, click here.

Check out these other reviews as well:

Tech Initiatives
Ken Guest’s online diary
Enterprise Architecture SOA and More

Tags: ,

No Comments

ID Lookup Custom Transform (Part 1 of 3)

As promised, I’ll discuss in this post some of the details of my custom transform component I wrote in C# for SSIS. Unfortunately, I cannot publish the component because it contains business logic and company secrets! I will, however, talk about how I went about it and detail some of its finer and not-so-finer points.

This topic will be presented in multiple parts because of its length. In Part 1, I’ll discuss the ProvideComponentProperties method in depth. Part 2 will focus on AcquireConnections and PreExecute methods. Finally, I’ll look at ProcessInput and give some closing thoughts.

Before reading on, please have a look at my previous post to get some background on the problem.

Setup and Constraints

Looking up ID Numbers in our environment is rather complex due to the number of valid search combinations and the disparity and reliability of the source files. Because all source data comes from many different vendors and affiliates, the structure and content is often provided on a “best effort” basis. Therefore, extra care must be taken when accepting a file for import.

The current environment makes use of staging tables, which I will continue to use. Data without our ID is staged, the ID is added in a Data Flow (or if it is not found, it is flagged as unmatched), and then staged again for later processing. This made implementing the component a breeze. Essentially, I deleted every component in the original Data Flow between the Source and Destination(s) and dropped in the new component.

I also had to focus on recreating the logic in C# that existed in the package, and do so without loosing performance. Gaining a degree of maintainability was not good enough: It needed to perform better too. Thankfully, the team before me went through all the hard work of turning the “business” problem of determining an ID into an easily followed data flow in SSIS. Decomposing the pieces and turning them into C# was infinitely easier as a result.

Settin up the ProvideComponentProperties method

The ProvideComponentProperties method is called when the component is dropped on the Data Flow view. Its purpose is to initialize the component’s ComponentMetaData. The ComponentMetaData contains the custom properties, inputs, and outputs of the component.

To set this up, I’ve done the following in the ProvideComponentProperties method:

// although unlikely that this function would be called more 
// than once, it is a good idea to start out clean
RemoveAllInputsOutputsAndCustomProperties()
 
// define specific properties of ComponentMetaData
ComponentMetaData.ValidateExternalMetadata = false;
ComponentMetaData.UsesDispositions = true;
ComponentMetaData.Name = "IDLookup"

ValidateExternalMetadata, when set to false, tells the component not to perform any validations against external sources at design time. UsesDispositions tells the component to include (true) or not to include (false) error output. I also added additional metadata (Description, ContactInfo, etc..) not shown in the snippet above.

// add custom properties
IDTSCustomProperty90 propAllow3rdPartyLookup = ComponentMetaData.CustomPropertyCollection.New();
 
// set properties on the new, um, property
propAllow3rdPartyLookup.Name = "Allow3rdPartyLookup";
propAllow3rdPartyLookup.Description = "Also search the external database";
propAllow3rdPartyLookup.ExpressionType = DTSCustomPropertyExpressionType.CPET_NOTIFY;

The above chunk of code adds a new property to my component called “Allow3rdPartyLookup”. This property’s value will be used later on to determine if I need to look up the ID in a different database. The ExpressionType property gives me the ability to allow (CPET_NOTIFY) or disallow (CPET_NONE) expressions for the property’s value. I’m not certain if there are any general rules-of-thumb for setting this property: I always seem to allow expressions and have never disallowed them. This provides more flexibility and allows property settings to be dynamic, both good things.

I have added an additional 6 properties (not shown here) that are used to further customize the behavior of the component.

Finally, I set up the inputs and outputs. The component will accept a single input and direct rows to one of five outputs (including the error output if needed). Here is a sample view showing the input and outputs:

ID Lookup

The following code sets up the Input:

IDTSInput90 RuntimeInput = ComponentMetaData.InputCollection.New();
RuntimeInput.Name = "IDLookupInput";
RuntimeInput.ErrorRowDisposition = DTSRowDisposition.RD_FailComponent;

For the Input, the ErrorRowDisposition tells the component how to handle errors. In this case, I want to fail the component.

ID Lookup is synchronous, which means that it processes buffers as it gets them from the upstream component. To put it another way: Its output is processed with its input. An asynchronous component, like a Sort, sends the output at a different time than the input.

Setting up the outputs is rather straightforward. First, create a new IDTSOutput90 object and assign it a name. To be able to feed the input to the output, set the SynchronousInputID of the output to the ID of the input:

// Specify each output
IDTSOutput90 RuntimeOutput = ComponentMetaData.OutputCollection.New();
RuntimeOutput.Name = "IDLookupOut";
RuntimeOutput.SynchronousInputID = RuntimeInput.ID;
RuntimeOutput.ExclusionGroup = 1;
 
// Add a new column to the Output to store the ID
IDTSOutputColumn90 outputColum1 = ComponentMetaData.OutputCollection["IDLookupOut"].OutputColumnCollection.New();
outputColum1.Name = "FoundID";
outputColum1.SetDataTypeProperties(DataType.DT_STR, 12, 0, 0, 1252);

Note: The above is repeated for the remaining 3 outputs: IDAddToMasterData, IDLookupDuplicateOut, and IDLookupNoMatchOut. I’ve not included those here for brevity.

Setting ExclusionGroup to 1 on all non-error Outputs tells the object that I want to send rows — exclusively and at my discretion — to a specific output (or to more than one as you’ll see in Part 3). This is done manually using the DirectRow method in the ProcessInput method.

In my next post, I’ll take you through AcquireConnections and PreExecute methods. In the meantime, your questions, comments, or concerns are welcome!

Tags: , , , ,

2 Comments

ETL Subsystem 10: Surrogate Key Generator

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

When integrating data into a Dimensional Model, you need a mechanism to assign new primary keys to each dimension. These primary keys will be used in your Fact table as foreign keys. You cannot use natural keys because they are likely to repeat — this is especially true if you are maintaining history using SCD Type 2 (more on this in a bit). Subsystem 10 addresses this important need, by specifying the need to generate surrogate keys for all dimensions.

Surrogates

Integer surrogate keys should always be used as primary keys for a dimension. In addition, each dimension will have at least one candidate key comprised of the natural key and row status (active or inactive dates) if the dimension is a TYPE 2 SCD. Here is what I wrote in a previous post:

Every dimension contains a single-part surrogate primary key field. This primary key is used to relate to one or more fact tables in the schema. Natural keys (like an employee id) are kept in the dimension, but are not used in joins. This technique ensures a unique value for each record, simplifies joins, and keeps indexes small.

So the concept is simple: Every time a new row is inserted into any dimension, it gets a new primary surrogate key. We use integers because they’re compact, perfect for sequencing, and under most relational databases we can control their ranges. How you generating this primary key largely depends on your performance needs and RDBMS capabilities (see the SSIS and VFP subsections below).

Special Cases

You can treat Date dimensions a little differently. Instead of making the first row in your Date Dimension start at 1, use the ISO 8601 date standard YYYYMMDD format. This makes sense for a few reasons. First, it is easier to look at a row in a Fact table and have the date available without a join to the Date dimension or one of its role players. Second, a Date Dimension is treated as a Type 1 SCD, meaning, you would never have the same date repeated with one row being active and the other inactive as in TYPE 2 SCDs (hence throwing off the ISO date format). Third, because Fact tables generally contain several dates, using the integer date key (which is an int in the ISO date format) instead of fetching the formatted date from the Date Dimension may have some performance benefits because you can easily format the date later on in the presentation layer. Lastly, you can usually predict with certainty the total number of rows that the Date Dimension will have, and then pre-populate all the dates before you even do your first integration. This removes the Date Dimension from the surrogate key processing all together.

As I mentioned previously, one row in each dimension should have a primary key of zero, which can be used as a symbolic “null”. Remember that Fact tables reference dimensions through foreign keys. You cannot have a null foreign key, so the only viable alternative is to link to a dummy row in the dimension. In dimensional modeling, the acceptable solution is to create a row in each dimension with a primary key of zero. For example, in a Customer Dimension, the zero key might have a customer name of “Unknown” and a DOB of 0 (which is a foreign key that relates to the Date Dimension’s null row). This is a very cool concept because it becomes incredibly easy to produce a query that could give you all the order transactions that did not involve a known customer. Maybe not the best of examples, but try writing a query to give you results of everything that didn’t happen, or wasn’t there, using a relational model!

You can also use -1, -2, and so on to represent other special conditions. For example, a Promotion Dimension may have a zero-key row that represents “Unknown” and a -1 key row that represents “No Promotion”. Perhaps a -2 key could symbolize a “Canceled Promotion”. And so on…

The following sections give some ideas on how to generate surrogate keys in both SSIS and VFP. I’ll discuss using these keys in greater depth when I discuss Subsystem 14: Surrogate Key Management.

SQL Server 2005 Integration Services (SSIS)

I prefer to let the database generate new, unique surrogate keys in all cases. This is usually the safest approach and doesn’t require any real development around the key-generation process.

For SQL Server, select the best int for the job and make it an IDENTITY. Select among the 1 byte Tinyint (0 to 255), 2 byte Smallint (-32,768 to 32,768), 4 byte Int (-2,147,483,648 to 2,147,483,647), and the 8 byte Bigint (-2^63 to 2^63-1) depending on the expected maximum size of the dimension.

It may not be practical or feasible to make the primary key attribute in your dimensions IDENTITY Ints. Performance may also be an issue when assigning and returning these new keys. There are really two alternatives, my favorite first:

Alternative 1: Use a staging table. Using a lookup/reference table in the staging area is an option that always works. The concept is simple: Create lookup tables that correspond to each dimension in the model. Include the IDENTITY surrogate key column, the natural key(s), and active/inactive dates (for Type 2 dimensions). When processing dimensions, insert new rows in this lookup table just before adding them to the dimension. Use the surrogate key generated from the lookup table to insert into the dimension (you can either return the generated key using SCOPE_IDENTITY() or simply do a lookup/join on the natural key and status). When you need the key for the Fact table, you can do another lookup/join on the staging table. This approach takes pressure off of the dimensional model and creates opportunities to increase performance by optimizing the lookup/reference table. I’ll talk more about the details on this approach when I discuss Subsystem 14.

Alternative 2: Generate your own key. Another way to achieve the same result is to figure out the last key in the dimension and increment it manually. This number will then be used to insert into the dimension and again into the Fact. The problem with this approach is that if you do not process Facts immediately after each dimension, one at a time, then you will need to store and lookup the key in a staging table or in memory. Also, this approach assumes that no other process or thread is inserting data into the dimension at the same time. This, of course, would kill the sequential system and violate the primary key constraint on the table.

To explore Alternative 2 further, have a look at this post from Marco Russo of sqljunkies.com. He explains how to use an Execute SQL Task in SSIS that executes the following:

SELECT COALESCE( MAX( ID_Dimension ), 0 ) AS LastID 
    FROM Dimension

The claim is that his method is the best, most scalable solution. I don’t fully agree, as my experiences dictate otherwise. But I believe that each approach has merit and depends a lot on the environment.

Hand Coding with Visual FoxPro (VFP9)

For VFP, use an int data type with AutoIncrement enabled. As with a SQL Server implementation, I prefer using this approach to generating keys manually. Even still, both alternatives mentioned above are also available in FoxPro:

Alternative 1: Use a staging table. Here, as with SQL Server, use a staging table with AutoIncrement enabled on the primary key. As a consequence, you will not enabled AutoIncrement on the dimension. To return the newly added key from the lookup/reference table, you simply use the GETAUTOINCVALUE() function and return it.

Alternative 2: Generate your own key. You can also determine the last key used in the dimension and manually increment it. Personally, this approach offers very little to a VFP developer. Alternative 1 — or simply using the Dimension for generating the surrogate — is by far the best approach.

I like VFP for these types of tasks. Whatever you decide, you don’t really need to worry about returning the generated key. You typically insert and update all your dimensional data first, and then return to the dimension later to lookup keys just before inserting the facts. Nothing beats an old-fashioned SEEK in this scenario, which can be light years faster than building a select with a where clause depending on how crafty you are with Rushmore and Index building.

Compare this:

SELECT PrimaryKey ;
    FROM Dimension ;
    WHERE NaturalKey = 'SomeValue' ;
        AND StatusFlag = True

to this:

SEEK('SomeCandidateKey','Dimension','CandidateIndex')

If you are using Alternative 1, then instead of SEEK, you could build a JOIN between the lookup table and the data being processed for the fact. This set-based approach can be pretty speedy as well.

If you insist on exploring Alternative 2, the translation from Marco’s TSQL code to VFP is as follows:

SELECT NVL(MAX(ID_Dimension),0) AS LastID ;
    FROM Dimension

In my experience, I simply use VFP9’s Autoincrementing Integer field in each Dimension. I don’t bother with a staging lookup table because with VFP, I can just set the order to the natural key and do a SEEK on the dimension. It’s a fast, simple, and consistent approach.

VFP Really Shines

In this series of articles, I have attempted to highlight some of the differences between implementing the 34 Subsystem best practices in both a hand-coded environment (VFP) and through a tool specifically designed for the job (SSIS). In each case, both VFP and SSIS have really been about equal (this is a non-scientific study, of course!). SSIS has advantages that are hard to duplicate in VFP (some of which I will be highlighting later on in this series), but in this case, VFP really shines. While both databases can handle generating keys for us in similar ways, VFP’s SEEK command allows us to use it!

I don’t want to dwell on this any more. I’ll discuss more details on how to use the surrogate keys when I talk about Subsystem 14: Surrogate Key Management.

Tags: , , , , , ,

5 Comments