Posts Tagged Data Integration

Scoping Data Warehouse Initiatives

focus Scoping Data Warehouse InitiativesData warehousing is a complex operation. From start to finish (if there is a finish), project teams are faced with many challenges. In all phases of the lifecycle, there are opportunities for derailment. The best way to mitigate potential issues and stay on time and within budget is to carefully define and manage scope. Managing scope can be an ongoing struggle (especially if requirements are not clearly defined or justified). While this is really a PM101-type of topic, I feel there are some fine points in a DW/BI environment that are not mentioned enough.

Consider the following:

Programs verses projects

I won’t get into a deep PM discussion here, but it is important to point out that data warehousing (or business intelligence, master data management, etc.) initiatives should be thought of as programs and not projects. This mindset will help in scoping.

A program (which might also be called a “project portfolio” in some circles) is basically just a set of related projects. With a program, the emphasis is on organizing, prioritizing, and allocating resources to the right projects. Program scope is more strategic, and answers long-term questions about what type of value the organization hopes to achieve from the initiative.

A project, on the other hand, is much more specific — with a set number of deliverables and goals that have a high immediate impact. The scope at the project level is therefore more tactical in nature: high impact, fast delivery. Be aware that some projects may never be given the green light (for example, if there is a low business impact or if there is a low feasibility rating because of data source or data quality complications).

What I find odd is that organizations still choose to tackle immense data warehousing initiatives in one or two shots, trying to deliver everything at once over a period of 18 or more months. This is the wrong approach (here’s why). Break this large initiative into individual projects and try to deliver functionality every 6 to 8 weeks.

The business process

The best way to break down data warehousing programs into high-impact projects is along business process lines. A business process, as defined here, is:

The complete response that a business makes to an event. A business process entails the execution of a sequence of one or more process steps. It has a clearly defined deliverable or outcome. A Business Process is defined by the business event that triggers the process, the inputs and outputs, all the operational steps required to produce the output, the sequential relationship between the process steps, the business decisions that are part of the event response, and the flow of material and/or information between process steps.

Some example of the above: inventory tracking, Internet sales, retail sales, marketing, tax assessment, tax collection, pitching, batting.

In any data warehousing environment, you can expect to have several business processes to model. Each business process you tackle will have elements touching upon different aspects of the data warehouse, including infrastructure, middleware, data modeling, ETL, business logic development, presentation elements, and so on. If you scope each project to the business process, you can deliver complete solutions in the shortest amount of time. (It should be obvious that the very first business process you implement will take the longest, as the team works out the core infrastructure. Most of this infrastructure will be reused by other business processes.)

Avoid scoping to a data source

Do not fall into the trap of scoping to a data source. Scoping to a data source is almost guaranteed to deliver mediocre outcomes. These projects typically include many unfinished or inadequate business processes all delivered at once some time in the distant future and long after the excitement over the initiative has subsided.

While it is true that only one or two data sources might exist in some organizations, it is not true that inventory, customers, sales, procurement, shipping, and other business processes need to be taken on at once. Create a single project for each business process, prioritize based on impact and feasibility, and then badabing badaboom, you deliver. Next.

Along the same lines, do not adjust your scope if the data source is unavailable, uncooperative, or lacking in quality. Instead, bring the fight to the data source (here is where a good, preferable C-Leveled, business sponsor can come in handy) and set things right. This is obviously a project risk, and also an organizational risk. If you are having problems extracting inventory data then maybe its time to put down your data warehousing gloves and get a new inventory system.

Last thoughts

Scoping the data warehouse is a difficult problem. Troubles start early on with the initial idea, it moves on through requirement gathering, and finally into the development phase of the lifecycle. There is not a lot of good advice in this area for data warehousing (if you happen to know of a good source, please send me a link or title). But I do find that if you work towards business processes, think in terms of programs and projects, and avoid the data source trap, scoping decisions will settle into the real needs of the business.

Tags: , , , , , , , , ,

1 Comment

The Three Faces of a Good ETLer

Hiring a “data integration expert” or consultant for your next, greatest, data warehousing project? Don’t take it lightly. ETL personnel are critical to the success or failure of your project.

The following are what I deem to be essential technology-related aspects, or faces, of a good ETL developer and/or architect (herein referred to as an ETLer for lack of creativity). While you need to consider business and industry knowledge, personality, and experience in your team-building process, you should start by checking off the following on your interview sheet:

First Face: the technologist

Programming must come natural to an ETLer. Objects, logical constructs, expression construction, program flow, and the like, must be well understood. The truth is that no matter how much your vendor proclaims that their tool does it all, chances are excellent that some hand coding will be required. On top of that, ETL tools work a lot like procedural programs. Technologists are very good at putting their right foot forward, and will generally think of things to make the ETL flow perform better. They also think about logging, auditing, and exception handling; all important.

Second Face: the theorist

But a solid programming background is not enough. Knowledge of Data Integration theory and best practices are equally important. While I believe in and use Kimball’s methodologies for integrating data into a dimensional data warehouse, other methodologies exist that may be more suitable to your business and integration needs. Following a proven methodology, with slight modifications to suit your environment will get you further, faster. Having little or no theory behind what you’re doing gets you somewhere, slower. Identify your methodology, and then find someone who understands it.

Third Face: the specialist

Knowing the ins and outs of your ETL tool (SSIS, OWB, Datastage, Talend Open Studio, etc.) is essential. I would venture to guess that a solid programmer who has a great understanding of ETL theory will be able to get by using most tools with little learning curve. What I worry about (and you should too) are the nuances in the tooling that can stump even the best. These nuances (SSIS, my tool of *ehem* choice — sorry, I needed to clear my throat, has many of these nuances) can cost you many project hours and force rewrites if blocking issues are encountered. Tool knowledge is also essential to know when it is appropriate to forgo the tool because of I/O issues, or because hierarchical data is better handled elsewhere, or because business logic is best not bundled within a data flow.

About Face

While junior members of your data integration team can be one or two-faced (that came out funny), senior members and architects must have more meat on the bone.

I suppose this is why good ETLers are difficult to come by. The ETLer needs to have a healthy mix of programming talent, an approach discipline, and tool knowledge. Trained DBAs and software developers might have a lot to offer, as might a troop of certified tool jocks and method junkies, but to get your project in on time and within budget, don’t settle.

Tags: , , , , ,

No Comments

10 Commandments of Data Integration

  1. You shall compile and document all requirements and mappings; segregate the work by business process. You may have more than one of these business processes, some of which may come before others.
  2. Do not begin without first conducting a thorough data profile; otherwise, you will be punished for your inequities, as will the generations that come after you.
  3. Do not think commandments one or two are in vain, lest you will become overrun by the dead line, scope creepers, and a great exodus of people from your tribe; if this happens to you, do not swear or curse, for you have been warned.
  4. Remember that latency and timeliness are equal in importance to non-volatility and having a traceable lineage; a staging area may lead you to this promised land.
  5. Honor the rules of data conformance.
  6. Do not kill dirty data: you shall clean them, or take them back to their sources for retribution.
  7. Do not commit the worst data integration transgression of all and ignore data quality, your ignorance will not be forgiven.
  8. Do not be shy about stealing your neighbor’s work, for his trials have led to best practices that you can make equally good use of.
  9. Do not rely solely on business keys; surrogates are your friend and will permit you to engage in slowly changing your dimensions.
  10. You shall covet a proper audit and log system; for on the day of judgment, you will need proof of your compliance.

Tags: , , , , ,

No Comments

ETL Subsystem 31: Paralleling and Pipelining

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

parallel processingOf all the subsystems that I’ve discussed so far, this one resulted in the most research. I had to (a) learn more about how paralleling works and (b) experiment with my environment to better understand it.

Honestly, I’ve taken this subsystem for granted over the years. And for VFP, I’ve done little exploration in this arena. For SSIS, I have tended to adjust the settings I can adjust (more on this below), monitor the results, and tweak my performance as needed. In some environments, this lackluster approach will get you by just fine. If you have very small load windows and performance is critical, then you’ll need to make a better effort.

So what is Paralleling and Pipelining?

Lumped together into a single subsystem, these two performance means are related but different. They’re cousins, I suppose. Running your ETL processes in parallel means that your ETL system is carrying out multiple operations simultaneously. Pipelining your ETL processes means that you can start new operations before the previous ones complete.

Paralleling and Pipelining are quite desirable. And depending on your tool of choice, taking advantage of them can be painless or painful.

How can you apply them?

You can achieve parallel processing by utilizing the CPUs on a single machine, or you can utilize multiple machines. The first option is the easiest to set up and results can be quite good. For a multiple CPU (or core) setup, you are actually running code (programs, algorithms) simultaneously, potentially doubling performance, all on the same box. You can scale out (i.e. scale horizontally) your ETL processes by adding computers (see What is distributed computing? by Kirk Pearson), allowing you to take advantage of the CPUs, RAM, and I/O of each machine. The latter has some significant design implications, but well worth it if your environment needs it.

assembly-linePipelining increases throughput. Unlike parallelism, it will not allow the instruction to run faster, but rather it permits downstream processes to start before the upstream process finishes. A great analogy is an assembly line, where parts are added to the whole as it travels down the line.

Getting parallelism and pipelining to work together is the Holy Grail of ETL performance. While certain performance techniques are available at all phases of data integration (from Extraction and CDC, to surrogate key handling and using partitions for fast loading), none can compare to the gains you can realize with this subsystem.

You should also keep in mind that CPU multitasking is different from parallel processing, and multithreading is different from pipelining. A multitasking process shares CPU resources, giving the illusion of parallelism (although one man’s illusion can be another man’s reality). Multithreaded applications share the same memory, but operate on different engine threads (i.e. a subtask). Multitasking and multithreading, like pipelining, increase throughput and also play an important role in performance tuning. I’ll talk a little more about this below in my section about FoxPro. Otherwise, if anyone is interested, I can try to elaborate in another post.

Where can this best be utilized?

Here are some ideas on where you can focus your efforts:

  • When loading historical data or retrieving data from multiple similar sources, execute the same package for different date ranges, at the same time (in SSIS for example, use multiple Execute Package Tasks or run the same package multiple times together as Jamie Thomson explores here), or you could design your historical load packages to break apart the data into separate threads.
  • Spread out UPDATE statements. This can be real handy if you have a few accumulating snapshot fact tables.
  • Spread out complex algorithms and routines that can operate on a subset of data.
  • Load staging tables while downstream processing loads your dimensional model
  • Do lookups (especially surrogate key lookups) in parallel
  • Distribute your conformed dimensions to other machines, data marts, etc. in parallel

 

SQL Server 2005 Integration Services (SSIS)

As you know, I use SSIS and VFP for ETL (not at the same time or on the same project though). With SSIS, I can quickly create complex routines that can automatically take advantage of multiple processors. The native support for buffers, execution trees, and parallelism makes my job pretty easy (which is why I suppose I’ve taken this subsystem for granted over the years). Simply understand how SSIS works, adjust the settings you need to adjust, monitor your performance, and tweak as needed.

To get a grip, the following resources are invaluable:

 

Hand Coding with Visual FoxPro (VFP9)

While SSIS and SQL Server have built-in mechanisms to manage most of the paralleling and pipelining responsibilities for you, FoxPro does not. You can achieve some very good results using VFP and multithreading, but you have to be extremely creative in how you handle paralleling and pipelining. If you don’t think this is the case, I’d love to hear how paralleling and pipelining can be achieved with VFP!!!

Of course, the VFP community is — and has always been — quite creative. As with most of this sort of thing, Calvin Hsia is near the front of the line. MTmyVFP (True VFP multi-threading) on CodePlex is a creative example using Hsia’s Multithreading class. For more information and a ton of details, check out:

As I’ve stated before, multithreading is not parallelism, nor is it pipelining. But if you utilize MTmyVFP (or similar solution) in your VFP ETL system, you will realize many performance benefits. Lastly, there was a pretty interesting, albiet short, discussion on this issue here.

From here

This post might have come off a bit long-winded, but there were quite a few important points to make. I hope that I’ve been able to distill what I’ve learned and that in the end, it all makes some sense. In my next ETL post, I’ll talk about ETL Subsystem 32: Security.

Tags: , , , , , , , , , ,

No Comments

ETL Subsystem 30: Problem Escalation

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

ETL Subsystem 30: Problem EscalationProblem escalation in a data integration project is much like problem escalation for any deployed application. The primary difference is that most of the escalations are initiated by software and not people.

A typical escalation scenario would start with an end-user or business user report or complaint. In a data integration environment, your data monitoring tools, scheduler, and ETL system will be watching for the exceptional events and states that cannot be handled automatically. These exceptions are then forwarded through your various levels of support (or simply to your data warehouse maintenance group) by email or through some dashboard/support desk application. This process is in addition to calls made by your end users to your help desk.

The ETL Subsytems, remember, are a set of best-practices identified by the Kimball Group for data integration. So it follows that problem escalation and resolution would be an important element. Note that problem escalation is also a major component of your Service Level Agreement (SLA) — A contract between you and your business users which states how you will provide your (DW/BI) service over some period.

The ultimate goal of this Subsystem is to create a highly automated support center that will keep your data integration processes healthy. The support center’s foundation is its escalation plan; in other words, the pathway that an incident takes through your team. This pathway includes stops along the different support levels. Each level has certain capabilities and expertise that can help solve problems as fast as possible.

What are incidents?

An incident, according to ITIL, is “any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service.” They range from end-user complaints to server crashes. Problems, issues, and complaints are all types of incidents.

I like to categorize ETL incidents into 3 broad groups:

  • Data (where quality, latency, or reliability is the primary issue)
  • Process (where one of the ETL components is failing due to some exception)
  • Infrastructure (where the network, hardware, middleware, or any supporting software fails)

After the incident is categorized, a severity level can be assigned.

Severity

I recommend taking the time to develop a good severity matrix. A severity matrix is a nothing more than a table with the following headings: Severity Level and Description, Response Time, and Resolution Time. The matrix will help you determine what incidents are sent where, how responses on the issues should be met, and what the expected turnaround times should be.

Severity Matrix example from shortinfosec.com

The above image is an example of a severity matrix taken from Information Security Short Takes, in an article titled “9 Things to watch out for in an SLA“. A very good read if you want to know more about SLAs!

Severity Level and Description
Usually you will see severity levels ranging from 1 to 4. The rankings depend entirely on your organization, your IT framework (if you use ITIL, for example), and the range of users using the data warehouse. As an example:

  1. Critical: A level reserved for situations when your data warehouse or BI applications are non-functioning
  2. High: Any non-critical issue that prevents one or more people from doing their job
  3. Medium: All other problems not deemed high or critical
  4. Normal: User requests, such as a new installation, that are not deemed to be true incidents or perhaps known issues that will be addressed in some future release
Response Time
Once an incident is reported, what is the expected feedback time? For critical issues, the feedback should be immediate. For Normal requests, the requesting party or parties should be notified as soon as possible and practical.
Resolution Time
This is the expected time it takes to resolve an issue. Critical issues must be resolved quickly and could involve all of your resources, while simple requests would be handled as time and resources permit.

The next step is defining your escalation groups — the people in charge of handling the various incidents.

Escalation Groups

Remember that ITIL and other frameworks already define how your escalation groups should look and interact. But also keep in mind that data integration is much different than typical applications. Your customers are generally high-level analysts and power users, managers who are responsible for P&L, and all those wonderful C-Level executives who expect this heavy Business Intelligence (or SOA, MDM, etc.) investment to run flawlessly all the time.

If you are not operating under an IT framework, and you have some flexibility in how you handle incidents, then consider the following:

  • Create 3 Escalation Groups, or “lines of support”:
    1. Triage - will organize and distribute incidents appropriately (i.e. the help desk)
    2. Analysts and your Data Steward - will be responsible for thinking through and building resolution plans for data and process problems
    3. DBAs, network admins, and the development team - will do the work required to correct the issue

  • Automate almost every part of the incident reporting process. This avoids interaction with Triage, saving precious time. An added benefit is that some issues can be resolved before a manager has to make a call. Some examples:
    • Send an email to the support team if a job fails, a report crashes, or some other process is interrupted
    • Pick up a 3rd-party bug report system that can be installed on the company intranet and allow your users to access and post to this system
    • Infrastructure issues should go directly to the IT department in charge of the component, skipping the need for level 2 support.

  • Do not outsource critical support personnel. Unless the contractors are engaged with your organization and business, they likely won’t care enough to truly “own” the problem (unfortunately, this is the case where I work: the outsourced help just doesn’t “get it” and often lazily handle issues business users deem critical).

From here

As an ETL developer and/or architect, problem escalation might not be on your radar. But it should be. Everything you build should self-report when an exception occurs. That’s the key to automating this Subsystem. You can’t do it afterward (at least not easily), so it must be in the initial planning.

In my next post, I’ll dive into ETL Subsystem 31: Paralleling and Pipelining.

 

Tags: , , , ,

No Comments