This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Problem escalation in a data integration project is much like problem escalation for any deployed application. The primary difference is that most of the escalations are initiated by software and not people.
A typical escalation scenario would start with an end-user or business user report or complaint. In a data integration environment, your data monitoring tools, scheduler, and ETL system will be watching for the exceptional events and states that cannot be handled automatically. These exceptions are then forwarded through your various levels of support (or simply to your data warehouse maintenance group) by email or through some dashboard/support desk application. This process is in addition to calls made by your end users to your help desk.
The ETL Subsytems, remember, are a set of best-practices identified by the Kimball Group for data integration. So it follows that problem escalation and resolution would be an important element. Note that problem escalation is also a major component of your Service Level Agreement (SLA) — A contract between you and your business users which states how you will provide your (DW/BI) service over some period.
The ultimate goal of this Subsystem is to create a highly automated support center that will keep your data integration processes healthy. The support center’s foundation is its escalation plan; in other words, the pathway that an incident takes through your team. This pathway includes stops along the different support levels. Each level has certain capabilities and expertise that can help solve problems as fast as possible.
What are incidents?
An incident, according to ITIL, is “any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service.” They range from end-user complaints to server crashes. Problems, issues, and complaints are all types of incidents.
I like to categorize ETL incidents into 3 broad groups:
- Data (where quality, latency, or reliability is the primary issue)
- Process (where one of the ETL components is failing due to some exception)
- Infrastructure (where the network, hardware, middleware, or any supporting software fails)
After the incident is categorized, a severity level can be assigned.
Severity
I recommend taking the time to develop a good severity matrix. A severity matrix is a nothing more than a table with the following headings: Severity Level and Description, Response Time, and Resolution Time. The matrix will help you determine what incidents are sent where, how responses on the issues should be met, and what the expected turnaround times should be.
The above image is an example of a severity matrix taken from Information Security Short Takes, in an article titled “9 Things to watch out for in an SLA“. A very good read if you want to know more about SLAs!
- Severity Level and Description
- Usually you will see severity levels ranging from 1 to 4. The rankings depend entirely on your organization, your IT framework (if you use ITIL, for example), and the range of users using the data warehouse. As an example:
- Critical: A level reserved for situations when your data warehouse or BI applications are non-functioning
- High: Any non-critical issue that prevents one or more people from doing their job
- Medium: All other problems not deemed high or critical
- Normal: User requests, such as a new installation, that are not deemed to be true incidents or perhaps known issues that will be addressed in some future release
- Response Time
- Once an incident is reported, what is the expected feedback time? For critical issues, the feedback should be immediate. For Normal requests, the requesting party or parties should be notified as soon as possible and practical.
- Resolution Time
- This is the expected time it takes to resolve an issue. Critical issues must be resolved quickly and could involve all of your resources, while simple requests would be handled as time and resources permit.
The next step is defining your escalation groups — the people in charge of handling the various incidents.
Escalation Groups
Remember that ITIL and other frameworks already define how your escalation groups should look and interact. But also keep in mind that data integration is much different than typical applications. Your customers are generally high-level analysts and power users, managers who are responsible for P&L, and all those wonderful C-Level executives who expect this heavy Business Intelligence (or SOA, MDM, etc.) investment to run flawlessly all the time.
If you are not operating under an IT framework, and you have some flexibility in how you handle incidents, then consider the following:
- Create 3 Escalation Groups, or “lines of support”:
- Triage - will organize and distribute incidents appropriately (i.e. the help desk)
- Analysts and your Data Steward - will be responsible for thinking through and building resolution plans for data and process problems
- DBAs, network admins, and the development team - will do the work required to correct the issue
- Automate almost every part of the incident reporting process. This avoids interaction with Triage, saving precious time. An added benefit is that some issues can be resolved before a manager has to make a call. Some examples:
- Send an email to the support team if a job fails, a report crashes, or some other process is interrupted
- Pick up a 3rd-party bug report system that can be installed on the company intranet and allow your users to access and post to this system
- Infrastructure issues should go directly to the IT department in charge of the component, skipping the need for level 2 support.
- Do not outsource critical support personnel. Unless the contractors are engaged with your organization and business, they likely won’t care enough to truly “own” the problem (unfortunately, this is the case where I work: the outsourced help just doesn’t “get it” and often lazily handle issues business users deem critical).
From here
As an ETL developer and/or architect, problem escalation might not be on your radar. But it should be. Everything you build should self-report when an exception occurs. That’s the key to automating this Subsystem. You can’t do it afterward (at least not easily), so it must be in the initial planning.
In my next post, I’ll dive into ETL Subsystem 31: Paralleling and Pipelining.