Posts Tagged BI

Scoping Data Warehouse Initiatives

focus Scoping Data Warehouse InitiativesData warehousing is a complex operation. From start to finish (if there is a finish), project teams are faced with many challenges. In all phases of the lifecycle, there are opportunities for derailment. The best way to mitigate potential issues and stay on time and within budget is to carefully define and manage scope. Managing scope can be an ongoing struggle (especially if requirements are not clearly defined or justified). While this is really a PM101-type of topic, I feel there are some fine points in a DW/BI environment that are not mentioned enough.

Consider the following:

Programs verses projects

I won’t get into a deep PM discussion here, but it is important to point out that data warehousing (or business intelligence, master data management, etc.) initiatives should be thought of as programs and not projects. This mindset will help in scoping.

A program (which might also be called a “project portfolio” in some circles) is basically just a set of related projects. With a program, the emphasis is on organizing, prioritizing, and allocating resources to the right projects. Program scope is more strategic, and answers long-term questions about what type of value the organization hopes to achieve from the initiative.

A project, on the other hand, is much more specific — with a set number of deliverables and goals that have a high immediate impact. The scope at the project level is therefore more tactical in nature: high impact, fast delivery. Be aware that some projects may never be given the green light (for example, if there is a low business impact or if there is a low feasibility rating because of data source or data quality complications).

What I find odd is that organizations still choose to tackle immense data warehousing initiatives in one or two shots, trying to deliver everything at once over a period of 18 or more months. This is the wrong approach (here’s why). Break this large initiative into individual projects and try to deliver functionality every 6 to 8 weeks.

The business process

The best way to break down data warehousing programs into high-impact projects is along business process lines. A business process, as defined here, is:

The complete response that a business makes to an event. A business process entails the execution of a sequence of one or more process steps. It has a clearly defined deliverable or outcome. A Business Process is defined by the business event that triggers the process, the inputs and outputs, all the operational steps required to produce the output, the sequential relationship between the process steps, the business decisions that are part of the event response, and the flow of material and/or information between process steps.

Some example of the above: inventory tracking, Internet sales, retail sales, marketing, tax assessment, tax collection, pitching, batting.

In any data warehousing environment, you can expect to have several business processes to model. Each business process you tackle will have elements touching upon different aspects of the data warehouse, including infrastructure, middleware, data modeling, ETL, business logic development, presentation elements, and so on. If you scope each project to the business process, you can deliver complete solutions in the shortest amount of time. (It should be obvious that the very first business process you implement will take the longest, as the team works out the core infrastructure. Most of this infrastructure will be reused by other business processes.)

Avoid scoping to a data source

Do not fall into the trap of scoping to a data source. Scoping to a data source is almost guaranteed to deliver mediocre outcomes. These projects typically include many unfinished or inadequate business processes all delivered at once some time in the distant future and long after the excitement over the initiative has subsided.

While it is true that only one or two data sources might exist in some organizations, it is not true that inventory, customers, sales, procurement, shipping, and other business processes need to be taken on at once. Create a single project for each business process, prioritize based on impact and feasibility, and then badabing badaboom, you deliver. Next.

Along the same lines, do not adjust your scope if the data source is unavailable, uncooperative, or lacking in quality. Instead, bring the fight to the data source (here is where a good, preferable C-Leveled, business sponsor can come in handy) and set things right. This is obviously a project risk, and also an organizational risk. If you are having problems extracting inventory data then maybe its time to put down your data warehousing gloves and get a new inventory system.

Last thoughts

Scoping the data warehouse is a difficult problem. Troubles start early on with the initial idea, it moves on through requirement gathering, and finally into the development phase of the lifecycle. There is not a lot of good advice in this area for data warehousing (if you happen to know of a good source, please send me a link or title). But I do find that if you work towards business processes, think in terms of programs and projects, and avoid the data source trap, scoping decisions will settle into the real needs of the business.

Tags: , , , , , , , , ,

1 Comment

The Three Faces of a Good ETLer

Hiring a “data integration expert” or consultant for your next, greatest, data warehousing project? Don’t take it lightly. ETL personnel are critical to the success or failure of your project.

The following are what I deem to be essential technology-related aspects, or faces, of a good ETL developer and/or architect (herein referred to as an ETLer for lack of creativity). While you need to consider business and industry knowledge, personality, and experience in your team-building process, you should start by checking off the following on your interview sheet:

First Face: the technologist

Programming must come natural to an ETLer. Objects, logical constructs, expression construction, program flow, and the like, must be well understood. The truth is that no matter how much your vendor proclaims that their tool does it all, chances are excellent that some hand coding will be required. On top of that, ETL tools work a lot like procedural programs. Technologists are very good at putting their right foot forward, and will generally think of things to make the ETL flow perform better. They also think about logging, auditing, and exception handling; all important.

Second Face: the theorist

But a solid programming background is not enough. Knowledge of Data Integration theory and best practices are equally important. While I believe in and use Kimball’s methodologies for integrating data into a dimensional data warehouse, other methodologies exist that may be more suitable to your business and integration needs. Following a proven methodology, with slight modifications to suit your environment will get you further, faster. Having little or no theory behind what you’re doing gets you somewhere, slower. Identify your methodology, and then find someone who understands it.

Third Face: the specialist

Knowing the ins and outs of your ETL tool (SSIS, OWB, Datastage, Talend Open Studio, etc.) is essential. I would venture to guess that a solid programmer who has a great understanding of ETL theory will be able to get by using most tools with little learning curve. What I worry about (and you should too) are the nuances in the tooling that can stump even the best. These nuances (SSIS, my tool of *ehem* choice — sorry, I needed to clear my throat, has many of these nuances) can cost you many project hours and force rewrites if blocking issues are encountered. Tool knowledge is also essential to know when it is appropriate to forgo the tool because of I/O issues, or because hierarchical data is better handled elsewhere, or because business logic is best not bundled within a data flow.

About Face

While junior members of your data integration team can be one or two-faced (that came out funny), senior members and architects must have more meat on the bone.

I suppose this is why good ETLers are difficult to come by. The ETLer needs to have a healthy mix of programming talent, an approach discipline, and tool knowledge. Trained DBAs and software developers might have a lot to offer, as might a troop of certified tool jocks and method junkies, but to get your project in on time and within budget, don’t settle.

Tags: , , , , ,

No Comments

Avoid Data Dead Ends and Information Loss

Black Hole Avoid Data Dead Ends and Information LossWhen analyzing data to make a decision, the last thing you want to encounter is a data dead end. You may be digging into some figures only to find that the data you have access to has been aggregated, combined, filtered, interpreted, or otherwise changed (in an unauthorized way) from its original source. And as an analyst, the last thing that you want to discover is that your ETL processes are solely responsible.

In Business Intelligence and decision-support instances, especially reports and dashboards, data alterations are common. Aggregates, summaries, snapshots, and the like are normal and necessary for a bird’s eye view of whatever business process is being examined. But in order to avoid information loss, be certain that the underlying data is intact at the most atomic and granular level. And also be sure analysts can get at this data (no black boxes allowed). You don’t want this information to be tossed into a black hole never to be seen or heard from again.

Atomic and Granular

I like to distinguish atomicity from granularity in the following way: Atomicity refers to non-additive and descriptive elements, usually stored as dimensions or non-additive facts, while granularity refers to measurement data usually stored as facts in a business process dimensional model. You could interchange these definitions under certain circumstances, but I like to draw the line so it is clear what I’m talking about.

Atomicity

Atomic data elements will give you the ability to conduct deeper research. By atomic, I mean that the data element has an exact meaning and does not represent some concatenated value or total. The sum of the parts have greater meaning than their whole, and in the end, allow analysts to cut analysis across different dimensions at a very minute scale.

  • A phone number is better split into country code, area code, and subscriber number
  • A street address into street number, name, type, and direction
  • A person’s name into surname and given name
  • A parcel ID into plat, lot, and map
  • An industry classification into groups and subgroups
  • A date into year, quarter, month, week of year, day, and day of week
  • Et cetera!

Granularity

With granularity, you define the level of detail in a measurement. The more granular, the greater the detail. For a trip to the market, you can define the granularity of your shopping excursion on the item level (each item in the basket), by product (grouping similar items), or perhaps by the entire basket as a whole. The choice is yours. Of course, storing the price of each item is the most granular and will give you the greatest flexibility in your analysis. You can then build your aggregates (by product, entire basket, etc.) from the most granular metrics.

If you decide to load data at larger grains, you are losing information and creating dead ends for your decision-makers. It pays to load data at the finest grain possible.

From here…

Integrating data into the data warehouse at an atomic and granular level gets you pretty far. You are likely already doing this (especially if you are familiar with transaction grain fact tables). But there are other ways you can lose data, and therefore information. In a follow-up to this post, I’ll discuss how evaluations and logic gates can also be a source of information loss.

I’d like to know your thoughts on this subject. Have I missed anything important, or have I marked something important that you feel is inconsequential?

Tags: , , , , ,

No Comments

Maintaining a DW/BI Environment

In posts Version Control and Version Migration I glossed over one of the more complex and challenging aspects of data warehousing: Once the DW/BI environment is in a production environment, how do you maintain and update it?

Some Thoughts

While I do not have a blueprint for you, I do have a few thoughts on the subject.

First, you must consider the Data Warehouse as a living and breathing organism. Not only will it be growing in size as your carefully constructed ETL packages churn, but it will also likely be growing in scope and importance (if not, then you may have to re-think your DW/BI marketing approach and/or find a new sponsor).

Second, you have to realize from the beginning that deploying a data warehouse is both an iterative and incremental process. Iterative in that you will build and rebuild as you get deeper into the project; incremental in that different parts of the warehouse will be constructed and delivered at varying rates. This is in direct contrast to the normal waterfall approach to releasing software applications and systems. It is not realistic, practical, or advised to attempt to deliver a DW/BI project in one shot. You may as well use your bullet for something else!

Passing the Baton

Pass the BatonYour DW/BI team must plan up front for the often complex handoffs between development and maintenance amid the ever-turning wheel of the DW Lifecycle (detailed here).

Those handoffs are critical. If you’re in a large organization (like me now), then you will be literally handing off the maintenance of the project to a complete different group. This group will need documentation and escalation procedures to monitor and respond to various exceptions that may occur in the production environment. While you can’t plan for every exception, you should prepare enough documentation so that all the basics are covered (for example, what should the maintenance team do if a SQL Agent Job reports a failure?). While some things will fall back on the Dev team, many of the maintenance tasks can be handled by a well trained support group.

If you’re in a small organization (like I used to be in), then you are the maintenance team. You have the added responsibility of maintaining a high-priority production system while you continue to build new pieces for the next release. This can be exhausting and stressful so be sure you automate as much of the maintenance part as you can! And be sure to allocate enough “business as usual” time in your schedule each week. Make sure your manager and/or project leader is cognizant of this added pressure and responsibility.

Back to the Start

Once you’ve passed the baton, you can go back to the beginning of the DW Lifecycle and start over. You may improve some processes (iterative development) while adding on new functionality (incremental development).

DW LifecycleFor example, in your team’s first run through the lifecycle, you implemented a single business process dimensional model (retail sales) that allowed you to also produce a series of reports for your sales department. During this first run through, you accomplished quite a bit: You installed and learned how to use the toolsets, you built the dimensional model and ETL architecture, you designed a rudimentary web portal, you created a few useful reports, and you released all the pieces to your users. So far so good.

Now, this first version is out and the business users have some feedback. An issue log and wish list is compiled (perhaps a few new reports, an updated report template, and some user-defined filters for the portal). In addition, you want to also include a new business process, inventory, so you can expand the usage of the date warehouse in your business.

You and your team get back to work. The modeler begins to construct the new dimensional model while the ETL team works on the integration packages. This process takes the longest, and may be interrupted occasionally by the production team reporting some data quality issues or some failed packages.

In a separate thread, your BI application developers are busy with implementing many of the suggestions brought forward by the sales department. When finished, they release version 1.1 of the portal. Immediately, they begin working on some new inventory reports.

This scenario repeats with each cycle for each thread lasting anywhere from 3 weeks to 3 months.

Rosy Picture?

The above scenario really paints a rosy picture of the process. In reality, it doesn’t always work smoothly. But it does work if managed correctly. And it can be quite exciting when each iteration completes and the handover goes according to plan.

I would like to talk more about this subject in future posts. For now I hope my thoughts on the matter have left you with some insight!

 

Tags: , , , , ,

1 Comment

A Data Warehouser’s Vocabulary (Part 2)

This post is part 2 (read part 1) of a series of posts containing a glossary of terms and concepts that I feel has some relevance to the data warehousing and business intelligence world. Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! When finished, these terms will be compiled and made a static page on TmF.

Aggregation
The process of redefining data into a summarization based on some rules or criteria. Aggregation may also encompass de-normalization for data access and retrieval.
Analytical Processing
Producing analysis for management decisions, usually involving trend analysis, drill-down analysis, demographic analysis, profiling, and so on.
Attribute
Any detail that serves to qualify, identify, classify, quantify, or express the state of an entity.
Data Mining
The process of analyzing large amounts of data in search of previously undiscovered business patterns.
Dimension
A denormalized table in a dimensional model with a single part primary key and descriptive attribute columns.
Event
A signal that some activity (usually a business transaction) has occurred.
Fact
Central table of a Star Schema which numeric performance measurements identified by a composite key, each of whose elements is a foreign key drawn from a dimension table.
Heuristic Analysis
Heuristic Analysis is a method to help to solve a problem, commonly informal. It is particularly used for a method that often rapidly leads to a solution that is usually reasonably close to the best possible answer. Heuristics are “rules of thumb”, educated guesses, intuitive judgments or simply common sense.
Online Analytical Processing (OLAP, also MOLAP)
On-line retrieval and analysis of data to reveal business trends and statistics not directly visible in the data directly retrieved from a data warehouse. Also known as multidimensional analysis.
Outrigger
A secondary dimension table attached to a dimension table. An outrigger is not used to normalize a dimension.
Relational OLAP (ROLAP)
“Relational” OLAP, in which the OLAP processes use a relational, normalized model for its source.
Slowly Changing Dimension (SCD)
The tendency for dimension attributes to change gradually or occasionally over time. The techniques for handling these changes include Type 1 (overwrite), Type 2 (keep history), and Type 3 (alternate realities).
Snowflake
A normalized dimension where a flat, single dimension table is deconstructed into a tree structure with potentially many nesting levels. Snowflaking a dimension generally compromises user understandability and browsing performance.
Snowflaking
The (undesirable) act of normalizing a dimensional model.
Star Schema
A generic representation of a dimensional model in a relational database in which a fact table with a composite key is joined to a number of single level dimension tables, each with a single primary key.

Tags: , , ,

1 Comment

Business Processes and the Integrated Enterprise

It’s time to think about business processes.

In a recent post, I defined a business process as “the complete response that a business makes to an event”. Because this is such an important topic for data warehousing, I thought I’d share some additional thoughts.

IntegrationBusiness processes include such activities as accounts receivable, orders, sales, and inventory management. Each process has a specific event (or goal) that defines the process and in many cases allows us to gauge the health of that process. For example, an order is an event within the orders business process. Inventory movement is an event within the inventory management process. And so on.

For a few years now, there has been a significant push — mainly by service oriented (SOA) and data warehousing architects — to get businesses to think more about business processes and not about departments, applications, and technologies. Traditionally, most organizations have structured IT around specific software purchases and departmental needs. Integrating these disparate systems later becomes a significant challenge for business intelligence, performance management, and master data initiatives.

James Gibson, in his research piece “A Research Strategy for Investigating Business Process Management Approaches”, wrote that it’s time to start thinking about process and process processing rather than data and data processing (I had to read that more than once too!). The key is that the business process — which is tied to a specific event — is a driver that can lead all other initiatives along. Actionable insights (typically what you hope to derive from your Business Intelligence and Performance Management initiatives) are only useful if they’re tied to a process that can be improved.

Thinking more about business processes, and developing architectures to support them, leads to a more integrated enterprise

Data Warehousing with dimensional modeling is solely focused on the business process. In fact, you cannot develop a true dimensional model without modeling it around some business event. And it should be clear that a single business event can span multiple source systems and departments. The dimensional model pulls all this together.

On the transactional and operational side of the fence, SOA is the right approach to take. Essentially, SOA provides a standard way to access myriad resources across a network through RPC, Web Services, and APIs (among other techniques). One application can communicate with another in real-time.

Developing an SOA and a Data Warehouse one-process-at-a-time is smart. I will talk more about this in a future posting, but the idea is simple: start with a single business process that will make the most impact and is most feasible. Then, in an iterative way, expand into additional processes. This allows development to quickly turn over key functionality while leaving room to resolve business process volatility issues and political ramblings. If you are lucky enough to be starting both data warehousing and SOA programs simultaneously, it makes most sense for the same business process to be the subject for both!

Master Data Management is about data governance and forms a core part of the integrated enterprise. Through SOA, applications can access master copies of shared entities, such as Customer and Product. Master data might be derived partly from a data warehouse using ETL and partly by operational applications in a transactional environment through SOA. When it is time to embark on an MDM initiative, it makes a lot of sense to start thinking about business processes, conformed dimensions, and how to maintain this critical data.

So imagine for a moment an enterprise with dozens of departments all using different tools and software solutions to manage their day-to-day operations. Through SOA, these applications can all talk to each other so that when a customer checks on an order, the clerk can also see who took the order, where the product currently is in transit, the customer’s order history and much more. The data is not integrated, but the processes are. At the end of the day, when the regional salespeople need their numbers, the data warehouse — which has integrated data arranged around various business events — provides the results quickly giving all subscribers a complete and integrated view of all relevant business processes.

Adopt architectures that facilitate business’ natural orientation towards the business process. Business Intelligence, Performance Management, Business Process Reengineering, and Master Data Management initiatives will benefit tremendously. I’ve been saddled by the department-oriented mentality by business for too long. Better IT/Business alignment in this area will create more opportunities for defining clear business processes which in turn will lead to a better integrated organization.

Tags: , ,

1 Comment

News from the Kimball Group

As you might know, I recently attended the Data Warehouse Lifecycle in Depth course from Kimball University. I blogged about each day of the event and also got an opportunity to speak with Margy and Warren at various times and about various topics. One of which was KU Certification.

“Now, if we can just convince them to create a certification exam…” is what I wrote, somewhat jokingly (because I knew full-well what Warren and Margy thought about a KU certification program).

Well, here is the official response, as written in the most recent design tip newsletter, “Kimball Design Tip #102 Server Configuration Considerations”:

We’ve received several inquiries about Kimball certification. After much consideration, we’ve concluded that it’s not meaningful to bestow certification by charging you to take a multiple choice exam. We do believe quality, in-depth education consistent with our proven methodology is critical to your professional development. Rather than embarking on a certification program, we’re giving Kimball University alumni an alternative opportunity to publicize their completion of our courses.

Which of course leads into my post regarding KU’s new LinkedIN Group!

We’ve launched a Kimball University Alumni group on LinkedIn so you can let everyone in your network and other interested parties know that you’ve attended a KU course. We’re not promoting LinkedIn, but it’s a well-accepted networking tool for industry professionals. Joining our alumni group allows you to promote your alumni status on your profile and connect with other alums (and perhaps potential future employers) in your area.

The Kimball University Alumni group is limited to students who have attended a full length 2- to 4-day Kimball University onsite or public course. Unfortunately, 1-day vendor seminars and other industry events do not qualify. Follow this link to join the Kimball University Alumni group.

So there you have it! If you’ve had the pleasure of attending at least one KU course, then go to LinkedIN and sign up!

Tags: , ,

2 Comments