Posts Tagged DW

Chaos Theory and the Data Warehouse

Have you ever considered the Data Warehouse as a chaotic system? The work of the Data Warehouse team is never complete: new requirements trickle in every day, and user feedback gets more and more sophisticated as time passes. Chaos Theory can help explain this, and in the end, offer us some insight into how we can better plan Data Warehouse development, deployment, and maintenance.

butterfly effect 150x150 Chaos Theory and the Data WarehouseThe Data Warehouse is a process which forms the center of an information supply supply chain, with several inputs and several outputs. Each input and each output is subject to change based on factors such as vendor upgrades, new interfaces, expanded interfaces, and perhaps most importantly end-user (client) evolution. All of these changes happen continuously. As people use the Data Warehouse, they become more inquisitive. They want their output and analysis rolled up or down in different ways. Predicting (i.e. planning) for Data Warehouse change can be as difficult as predicting (and therefore planning for) the weather. This environment of ever-changing needs fits neatly into the confines of Chaos Theory. But what is chaos in this context? What is Chaos Theory exactly?

From the book “Chaos Theory Tamed”, author Garnett P. Williams writes:

Chaos is sustained and disorderly-looking long-term evolution that satisfies certain mathematical criteria and that occurs in a deterministic non-linear system. Chaos theory is the principles and mathematical operations underlining chaos. (pg 9)

Meteorologist Edward Lorenz in the 1960s determined that even the tiniest differences in an initial measurement can have a huge impact on an outcome. In other words, as his butterfly effect posits, a butterfly flapping its wings in Africa can affect weather patterns in North America. Weather is a system which has a highly sensitive dependence on its initial inputs.

The foundation of the Data Warehouse is only as stable as how you control for the tiniest changes to the inputs into the information structure. As weather, it too has a highly sensitive dependence on inputs. One tiny change to a source system can have almost catastrophic effects on the Data Warehouse.

Finding Order

However, despite the chaos, we should be able to find some order. This is what Lorenz and scientists after him tried to do. The first step in this process is understanding that even seemingly random changes are not always as random as they seem. If we can understand that changes to our Data Warehouse are not random, then we can build a better Data Warehouse.

There are a few things you can do to tame the chaos:

  • Be consistent and systematic. The more predictable you and your Data Warehouse team are, the easier it will be handle change. In other words, control any and all variables that you can.
  • Adopt proven analysis and development methodologies that others have had success with. This is not to say that some level of adaptation to your environment, team skills, and situation are not required, but rather, start off with a good foundation and follow along where it makes sense.
  • Keep the team close. Quality and frequent interaction among the people who make and run the DWH is essential.
  • Stay in the groove like an improvisational jazz band. If your data modelers are not in tune with your decision-support analysts who are not in tune with your DBA, then you can’t expect to handle the challenges of chaos.
  • Feedback and evolution are two very important aspects of Data Warehousing. Keep your ear to the wall and try to anticipate changes before they occur. This takes practice, but (back to the improvisational jazz band analogy) practice makes perfect.
  • Keep in step. In the Data Warehouse world, change is natural and will come in waves. More significantly, if changes cannot be implemented quickly, your clients will lose confidence in your ability to keep up.
  • Think and act quickly. The longer you debate, the longer your client must wait. While they wait, they construct workarounds or look elsewhere. If you’re lucky and they do wait for you, their change may become outdated and no longer relevant; an opportunity might have been missed (and you’ve essentially failed them).
  • Don’t be afraid to be wrong. The consequence of acting quickly is that you might get something wrong. Just be agile enough to respond and deliver new change with urgency.

I’ll post more thoughts on this over the next weeks. I’m particularly interested in how users of the Data Warehouse become more and more sophisticated as they use its tools and applications.

Tags: , , , , , ,

No Comments

The Client or the Customer?

The terms client and customer are usually used interchangeably. After all, the dictionary basically says they’re the same thing. But why do I feel more warm and fuzzy using the term client to describe my relationship with my, um, clients?

IT/Business Alignment
This is cliché-ville, but I’m going there. Thinking of the business part of your organization as your client, rather than as your customer, might actually help you better align your objectives and goals with them. There’s a difference with the terms and it all comes down to that warm and fuzzy feeling I mentioned a few seconds ago (more tangibly, I’m talking about how the relationship is perceived by both sides).

As technologists, we’re not often known for our people skills (come on, admit it). We like zeros and ones, placing things in nice neat boxes, and Gantting. While I might be generalizing — ok, stereotyping — I think you know what I mean. Essentially, we need all the help we can get. So here’s my take:

A client is someone who uses the professional advice or services of another, while a customer is a person who purchases goods or services from another. The difference might be subtle, but imagine that your relationship is one built on partnership. In this fantasy, the “customer” is not always right (still in cliché-ville) and they know it. They have come to trust your guidance and advice, and turn to you as the expert. They are your client, and you are providing them a valuable service.

How do you see the relationship with each of your business lines? Do you position yourselves as information partners (clients) or as information servants (customer)?

Tags: , , ,

5 Comments

Scoping Data Warehouse Initiatives

focus Scoping Data Warehouse InitiativesData warehousing is a complex operation. From start to finish (if there is a finish), project teams are faced with many challenges. In all phases of the lifecycle, there are opportunities for derailment. The best way to mitigate potential issues and stay on time and within budget is to carefully define and manage scope. Managing scope can be an ongoing struggle (especially if requirements are not clearly defined or justified). While this is really a PM101-type of topic, I feel there are some fine points in a DW/BI environment that are not mentioned enough.

Consider the following:

Programs verses projects

I won’t get into a deep PM discussion here, but it is important to point out that data warehousing (or business intelligence, master data management, etc.) initiatives should be thought of as programs and not projects. This mindset will help in scoping.

A program (which might also be called a “project portfolio” in some circles) is basically just a set of related projects. With a program, the emphasis is on organizing, prioritizing, and allocating resources to the right projects. Program scope is more strategic, and answers long-term questions about what type of value the organization hopes to achieve from the initiative.

A project, on the other hand, is much more specific — with a set number of deliverables and goals that have a high immediate impact. The scope at the project level is therefore more tactical in nature: high impact, fast delivery. Be aware that some projects may never be given the green light (for example, if there is a low business impact or if there is a low feasibility rating because of data source or data quality complications).

What I find odd is that organizations still choose to tackle immense data warehousing initiatives in one or two shots, trying to deliver everything at once over a period of 18 or more months. This is the wrong approach (here’s why). Break this large initiative into individual projects and try to deliver functionality every 6 to 8 weeks.

The business process

The best way to break down data warehousing programs into high-impact projects is along business process lines. A business process, as defined here, is:

The complete response that a business makes to an event. A business process entails the execution of a sequence of one or more process steps. It has a clearly defined deliverable or outcome. A Business Process is defined by the business event that triggers the process, the inputs and outputs, all the operational steps required to produce the output, the sequential relationship between the process steps, the business decisions that are part of the event response, and the flow of material and/or information between process steps.

Some example of the above: inventory tracking, Internet sales, retail sales, marketing, tax assessment, tax collection, pitching, batting.

In any data warehousing environment, you can expect to have several business processes to model. Each business process you tackle will have elements touching upon different aspects of the data warehouse, including infrastructure, middleware, data modeling, ETL, business logic development, presentation elements, and so on. If you scope each project to the business process, you can deliver complete solutions in the shortest amount of time. (It should be obvious that the very first business process you implement will take the longest, as the team works out the core infrastructure. Most of this infrastructure will be reused by other business processes.)

Avoid scoping to a data source

Do not fall into the trap of scoping to a data source. Scoping to a data source is almost guaranteed to deliver mediocre outcomes. These projects typically include many unfinished or inadequate business processes all delivered at once some time in the distant future and long after the excitement over the initiative has subsided.

While it is true that only one or two data sources might exist in some organizations, it is not true that inventory, customers, sales, procurement, shipping, and other business processes need to be taken on at once. Create a single project for each business process, prioritize based on impact and feasibility, and then badabing badaboom, you deliver. Next.

Along the same lines, do not adjust your scope if the data source is unavailable, uncooperative, or lacking in quality. Instead, bring the fight to the data source (here is where a good, preferable C-Leveled, business sponsor can come in handy) and set things right. This is obviously a project risk, and also an organizational risk. If you are having problems extracting inventory data then maybe its time to put down your data warehousing gloves and get a new inventory system.

Last thoughts

Scoping the data warehouse is a difficult problem. Troubles start early on with the initial idea, it moves on through requirement gathering, and finally into the development phase of the lifecycle. There is not a lot of good advice in this area for data warehousing (if you happen to know of a good source, please send me a link or title). But I do find that if you work towards business processes, think in terms of programs and projects, and avoid the data source trap, scoping decisions will settle into the real needs of the business.

Tags: , , , , , , , , ,

1 Comment

Avoid Data Dead Ends and Information Loss

Black Hole Avoid Data Dead Ends and Information LossWhen analyzing data to make a decision, the last thing you want to encounter is a data dead end. You may be digging into some figures only to find that the data you have access to has been aggregated, combined, filtered, interpreted, or otherwise changed (in an unauthorized way) from its original source. And as an analyst, the last thing that you want to discover is that your ETL processes are solely responsible.

In Business Intelligence and decision-support instances, especially reports and dashboards, data alterations are common. Aggregates, summaries, snapshots, and the like are normal and necessary for a bird’s eye view of whatever business process is being examined. But in order to avoid information loss, be certain that the underlying data is intact at the most atomic and granular level. And also be sure analysts can get at this data (no black boxes allowed). You don’t want this information to be tossed into a black hole never to be seen or heard from again.

Atomic and Granular

I like to distinguish atomicity from granularity in the following way: Atomicity refers to non-additive and descriptive elements, usually stored as dimensions or non-additive facts, while granularity refers to measurement data usually stored as facts in a business process dimensional model. You could interchange these definitions under certain circumstances, but I like to draw the line so it is clear what I’m talking about.

Atomicity

Atomic data elements will give you the ability to conduct deeper research. By atomic, I mean that the data element has an exact meaning and does not represent some concatenated value or total. The sum of the parts have greater meaning than their whole, and in the end, allow analysts to cut analysis across different dimensions at a very minute scale.

  • A phone number is better split into country code, area code, and subscriber number
  • A street address into street number, name, type, and direction
  • A person’s name into surname and given name
  • A parcel ID into plat, lot, and map
  • An industry classification into groups and subgroups
  • A date into year, quarter, month, week of year, day, and day of week
  • Et cetera!

Granularity

With granularity, you define the level of detail in a measurement. The more granular, the greater the detail. For a trip to the market, you can define the granularity of your shopping excursion on the item level (each item in the basket), by product (grouping similar items), or perhaps by the entire basket as a whole. The choice is yours. Of course, storing the price of each item is the most granular and will give you the greatest flexibility in your analysis. You can then build your aggregates (by product, entire basket, etc.) from the most granular metrics.

If you decide to load data at larger grains, you are losing information and creating dead ends for your decision-makers. It pays to load data at the finest grain possible.

From here…

Integrating data into the data warehouse at an atomic and granular level gets you pretty far. You are likely already doing this (especially if you are familiar with transaction grain fact tables). But there are other ways you can lose data, and therefore information. In a follow-up to this post, I’ll discuss how evaluations and logic gates can also be a source of information loss.

I’d like to know your thoughts on this subject. Have I missed anything important, or have I marked something important that you feel is inconsequential?

Tags: , , , , ,

No Comments

ETL Subsystem 31: Paralleling and Pipelining

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

intro 150x150 ETL Subsystem 31: Paralleling and PipeliningOf all the subsystems that I’ve discussed so far, this one resulted in the most research. I had to (a) learn more about how paralleling works and (b) experiment with my environment to better understand it.

Honestly, I’ve taken this subsystem for granted over the years. And for VFP, I’ve done little exploration in this arena. For SSIS, I have tended to adjust the settings I can adjust (more on this below), monitor the results, and tweak my performance as needed. In some environments, this lackluster approach will get you by just fine. If you have very small load windows and performance is critical, then you’ll need to make a better effort.

So what is Paralleling and Pipelining?

Lumped together into a single subsystem, these two performance means are related but different. They’re cousins, I suppose. Running your ETL processes in parallel means that your ETL system is carrying out multiple operations simultaneously. Pipelining your ETL processes means that you can start new operations before the previous ones complete.

Paralleling and Pipelining are quite desirable. And depending on your tool of choice, taking advantage of them can be painless or painful.

How can you apply them?

You can achieve parallel processing by utilizing the CPUs on a single machine, or you can utilize multiple machines. The first option is the easiest to set up and results can be quite good. For a multiple CPU (or core) setup, you are actually running code (programs, algorithms) simultaneously, potentially doubling performance, all on the same box. You can scale out (i.e. scale horizontally) your ETL processes by adding computers (see What is distributed computing? by Kirk Pearson), allowing you to take advantage of the CPUs, RAM, and I/O of each machine. The latter has some significant design implications, but well worth it if your environment needs it.

assembly line women.thumbnail ETL Subsystem 31: Paralleling and PipeliningPipelining increases throughput. Unlike parallelism, it will not allow the instruction to run faster, but rather it permits downstream processes to start before the upstream process finishes. A great analogy is an assembly line, where parts are added to the whole as it travels down the line.

Getting parallelism and pipelining to work together is the Holy Grail of ETL performance. While certain performance techniques are available at all phases of data integration (from Extraction and CDC, to surrogate key handling and using partitions for fast loading), none can compare to the gains you can realize with this subsystem.

You should also keep in mind that CPU multitasking is different from parallel processing, and multithreading is different from pipelining. A multitasking process shares CPU resources, giving the illusion of parallelism (although one man’s illusion can be another man’s reality). Multithreaded applications share the same memory, but operate on different engine threads (i.e. a subtask). Multitasking and multithreading, like pipelining, increase throughput and also play an important role in performance tuning. I’ll talk a little more about this below in my section about FoxPro. Otherwise, if anyone is interested, I can try to elaborate in another post.

Where can this best be utilized?

Here are some ideas on where you can focus your efforts:

  • When loading historical data or retrieving data from multiple similar sources, execute the same package for different date ranges, at the same time (in SSIS for example, use multiple Execute Package Tasks or run the same package multiple times together as Jamie Thomson explores here), or you could design your historical load packages to break apart the data into separate threads.
  • Spread out UPDATE statements. This can be real handy if you have a few accumulating snapshot fact tables.
  • Spread out complex algorithms and routines that can operate on a subset of data.
  • Load staging tables while downstream processing loads your dimensional model
  • Do lookups (especially surrogate key lookups) in parallel
  • Distribute your conformed dimensions to other machines, data marts, etc. in parallel

 

SQL Server 2005 Integration Services (SSIS)

As you know, I use SSIS and VFP for ETL (not at the same time or on the same project though). With SSIS, I can quickly create complex routines that can automatically take advantage of multiple processors. The native support for buffers, execution trees, and parallelism makes my job pretty easy (which is why I suppose I’ve taken this subsystem for granted over the years). Simply understand how SSIS works, adjust the settings you need to adjust, monitor your performance, and tweak as needed.

To get a grip, the following resources are invaluable:

 

Hand Coding with Visual FoxPro (VFP9)

While SSIS and SQL Server have built-in mechanisms to manage most of the paralleling and pipelining responsibilities for you, FoxPro does not. You can achieve some very good results using VFP and multithreading, but you have to be extremely creative in how you handle paralleling and pipelining. If you don’t think this is the case, I’d love to hear how paralleling and pipelining can be achieved with VFP!!!

Of course, the VFP community is — and has always been — quite creative. As with most of this sort of thing, Calvin Hsia is near the front of the line. MTmyVFP (True VFP multi-threading) on CodePlex is a creative example using Hsia’s Multithreading class. For more information and a ton of details, check out:

As I’ve stated before, multithreading is not parallelism, nor is it pipelining. But if you utilize MTmyVFP (or similar solution) in your VFP ETL system, you will realize many performance benefits. Lastly, there was a pretty interesting, albiet short, discussion on this issue here.

From here

This post might have come off a bit long-winded, but there were quite a few important points to make. I hope that I’ve been able to distill what I’ve learned and that in the end, it all makes some sense. In my next ETL post, I’ll talk about ETL Subsystem 32: Security.

Tags: , , , , , , , , , ,

No Comments

Maintaining a DW/BI Environment

In posts Version Control and Version Migration I glossed over one of the more complex and challenging aspects of data warehousing: Once the DW/BI environment is in a production environment, how do you maintain and update it?

Some Thoughts

While I do not have a blueprint for you, I do have a few thoughts on the subject.

First, you must consider the Data Warehouse as a living and breathing organism. Not only will it be growing in size as your carefully constructed ETL packages churn, but it will also likely be growing in scope and importance (if not, then you may have to re-think your DW/BI marketing approach and/or find a new sponsor).

Second, you have to realize from the beginning that deploying a data warehouse is both an iterative and incremental process. Iterative in that you will build and rebuild as you get deeper into the project; incremental in that different parts of the warehouse will be constructed and delivered at varying rates. This is in direct contrast to the normal waterfall approach to releasing software applications and systems. It is not realistic, practical, or advised to attempt to deliver a DW/BI project in one shot. You may as well use your bullet for something else!

Passing the Baton

track baton pic.thumbnail Maintaining a DW/BI Environment Your DW/BI team must plan up front for the often complex handoffs between development and maintenance amid the ever-turning wheel of the DW Lifecycle (detailed here).

Those handoffs are critical. If you’re in a large organization (like me now), then you will be literally handing off the maintenance of the project to a complete different group. This group will need documentation and escalation procedures to monitor and respond to various exceptions that may occur in the production environment. While you can’t plan for every exception, you should prepare enough documentation so that all the basics are covered (for example, what should the maintenance team do if a SQL Agent Job reports a failure?). While some things will fall back on the Dev team, many of the maintenance tasks can be handled by a well trained support group.

If you’re in a small organization (like I used to be in), then you are the maintenance team. You have the added responsibility of maintaining a high-priority production system while you continue to build new pieces for the next release. This can be exhausting and stressful so be sure you automate as much of the maintenance part as you can! And be sure to allocate enough “business as usual” time in your schedule each week. Make sure your manager and/or project leader is cognizant of this added pressure and responsibility.

Back to the Start

Once you’ve passed the baton, you can go back to the beginning of the DW Lifecycle and start over. You may improve some processes (iterative development) while adding on new functionality (incremental development).

data flowchart.thumbnail Maintaining a DW/BI Environment For example, in your team’s first run through the lifecycle, you implemented a single business process dimensional model (retail sales) that allowed you to also produce a series of reports for your sales department. During this first run through, you accomplished quite a bit: You installed and learned how to use the toolsets, you built the dimensional model and ETL architecture, you designed a rudimentary web portal, you created a few useful reports, and you released all the pieces to your users. So far so good.

Now, this first version is out and the business users have some feedback. An issue log and wish list is compiled (perhaps a few new reports, an updated report template, and some user-defined filters for the portal). In addition, you want to also include a new business process, inventory, so you can expand the usage of the date warehouse in your business.

You and your team get back to work. The modeler begins to construct the new dimensional model while the ETL team works on the integration packages. This process takes the longest, and may be interrupted occasionally by the production team reporting some data quality issues or some failed packages.

In a separate thread, your BI application developers are busy with implementing many of the suggestions brought forward by the sales department. When finished, they release version 1.1 of the portal. Immediately, they begin working on some new inventory reports.

This scenario repeats with each cycle for each thread lasting anywhere from 3 weeks to 3 months.

Rosy Picture?

The above scenario really paints a rosy picture of the process. In reality, it doesn’t always work smoothly. But it does work if managed correctly. And it can be quite exciting when each iteration completes and the handover goes according to plan.

I would like to talk more about this subject in future posts. For now I hope my thoughts on the matter have left you with some insight!

 

Tags: , , , , ,

1 Comment