Posts Tagged Kimball

Scoping Data Warehouse Initiatives

focus Scoping Data Warehouse InitiativesData warehousing is a complex operation. From start to finish (if there is a finish), project teams are faced with many challenges. In all phases of the lifecycle, there are opportunities for derailment. The best way to mitigate potential issues and stay on time and within budget is to carefully define and manage scope. Managing scope can be an ongoing struggle (especially if requirements are not clearly defined or justified). While this is really a PM101-type of topic, I feel there are some fine points in a DW/BI environment that are not mentioned enough.

Consider the following:

Programs verses projects

I won’t get into a deep PM discussion here, but it is important to point out that data warehousing (or business intelligence, master data management, etc.) initiatives should be thought of as programs and not projects. This mindset will help in scoping.

A program (which might also be called a “project portfolio” in some circles) is basically just a set of related projects. With a program, the emphasis is on organizing, prioritizing, and allocating resources to the right projects. Program scope is more strategic, and answers long-term questions about what type of value the organization hopes to achieve from the initiative.

A project, on the other hand, is much more specific — with a set number of deliverables and goals that have a high immediate impact. The scope at the project level is therefore more tactical in nature: high impact, fast delivery. Be aware that some projects may never be given the green light (for example, if there is a low business impact or if there is a low feasibility rating because of data source or data quality complications).

What I find odd is that organizations still choose to tackle immense data warehousing initiatives in one or two shots, trying to deliver everything at once over a period of 18 or more months. This is the wrong approach (here’s why). Break this large initiative into individual projects and try to deliver functionality every 6 to 8 weeks.

The business process

The best way to break down data warehousing programs into high-impact projects is along business process lines. A business process, as defined here, is:

The complete response that a business makes to an event. A business process entails the execution of a sequence of one or more process steps. It has a clearly defined deliverable or outcome. A Business Process is defined by the business event that triggers the process, the inputs and outputs, all the operational steps required to produce the output, the sequential relationship between the process steps, the business decisions that are part of the event response, and the flow of material and/or information between process steps.

Some example of the above: inventory tracking, Internet sales, retail sales, marketing, tax assessment, tax collection, pitching, batting.

In any data warehousing environment, you can expect to have several business processes to model. Each business process you tackle will have elements touching upon different aspects of the data warehouse, including infrastructure, middleware, data modeling, ETL, business logic development, presentation elements, and so on. If you scope each project to the business process, you can deliver complete solutions in the shortest amount of time. (It should be obvious that the very first business process you implement will take the longest, as the team works out the core infrastructure. Most of this infrastructure will be reused by other business processes.)

Avoid scoping to a data source

Do not fall into the trap of scoping to a data source. Scoping to a data source is almost guaranteed to deliver mediocre outcomes. These projects typically include many unfinished or inadequate business processes all delivered at once some time in the distant future and long after the excitement over the initiative has subsided.

While it is true that only one or two data sources might exist in some organizations, it is not true that inventory, customers, sales, procurement, shipping, and other business processes need to be taken on at once. Create a single project for each business process, prioritize based on impact and feasibility, and then badabing badaboom, you deliver. Next.

Along the same lines, do not adjust your scope if the data source is unavailable, uncooperative, or lacking in quality. Instead, bring the fight to the data source (here is where a good, preferable C-Leveled, business sponsor can come in handy) and set things right. This is obviously a project risk, and also an organizational risk. If you are having problems extracting inventory data then maybe its time to put down your data warehousing gloves and get a new inventory system.

Last thoughts

Scoping the data warehouse is a difficult problem. Troubles start early on with the initial idea, it moves on through requirement gathering, and finally into the development phase of the lifecycle. There is not a lot of good advice in this area for data warehousing (if you happen to know of a good source, please send me a link or title). But I do find that if you work towards business processes, think in terms of programs and projects, and avoid the data source trap, scoping decisions will settle into the real needs of the business.

Tags: , , , , , , , , ,

1 Comment

News from the Kimball Group

As you might know, I recently attended the Data Warehouse Lifecycle in Depth course from Kimball University. I blogged about each day of the event and also got an opportunity to speak with Margy and Warren at various times and about various topics. One of which was KU Certification.

“Now, if we can just convince them to create a certification exam…” is what I wrote, somewhat jokingly (because I knew full-well what Warren and Margy thought about a KU certification program).

Well, here is the official response, as written in the most recent design tip newsletter, “Kimball Design Tip #102 Server Configuration Considerations”:

We’ve received several inquiries about Kimball certification. After much consideration, we’ve concluded that it’s not meaningful to bestow certification by charging you to take a multiple choice exam. We do believe quality, in-depth education consistent with our proven methodology is critical to your professional development. Rather than embarking on a certification program, we’re giving Kimball University alumni an alternative opportunity to publicize their completion of our courses.

Which of course leads into my post regarding KU’s new LinkedIN Group!

We’ve launched a Kimball University Alumni group on LinkedIn so you can let everyone in your network and other interested parties know that you’ve attended a KU course. We’re not promoting LinkedIn, but it’s a well-accepted networking tool for industry professionals. Joining our alumni group allows you to promote your alumni status on your profile and connect with other alums (and perhaps potential future employers) in your area.

The Kimball University Alumni group is limited to students who have attended a full length 2- to 4-day Kimball University onsite or public course. Unfortunately, 1-day vendor seminars and other industry events do not qualify. Follow this link to join the Kimball University Alumni group.

So there you have it! If you’ve had the pleasure of attending at least one KU course, then go to LinkedIN and sign up!

Tags: , ,

2 Comments

Mums the word: KU Alumni Group in the works for LinkedIn

It turns out that the Kimball Group is working on a new Kimball University Alumni group for LinkedIn. This is pretty cool because as far as I know, there is no real good way to get in touch with Kimball University “graduates”. The plan is to announce the group officially at some point in the near future (I believe on the next design tip).

According to my “source” (can you tell I’m really trying to sound like a reporter with some really important scoop?), they even have a logo ready to go. As some of you can attest to, the logo can be the hardest thing to settle on!

ETL Subsystems

I’ll be getting back on track with my dissection of the 34 Subsystems of ETL (in which I give my thoughts on each subsystem through SSIS and Visual FoxPro). I took a few weeks off while in Amsterdam, but should be ready to start again. Next on the list is ETL Subsystem 11: Hierarchy Manager. I expect to have it ready by early next week.

LinkedIn

By the way, if you’re on LinkedIn, and you read this blog, feel free to invite me into your network. You likely are a FoxPro user which means I’m going to like you, or you are involved in some way with data warehousing, which means I’m going to like you! Just let me know you’re a reader!

Tags: , , ,

2 Comments

Live from Kimball University: Day 4 (ETL, Addresses, Applications, BI Toolkit?)

Ok, so I’m a day late! I had to catch a train back to Brussels, and considering I hadn’t seen my family since Monday morning, I didn’t spend last night blogging! Besides, the four days were fairly taxing. Whether you think you know everything about dimensional modeling or not, the material was fairly dense and Warren and Margy both moved fast.

If you are serious about data warehousing, and call yourself a data warehousing professional (especially if you are a “dimensional modeler”), then attending this and other sessions is a must. I learned more Warren and Margy directly than I could have from their books. And hearing about their various experiences added an extra dimension (no pun) to the education.

Now, if we can just convince them to create a certification exam

ETL System

The entire morning was focused on ETL Design and Development. The first step in designing the ETL processes is to meet a basic set of requirements. These would include having all business requirements fleshed out; a “solid” dimensional model ready to go; data quality, archiving, lineage, and management infrastructures defined; compliance, latency, and security issues addressed; and finally a good understanding of all application requirements. With these in place, it is possible to create the ETL System.

Basically, the ETL system is in charge of converting source system data to the data warehouse database. Don’t get hung up on the acronym though. Depending on the circumstances, the process may feel more like ELT, or ELTL, or E&T, more T, some L, and more T.

Warren recommends using an ETL vendor tool mainly because the tools handle technical metadata requirements and likely have a lot of other very helpful functionality built in (logging, standard tasks, connection managers, dimension management, surrogate key lookup capability, in-memory pipelining, etc..).

As many of you know, I am an advocate of hand-coding ETL, especially using Visual FoxPro. VFP makes it easy to build in all the things that other ETL tools do with little effort. In addition, VFP is data centric and object-oriented, and if the ETL processes are designed correctly can be totally metadata driven. With that said, ETL tools are getting better and better. If you have a couple of developers in house that are experienced with FoxPro or another data-centric language, then by all means go for it. If not, then using an ETL tool is likely your best bet.

Regardless if you hand-code or buy, Warren defined the following steps that are needed to develop the ETL system:

  1. Create a high-level plan for the target model
  2. Install and learn the ETL tool!
  3. Determine your default strategies
  4. Drill down by target table
  5. Develop the historical load for dimensions
  6. Populate historical facts
  7. Dimension table incremental processing
  8. Fact table incremental processing
  9. Aggregate and OLAP loads
  10. ETL operation and automation

Nothing earth-shattering, but I have seen first-hand how ETL systems can stall and flutter along because one or more steps were ignored or done out of order. The Kimball Group has spent a long time together developing their best-practice and practical techniques so that we can all benefit. This list is just another example; it’s a simple pattern to follow that will likely lead to greater successes.

Name and Address Fun

As part of our ETL discussion, we spoke of data cleansing and conforming. One major challenge that data integrators have is dealing with very dirty data from multiple source systems. One good example is the challenges associated with name and address matching.

Warren showed us a series of slides with scanned address labels delivered (miraculously in some cases if you saw the slides) to his house. His name was butchered in every which way. Some had his name as Thorn. Another had him as a woman. It goes to show that data quality and name and address cleansing is very hard to do. When you consider that some companies spend millions on their mailings, this is a big deal: And they can’t even get it right!

BI Applications

After lunch, we dove right into a discussion on Business Intelligence (BI) Applications. One thing Warren said really stuck out to me, and it is something I wrote in big letters on my copy of the presentation slides: BRANDING.

BI Applications are in fact a way for us — as DW/BI professionals — to brand our services. We create our image through the BI Application layer. Despite all the background and expertise we need to do all other phases of the LifeCycle, the executives, managers, and users will see us through our applications.

I admit, I had an “ah-ha” moment when this came up. Everything from Portal design to reporting templates to our logo (yes, we need a logo too!) is equally as important as cleansing and conforming data, designing a perfect data model, and conducting hours upon ours of requirement-gathering interviews.

I only wish that we had more time to spend on this topic. An entire day would have been great.

BI Toolkit

We need a BI Toolkit. A vendor-neutral, general discussion on what it means to deliver BI Applications to our users. I talked with both Margy and Warren about this and both seemed to agree (as does their publisher). It’s been in their sights for some time now, I suppose. While the other toolkit books have been an invaluable reference for the entire community — this book could literally change the way data warehousing teams and businesses consider BI, BI Applications, and the various possibilities that BI offers.

Personally, I would like this book to take more time delving into user requirements. I don’t want it to tell me why I need one p-value over another, nor do I want it to tell me to use Business Objects because it can support feature X or Y. I need a reference that reminds me that developing a scorecard requires multiple business processes, that building a reporting template to pass around to business users can be a great tool to flesh out design decisions, and that perhaps a “BI Matrix” needs to be developed so that each BI Application can be resolved against each underlying business process dimensional model. Ideally, it would be a book that a business sponsor can read and digest, while providing the Data Warehousing team with enough technical knowledge to appropriately evaluate tools, build strategies, and formulate roll out plans that give the user the biggest and quickest return on their investments.

Lastly…

It’s a bummer that Warren had to race through the deployment and maintenance materials. We really did have a good class though, and a lot of great questions were asked. You can tell that most people in the room really “get it”, and their questions all reflect that. Of course, this led to a few diversions and a tangent or two. We paid for that in the last section. Thankfully, I have a brand new, 2cnd Edition Lifecylce Toolkit book to refer to!

This was a great experience for me, and I met a lot of great people. If you haven’t been to one of these courses, I recommend it no matter the cost.

If you have questions or comments for me, don’t be shy. I will be writing more and more about these topics in future postings, and I’ll also be picking up where I left off with my postings on the ETL Subsytems 1 through 34. So if there is something you would like me to address first, let me know!

Tags: , , , , , ,

8 Comments

Live from Kimball University: Day 2 (SCDs, the Mini, Modeling Process)

I am thoroughly exhausted! Margy followed up day one with another day packed with information.

We went through the various types of Slowly Changing Dimensions (SCD) and even talked about Type 6 SCDs, which I’ll get to in a bit. We then rolled right into mini dimensions and other related concepts. After that, we talked about the modeling process, different types of fact tables, and then finally how to deal with a mature data warehouse.

Slowly Changing Dimensions

Although I’ve posted a few entries (1 2) on the SCD Types, I’ll go over them once again here, but hopefully in a slightly new light. We talked specifically about:

Type 1: Overwrite the Attribute
In short, when an attribute’s value changes, we overwrite the old value with the new value. This technique does not preserve history. While it might be easy to implement, its use must be fully justified! In most every case, we will want to keep some sort of history in our dimensions.
Type 2: Preserve History / Add a New Record
The most common type, this technique preserves the old value by inserting a copy of the current row with the updated attribute values into the Dimension. The old row is marked as archived (typically by using active and inactive dates and a status indicator).
Type 3: Alternate Reality / Add a New Attribute
New attributes are added to the dimension to represent the old and new value. In some cases this is a handy technique because you can preserve the old value along side the new value. This technique is best used when there is a massive, dimension-wide change that affects many rows. For example, if a voter’s representative district changes during redistricting, we might want to see what election results would have been if the old districts were used in the analysis.
Type 6: Advanced Hybrid
This is a hybrid approach that combines the other 3 types (1 + 2 + 3 = 6). According to Margy: “We’re creating new rows to capture change (Type 2), adding attributes to reflect an alternative view of the world (Type 3), which are overwritten for all earlier dimension rows for a given product (Type 1).” Read more about this here and be sure to check out this article by Margy and Ralph for more information. I hope to have some examples posted in the coming weeks as well.

We didn’t talk about inferred dimensions (although Margy did mention the late arriving kind). I suppose the idea of an inferred dimension (or Type 1.5) might be discussed with Warren in the coming days.

The Mini

After all that SCD talk, we talked about dealing with very large dimensions. Especially those that have Type 2 attributes. Margy introduced the concept of the Mini-Dimension, and stated that for extremely large dimensions, using Type 2 may not be a good idea for attributes that might change often. For example, suppose you wanted to store an airline passenger’s ‘age’ in your Passenger dimension. If the passenger flies 7 times over an 8 year period, you could have as many as 7 rows reflecting that change.

Instead, create a mini-dimension with attributes that change frequently and link them to the Fact table using a separate foreign key. It is hard to explain this in any detail without a picture. I promise once I get back from Amsterdam I will prepare a better example for this (with pictures!).

The Modeling Process

I was happy to see a segment dedicated to the overall modeling process. When it first came up, I felt it was out of place (perhaps it could have been discussed before we dove into sketching our case study models?), but in the end, it seemed to fit the flow.

The modeling process includes the following steps: Preparation, a high level design session, detailed model development, Dimensional model review and validation, and final documentation to be handed to the ETL team. It is important to note that this process is iterative.

Next we went over the various participants that need to be involved: the data modeler, business analyst, power users, BI application developer, data steward, source experts, and the ETL team.

During this process, we’ll also start to conduct our first rounds of data profiling. I find this part of the process to be most crucial because it is at this level where you can start to determine the feasibility of implementing a particular business process.

Types of Fact Tables

After a brief case study, Margy took a little more time to go over the different types of Fact table grains. These include the transaction, periodic snapshot, and accumulating snapshot. For more information on the different types, please read my post “Fact Tables“.

Some additional notes on Accumulating Snapshots: They generally represent some sequential, short lived, and predictable process that needs to be expressed by various milestones. These milestones are most often dates. These fact tables often span multiple business processes as well. For example, an order is placed, it is released, shipped, and delivered. These are separate milestones that have some serious significance to the business for analysis.

Lastly…

We finished the day discussing mature BI projects and how to handle existing warehouses that might not be in great shape. Margy went over several important common disorders (ranging from lack of business sponsor to infrastructure issues), some symptoms, and finally she proposed a treatment plan. Every slide in this section was packed with good, solid advice. When I can get my notes together on this, I plan to write about some of my own experiences with these disorders.

Tomorrow Warren will take over. I trust he’ll be as good as Margy!

Tags: , , , , ,

No Comments

Live from Kimball University: Day 1 (planning, requirements, dimensional models)

As you may know, I am currently in Amsterdam attending the Kimball University (organized by Quest for Knowledge / Q4K) training on the Data Warehouse Lifecycle. Margy Ross and Warren Thornthwaite are here and so far so good. Margy is teaching for the first two days, while Warren will take the final two days. I am not disappointed thus far.

Admittedly, much of the material in this course is not new to me. I’ve been exposed to dimensional modeling for some time now, and with a few projects under my belt, I’m starting to feel like I know what I’m talking about (which could be dangerous, I know). I’ve read the books, the articles, the tips, and have even written my own articles and blogged on many subjects. Attending these four day sessions will be primarily to validate and shore-up what I already know and hopefully correct some of my misconceptions and mistakes. With that said, I am getting a lot out of each session so far.

Project and Program Planning

Margy started out with a brief introduction, detailing the roots of the Kimball Group. Its pretty interesting to hear how they started out as 3 2-person companies and essentially merged into one: The Kimball Group. These six people have fundamentally changed the way data modelers and architects think about and approach data warehousing. What sets them apart, I believe, is that their ideas and theories stem from needs of the user.

Tremendous emphasis is placed on getting the requirements right. Margy pointed out that involving the users early and often is a key to success. Oftentimes, she said, users are involved early on and then forgotten once development begins. Three or Four months later, what is delivered might not be exactly what the end user had in mind. As the development team works, they uncover more and more hidden issues that must be resolved. The best person to help resolve many of these issues is the business user — not the data modeler or ETL architect.

In short, planning for a data warehouse must involve the business community. You need a strong and charismatic sponsor who has vision and influence, a data steward who cares about convention and quality, and a business analyst who can straddle the fence between the green grassy fields of IT and the barren wasteland of the business (kidding…).

Margy also developed a list of warning signs that ranged from failing to fill the various shoes I mentioned in the previous paragraph to acknowledging “BI/DW success is tied to user acceptance”. One of her points, not to underestimate the data cleansing workload, is finally starting to become less common. But it still remains one of the main reasons why data warehousing projects fail.

The Bus Architecture

Here’s something you don’t get by reading a book: Margy and the group didn’t quite like Ralph’s name for what we know as the Bus Architecture. I don’t have an electrical engineering background, but Ralph does. AdventureWorks Bus Matrix SampleI suppose he felt that the way business processes and conformed dimensions related together reminded him of how a computer’s bus architecture worked. Essentially, being able to swap in and out components (If you have an electrical engineering background, or are just plain smarter than me on this, then please correct me if I’m off-base!).

I asked Margy what she would have called it, and she didn’t really know. I guess that’s why Ralph won-out. He was the only one with an idea!

For our purposes, the Bus Architecture “provides a standardized master set of conformed dimensions and conformed facts used throughout the data warehouse.” Click the image for a larger picture. This image comes from a SQL Server Mag article “Dimensional Modeling Basics” by Joy Mundy and Warren Thornthwaite.

Dimensional Modeling and Conformed Dimensions

I’ve talked at length in my articles and posts about dimensional modeling. Margy went over the basics today, starting out with a discussion on the differences between the Kimball approach and the Corporate Information Factory (CIF) approach pushed by Inmon. Although the dust between the two camps has settled in recent years, Margy was quite adamant about the benefits of dimensional modeling over the enterprise data warehouse concepts. Margy detailed these, but until I get my notes together, I’ll leave it at that.

Snowflake schema example from http://sql.wikis.com/wc.dll?SQL~snowflakeschemaShe also stressed the reliance on the star schema, and warned of snowflaking designs. Snowflaking occurs mainly when you attempt to normalize a dimension, usually by breaking the dimension up by its hierarchal data. For example, you might be tempted to break the “Store” dimension, which contains city, district, region, and state into separate City, District, Region, and State dimensions. Of course, doing so creates a snowflake and makes the model much more complex than it needs to be.

Conformed dimensions and facts are really the key to a successful dimensional modeling — and by extension data warehousing — project or program. They allow you to conduct drill-across queries, where you query data across business processes, and ultimately conformed dimensions and facts give your end users a consistent version of the data no matter what process they are looking at. Using the Bus Architecture, dimensions that are used across business processes are easy to spot.

Comments and Questions

I am also paying close attention to many of the questions and comments being made during the Q&A moments. The room is roughly composed of about 75% of people who are somewhat new to data warehousing and business intelligence. So it is a great opportunity to hear some of the questions the group has had for Margy.

Some questions: a.) How do you link a promotion to a product to show all the products that did not sell during a promotion? b.) Can you add the time to the date dimension? c.) Why should I conduct interviews during requirement gathering when I know the business well enough myself? d.) When determining the feasibility of implementing a business process, are there other important factors outside of data availability to consider?

I’ll elaborate more on all of these answers (and more) in future postings. In short, the answer key is simply: a.) Use a factless fact between product and promotion; b.) No, use either a Time dimension or store a timestamp in the Fact table; c.) You don’t know as much as you think you know and interviews will form the basis for future documentation; d.) available resources and technology issues can be overcome, perhaps the next biggest factor is dependencies between business processes.

And Lastly…

Almost everyone in attendance is a consultant. I am still getting used to this. I’ve seen and met more consultants in my short time here in Europe (about 5 months so far) than I have my entire life. Margy was also surprised. She did make a very interesting observation though. Consultants, who usually pay their own way through courses and seminars, typically stay away when times are tough. When the economy is good and/or they have plenty of work, consultants attend more conferences. Interesting. I suppose this could mean that the consultant business is doing quite well in this part of Europe.

I couldn’t get everything into this single post. I’ll be sure to post more notes as time goes on. For now, it’s on to tomorrow!

Tags: , , , , ,

1 Comment