Posts Tagged BI

News from the Kimball Group

As you might know, I recently attended the Data Warehouse Lifecycle in Depth course from Kimball University. I blogged about each day of the event and also got an opportunity to speak with Margy and Warren at various times and about various topics. One of which was KU Certification.

“Now, if we can just convince them to create a certification exam…” is what I wrote, somewhat jokingly (because I knew full-well what Warren and Margy thought about a KU certification program).

Well, here is the official response, as written in the most recent design tip newsletter, “Kimball Design Tip #102 Server Configuration Considerations”:

We’ve received several inquiries about Kimball certification. After much consideration, we’ve concluded that it’s not meaningful to bestow certification by charging you to take a multiple choice exam. We do believe quality, in-depth education consistent with our proven methodology is critical to your professional development. Rather than embarking on a certification program, we’re giving Kimball University alumni an alternative opportunity to publicize their completion of our courses.

Which of course leads into my post regarding KU’s new LinkedIN Group!

We’ve launched a Kimball University Alumni group on LinkedIn so you can let everyone in your network and other interested parties know that you’ve attended a KU course. We’re not promoting LinkedIn, but it’s a well-accepted networking tool for industry professionals. Joining our alumni group allows you to promote your alumni status on your profile and connect with other alums (and perhaps potential future employers) in your area.

The Kimball University Alumni group is limited to students who have attended a full length 2- to 4-day Kimball University onsite or public course. Unfortunately, 1-day vendor seminars and other industry events do not qualify. Follow this link to join the Kimball University Alumni group.

So there you have it! If you’ve had the pleasure of attending at least one KU course, then go to LinkedIN and sign up!

Tags: , ,

2 Comments

A Data Warehouser’s Vocabulary (Part 1)

Partly inspired by a post entitled “The most important thing I know about Analytics is that no-one agrees what it means” by James Taylor and partly inspired by the section “Slowly Changing Vocabulary” in the book “Data Warehouse Lifecycle Toolkit 2nd Edition“, I have decided to compile a glossary of terms and concepts that I feel have some relevance to the data warehousing and business intelligence world. I’ll break this list into several postings, and I reserve the right to refine, enhance, clarify, and augment a definition at any time! When finished, I’ll make them a permanent feature of TmF.

With this list, I am not attempting to resolve any debates, nor am I attempting to invalidate or discredit a definition you may be using. These are the definitions I use. Also be aware that certain terms might hold different meanings under different contexts. If I need to use one of those ambiguous terms, I try my best to put a good context around it. For example, when I refer to “Data Mart”, I specifically mean “Atomic Business Process Dimensional Model”. However, there are times when what I mean is to describe a separate (perhaps normalized) database for a specific user or department (i.e. a throw-away sandbox for the big kids).

Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! Also, when I finish this list, and put these all together on a single page, I’ll be sure to include a reference link section as well.

So, without further ado, I give you the first group of many to come (A-Z):

Business Intelligence (BI)
A generic term to describe leveraging the organizations’ internal and external information assets for making better business decisions.
Business Process
The complete response that a business makes to an event. A business process entails the execution of a sequence of one or more process steps. It has a clearly defined deliverable or outcome. A Business Process is defined by the business event that triggers the process, the inputs and outputs, all the operational steps required to produce the output, the sequential relationship between the process steps, the business decisions that are part of the event response, and the flow of material and/or information between process steps.
Changed Data Capture (CDC)
Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.
Data Mart
A business process dimensional model.
Data Profiling
Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more than reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.
Data Quality
Assurances that the integrated data is consistent, complete, and fit to publish to the business community.
Data Warehouse Database
The largest possible union of queryable presentation data in a DW/BI System.
ETL
A set of processes that prepare source data for a Data Warehouse, adding value and confidence along the way. These processes include extraction, transformations (cleans & conform), and load operations. Note that the order in which ETL processes occur can be varied based on the situation. Some sources refer to the ET or just the E broadly as “Data Acquisition”.
Master Data Management (MDM)
Centralized facilities designed to hold master copies of shared entities, such as Customer and Product.
Metadata
All the information that defines and describes the structures, operations, and contents of a BI/DW system.
Operational Data Store (ODS)
A physical set of tables sitting between the operational systems and the data warehouse, or a specially administered hot partition of the data warehouse itself. The main purpose of an ODS is to provide immediate reporting of operational results if neither the operational system or the data warehouse can provide satisfactory access.
Staging
Physical workspace for data during the ETL process. Some data is temporarily staged, while other data may persist.

Tags: , , , ,

1 Comment

Book Review: Blink

In Blink, Malcolm Gladwell discusses the kind of thinking that occurs in about the same amount of time it takes to blink: Rapid Cognition. The brain has an ability to make excellent or devastating decisions in a fraction of a second. In some cases, if we relied more on this part of our brain for decision-making we might all be better off. In other cases, not thoroughly thinking through a decision could lead to disaster.

blink malcom gladwell Book Review: BlinkI’m posting about this book here because I think that the basic ideas holds true for any type of development project, whether it is a vertical application or a data warehouse. I couldn’t help but relate the material to my work with business intelligence and data warehousing applications in particular.

Malcolm claims that “we have come to confuse information with understanding” (page 264). How many times have you heard the phrases “information overload” or “analysis paralysis”? In fact, that is one of the problems that data warehouses are trying to solve: Get the right data to the right people at the right time to help them make better decisions. A “better” decision might simply mean making a decision with less information, less noise, and less bias. Don’t throw the album at the CEO, give him a photo.

Thinking too long about complex matters often leads to paralysis — or if you’re lucky, some sort of compromise that weaves in and out of your 3-page long pro and con list. Conversely, making snap decisions on simple matters can lead to equally bad results. A good rule of thumb is that if the matter is simple, think about it. Spend time on it. Make a good decision. Do this when you buy a new dishwasher, if you’re deciding what to feed the kids for dinner, or if you need to hire a new employee. For more intricate and complex matters, like asking your girlfriend to marry you, rely on rapid cognition to deliver an appropriate amount of insight without all the noise (the “she doesn’t dig sci-fi, but she does like roller coasters” back and forth in your brain might take years to sort out). Your brain’s rapid cognition center will automatically filter out the noise, make a good decision for you, and in the end, you’ll be happy with yourself moving forward. Do you really need to think long and hard about getting out of the way of a speeding bus? Should you go on an impulse-shopping spree?

Malcolm doesn’t talk about this, but I think that regret comes out of making bad choices after thinking too long and too hard. You end up making some sort of compromise, when you knew exactly what you really wanted in the first two seconds. Think about some of your greatest regrets. Before you made the fatal compromise or decision, the one that you would come to regret, did you think long and hard about it? I have a silly monkey tattoo on my ankle that I thought long and hard about getting. My gut told me no, but here I am more than 10 years later not too happy with my decision.

Blink and Bloomberg Professional

Yesterday I attended a training session for using the Bloomberg Professional data service. Within their training, they talked about their Launchpad desktop application. This got me thinking about Blink and data warehousing.

Launchpad allows you save any number of functions that are specific to your business needs so that you can easily get to them again (sort of like your Internet Favorites or Bookmarks). With thousands of functions, I’m quite sure that Launchpad was well-received. But on a deeper level, Launchpad gives brokers and researchers the ability to make better decisions faster. You now have a filter. Not as fast as a blink of your eye, but close. You can get lost in Bloomberg Professional and spend hours upon hours achieving absolutely nothing. Is this the cost of research? Or is it that we need more filters? As Business Intelligence professionals, application developers, and solution providers, we must start thinking that “Less is More” and start giving our end-users access to new brain functions.

Scorecards, dashboards, and aggregations are great examples of initiatives that are in this arena. But it isn’t so much about the latest tool or interface, it’s about a mindset. It’s about providing decision-makers with just enough information to “thin-slice” a situation or condition and make an approriate decision to gain some business advantage. Malcom talks a great deal about thin-slicing and provides several great examples of how it works.

Recommendation

Tipping Point was a better read and seemed to have a better direction. In fact, I was wondering throughout Blink what Gladwell’s real point was: Should we or shouldn’t we rely on Rapid Cognition? He added an Afterward some time after the book was first published and this actually tied his thoughts together. Without the Afterward, I would have been thoroughly disappointed.

That said, this is a good book that got me thinking (not so rapidly, though!) If you liked Tipping Point, this is another book with a subtle social message (that involves screens, women, and large musical instruments — now doesn’t that make you want to go out and get it?).

More importantly for me, it has given me a bit of a push into a different way of thinking about business problems, data as an asset, and decision support.

You can get Blink here from Amazon.

Tags: , , ,

No Comments

Mums the word: KU Alumni Group in the works for LinkedIn

It turns out that the Kimball Group is working on a new Kimball University Alumni group for LinkedIn. This is pretty cool because as far as I know, there is no real good way to get in touch with Kimball University “graduates”. The plan is to announce the group officially at some point in the near future (I believe on the next design tip).

According to my “source” (can you tell I’m really trying to sound like a reporter with some really important scoop?), they even have a logo ready to go. As some of you can attest to, the logo can be the hardest thing to settle on!

ETL Subsystems

I’ll be getting back on track with my dissection of the 34 Subsystems of ETL (in which I give my thoughts on each subsystem through SSIS and Visual FoxPro). I took a few weeks off while in Amsterdam, but should be ready to start again. Next on the list is ETL Subsystem 11: Hierarchy Manager. I expect to have it ready by early next week.

LinkedIn

By the way, if you’re on LinkedIn, and you read this blog, feel free to invite me into your network. You likely are a FoxPro user which means I’m going to like you, or you are involved in some way with data warehousing, which means I’m going to like you! Just let me know you’re a reader!

Tags: , , ,

2 Comments

Live from Kimball University: Day 4 (ETL, Addresses, Applications, BI Toolkit?)

Ok, so I’m a day late! I had to catch a train back to Brussels, and considering I hadn’t seen my family since Monday morning, I didn’t spend last night blogging! Besides, the four days were fairly taxing. Whether you think you know everything about dimensional modeling or not, the material was fairly dense and Warren and Margy both moved fast.

If you are serious about data warehousing, and call yourself a data warehousing professional (especially if you are a “dimensional modeler”), then attending this and other sessions is a must. I learned more Warren and Margy directly than I could have from their books. And hearing about their various experiences added an extra dimension (no pun) to the education.

Now, if we can just convince them to create a certification exam

ETL System

The entire morning was focused on ETL Design and Development. The first step in designing the ETL processes is to meet a basic set of requirements. These would include having all business requirements fleshed out; a “solid” dimensional model ready to go; data quality, archiving, lineage, and management infrastructures defined; compliance, latency, and security issues addressed; and finally a good understanding of all application requirements. With these in place, it is possible to create the ETL System.

Basically, the ETL system is in charge of converting source system data to the data warehouse database. Don’t get hung up on the acronym though. Depending on the circumstances, the process may feel more like ELT, or ELTL, or E&T, more T, some L, and more T.

Warren recommends using an ETL vendor tool mainly because the tools handle technical metadata requirements and likely have a lot of other very helpful functionality built in (logging, standard tasks, connection managers, dimension management, surrogate key lookup capability, in-memory pipelining, etc..).

As many of you know, I am an advocate of hand-coding ETL, especially using Visual FoxPro. VFP makes it easy to build in all the things that other ETL tools do with little effort. In addition, VFP is data centric and object-oriented, and if the ETL processes are designed correctly can be totally metadata driven. With that said, ETL tools are getting better and better. If you have a couple of developers in house that are experienced with FoxPro or another data-centric language, then by all means go for it. If not, then using an ETL tool is likely your best bet.

Regardless if you hand-code or buy, Warren defined the following steps that are needed to develop the ETL system:

  1. Create a high-level plan for the target model
  2. Install and learn the ETL tool!
  3. Determine your default strategies
  4. Drill down by target table
  5. Develop the historical load for dimensions
  6. Populate historical facts
  7. Dimension table incremental processing
  8. Fact table incremental processing
  9. Aggregate and OLAP loads
  10. ETL operation and automation

Nothing earth-shattering, but I have seen first-hand how ETL systems can stall and flutter along because one or more steps were ignored or done out of order. The Kimball Group has spent a long time together developing their best-practice and practical techniques so that we can all benefit. This list is just another example; it’s a simple pattern to follow that will likely lead to greater successes.

Name and Address Fun

As part of our ETL discussion, we spoke of data cleansing and conforming. One major challenge that data integrators have is dealing with very dirty data from multiple source systems. One good example is the challenges associated with name and address matching.

Warren showed us a series of slides with scanned address labels delivered (miraculously in some cases if you saw the slides) to his house. His name was butchered in every which way. Some had his name as Thorn. Another had him as a woman. It goes to show that data quality and name and address cleansing is very hard to do. When you consider that some companies spend millions on their mailings, this is a big deal: And they can’t even get it right!

BI Applications

After lunch, we dove right into a discussion on Business Intelligence (BI) Applications. One thing Warren said really stuck out to me, and it is something I wrote in big letters on my copy of the presentation slides: BRANDING.

BI Applications are in fact a way for us — as DW/BI professionals — to brand our services. We create our image through the BI Application layer. Despite all the background and expertise we need to do all other phases of the LifeCycle, the executives, managers, and users will see us through our applications.

I admit, I had an “ah-ha” moment when this came up. Everything from Portal design to reporting templates to our logo (yes, we need a logo too!) is equally as important as cleansing and conforming data, designing a perfect data model, and conducting hours upon ours of requirement-gathering interviews.

I only wish that we had more time to spend on this topic. An entire day would have been great.

BI Toolkit

We need a BI Toolkit. A vendor-neutral, general discussion on what it means to deliver BI Applications to our users. I talked with both Margy and Warren about this and both seemed to agree (as does their publisher). It’s been in their sights for some time now, I suppose. While the other toolkit books have been an invaluable reference for the entire community — this book could literally change the way data warehousing teams and businesses consider BI, BI Applications, and the various possibilities that BI offers.

Personally, I would like this book to take more time delving into user requirements. I don’t want it to tell me why I need one p-value over another, nor do I want it to tell me to use Business Objects because it can support feature X or Y. I need a reference that reminds me that developing a scorecard requires multiple business processes, that building a reporting template to pass around to business users can be a great tool to flesh out design decisions, and that perhaps a “BI Matrix” needs to be developed so that each BI Application can be resolved against each underlying business process dimensional model. Ideally, it would be a book that a business sponsor can read and digest, while providing the Data Warehousing team with enough technical knowledge to appropriately evaluate tools, build strategies, and formulate roll out plans that give the user the biggest and quickest return on their investments.

Lastly…

It’s a bummer that Warren had to race through the deployment and maintenance materials. We really did have a good class though, and a lot of great questions were asked. You can tell that most people in the room really “get it”, and their questions all reflect that. Of course, this led to a few diversions and a tangent or two. We paid for that in the last section. Thankfully, I have a brand new, 2cnd Edition Lifecylce Toolkit book to refer to!

This was a great experience for me, and I met a lot of great people. If you haven’t been to one of these courses, I recommend it no matter the cost.

If you have questions or comments for me, don’t be shy. I will be writing more and more about these topics in future postings, and I’ll also be picking up where I left off with my postings on the ETL Subsytems 1 through 34. So if there is something you would like me to address first, let me know!

Tags: , , , , , ,

8 Comments

Live from Kimball University: Day 2 (SCDs, the Mini, Modeling Process)

I am thoroughly exhausted! Margy followed up day one with another day packed with information.

We went through the various types of Slowly Changing Dimensions (SCD) and even talked about Type 6 SCDs, which I’ll get to in a bit. We then rolled right into mini dimensions and other related concepts. After that, we talked about the modeling process, different types of fact tables, and then finally how to deal with a mature data warehouse.

Slowly Changing Dimensions

Although I’ve posted a few entries (1 2) on the SCD Types, I’ll go over them once again here, but hopefully in a slightly new light. We talked specifically about:

Type 1: Overwrite the Attribute
In short, when an attribute’s value changes, we overwrite the old value with the new value. This technique does not preserve history. While it might be easy to implement, its use must be fully justified! In most every case, we will want to keep some sort of history in our dimensions.
Type 2: Preserve History / Add a New Record
The most common type, this technique preserves the old value by inserting a copy of the current row with the updated attribute values into the Dimension. The old row is marked as archived (typically by using active and inactive dates and a status indicator).
Type 3: Alternate Reality / Add a New Attribute
New attributes are added to the dimension to represent the old and new value. In some cases this is a handy technique because you can preserve the old value along side the new value. This technique is best used when there is a massive, dimension-wide change that affects many rows. For example, if a voter’s representative district changes during redistricting, we might want to see what election results would have been if the old districts were used in the analysis.
Type 6: Advanced Hybrid
This is a hybrid approach that combines the other 3 types (1 + 2 + 3 = 6). According to Margy: “We’re creating new rows to capture change (Type 2), adding attributes to reflect an alternative view of the world (Type 3), which are overwritten for all earlier dimension rows for a given product (Type 1).” Read more about this here and be sure to check out this article by Margy and Ralph for more information. I hope to have some examples posted in the coming weeks as well.

We didn’t talk about inferred dimensions (although Margy did mention the late arriving kind). I suppose the idea of an inferred dimension (or Type 1.5) might be discussed with Warren in the coming days.

The Mini

After all that SCD talk, we talked about dealing with very large dimensions. Especially those that have Type 2 attributes. Margy introduced the concept of the Mini-Dimension, and stated that for extremely large dimensions, using Type 2 may not be a good idea for attributes that might change often. For example, suppose you wanted to store an airline passenger’s ‘age’ in your Passenger dimension. If the passenger flies 7 times over an 8 year period, you could have as many as 7 rows reflecting that change.

Instead, create a mini-dimension with attributes that change frequently and link them to the Fact table using a separate foreign key. It is hard to explain this in any detail without a picture. I promise once I get back from Amsterdam I will prepare a better example for this (with pictures!).

The Modeling Process

I was happy to see a segment dedicated to the overall modeling process. When it first came up, I felt it was out of place (perhaps it could have been discussed before we dove into sketching our case study models?), but in the end, it seemed to fit the flow.

The modeling process includes the following steps: Preparation, a high level design session, detailed model development, Dimensional model review and validation, and final documentation to be handed to the ETL team. It is important to note that this process is iterative.

Next we went over the various participants that need to be involved: the data modeler, business analyst, power users, BI application developer, data steward, source experts, and the ETL team.

During this process, we’ll also start to conduct our first rounds of data profiling. I find this part of the process to be most crucial because it is at this level where you can start to determine the feasibility of implementing a particular business process.

Types of Fact Tables

After a brief case study, Margy took a little more time to go over the different types of Fact table grains. These include the transaction, periodic snapshot, and accumulating snapshot. For more information on the different types, please read my post “Fact Tables“.

Some additional notes on Accumulating Snapshots: They generally represent some sequential, short lived, and predictable process that needs to be expressed by various milestones. These milestones are most often dates. These fact tables often span multiple business processes as well. For example, an order is placed, it is released, shipped, and delivered. These are separate milestones that have some serious significance to the business for analysis.

Lastly…

We finished the day discussing mature BI projects and how to handle existing warehouses that might not be in great shape. Margy went over several important common disorders (ranging from lack of business sponsor to infrastructure issues), some symptoms, and finally she proposed a treatment plan. Every slide in this section was packed with good, solid advice. When I can get my notes together on this, I plan to write about some of my own experiences with these disorders.

Tomorrow Warren will take over. I trust he’ll be as good as Margy!

Tags: , , , , ,

No Comments