Tod means Fox

Business Intelligence, Data Warehousing, SQL, Visual FoxPro.

Archive for ‘Education’


Published June 13th, 2008

A Datawarehouser’s Vocabulary (Part 2)

This post is part 2 (read part 1) of a series of posts containing a glossary of terms and concepts that I feel has some relevance to the data warehousing and business intelligence world. Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! When finished, these terms will be compiled and made a static page on TmF.

Aggregation
The process of redefining data into a summarization based on some rules or criteria. Aggregation may also encompass de-normalization for data access and retrieval.
Analytical Processing
Producing analysis for management decisions, usually involving trend analysis, drill-down analysis, demographic analysis, profiling, and so on.
Attribute
Any detail that serves to qualify, identify, classify, quantify, or express the state of an entity.
Data Mining
The process of analyzing large amounts of data in search of previously undiscovered business patterns.
Dimension
A denormalized table in a dimensional model with a single part primary key and descriptive attribute columns.
Event
A signal that some activity (usually a business transaction) has occurred.
Fact
Central table of a Star Schema which numeric performance measurements identified by a composite key, each of whose elements is a foreign key drawn from a dimension table.
Heuristic Analysis
Heuristic Analysis is a method to help to solve a problem, commonly informal. It is particularly used for a method that often rapidly leads to a solution that is usually reasonably close to the best possible answer. Heuristics are “rules of thumb”, educated guesses, intuitive judgments or simply common sense.
Online Analytical Processing (OLAP, also MOLAP)
On-line retrieval and analysis of data to reveal business trends and statistics not directly visible in the data directly retrieved from a data warehouse. Also known as multidimensional analysis.
Outrigger
A secondary dimension table attached to a dimension table. An outrigger is not used to normalize a dimension.
Relational OLAP (ROLAP)
“Relational” OLAP, in which the OLAP processes use a relational, normalized model for its source.
Slowly Changing Dimension (SCD)
The tendency for dimension attributes to change gradually or occasionally over time. The techniques for handling these changes include Type 1 (overwrite), Type 2 (keep history), and Type 3 (alternate realities).
Snowflake
A normalized dimension where a flat, single dimension table is deconstructed into a tree structure with potentially many nesting levels. Snowflaking a dimension generally compromises user understandability and browsing performance.
Snowflaking
The (undesirable) act of normalizing a dimensional model.
Star Schema
A generic representation of a dimensional model in a relational database in which a fact table with a composite key is joined to a number of single level dimension tables, each with a single primary key.

Published June 3rd, 2008

News from the Kimball Group

As you might know, I recently attended the Data Warehouse Lifecycle in Depth course from Kimball University. I blogged about each day of the event and also got an opportunity to speak with Margy and Warren at various times and about various topics. One of which was KU Certification.

“Now, if we can just convince them to create a certification exam…” is what I wrote, somewhat jokingly (because I knew full-well what Warren and Margy thought about a KU certification program).

Well, here is the official response, as written in the most recent design tip newsletter, “Kimball Design Tip #102 Server Configuration Considerations”:

We’ve received several inquiries about Kimball certification. After much consideration, we’ve concluded that it’s not meaningful to bestow certification by charging you to take a multiple choice exam. We do believe quality, in-depth education consistent with our proven methodology is critical to your professional development. Rather than embarking on a certification program, we’re giving Kimball University alumni an alternative opportunity to publicize their completion of our courses.

Which of course leads into my post regarding KU’s new LinkedIN Group!

We’ve launched a Kimball University Alumni group on LinkedIn so you can let everyone in your network and other interested parties know that you’ve attended a KU course. We’re not promoting LinkedIn, but it’s a well-accepted networking tool for industry professionals. Joining our alumni group allows you to promote your alumni status on your profile and connect with other alums (and perhaps potential future employers) in your area.

The Kimball University Alumni group is limited to students who have attended a full length 2- to 4-day Kimball University onsite or public course. Unfortunately, 1-day vendor seminars and other industry events do not qualify. Follow this link to join the Kimball University Alumni group.

So there you have it! If you’ve had the pleasure of attending at least one KU course, then go to LinkedIN and sign up!

Published May 30th, 2008

A Data Warehouser’s Vocabulary (Part 1)

Partly inspired by a post entitled “The most important thing I know about Analytics is that no-one agrees what it means” by James Taylor and partly inspired by the section “Slowly Changing Vocabulary” in the book “Data Warehouse Lifecycle Toolkit 2nd Edition“, I have decided to compile a glossary of terms and concepts that I feel have some relevance to the data warehousing and business intelligence world. I’ll break this list into several postings, and I reserve the right to refine, enhance, clarify, and augment a definition at any time! When finished, I’ll make them a permanent feature of TmF.

With this list, I am not attempting to resolve any debates, nor am I attempting to invalidate or discredit a definition you may be using. These are the definitions I use. Also be aware that certain terms might hold different meanings under different contexts. If I need to use one of those ambiguous terms, I try my best to put a good context around it. For example, when I refer to “Data Mart”, I specifically mean “Atomic Business Process Dimensional Model”. However, there are times when what I mean is to describe a separate (perhaps normalized) database for a specific user or department (i.e. a throw-away sandbox for the big kids).

Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! Also, when I finish this list, and put these all together on a single page, I’ll be sure to include a reference link section as well.

So, without further ado, I give you the first group of many to come (A-Z):

Business Intelligence (BI)
A generic term to describe leveraging the organizations’ internal and external information assets for making better business decisions.
Business Process
The complete response that a business makes to an event. A business process entails the execution of a sequence of one or more process steps. It has a clearly defined deliverable or outcome. A Business Process is defined by the business event that triggers the process, the inputs and outputs, all the operational steps required to produce the output, the sequential relationship between the process steps, the business decisions that are part of the event response, and the flow of material and/or information between process steps.
Changed Data Capture (CDC)
Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.
Data Mart
A business process dimensional model.
Data Profiling
Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more than reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.
Data Quality
Assurances that the integrated data is consistent, complete, and fit to publish to the business community.
Data Warehouse Database
The largest possible union of queryable presentation data in a DW/BI System.
ETL
A set of processes that prepare source data for a Data Warehouse, adding value and confidence along the way. These processes include extraction, transformations (cleans & conform), and load operations. Note that the order in which ETL processes occur can be varied based on the situation. Some sources refer to the ET or just the E broadly as “Data Acquisition”.
Master Data Management (MDM)
Centralized facilities designed to hold master copies of shared entities, such as Customer and Product.
Metadata
All the information that defines and describes the structures, operations, and contents of a BI/DW system.
Operational Data Store (ODS)
A physical set of tables sitting between the operational systems and the data warehouse, or a specially administered hot partition of the data warehouse itself. The main purpose of an ODS is to provide immediate reporting of operational results if neither the operational system or the data warehouse can provide satisfactory access.
Staging
Physical workspace for data during the ETL process. Some data is temporarily staged, while other data may persist.

Published May 15th, 2008

Mums the word: KU Alumni Group in the works for LinkedIn

It turns out that the Kimball Group is working on a new Kimball University Alumni group for LinkedIn. This is pretty cool because as far as I know, there is no real good way to get in touch with Kimball University “graduates”. The plan is to announce the group officially at some point in the near future (I believe on the next design tip).

According to my “source” (can you tell I’m really trying to sound like a reporter with some really important scoop?), they even have a logo ready to go. As some of you can attest to, the logo can be the hardest thing to settle on!

ETL Subsystems

I’ll be getting back on track with my dissection of the 34 Subsystems of ETL (in which I give my thoughts on each subsystem through SSIS and Visual FoxPro). I took a few weeks off while in Amsterdam, but should be ready to start again. Next on the list is ETL Subsystem 11: Hierarchy Manager. I expect to have it ready by early next week.

LinkedIn

By the way, if you’re on LinkedIn, and you read this blog, feel free to invite me into your network. You likely are a FoxPro user which means I’m going to like you, or you are involved in some way with data warehousing, which means I’m going to like you! Just let me know you’re a reader!

Published May 10th, 2008

Live from Kimball University: Day 4 (ETL, Addresses, Applications, BI Toolkit?)

Ok, so I’m a day late! I had to catch a train back to Brussels, and considering I hadn’t seen my family since Monday morning, I didn’t spend last night blogging! Besides, the four days were fairly taxing. Whether you think you know everything about dimensional modeling or not, the material was fairly dense and Warren and Margy both moved fast.

If you are serious about data warehousing, and call yourself a data warehousing professional (especially if you are a “dimensional modeler”), then attending this and other sessions is a must. I learned more Warren and Margy directly than I could have from their books. And hearing about their various experiences added an extra dimension (no pun) to the education.

Now, if we can just convince them to create a certification exam

ETL System

The entire morning was focused on ETL Design and Development. The first step in designing the ETL processes is to meet a basic set of requirements. These would include having all business requirements fleshed out; a “solid” dimensional model ready to go; data quality, archiving, lineage, and management infrastructures defined; compliance, latency, and security issues addressed; and finally a good understanding of all application requirements. With these in place, it is possible to create the ETL System.

Basically, the ETL system is in charge of converting source system data to the data warehouse database. Don’t get hung up on the acronym though. Depending on the circumstances, the process may feel more like ELT, or ELTL, or E&T, more T, some L, and more T.

Warren recommends using an ETL vendor tool mainly because the tools handle technical metadata requirements and likely have a lot of other very helpful functionality built in (logging, standard tasks, connection managers, dimension management, surrogate key lookup capability, in-memory pipelining, etc..).

As many of you know, I am an advocate of hand-coding ETL, especially using Visual FoxPro. VFP makes it easy to build in all the things that other ETL tools do with little effort. In addition, VFP is data centric and object-oriented, and if the ETL processes are designed correctly can be totally metadata driven. With that said, ETL tools are getting better and better. If you have a couple of developers in house that are experienced with FoxPro or another data-centric language, then by all means go for it. If not, then using an ETL tool is likely your best bet.

Regardless if you hand-code or buy, Warren defined the following steps that are needed to develop the ETL system:

  1. Create a high-level plan for the target model
  2. Install and learn the ETL tool!
  3. Determine your default strategies
  4. Drill down by target table
  5. Develop the historical load for dimensions
  6. Populate historical facts
  7. Dimension table incremental processing
  8. Fact table incremental processing
  9. Aggregate and OLAP loads
  10. ETL operation and automation

Nothing earth-shattering, but I have seen first-hand how ETL systems can stall and flutter along because one or more steps were ignored or done out of order. The Kimball Group has spent a long time together developing their best-practice and practical techniques so that we can all benefit. This list is just another example; it’s a simple pattern to follow that will likely lead to greater successes.

Name and Address Fun

As part of our ETL discussion, we spoke of data cleansing and conforming. One major challenge that data integrators have is dealing with very dirty data from multiple source systems. One good example is the challenges associated with name and address matching.

Warren showed us a series of slides with scanned address labels delivered (miraculously in some cases if you saw the slides) to his house. His name was butchered in every which way. Some had his name as Thorn. Another had him as a woman. It goes to show that data quality and name and address cleansing is very hard to do. When you consider that some companies spend millions on their mailings, this is a big deal: And they can’t even get it right!

BI Applications

After lunch, we dove right into a discussion on Business Intelligence (BI) Applications. One thing Warren said really stuck out to me, and it is something I wrote in big letters on my copy of the presentation slides: BRANDING.

BI Applications are in fact a way for us — as DW/BI professionals — to brand our services. We create our image through the BI Application layer. Despite all the background and expertise we need to do all other phases of the LifeCycle, the executives, managers, and users will see us through our applications.

I admit, I had an “ah-ha” moment when this came up. Everything from Portal design to reporting templates to our logo (yes, we need a logo too!) is equally as important as cleansing and conforming data, designing a perfect data model, and conducting hours upon ours of requirement-gathering interviews.

I only wish that we had more time to spend on this topic. An entire day would have been great.

BI Toolkit

We need a BI Toolkit. A vendor-neutral, general discussion on what it means to deliver BI Applications to our users. I talked with both Margy and Warren about this and both seemed to agree (as does their publisher). It’s been in their sights for some time now, I suppose. While the other toolkit books have been an invaluable reference for the entire community — this book could literally change the way data warehousing teams and businesses consider BI, BI Applications, and the various possibilities that BI offers.

Personally, I would like this book to take more time delving into user requirements. I don’t want it to tell me why I need one p-value over another, nor do I want it to tell me to use Business Objects because it can support feature X or Y. I need a reference that reminds me that developing a scorecard requires multiple business processes, that building a reporting template to pass around to business users can be a great tool to flesh out design decisions, and that perhaps a “BI Matrix” needs to be developed so that each BI Application can be resolved against each underlying business process dimensional model. Ideally, it would be a book that a business sponsor can read and digest, while providing the Data Warehousing team with enough technical knowledge to appropriately evaluate tools, build strategies, and formulate roll out plans that give the user the biggest and quickest return on their investments.

Lastly…

It’s a bummer that Warren had to race through the deployment and maintenance materials. We really did have a good class though, and a lot of great questions were asked. You can tell that most people in the room really “get it”, and their questions all reflect that. Of course, this led to a few diversions and a tangent or two. We paid for that in the last section. Thankfully, I have a brand new, 2cnd Edition Lifecylce Toolkit book to refer to!

This was a great experience for me, and I met a lot of great people. If you haven’t been to one of these courses, I recommend it no matter the cost.

If you have questions or comments for me, don’t be shy. I will be writing more and more about these topics in future postings, and I’ll also be picking up where I left off with my postings on the ETL Subsytems 1 through 34. So if there is something you would like me to address first, let me know!

Published May 8th, 2008

Live from Kimball University: Day 3 (Architecture, Metadata, Real Time BI)

Today’s session was packed with practical design and implementation advice and best practices for the technical architecture design, product selection, and physical design of the data warehouse. Warren has a great wit and produced a lively session out of material that most normal humans would fall asleep to. We all managed to stay awake though — even past our 5 O’Clock end time.

So far, out of the three days, I learned the most today.

Technical Architecture

“Most data warehouse projects don’t take the time up-front to create a viable technical architecture.” That was the bullet on the Power Point slide. Fortunately, I’ve been on projects that did. In one group we had a CORD (Concept of Operations & Requirements Document) together with several other documents detailing the ETL architecture choices and standards. In another, we had a full LPAD (Logical and Physical Architecture Document) which clearly laid out the hardware, metadata, and flow of the expected system.

Designing such a document is unchartered territory for me, even though I’ve penned several (more than I care to recount) technical documents. The framework and requirements of the final data warehouse technical architecture document is quite demanding and just plain hard to do. You must include business requirements, standards, implementation details for the business requirements, ETL service details, Subsystem strategies, data access strategies, metadata maintenance plans, data storage, backup and recovery, deployment, interview summaries, and many more categories.

To create this plan, Warren detailed an 8 step process:

  1. Prepare
  2. Gather architecture-related requirements
  3. Draft document
  4. Develop architectural models
  5. Determine implementation phases
  6. Spec the Subsystems
  7. Develop the final document
  8. Review and finalize the plan

As you can see, preparing this document is not trivial but an integral part of the design process. Most of what is done at this level can be implemented across the entire data warehousing “Program” (not just the “project”). Individual projects might have some specific requirements, so during each iteration of the Lifecycle, it might be necessary to refine or update the document.

Metadata

We talked at length throughout the sessions today about metadata. In our environment, there are really three types we truly care about: business, technical, and process metadata. The following sections will give you an idea of what is expected for each category.

Business Metadata: Describes the contents of the data warehouse in more user-acceptable terms:

  • Data quality
  • Data dictionaries
  • Logical data mappings
  • Business rules

Technical Metadata: Defines the objects and processes of the warehouse:

  • System inventory
  • Source descriptions
  • ETL Job logic
  • Backup and security

Process Metadata: Describes the results of various warehouse operations:

  • ETL operations statistics
  • Audit results
  • Quality results

Metadata management is really about following the 80/20 Principle. Warren even talked about striving for a realistic 10% metadata coverage instead of shooting for the holy grail, entirely integrated metadata repository. There are a lot of reasons for holding back efforts. Perhaps the biggest reason is that it is a maintenance headache with very little or spotty business value. Warren stressed that vendor tools should provide metadata and it’s about time we started demanding better of them.

Presentation Server

The presentation server(s) contains the atomic business process dimensional models. The server(s) acts as the access point for all BI applications and is fed by the ETL system. The dimensional models are complete with atomic-level data, aggregates, and conformed dimensions and facts. The server’s configuration and design is based on the requirements established during planning. The dimensions will contain combinations of SCD types, multi-valued hierarchies, and surrogate keys. Fact tables will be at transaction, accumulating snapshot, and periodic snapshot grains.

All of these things are planned for and conceived before the ETL and DBA groups get their hands on the documentation.

Real Time Data Warehousing

Real-time is defined (at least for this discussion) as any processing which requires access to data in the presentation servers less than 24 hours after the event occurs. Although the buzz of real-time has subsided — in some cases the term “right-time” has taken over — there is still a need to process data quickly. Warren went over a few different solutions, including the design of a “Real Time Layer”, which is a parallel structure to the data warehouse. This structure is in a dimensional form that matches (conforms, I guess you could say) to the dimensional models in the presentation servers.

The real-time layer would contain facts and their related dimensions only for the business events that have occurred since the last reconciliation with the data warehouse. Sounds simple? Not so fast. There are various issues with Type 1 and Type 2 changes, as well as the added complexity in maintaining the layer. But, as Warren stresses, if there is a compelling business need to have real-time access to dimensional data, then this might be a good way to go. Other options include building operational data stores, or building some sort of federated reporting system.

Finally…

We talked a lot about different vendor products. I may talk about these in some future post, but for now I don’t really see much value in it. There have been so many mergers and acquisitions, that it is very difficult to keep track of who is who. Warren even admits that he can’t keep up with some of these marketing departments as they re-brand and re-package many of the same products.

Tomorrow is the last day. I feel pretty good about what I’ve learned so far. More importantly, I feel that the work I have been doing to this point is valid. It’s nice to get reassurance once in a while!