Today’s session was packed with practical design and implementation advice and best practices for the technical architecture design, product selection, and physical design of the data warehouse. Warren has a great wit and produced a lively session out of material that most normal humans would fall asleep to. We all managed to stay awake though — even past our 5 O’Clock end time.

So far, out of the three days, I learned the most today.

Technical Architecture

“Most data warehouse projects don’t take the time up-front to create a viable technical architecture.” That was the bullet on the Power Point slide. Fortunately, I’ve been on projects that did. In one group we had a CORD (Concept of Operations & Requirements Document) together with several other documents detailing the ETL architecture choices and standards. In another, we had a full LPAD (Logical and Physical Architecture Document) which clearly laid out the hardware, metadata, and flow of the expected system.

Designing such a document is unchartered territory for me, even though I’ve penned several (more than I care to recount) technical documents. The framework and requirements of the final data warehouse technical architecture document is quite demanding and just plain hard to do. You must include business requirements, standards, implementation details for the business requirements, ETL service details, Subsystem strategies, data access strategies, metadata maintenance plans, data storage, backup and recovery, deployment, interview summaries, and many more categories.

To create this plan, Warren detailed an 8 step process:

  1. Prepare
  2. Gather architecture-related requirements
  3. Draft document
  4. Develop architectural models
  5. Determine implementation phases
  6. Spec the Subsystems
  7. Develop the final document
  8. Review and finalize the plan

As you can see, preparing this document is not trivial but an integral part of the design process. Most of what is done at this level can be implemented across the entire data warehousing “Program” (not just the “project”). Individual projects might have some specific requirements, so during each iteration of the Lifecycle, it might be necessary to refine or update the document.

Metadata

We talked at length throughout the sessions today about metadata. In our environment, there are really three types we truly care about: business, technical, and process metadata. The following sections will give you an idea of what is expected for each category.

Business Metadata: Describes the contents of the data warehouse in more user-acceptable terms:

  • Data quality
  • Data dictionaries
  • Logical data mappings
  • Business rules

Technical Metadata: Defines the objects and processes of the warehouse:

  • System inventory
  • Source descriptions
  • ETL Job logic
  • Backup and security

Process Metadata: Describes the results of various warehouse operations:

  • ETL operations statistics
  • Audit results
  • Quality results

Metadata management is really about following the 80/20 Principle. Warren even talked about striving for a realistic 10% metadata coverage instead of shooting for the holy grail, entirely integrated metadata repository. There are a lot of reasons for holding back efforts. Perhaps the biggest reason is that it is a maintenance headache with very little or spotty business value. Warren stressed that vendor tools should provide metadata and it’s about time we started demanding better of them.

Presentation Server

The presentation server(s) contains the atomic business process dimensional models. The server(s) acts as the access point for all BI applications and is fed by the ETL system. The dimensional models are complete with atomic-level data, aggregates, and conformed dimensions and facts. The server’s configuration and design is based on the requirements established during planning. The dimensions will contain combinations of SCD types, multi-valued hierarchies, and surrogate keys. Fact tables will be at transaction, accumulating snapshot, and periodic snapshot grains.

All of these things are planned for and conceived before the ETL and DBA groups get their hands on the documentation.

Real Time Data Warehousing

Real-time is defined (at least for this discussion) as any processing which requires access to data in the presentation servers less than 24 hours after the event occurs. Although the buzz of real-time has subsided — in some cases the term “right-time” has taken over — there is still a need to process data quickly. Warren went over a few different solutions, including the design of a “Real Time Layer”, which is a parallel structure to the data warehouse. This structure is in a dimensional form that matches (conforms, I guess you could say) to the dimensional models in the presentation servers.

The real-time layer would contain facts and their related dimensions only for the business events that have occurred since the last reconciliation with the data warehouse. Sounds simple? Not so fast. There are various issues with Type 1 and Type 2 changes, as well as the added complexity in maintaining the layer. But, as Warren stresses, if there is a compelling business need to have real-time access to dimensional data, then this might be a good way to go. Other options include building operational data stores, or building some sort of federated reporting system.

Finally…

We talked a lot about different vendor products. I may talk about these in some future post, but for now I don’t really see much value in it. There have been so many mergers and acquisitions, that it is very difficult to keep track of who is who. Warren even admits that he can’t keep up with some of these marketing departments as they re-brand and re-package many of the same products.

Tomorrow is the last day. I feel pretty good about what I’ve learned so far. More importantly, I feel that the work I have been doing to this point is valid. It’s nice to get reassurance once in a while!