Published May 30th, 2008
A Data Warehouser’s Vocabulary (Part 1)
Partly inspired by a post entitled “The most important thing I know about Analytics is that no-one agrees what it means” by James Taylor and partly inspired by the section “Slowly Changing Vocabulary” in the book “Data Warehouse Lifecycle Toolkit 2nd Edition“, I have decided to compile a glossary of terms and concepts that I feel have some relevance to the data warehousing and business intelligence world. I’ll break this list into several postings, and I reserve the right to refine, enhance, clarify, and augment a definition at any time! When finished, I’ll make them a permanent feature of TmF.
With this list, I am not attempting to resolve any debates, nor am I attempting to invalidate or discredit a definition you may be using. These are the definitions I use. Also be aware that certain terms might hold different meanings under different contexts. If I need to use one of those ambiguous terms, I try my best to put a good context around it. For example, when I refer to “Data Mart”, I specifically mean “Atomic Business Process Dimensional Model”. However, there are times when what I mean is to describe a separate (perhaps normalized) database for a specific user or department (i.e. a throw-away sandbox for the big kids).
Each of these definitions has a citation; I am using the XHTML “cite” tag with each. If you would like to see the source, view the source! Also, when I finish this list, and put these all together on a single page, I’ll be sure to include a reference link section as well.
So, without further ado, I give you the first group of many to come (A-Z):
- Business Intelligence (BI)
- A generic term to describe leveraging the organizations’ internal and external information assets for making better business decisions.
- Business Process
- The complete response that a business makes to an event. A business process entails the execution of a sequence of one or more process steps. It has a clearly defined deliverable or outcome. A Business Process is defined by the business event that triggers the process, the inputs and outputs, all the operational steps required to produce the output, the sequential relationship between the process steps, the business decisions that are part of the event response, and the flow of material and/or information between process steps.
- Changed Data Capture (CDC)
- Changed Data Capture (CDC) is a method of identifying changes made to a source database or file for the purposes of integrating the data into the data warehousing pipeline. CDC reduces data volume and processing needed for the data warehouse.
- Data Mart
- A business process dimensional model.
- Data Profiling
- Data profiling is a method of assessing source data in a systematic and analytical way. The goal of data profiling is to build an exhaustive inventory detailing the content, context, and quality of source data. It entails much more than reviewing a diagram or running a few SQL statements. Data profiling leads to better data integration, which leads to better data quality.
- Data Quality
- Assurances that the integrated data is consistent, complete, and fit to publish to the business community.
- Data Warehouse Database
- The largest possible union of queryable presentation data in a DW/BI System.
- ETL
- A set of processes that prepare source data for a Data Warehouse, adding value and confidence along the way. These processes include extraction, transformations (cleans & conform), and load operations. Note that the order in which ETL processes occur can be varied based on the situation. Some sources refer to the ET or just the E broadly as “Data Acquisition”.
- Master Data Management (MDM)
- Centralized facilities designed to hold master copies of shared entities, such as Customer and Product.
- Metadata
- All the information that defines and describes the structures, operations, and contents of a BI/DW system.
- Operational Data Store (ODS)
- A physical set of tables sitting between the operational systems and the data warehouse, or a specially administered hot partition of the data warehouse itself. The main purpose of an ODS is to provide immediate reporting of operational results if neither the operational system or the data warehouse can provide satisfactory access.
- Staging
- Physical workspace for data during the ETL process. Some data is temporarily staged, while other data may persist.
Snowflakes are usually a sign that a hierarchy has been normalized. This is bad. Don’t fall into the trap! In order to keep the dimensional model as simple as possible, you should avoid snowflaking; however, snowflake designs are perfectly legal under certain circumstances. Carefully examine the reasons though. If you are snowflaking to accommodate a hierarchy, hold your horses. Hierarchies are a natural part of a dimension. In fact, most things in life are categorized and need to be grouped in some way. Remember that one of the primary purposes of delivering a denormalized dimension is to remove almost all complexity from the user’s perspective. This usually means hierarchies.
I'm a Quant Technical Specialist (Data Warehousing and Business Intelligence), with expertise in business analysis, data modeling, and data integration. I have extensive experience developing vertical and integrated desktop, Internet, and BI applications spanning municipal, clinical, and financial industries.
