Posts Tagged Conference

Live from Kimball University: Day 4 (ETL, Addresses, Applications, BI Toolkit?)

Ok, so I’m a day late! I had to catch a train back to Brussels, and considering I hadn’t seen my family since Monday morning, I didn’t spend last night blogging! Besides, the four days were fairly taxing. Whether you think you know everything about dimensional modeling or not, the material was fairly dense and Warren and Margy both moved fast.

If you are serious about data warehousing, and call yourself a data warehousing professional (especially if you are a “dimensional modeler”), then attending this and other sessions is a must. I learned more Warren and Margy directly than I could have from their books. And hearing about their various experiences added an extra dimension (no pun) to the education.

Now, if we can just convince them to create a certification exam

ETL System

The entire morning was focused on ETL Design and Development. The first step in designing the ETL processes is to meet a basic set of requirements. These would include having all business requirements fleshed out; a “solid” dimensional model ready to go; data quality, archiving, lineage, and management infrastructures defined; compliance, latency, and security issues addressed; and finally a good understanding of all application requirements. With these in place, it is possible to create the ETL System.

Basically, the ETL system is in charge of converting source system data to the data warehouse database. Don’t get hung up on the acronym though. Depending on the circumstances, the process may feel more like ELT, or ELTL, or E&T, more T, some L, and more T.

Warren recommends using an ETL vendor tool mainly because the tools handle technical metadata requirements and likely have a lot of other very helpful functionality built in (logging, standard tasks, connection managers, dimension management, surrogate key lookup capability, in-memory pipelining, etc..).

As many of you know, I am an advocate of hand-coding ETL, especially using Visual FoxPro. VFP makes it easy to build in all the things that other ETL tools do with little effort. In addition, VFP is data centric and object-oriented, and if the ETL processes are designed correctly can be totally metadata driven. With that said, ETL tools are getting better and better. If you have a couple of developers in house that are experienced with FoxPro or another data-centric language, then by all means go for it. If not, then using an ETL tool is likely your best bet.

Regardless if you hand-code or buy, Warren defined the following steps that are needed to develop the ETL system:

  1. Create a high-level plan for the target model
  2. Install and learn the ETL tool!
  3. Determine your default strategies
  4. Drill down by target table
  5. Develop the historical load for dimensions
  6. Populate historical facts
  7. Dimension table incremental processing
  8. Fact table incremental processing
  9. Aggregate and OLAP loads
  10. ETL operation and automation

Nothing earth-shattering, but I have seen first-hand how ETL systems can stall and flutter along because one or more steps were ignored or done out of order. The Kimball Group has spent a long time together developing their best-practice and practical techniques so that we can all benefit. This list is just another example; it’s a simple pattern to follow that will likely lead to greater successes.

Name and Address Fun

As part of our ETL discussion, we spoke of data cleansing and conforming. One major challenge that data integrators have is dealing with very dirty data from multiple source systems. One good example is the challenges associated with name and address matching.

Warren showed us a series of slides with scanned address labels delivered (miraculously in some cases if you saw the slides) to his house. His name was butchered in every which way. Some had his name as Thorn. Another had him as a woman. It goes to show that data quality and name and address cleansing is very hard to do. When you consider that some companies spend millions on their mailings, this is a big deal: And they can’t even get it right!

BI Applications

After lunch, we dove right into a discussion on Business Intelligence (BI) Applications. One thing Warren said really stuck out to me, and it is something I wrote in big letters on my copy of the presentation slides: BRANDING.

BI Applications are in fact a way for us — as DW/BI professionals — to brand our services. We create our image through the BI Application layer. Despite all the background and expertise we need to do all other phases of the LifeCycle, the executives, managers, and users will see us through our applications.

I admit, I had an “ah-ha” moment when this came up. Everything from Portal design to reporting templates to our logo (yes, we need a logo too!) is equally as important as cleansing and conforming data, designing a perfect data model, and conducting hours upon ours of requirement-gathering interviews.

I only wish that we had more time to spend on this topic. An entire day would have been great.

BI Toolkit

We need a BI Toolkit. A vendor-neutral, general discussion on what it means to deliver BI Applications to our users. I talked with both Margy and Warren about this and both seemed to agree (as does their publisher). It’s been in their sights for some time now, I suppose. While the other toolkit books have been an invaluable reference for the entire community — this book could literally change the way data warehousing teams and businesses consider BI, BI Applications, and the various possibilities that BI offers.

Personally, I would like this book to take more time delving into user requirements. I don’t want it to tell me why I need one p-value over another, nor do I want it to tell me to use Business Objects because it can support feature X or Y. I need a reference that reminds me that developing a scorecard requires multiple business processes, that building a reporting template to pass around to business users can be a great tool to flesh out design decisions, and that perhaps a “BI Matrix” needs to be developed so that each BI Application can be resolved against each underlying business process dimensional model. Ideally, it would be a book that a business sponsor can read and digest, while providing the Data Warehousing team with enough technical knowledge to appropriately evaluate tools, build strategies, and formulate roll out plans that give the user the biggest and quickest return on their investments.

Lastly…

It’s a bummer that Warren had to race through the deployment and maintenance materials. We really did have a good class though, and a lot of great questions were asked. You can tell that most people in the room really “get it”, and their questions all reflect that. Of course, this led to a few diversions and a tangent or two. We paid for that in the last section. Thankfully, I have a brand new, 2cnd Edition Lifecylce Toolkit book to refer to!

This was a great experience for me, and I met a lot of great people. If you haven’t been to one of these courses, I recommend it no matter the cost.

If you have questions or comments for me, don’t be shy. I will be writing more and more about these topics in future postings, and I’ll also be picking up where I left off with my postings on the ETL Subsytems 1 through 34. So if there is something you would like me to address first, let me know!

Tags: , , , , , ,

8 Comments

Live from Kimball University: Day 2 (SCDs, the Mini, Modeling Process)

I am thoroughly exhausted! Margy followed up day one with another day packed with information.

We went through the various types of Slowly Changing Dimensions (SCD) and even talked about Type 6 SCDs, which I’ll get to in a bit. We then rolled right into mini dimensions and other related concepts. After that, we talked about the modeling process, different types of fact tables, and then finally how to deal with a mature data warehouse.

Slowly Changing Dimensions

Although I’ve posted a few entries (1 2) on the SCD Types, I’ll go over them once again here, but hopefully in a slightly new light. We talked specifically about:

Type 1: Overwrite the Attribute
In short, when an attribute’s value changes, we overwrite the old value with the new value. This technique does not preserve history. While it might be easy to implement, its use must be fully justified! In most every case, we will want to keep some sort of history in our dimensions.
Type 2: Preserve History / Add a New Record
The most common type, this technique preserves the old value by inserting a copy of the current row with the updated attribute values into the Dimension. The old row is marked as archived (typically by using active and inactive dates and a status indicator).
Type 3: Alternate Reality / Add a New Attribute
New attributes are added to the dimension to represent the old and new value. In some cases this is a handy technique because you can preserve the old value along side the new value. This technique is best used when there is a massive, dimension-wide change that affects many rows. For example, if a voter’s representative district changes during redistricting, we might want to see what election results would have been if the old districts were used in the analysis.
Type 6: Advanced Hybrid
This is a hybrid approach that combines the other 3 types (1 + 2 + 3 = 6). According to Margy: “We’re creating new rows to capture change (Type 2), adding attributes to reflect an alternative view of the world (Type 3), which are overwritten for all earlier dimension rows for a given product (Type 1).” Read more about this here and be sure to check out this article by Margy and Ralph for more information. I hope to have some examples posted in the coming weeks as well.

We didn’t talk about inferred dimensions (although Margy did mention the late arriving kind). I suppose the idea of an inferred dimension (or Type 1.5) might be discussed with Warren in the coming days.

The Mini

After all that SCD talk, we talked about dealing with very large dimensions. Especially those that have Type 2 attributes. Margy introduced the concept of the Mini-Dimension, and stated that for extremely large dimensions, using Type 2 may not be a good idea for attributes that might change often. For example, suppose you wanted to store an airline passenger’s ‘age’ in your Passenger dimension. If the passenger flies 7 times over an 8 year period, you could have as many as 7 rows reflecting that change.

Instead, create a mini-dimension with attributes that change frequently and link them to the Fact table using a separate foreign key. It is hard to explain this in any detail without a picture. I promise once I get back from Amsterdam I will prepare a better example for this (with pictures!).

The Modeling Process

I was happy to see a segment dedicated to the overall modeling process. When it first came up, I felt it was out of place (perhaps it could have been discussed before we dove into sketching our case study models?), but in the end, it seemed to fit the flow.

The modeling process includes the following steps: Preparation, a high level design session, detailed model development, Dimensional model review and validation, and final documentation to be handed to the ETL team. It is important to note that this process is iterative.

Next we went over the various participants that need to be involved: the data modeler, business analyst, power users, BI application developer, data steward, source experts, and the ETL team.

During this process, we’ll also start to conduct our first rounds of data profiling. I find this part of the process to be most crucial because it is at this level where you can start to determine the feasibility of implementing a particular business process.

Types of Fact Tables

After a brief case study, Margy took a little more time to go over the different types of Fact table grains. These include the transaction, periodic snapshot, and accumulating snapshot. For more information on the different types, please read my post “Fact Tables“.

Some additional notes on Accumulating Snapshots: They generally represent some sequential, short lived, and predictable process that needs to be expressed by various milestones. These milestones are most often dates. These fact tables often span multiple business processes as well. For example, an order is placed, it is released, shipped, and delivered. These are separate milestones that have some serious significance to the business for analysis.

Lastly…

We finished the day discussing mature BI projects and how to handle existing warehouses that might not be in great shape. Margy went over several important common disorders (ranging from lack of business sponsor to infrastructure issues), some symptoms, and finally she proposed a treatment plan. Every slide in this section was packed with good, solid advice. When I can get my notes together on this, I plan to write about some of my own experiences with these disorders.

Tomorrow Warren will take over. I trust he’ll be as good as Margy!

Tags: , , , , ,

No Comments

Live from Kimball University: Day 1 (planning, requirements, dimensional models)

As you may know, I am currently in Amsterdam attending the Kimball University (organized by Quest for Knowledge / Q4K) training on the Data Warehouse Lifecycle. Margy Ross and Warren Thornthwaite are here and so far so good. Margy is teaching for the first two days, while Warren will take the final two days. I am not disappointed thus far.

Admittedly, much of the material in this course is not new to me. I’ve been exposed to dimensional modeling for some time now, and with a few projects under my belt, I’m starting to feel like I know what I’m talking about (which could be dangerous, I know). I’ve read the books, the articles, the tips, and have even written my own articles and blogged on many subjects. Attending these four day sessions will be primarily to validate and shore-up what I already know and hopefully correct some of my misconceptions and mistakes. With that said, I am getting a lot out of each session so far.

Project and Program Planning

Margy started out with a brief introduction, detailing the roots of the Kimball Group. Its pretty interesting to hear how they started out as 3 2-person companies and essentially merged into one: The Kimball Group. These six people have fundamentally changed the way data modelers and architects think about and approach data warehousing. What sets them apart, I believe, is that their ideas and theories stem from needs of the user.

Tremendous emphasis is placed on getting the requirements right. Margy pointed out that involving the users early and often is a key to success. Oftentimes, she said, users are involved early on and then forgotten once development begins. Three or Four months later, what is delivered might not be exactly what the end user had in mind. As the development team works, they uncover more and more hidden issues that must be resolved. The best person to help resolve many of these issues is the business user — not the data modeler or ETL architect.

In short, planning for a data warehouse must involve the business community. You need a strong and charismatic sponsor who has vision and influence, a data steward who cares about convention and quality, and a business analyst who can straddle the fence between the green grassy fields of IT and the barren wasteland of the business (kidding…).

Margy also developed a list of warning signs that ranged from failing to fill the various shoes I mentioned in the previous paragraph to acknowledging “BI/DW success is tied to user acceptance”. One of her points, not to underestimate the data cleansing workload, is finally starting to become less common. But it still remains one of the main reasons why data warehousing projects fail.

The Bus Architecture

Here’s something you don’t get by reading a book: Margy and the group didn’t quite like Ralph’s name for what we know as the Bus Architecture. I don’t have an electrical engineering background, but Ralph does. AdventureWorks Bus Matrix SampleI suppose he felt that the way business processes and conformed dimensions related together reminded him of how a computer’s bus architecture worked. Essentially, being able to swap in and out components (If you have an electrical engineering background, or are just plain smarter than me on this, then please correct me if I’m off-base!).

I asked Margy what she would have called it, and she didn’t really know. I guess that’s why Ralph won-out. He was the only one with an idea!

For our purposes, the Bus Architecture “provides a standardized master set of conformed dimensions and conformed facts used throughout the data warehouse.” Click the image for a larger picture. This image comes from a SQL Server Mag article “Dimensional Modeling Basics” by Joy Mundy and Warren Thornthwaite.

Dimensional Modeling and Conformed Dimensions

I’ve talked at length in my articles and posts about dimensional modeling. Margy went over the basics today, starting out with a discussion on the differences between the Kimball approach and the Corporate Information Factory (CIF) approach pushed by Inmon. Although the dust between the two camps has settled in recent years, Margy was quite adamant about the benefits of dimensional modeling over the enterprise data warehouse concepts. Margy detailed these, but until I get my notes together, I’ll leave it at that.

Snowflake schema example from http://sql.wikis.com/wc.dll?SQL~snowflakeschemaShe also stressed the reliance on the star schema, and warned of snowflaking designs. Snowflaking occurs mainly when you attempt to normalize a dimension, usually by breaking the dimension up by its hierarchal data. For example, you might be tempted to break the “Store” dimension, which contains city, district, region, and state into separate City, District, Region, and State dimensions. Of course, doing so creates a snowflake and makes the model much more complex than it needs to be.

Conformed dimensions and facts are really the key to a successful dimensional modeling — and by extension data warehousing — project or program. They allow you to conduct drill-across queries, where you query data across business processes, and ultimately conformed dimensions and facts give your end users a consistent version of the data no matter what process they are looking at. Using the Bus Architecture, dimensions that are used across business processes are easy to spot.

Comments and Questions

I am also paying close attention to many of the questions and comments being made during the Q&A moments. The room is roughly composed of about 75% of people who are somewhat new to data warehousing and business intelligence. So it is a great opportunity to hear some of the questions the group has had for Margy.

Some questions: a.) How do you link a promotion to a product to show all the products that did not sell during a promotion? b.) Can you add the time to the date dimension? c.) Why should I conduct interviews during requirement gathering when I know the business well enough myself? d.) When determining the feasibility of implementing a business process, are there other important factors outside of data availability to consider?

I’ll elaborate more on all of these answers (and more) in future postings. In short, the answer key is simply: a.) Use a factless fact between product and promotion; b.) No, use either a Time dimension or store a timestamp in the Fact table; c.) You don’t know as much as you think you know and interviews will form the basis for future documentation; d.) available resources and technology issues can be overcome, perhaps the next biggest factor is dependencies between business processes.

And Lastly…

Almost everyone in attendance is a consultant. I am still getting used to this. I’ve seen and met more consultants in my short time here in Europe (about 5 months so far) than I have my entire life. Margy was also surprised. She did make a very interesting observation though. Consultants, who usually pay their own way through courses and seminars, typically stay away when times are tough. When the economy is good and/or they have plenty of work, consultants attend more conferences. Interesting. I suppose this could mean that the consultant business is doing quite well in this part of Europe.

I couldn’t get everything into this single post. I’ll be sure to post more notes as time goes on. For now, it’s on to tomorrow!

Tags: , , , , ,

1 Comment

Data Warehouse Lifecycle in Depth

I’m starting to get excited about next week. I’m leaving Brussels for Amsterdam at about 5pm on Monday to attend a four day course at Kimball University. Subject: The Data Warehouse Lifecycle in Depth. Instructors: Margy Ross & Warren Thornthwaite. I’m looking forward to diving into the DW Lifecycle with industry experts: those who literally wrote the book. Ralph Kimball won’t be there but I trust that Margy and Warren will give us a great week of sessions.

What’s really, really cool is that I’ll be getting the new edition of “The Data Warehouse Lifecycle Toolkit“:

What better way to start 2008 than with a major new data warehousing book! Joy, Warren, Margy, Bob and I have been working on the second edition of The Data Warehouse Lifecycle Toolkit for most of the last year. Virtually every paragraph of the first edition was changed, and more than 60% of the book is new material. The happy looks on our faces are both pride in our new baby, as well as relief from a long writing and editing stint.

I plan on blogging from the event, so if anyone is interested in getting some insight on any of the sessions, please let me know. I’ll be sure to take especially good notes!

Tags: , , , ,

3 Comments

Trends in Business Intelligence

I had the pleasure of attending an excellent TDWI Benelux Chapter meeting in Antwerpen, Belgium last Thursday. Presenting were Steve Hoberman and Davy Nys. Steve discussed the latest trends in business intelligence, including the drivers that influence them. Davy’s presentation, which I’ll talk about in a future posting, discussed the place for open source technologies in a BI environment.

Steve Hoberman

I have been reading Steve’s work for a few years now. I first discovered him in print when I picked up his book “Data Modeler’s Workbench: Tools and Techniques for Analysis and Design”. Unfortunately (for me), I lent the book to a former colleague some time ago, and will likely not get it back! It looks like I’ll need to re-order. You can keep up with Steve at The Data Administrator’s Newsletter website, where Steve is a columnist. His writings are excellent. He offers a great deal of insight and experience into data modeling and related topics.

So what are the trends?

Steve identified five drivers and five trends. I don’t have access to his Power Point slides, so I will do my best in regurgitating what I heard based on my notes and memory. First, the drivers, which include:

  • Cheaper storage and computing;
  • businesses are more BI savvy and demand more of IT;
  • increased pressure for business and IT to perform well;
  • struggling global economy; and
  • scarce IT resources.

These drivers have contributed to the following five trends:

  • The merging of data warehouses and operational data stores;
  • increased integration efforts;
  • excitement around unstructured data;
  • pressure to deliver solutions correctly the first time; and
  • the need for BI personnel to wear multiple hats.

Certainly, Steve explained the above with more elegance and grace. Hopefully, I haven’t butchered his thoughts too much.

The point though, is that BI is heading in a particular direction. I’ve had a sense of this for the past couple of years:

  • I have witnessed firsthand the merging of DW and ODS.
  • I have seen a rise in integration efforts — including application and data.
  • Unstructured data and text mining are certainly talked about now as integral to future BI initiatives.
  • Because of the fact that business users are now more involved in BI projects (it used to be that IT had to struggle to obtain business sponsorship), the ideas come faster than the ability of IT to implement them, meaning that IT must strive to get them right the first time.
  • Lastly, I do wear multiple hats: DBA, developer, integration architect, researcher, business analyst, project manager, etc.

I would be interested in hearing your thoughts on this subject. Feel free to write me at Tod at Blackstone Providence dot com, or simply just make a comment to this post. You can also reach Steve at his website, or visit TDWI for more details on events that might be coming in your area.

Tags: , , ,

No Comments

Live from FoxForward: The Last Day

Yesterday was the last day of the conference, and I’m sorry it’s ended. I learned a lot and met a lot of great Fox developers (who are people too, by the way). As the sessions were winding down, many of us hung out outside the conference rooms chatting about all things Fox (and a little about cars). I spoke for a second time today (materials here), and was able to catch a few other sessions. Here are some of my notes:

GDIPlusX library

Bo spoke on the GDIPlusX Library. After a brief overview and some instructions on how to download and get the System object in memory, he took a moment to discuss how system.prg works. What a great piece of architecture. If you haven’t checked out the GDIPlusX Library, you should at least download it and review the system.prg file (and then all the others too). GDIPlusX looks a lot like the .NET implementation on purpose, and works the same too.

Of course, at the moment, my head is spinning on all the other implementations for this type of architecture (Math, IO, etc.).

Now here’s a great tip. Declared DLLs take precedent over function calls. Therefore, if you name your own function the same name as a declared library, your code won’t run — the library’s code will. Bo showed us how he and his team developed a technique to take advantage of this behavior. When you need a dll function, call a VFP function with the same name that DECLAREs the DLL and then calls it (passing in all relevant parameters). Now the DLL function is in memory, and next time you need it, it will use it instead of the function. This means that you only need to declare your DLLs when needed. It also means that CLEAR DLLS ALL commands won’t kill your app if you need a DLL after that call (by the way, this is something I lost sleep over a while ago when some 3rd party function issued this command). Additionally, you now only DECLARE the DLLs when needed. I think Bo mentioned that this technique was first used by Rick Strahl, but on a smaller scale.

Bo also talked about how the GDIPlusX help is coming along. It looks like we’ll have more help for this library than we’ll ever need. Bo and his team are doing an amazing job with it.

A great resource for GDIPlusX, other than the official CodePlex/VFPX site, is Cesar Chalom’s Blog. Cesar has a lot of examples and is very active on the project.

How to OGLE Your Application

Alen Stevens taught us about OGLE: “Optimized, Generic, Layer, Extensions”. The goals of OGLE apps is to be future proof, data source agnostic, location independent, and .NET accessible. OGLE relies on physical n-Tier development techniques, where XML is passed among layers. Using XML in this way helps applications to be future proof. After the session, I talked with Alan about the architecture and he reiterated the importance of using XML to pass data between layers. We’re also reminded that VFP is a state-of-the-art XML parser — faster and more reliable than other available technologies. Therefore, using XML to pass data back and forth is a natural fit for OGLE.

The presentation was a perfect compliment to the whitepaper Alan created. The slides helped shape out the logic in the paper. Now I’ve got my head around it. He walked through lots of code examples, demonstrating the basics of the architecture. I haven’t used com as a middle tier before, so I was glued to this session. I felt that Alan gave a great overview of the process and now I feel like I’m up to speed. He showed us how to set up COMATTRIB attribues, which specifies type library attributes and values for PEMs. This applies to OLEPUBLIC classes only. You can do things like insert descriptions and read-only attributes. These are exposed and can be read by other technologies.

OGLEing apps seem like a smart design move. The separation and isolation of layers in each tier are more dramatic and defined. Although I missed Alan’s test-driven-development (TDD) session, I got the feeling that OGLE is also quite friendly for TDD as well. Whether using FoxUnit (Alan loves this thing) or if you’re just writing stubs to test apps, the architecture makes it simple.

I’ll be sure to recap the entire event after my wife and I get out of the Tennessee mountains on Tuesday. I’ll also be at AFUG Tuesday night.

More on FoxForward over at Kevin’s blog: http://cully.biz/. Alan said he might post too, so check out http://netcave.org/.

Tags: , ,

No Comments