Archive for category Data Architecture

Maintaining a DW/BI Environment

In posts Version Control and Version Migration I glossed over one of the more complex and challenging aspects of data warehousing: Once the DW/BI environment is in a production environment, how do you maintain and update it?

Some Thoughts

While I do not have a blueprint for you, I do have a few thoughts on the subject.

First, you must consider the Data Warehouse as a living and breathing organism. Not only will it be growing in size as your carefully constructed ETL packages churn, but it will also likely be growing in scope and importance (if not, then you may have to re-think your DW/BI marketing approach and/or find a new sponsor).

Second, you have to realize from the beginning that deploying a data warehouse is both an iterative and incremental process. Iterative in that you will build and rebuild as you get deeper into the project; incremental in that different parts of the warehouse will be constructed and delivered at varying rates. This is in direct contrast to the normal waterfall approach to releasing software applications and systems. It is not realistic, practical, or advised to attempt to deliver a DW/BI project in one shot. You may as well use your bullet for something else!

Passing the Baton

track baton pic.thumbnail Maintaining a DW/BI Environment Your DW/BI team must plan up front for the often complex handoffs between development and maintenance amid the ever-turning wheel of the DW Lifecycle (detailed here).

Those handoffs are critical. If you’re in a large organization (like me now), then you will be literally handing off the maintenance of the project to a complete different group. This group will need documentation and escalation procedures to monitor and respond to various exceptions that may occur in the production environment. While you can’t plan for every exception, you should prepare enough documentation so that all the basics are covered (for example, what should the maintenance team do if a SQL Agent Job reports a failure?). While some things will fall back on the Dev team, many of the maintenance tasks can be handled by a well trained support group.

If you’re in a small organization (like I used to be in), then you are the maintenance team. You have the added responsibility of maintaining a high-priority production system while you continue to build new pieces for the next release. This can be exhausting and stressful so be sure you automate as much of the maintenance part as you can! And be sure to allocate enough “business as usual” time in your schedule each week. Make sure your manager and/or project leader is cognizant of this added pressure and responsibility.

Back to the Start

Once you’ve passed the baton, you can go back to the beginning of the DW Lifecycle and start over. You may improve some processes (iterative development) while adding on new functionality (incremental development).

data flowchart.thumbnail Maintaining a DW/BI Environment For example, in your team’s first run through the lifecycle, you implemented a single business process dimensional model (retail sales) that allowed you to also produce a series of reports for your sales department. During this first run through, you accomplished quite a bit: You installed and learned how to use the toolsets, you built the dimensional model and ETL architecture, you designed a rudimentary web portal, you created a few useful reports, and you released all the pieces to your users. So far so good.

Now, this first version is out and the business users have some feedback. An issue log and wish list is compiled (perhaps a few new reports, an updated report template, and some user-defined filters for the portal). In addition, you want to also include a new business process, inventory, so you can expand the usage of the date warehouse in your business.

You and your team get back to work. The modeler begins to construct the new dimensional model while the ETL team works on the integration packages. This process takes the longest, and may be interrupted occasionally by the production team reporting some data quality issues or some failed packages.

In a separate thread, your BI application developers are busy with implementing many of the suggestions brought forward by the sales department. When finished, they release version 1.1 of the portal. Immediately, they begin working on some new inventory reports.

This scenario repeats with each cycle for each thread lasting anywhere from 3 weeks to 3 months.

Rosy Picture?

The above scenario really paints a rosy picture of the process. In reality, it doesn’t always work smoothly. But it does work if managed correctly. And it can be quite exciting when each iteration completes and the handover goes according to plan.

I would like to talk more about this subject in future posts. For now I hope my thoughts on the matter have left you with some insight!

 

Tags: , , , , ,

1 Comment

Data Quality (For the sake of the children!)

Here is a somewhat humorous example of how data quality can impact more than just your analysis or reporting. This example is good for me in particular because for so many years I have been dealing with municipal data quality, tax assessment, and billing. It alarms me how little data quality is discussed, and alarms me more when business talks about data quality because they think they need to, yet they don’t fully understand what it takes to implement and maintain good data stewardship. Anyway, perhaps a video like this can at least raise some eyebrows:

Tags: , ,

No Comments

ETL Subsystem 28: Sorting

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

Before you can aggregate, summarize, look up, or merge data, you will need it ordered in some way. Sorting, which changes the physical order of the underlying data, is seldom necessary. Instead, indexing, which changes the logical order of the underlying data is often all you need. The Kimball Group seems to use Sort interchangeably with Order, so I won’t deviate in this post. This may be because some databases vendors do so as well.

If you’re loading data from a mainframe, then the utility programs DFSORT or SORT will be of some use. Other utilities exist depending on the vendor, but these are the main ones. The idea is to sort the raw physical data before it hits the ETL pipeline. Doing so will save processing time. Otherwise, you can simply load raw data into your staging area and then use standard SQL to order your data.

Sometimes sorting is implicit, so be careful. In SQL Server, sorting occurs when you use ORDER BY, GROUP BY, SELECT DISTINCT, and UNION. You should, for example, avoid the SELECT DISTINCT clause whenever possible, as the additional sorting can hinder performance. Many programmers and (ehem) trained DBAs automatically add the distinct clause to their queries. Be certain that the distinct clause is necessary; otherwise performance will take a hit.

Beware that sometimes too much sorting can be a problem. Here is a general rule: Only sort when you need to! Sorting is an expensive operation and I’ve seen it overused and abused.

Clusters and Heaps

A clustered index determines the physical order of data in a table. That is, data is inserted into the table in order. Pages are linked for faster sequential access. This is in contrast to a heap table, in which data is added to the end of the table and the data pages are not linked.

Clustered indexes can make a big impact in the staging area. Consider a surrogate key lookup dimension that contains 4 columns: surrogate key, business key, valid_from, and valid_to. A clustered index on business key, valid_from, and valid_to will return the surrogate key upon lookup faster than if your lookup dimension was a heap. This is mainly because “queries that return a range of values using operators such as BETWEEN, >, >=, <, and <=" are well suited for the clustered index structure.

This works well during surrogate key lookups where we typically want the surrogate key of some business key on a particular date. Of course INSERTs and UPDATEs take longer on clustered tables because the physical structure of the table needs to be maintained. So be sure to balance speed with need. In my experience: Maintaining a clustered lookup dimension is almost always well worth the effort, especially when the original dimension is big and wide.

Best Practices

Here are some best practices that might help you keep your sorting operations under control:

  • Create one tempdb data file for each physical processor on the machine (more)
  • Put the tempdb files on their own drive (more)
  • Make sure your data types are appropriate for the column (storing numbers in a varchar field is not very efficient)
  • Keep the columns you sort on as narrow as possible (here is where good data profiling will help)
  • Integer sorting is much faster than character sorting, so consider using integers instead of character data whenever possible (more)

 

SQL Server 2005 Integration Services (SSIS)

As I have stated, sorting is often overused. The Sort component that ships with SSIS is good for small data sets that fit entirely into memory. Performance will degrade as soon as there is overflow. One way around the problem:

  1. In the Source Adapter, set the data access mode to “The results of an SQL statement”
  2. Use and ORDER BY clause in the SQL Statement
  3. Open the “Advanced Editor” of the Source component
  4. On the Input and Output Properties tab, click the Source Output node and set the property IsSorted to True.
  5. Under Output Columns, set the SortKey for each column involved in the index (start at 1, use a negative number to denote a descending index order)

 

Hand Coding with Visual FoxPro (VFP9)

Physically sorting a table in FoxPro is not necessary. The physical order of the table is irrelevant because of how VFP uses its indexes. You can test this easily by creating a table with a few million rows containing a randomly generated integer value and another column with some other data (perhaps system time or something). Using the SORT command (or SELECT INTO with an ORDER BY) create a second table in the order of the integer column. Without creating an index, attempt to retrieve a row from the table based on some integer value. Now, create an index on the column in both tables. You should notice no performance differences.

So while you’ll need to understand indexes and rushmore in VFP, you won’t need to bother with physical sorts.

From Here

In the next post, I’ll discuss lineage and dependency analysis. If you work in the financial sector (like me) then this topic will be of critical interest to you. If you don’t, I suspect you still care about lineage and dependencies!

Tags: , , , ,

2 Comments

ETL Subsystem 26: Version Migration

This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.

migrating birds ETL Subsystem 26: Version MigrationDepending on the lifecycle in place at your business, you will likely need to move your data warehousing project from the development environment into a production environment. I am currently using an alpha / beta / gamma system (with mixed emotions), where alpha is our development environment, beta is our integration, and gamma is production. I have also worked with systems that combined alpha and beta and had a special user acceptance environment instead.

Whatever you do, you’ll need to consider migrating from one environment to another.

Some things to consider:

  • Environments: How similar are your environments (both hardware and software)? Are there major differences between the development servers and the production servers? Ideally, all environments should resemble each other (integration, UAT, and production environments should really be identical). You don’t want to develop on a 64-bit Windows Server 2008 and deploy to a 32-bit Windows Server 2003 machine!
  • Security: Have you considered the different security and access requirements that exist between development and production environments? Will you, as the data warehousing team, have access to manage server jobs, file transfers, and file shares?
  • Version Control: Have you set up your version control in such a way that if an installation fails, you can revert back to a previous version? This is tricky, so be careful to set aside enough time for planning. You also need to keep in mind potential compliance issues, as it may be necessary to “pull out” an older version to see how data might have been manipulated differently than the current installation.
  • Resources: Who are the people in charge of the migration? Are they available when you need them?
  • Documentation: Does your organization require installation documents, verification plans, quality assurance sign-offs, user acceptance sign-offs, and back-out plans?
  • Protocol: What type of red tape must you cut through to get your data warehouse online? Who needs to sign-off on the deployment? What forms must be filled out? Is there an official change request procedure? ITIL and other systems can be good in theory but difficult to work with in practice. Take the time to learn what you need to do to get your system online.
  • Packaging: How will you package your software and database scripts? Will you use an installation program or will you have someone manually copy files? Will scripts be run individually? Will you use a deployment tool such as MSBuild?
  • Verification: Have you thought about how you can test the system in a way that flexes all new and updated features, while providing adequate regression to be sure you didn’t break something that was once working? Have you done a proper impact analysis on all new or changed components?
  • Data: Will you be migrating data changes as well? How will you get these data changes from one environment to the next? Has everything been scripted and tested thoroughly?

 

SQL Server 2005 Integration Services (SSIS)

SSIS makes deployment of packages from one environment to another relatively easy. You merely create a single package and supply different configurations for each environment. I am currently using XML configurations, but you might do better with database configurations, environment variables, or using a table in SQL Server. The basic premise is simple: Create variables in your package for things like file locations and connection strings. Using the Package Configurations dialog (SSIS -> Package Configurations…), enable configurations, click Add and then follow the Wizard. Check the SQL Server Development Center for details.

Remember that you can always set a configurable item (e.g. variables, connections, tasks) from the command line or in the SQL Agent Job. For the Agent Jobs, open the Job Step properties and from the General page, select “Set values”. When entering your values to configure, use the following syntax:

\Package\container.property

Example:

\Package.Variable[ServerName].value

For more insight on setting up jobs you can see my post “SSIS Packages and setting up Jobs“.

SSIS comes with a Deployment Utility that will bundle all the files necessary for deployment of the package into a single folder. Copy this folder to the destination server and run the Package Installation Wizard to install your packages. While this is a good idea, and does remove the possibility of human error during the installation process, I do not use the Deployment Utility. Call me old-school, but I prefer to do this bundling and installing myself through the dtexec command line utility — I like to see what is going on and it gives me an opportunity to catch surprises before they become trouble.

Hand Coding with Visual FoxPro (VFP9)

You can also use FoxPro to mimic the same kind of functionality you get in SSIS using package configurations. This is no different than using an ini file, a “setup” table, registry entries, etc. I would warn against using free tables if your deployment team does not have access to FoxPro. You would be better served using XML or some other plain text formatted file.

You can also automate much of your builds by utilizing VFPBuild. Find the related CodePlex page here.

From Here

As a developer, you are responsible for ensuring that the right version of your software makes it through the migration path with little or no hassle. It can be challenging enough dealing with hardware and software compatibility issues. Do your best to know the routine, package appropriately, and plan ahead!

Next post, I’ll talk about Work Flow Monitoring…

 

Tags: , , , ,

8 Comments