This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
Before I can discuss this subsystem, I need to discuss bridge tables. This is a topic I haven’t covered much until now. So to help me illustrate how bridge tables are used in a dimensional model, I’ll present a case for a real-estate property transaction.
Sometimes the defined grain of a fact has multiple dimension rows that characterize it. These scenarios are referred to as many-to-many or multi-valued dimensions. A patient encounter at a hospital might have multiple procedures. Multiple bank accounts might belong to a single person. In the case that follows, I’ll show how multi-valued dimensions help support property sales (i.e. a real estate transaction).
I would like to make one important point before I begin: If you can thwart the need for bridge tables by redefining the grain of your fact table to be more atomic, then you should do so. Bridge tables add complexity to the model, ETL process, and the end-user application layer. With that said, multi-values bridge tables in many environments are unavoidable and absolutely necessary.
Case Study: A Real Estate Transaction
In the case of a property transaction (aka a property “sale”), you are potentially dealing with several many-to-many relationships. First of all, you may not be buying alone (one-to-many buyers). You may also not be buying your house from a single person (one-to-many sellers). Your house may be on two parcels of land or no land at all if you are buying an apartment or condo (none-to-many parcels). A barn or in-law apartment may be included or perhaps you are just buying land to build your dream home later (none-to-many buildings). These and other complexities make the case for bridge tables to represent the many-to-many relationships inherent in a complex transaction such as this.
When you buy land or a house (in the US, at least!) you must “record” the sale with the clerk of the municipality in which the property exists. This recording becomes part of the chain of ownership and tax record for the property. This recording can accommodate multiple grantors (the people or organizations selling the property), grantees (those who are buying the property), and properties (multiple parcels, land lines, lots, buildings, outbuildings, units, etc). This recording is given a book and page and filed away in a database. A hard copy is typically stored in a physical book in the clerks’ vault as well. It’s pretty fun (for some people who are into history) to go to your clerk’s office and peruse the vault. Anyway…
What follows is a simplified ER model from an OLTP system I designed several years ago. This model represents the complexities of the above case in a transaction environment. It has been simplified to only show the more important business entities:
Note that in the above OLTP system, table Real represents land and building values from tax billing software. These figures have been aggregated into a single row (i.e. total land value and total building value, number of buildings, total acres, etc.). For the above model, knowing the exact building or land line is not important. For the Data Warehouse, though, we want and need this atomic detail for some detailed analysis. The Deed table represents the actual recording (book, page, date, time, etc.). The bridge table ND_link contains grantor and grantee information. NR_link represents ownership history between the names and entities in Namefile and the aggregate property information in table Real.
The business process dimensional model could therefore be modeled as follows:

As you can see, there are bridge tables between the SaleFact and DimBuilding, SaleFact and DimLand, and SaleFact and DimPerson dimensions. I typically name these tables by including the word “Group” in its name (LandGroup, BuildingGroup, etc.) but you could use the word “Bridge” or any other term that is consistent throughout your models. These bridges define the many-to-many relationships that are normal for a Real Estate transaction. Because a property sale is a natural business event, and cannot be logically defined to be any more atomic, multi-value bridge tables are a must.
In the above schema, a particularly astute and observant modeler will notice the many-to-many relationship between the keys in the fact and bridge. FoxPro, which is where I put together these examples, will not allow a many-to-many join between two keys. So, I Photoshopped the graphic in order to show you the logical representation! In FoxPro, and in many other database systems, you will need another table in between the Group and the Fact with a single attribute primary key. This would represent the physical implementation of the above design. You would therefore need two physical tables to represent the many-to-many relationship. I try my best to avoid this. In FoxPro, for example, I would simply not create the relation at all! Or in some cases, the bridge group would be unique for each event, making the foreign key in the fact table always unique. Being unique, it can be defined as a candidate key and then happily joined to the bridge table (more on this in the section below on “special considerations”).
Bridge Table Characteristics
Bridge tables are not just janitors or pianos (well, they do have all they keys!). Most bridge tables also contain attributes. For the above case, I include two attributes in each bridge: weight and primary indicator.
Weights are used to allocate metrics in the fact. Weights always add up to 1 across the group. If you need to allocate the sale price across all buildings involved in the sale, you would use the weight. Weights are calculated as part of ETL process and are an important part of the mechanisms of Subsystem 15. Sometimes, it is perfectly OK to divide 1 by the number of group records to get the weight (applying any fraction to the first row perhaps). But more often than not, there is a more complex algorithm to determining the weight. For land, you might base the weight on lot size or price code. For buildings, you may base the weight on the replacement cost of the building, or you might use square footage. All of these are defined by the business and applied during ETL.
I use primary indicators because some towns only care about the primary grantor, grantee, building, or parcel for analysis. One row in the bridge table will be identified as ‘primary’. During ETL, the primary row is identified by using a sequence number (number 1 = primary) or by determining the highest valued property. Sometimes, its arbitrary, in that you just mark the first one encountered as primary. These decisions are part of your business requirements and must be carefully implemented during data transformation.
Other Considerations
Guess what? Your bridge table is really a type of fact table. In the above schema, LandGroup is a sort of fact with some semi-additive metrics that represent the relationship between a parcel of land and a sale.
In my example, a sale is a static event. The buildings, land, and people involved will change over time, but the groupings used at the time of the sale will remain the same. For this reason, there is no need to make the group tables Type2 dimensions. If we ever need to change a value in one of these bridge tables, then it is likely to correct a data problem and not to represent a changing dimension. Additionally, in this scenario, it would be a safe assumption — especially because there are weighting factors and primary indicators involved in each bridge — that each grouping is unique to the fact. For modeling purposes, you could avoid the many-to-many relationship issue discussed in the previous paragraphs by making the bridge keys in the fact table candidate keys. There would be a unique set of group keys for these bridge tables. But be cautioned: The volume of property sales is small in any given year for any given municipality. You’ll never see a million transactions in a year. This makes the candidate key approach attractive. If you are dealing with patient diagnosis across multiple hospitals then your bridge tables will become too large. In these cases, you’ll need to consider re-using groups. The consequence of reusing groups is that the key in the fact table can no longer be a candidate.
To be clear: Bridge tables can also be a Type 2 SCD. If this is needed, and tracking bridge table changes over time is important, then you will need to combine bridge building with your SCD logic. You will need activity dates (from and to), and a current row indicator.
As I said at the top: If you can thwart the need for bridge tables by redefining the grain of your fact table to be more atomic, then you should do so. Bridge tables are not for the faint of heart!
SQL Server 2005 Integration Services (SSIS)
Bridge table building is largely a modeling and business rules problem. There is nothing in SSIS that will prohibit you from implementing even the most complex bridge table design. Most difficulties will come when managing the keys and implementing weighting factors. For the keys, you can use the same techniques described in my posts about Subsystem 10 and Subsystem 14. For weightings, I’ve implemented weighting algorithms using both stored procedures (Execute SQL Task) and VB (Script Component) when appropriate.
Hand Coding with Visual FoxPro (VFP9)
The above case is taken from an actual implementation I created in Visual FoxPro. I must say, it was rather simple once I ironed out the nuances and had my weighting algorithms in place. Developing the bridge table builder subsystem was a simple matter of inserting and linking the fact with the multi-valued entity and applying the weight and primary indicator logic. The resulting analytics that can be derived now far outshines that which could be done with the original OLTP model. Not to mention, the dimensional model has atomic detail (not like the aggregate Real table shown in the first schema) which allows for more slicing and dicing. But I digress!
From Here
Next post, I’ll discuss Late Arriving Data (Subsystem 16). That’s a big topic, so I’ll distill it as much as possible to keep it brief!
#1 by andrea at March 5th, 2009
| Quote
hi,Tod
thx for ur article!
btw, could u tell me how what the ‘xxgroup’ table works.
i saw there is many-to-many relationship between fact table and landgroup.
could u tell the difference that if the land table directly link to fact table, also is many-to-many.
#2 by Tod McKenna at March 6th, 2009
| Quote
Hi Andrea,
In the above example, a single “sale” could involve none, one, or many parcels of land. The only way to represent this relationship (with the grain I have defined) is using the bridge table (landgroup).
The M2M relationship is logical and not necessarily physical. It would depend on your RDBMS and your analysis needs.
How it works? Well, each group that I create has a surrogate primary key. I repeat this key in the physical table (eg landgroup) for each group member. The actual primary key of the bridge is formed by taking the group key + the land key (in the case of landgroup). But in the Fact table, I store just the group key. Note that I also have a group for “None” and for single group members too.
Let me know if I’ve answered your question!
#3 by andrea at March 6th, 2009
| Quote
Hi,Tod
Thank u very much.
I’v got it. But now i’ll talk away from the point.
I notice that u said the m2m relationship is logical.
Is it still a star-schema, or snowflake-schema?
With this m2m modeling , any reporting tool can building semantic-layer or doing analysis directly, or need a special treatment first?
#4 by Tod McKenna at March 24th, 2009
| Quote
Hi Andrea, Sorry I missed your response until now!
Bridge tables can be used in a lot of different ways. Another common use is to replace recursive pointers for raged hierarchies. Here, I’m representing m2m relationships among the different data elements. You can also use bridge tables to show relationships between dimensions (although these could likely be classified as factless facts)
To test if the design is a still star schema, and not a snowflake, I simply look at the level of normalization present. Snowflakes are a product of too much normalization. Having bridge tables in your design does not turn your star into a snowflake. If you have any outrigger dimensions, then you have a snowlfake. Bridge tables, also called “helper tables”, do not change the design.
All good reporting tools can handle m2m relationships through bridge tables as far as I know.
#5 by mark at May 12th, 2009
| Quote
I’m all for reusing bridge tables rather than reconstructing them each time a snapshot fact record is created, but how do you ‘inventory’ bridge tables? Some may contain 2, 3, or 50 records within the group, and uniquely identifying these groups for reuse is proving to be tricky. Any ideas?
#6 by Tod McKenna at May 13th, 2009
| Quote
I suppose it depends on the nature of the group. Some groups have obvious group identifiers. For example, a sales team might share orders. The team could be given some unique identifier. If the makeup of the team changes, a new identifier would be created. The team might be involved in thousands of orders, making reuse a good idea.
Other groups are not so obvious (landGroup and buildingGroup in my examples above are a good example). The approach that I use in this case is simple: I assign a group key during ETL that uniquely represents the group.
You also have to consider weighting. If group members are weighted participants in the fact event (i.e. a row in the fact table), then reusing groups is not appropriate. If you do not use weights, then assign a group primary key during ETL that will be unique to each group. You need this anyway when you represent the group in the fact table.
#7 by mark at May 13th, 2009
| Quote
Thanks for the quick response Tod.
What you say makes sense.
The business case I have is for insurance, where for any given policy you might have multiple groups each representing parties that have a stake in selling the policy (i.e., multiple brokers, sales offices, sales reps each having a piece of the premium pie).
For simplicity I’ll stick with just one group, the Sales Rep group. Let’s say there’s 3 Sales Reps that split the premium, and that these Reps may come and go over time, both in makeup and in % split. Rep splits are common in the industry, if you think of it like a mentor-student relationship. Initially Rep A has %60, Rep B has 40%. Rep A leaves and is replaced by Rep C, and now the split for the same policy is Rep B 50% and Rep C 50%. I already have an inventory for Reps (the Rep_Dimension) but I would have an inventory for Rep groups, because for most policies the usual suspects are the same (i.e., Rep B and Rep C might pair up 50/50 split weight for 100+ policies). I’d rather pull up the combination of Rep B, Rep C and their 50/50 split rather than reconstruct this grouping each time (1 grouping instead of 100 groupings).
The underlying question is, what to use as an AK so that you call pull up a group in order to reuse that group? If this is not feasible, then we’re looking at recreating the group each time the fact table is built, and in this case of 3 Reps, 2 group (bridge) records would be created with new SK’s for each 1 fact record. Sometimes it’s a 2-way split, sometimes it might be more (3, 4 way splits are not uncommon).
I had thought about employing some sort of CRC-calculation to fingerprint the record groups uniquely, similar to how you would use CRC to uniquely id a single record, but not exactly sure how this would work or if there were a more elegant solution.
#8 by Tod McKenna at May 16th, 2009
| Quote
In the case of the Sales Rep group, I think you’re spot on. If the same two reps combine for a 50/50 share a number of times, it makes sense to reuse the group.
On the surface, the CRC approach seems to have some complexity that might make it too much hassle to be a good option. I’m not sure, for example, how you can represent n rows with varying percentages easily in a CRC.
I would approach the problem in this way:
If the grain of the fact table is “one sale rep group per sale”, and that a sale rep group contains 1 or more individual salespeople, then I need to represent this sales group using a single key. I always use surrogates, so in my ETL I would (a) identify the group members and their weights, and (b) do a surrogate key lookup on the bridge table for this exact makeup.
This is where it gets interesting because you need to match on more than one row. With some clever SQL, or at least with some sort of loop construct, you can get this done. You should be able to find the group and fetch the group key (which would be the same for each row in the group) and use that in the fact table.
Another way to attack a problem like this is to think about the grain of the fact table. It might be possible to redefine the grain so that the weighting is factored into each metric in the fact. “One row per sales rep” would allow you to roll up the fact rows by order number or similar degenerate dimension attribute. But I don’t think this is any more elegant than what you and I have already mentioned.
#9 by mark at May 19th, 2009
| Quote
Tod;
Thanks. The fact table I inherited is trying to do too many things for all people. It’s used for detail, aggregate, etc. In trying to be all things it’s not doing them very well. The rep split is just one of the dimensions – there are 3 others, resulting in 27 fact tbl rows (3 Brokers, 3 Offices, 3 Reps) , yet some metrics are at a higher level and so they’re fractionalized to make this all, well, fit.
I ended up determining the max nbr of Rep splits, and plan to build a Rep Group Control table to maintain the inventory of groups. I’ll use as the key RepSK1, RepSK2, Rep3Sk… RepSplit%1, RepSplit%2, RepSplit%3… up to a max of 10 splits. This way we can reuse groups, some of which may apply to over 100+ policies. It’s not elegant but it works. Whenever a Rep SK changes, we will still need to rebuild the group, but it’s still better than recreating new groups each snapshot regardless of a change or not.
I agree that the grain of the fact tbl needs some work as well…. or split it into a separate fact table.
That’s what happens when you let developers lead the design…..
Thanks