![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqSKQcyR7716MvrG4pAoE7vl4zCvZTo6pn45HwY2TpCtqeUitSIw0e4FZydQVMWDKZ8oNVMYYW04_z4K2PJJbLjLxmii_UXn6Ln7BKY_9eY7Gg8EDeXjVrqDlPz2hq0fzhb0ANd8TYKcPE/s320/Simple+Einstein.jpg)
Some might argue that the fact table will be thinner with a numeric foreign key rather than the 13 byte ISBN. That is true, however I think it's risky to remove the natural key (ISBN) from the fact table as it makes reconciliation with ISBN difficult in the event of a bug in the synthetic key allocation. Also, the fact and dimension tables might like indexes on these keys, and with natural and synthetic keys, there might be a need for twice the number of indexes. Overall, I would argue that natural keys offer higher performance.
I have seen datamarts mystified by inappropriate use of surrogate keys. To illustrate, imagine the titles dimension constructed with a synthetic surrogate key and the ISBN no longer unique. This might be argued by a few to be superior as it would support historical integrity. (Hmm, like historical integrity is necessarily superior.) Think about a slowly changing ISBN! Someone enters a new title as "Kate the Cat", when it should have been "Katie the Cat". After a few weeks, it gets noticed and is corrected. However the SCD datamart is too clever and says that it was "Kate the Cat" for those first few weeks, so it must remain so, for historical integrity! From now on, users of this dimension must combine "Kate the Cat" with "Katie the Cat" to get sales information on this book. This can be frustrating for users, who know only one book called "Katie the Cat".
I have also seen datamarts where the dimension tables have blown out by more than 20 times (1 Million customers turns into a 20 Million row customer dimension) because all attributes are maintained with SCD logic. So each customer record had been updated an average of 20 times. Sometimes these changes are irrelevant to the business regarding SCD, such as a field containing the number of Customer Service calls. In this case, every time the customer calls up, there will be a new record in the customer dimension, with just the call count incremented by 1. For what purpose?
I have also seen date dimensions with surrogate keys! This also puzzles me. Does the developer believe that attributes of a date will change and that reports should show those dates with historical integrity? To my simple mind, that just seems to be a quick way to make a datamart/DW more complex for no benefit. Unless, there is benefit in increasing future work loads as maintenance and reconciliation become bigger chores.
I would add, that if you really do identify a need for SCD I would encourage you to only maintain SCD on the identified attributes. Do not apply SCD logic to all attributes blindly.
Please don't see this blog as anti Kimball. I truly believe that he has brought a lot of rigour and maturity to data warehousing. However, to follow his dimensional modelling blindly is not pragmatic and does not help organisations trying to build business intelligence.
No comments:
Post a Comment