Wednesday, November 30, 2016

Merge into a temporal table

SQL Server 2016 supports Temporal Tables (aka Type 2 or Historical Integrity). Unfortunately, in this first release you cannot use a change datetime value from the source table. It uses the system time when the update occurs. So, if you want to use your source system change datetimes, you won't be able to use Temporal Tables. Fortunately, the Merge statement is well suited.
Here is a sample Merge statement for a self managed Temporal Table.
Note, the Transaction and TABLOCKX is optional. I like using the table lock when there is a possibility of lock escalation. Two concurrent lock escalations are a guaranteed way of a deadlock. The TABLOCKX will ensure the Merge does not start until it can get a full table lock, so no chance of deadlock. This is at the cost of a little concurrency.
This statement has two insert statements. The first insert is for modified records (it's inserting the newer record) and the second insert is for new (not matched) records.
Please see this statement as a sample. You may want to change logic for EffectiveFromDate if you know the "real" time this record was active (CreatedDate, for example).

     SELECT TOP 0 * from MySchema.MyTable WITH (TABLOCKX)
     INSERT INTO MySchema.MyTable(IsCurrent
              SELECT 1 IsCurrent
                           ,DatetimeChanged EffectiveFromDate
                           ,'9999-12-31' EffectiveToDate
                (MERGE MySchema.MyTable Dst
                    USING LifeSTG.MySchema.MyTable Src
                            ON Dst.MyKey1=Src.MyKey1
                            AND Dst.MyKey2=Src.MyKey2
                            AND Dst.IsCurrent=1
                    WHEN NOT MATCHED
                            THEN--record from source doesn't exist in Destination
                                --INSERT the record into Destination
                                INSERT (IsCurrent
                                        VALUES (1
                    WHEN MATCHED --record from source exists in Destination,so check if any of the attributes have changed
                                --AND Dst.DatetimeChanged <> Src.DatetimeChanged --using ChangedDateTime checks on other columns not required
                                AND CHECKSUM(Dst.MyAttribute1
                                    <> CHECKSUM(Src.MyAttribute1
                            THEN --Update the matching record as no longer current because the attributes have changed
                                        SET Dst.IsCurrent=0
                    OUTPUT   Src.MyKey1
                            ,$Action as MergeAction
                ) MRG
    WHERE MRG.MergeAction='UPDATE'

Thursday, September 1, 2016

Query Plans and Parameter Sniffing

It's not uncommon (and often quite annoying) for an SQL query to "flip". This is where the optimiser decides to use a different query plan even though the query code has not changed. This could be due to many things, but most commonly changes in statistics (for example, the table is now so large that an index is preferable to a table scan) but it could also be due to "parameter sniffing". This is where the optimiser "sniffs" the parameter values for the execution invoking the compile and uses these values (along with data statistics) to determine an appropriate query plan. Since query plans are cached, this can mean that the first execution of a procedure will generate a plan that is inappropriate for subsequent executions with different parameters. I don't want to cause anxiety - it is generally a good thing that the optimiser dynamically takes parameters, data statistics, physical resources etc into account in determining the best query plan. However, consider a procedure that has two parameters, @AccountNumber and @AccountType. When the procedure is first called, it may use @AccountNumber and @AccountType values to determine appropriate query plans. If AccountType='X' is very rare in the database, the optimiser might, quite rightly, choose a query plan using an index on AccountType. However, this index would be inappropriate when @AccountType='A', which is the value for 99 % of the data records. This is problematic, as the cached query plan could be favouring either parameter value, randomly determined by the "first" execution. Here are some a few techniques to help with this issue.
  1. Keep the base procedure, but have it call one of two other procedures depending on @AccountType value. Each of these sub procedures will be compiled on appropriate values.
  2. Use WITH RECOMPILE on the procedure, so that it recompiles with every execution.
  3. Declare a new parameter in the procedure @AccountType2, which is set to @AccountType within the procedure and @AccountType2 is the one referenced in the query. This way, when the optimiser "sniffs" the @AccountType2 parameter it will always be the same initial value. Note, the sniffing happens when the proc is called, not when the statement is about to be executed.
  4. Part of the solution might be to disable automatic updating of statistics and to update statistics manually (followed by performance and plan flipping tests). Note, there is a lot more to statistics and this is probably only appropriate for a tight OLTP environment. Disabling automatic updating can negatively impact performance of ad hoc queries that are using recently inserted key ranges. However, note; updated statistics will trigger recompilation of dependent query plans.
  5. Use the OPTIMISE FOR hint. However this often is not appropriate as there is no appropriate value that will continue to be appropriate in the future. Note, since 2008, you can OPTIMIZE FOR UNKNOWN, which is similar to point 3 above.
  6. Use the KEEPFIXED PLAN hint, which inhibits recompiles due to optimality issues.
  7. Use a plan guide. Technically, this solution works, but it does add to operational overhead and the plan guide may not be appropriate in the future.
  8. Enable Trace Flag 4136, which will disable parameter sniffing. Since parameter sniffing is generally a performance benefit, this is a bit drastic.
  9. Force the use of a particular join and/or index. This is frowned upon for good reason.
Note, SET options can affect plan creation and reuse. This means that if you want to use a plan in Management Studio the same as the application, you will need to ensure your SET options are the same as the application. SET options that affect reuse include ANSI_NULL_DFLT_ON, ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS, ARITHABORT, CONCAT_NULL_YIELDS_NULL, DATEFIRST, DATEFORMAT, FORCEPLAN, LANGUAGE, NO_BROWSETABLE, NUMERIC_ROUNDABORT, and QUOTED_IDENTIFIER. The SET options used in the compile can be found in the query_plan under the tag StatementSetOptions

Again, plan flipping is not something that should cause you anxiety. However, it is good practice for database developers to recognise plan flipping possibilities and use the techniques above to minimise its causing performance issues. Most plans are very good and better than the above average developer. I encourage developers to make use of point 1 above, where there are different "classes" of parameter values.

Here is a useful query to see the query plans for a particular procedure. Click on the query_plan to see a graphical plan view. Note, this query is looking for procedures only. You can also view p.objtype='Adhoc' for ad hoc queries that are compiled as "temporary procs". You are able to view the parameter values used for compilation of the query plan from the query_plan of the first query above. Simply copy out the query_plan xml and search for ParameterList.
   SELECT p.*,st.text, qp.query_plan
      FROM sys.dm_exec_cached_plans p
            CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS st
            cross apply sys.dm_exec_query_plan(p.plan_handle) as qp
      WHERE st.text LIKE '%MyProcedureName%'
        and p.objtype='Proc'
Here is a useful query to see the performance of multiple query plans
   SELECT * FROM sys.dm_exec_query_stats
      WHERE plan_handle in
                  (SELECT plan_handle FROM sys.dm_exec_cached_plans p  
                        CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS st 
                        WHERE text LIKE '%MyProcedureName%'
                          and p.objtype='Proc')
Here is a query to list all the procedures that have more than one query plan.
   SELECT db_name(st.dbid) DbName, st.text, count(distinct plan_handle)
      FROM sys.dm_exec_cached_plans p
            CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS st
            cross apply sys.dm_exec_query_plan(p.plan_handle) as qp
      WHERE p.objtype='Proc'
       group by db_name(st.dbid),st.text having count(distinct plan_handle) >1
       order by 1,2

Tuesday, July 19, 2016

Guiding Principles of Data Warehousing

I know Kymball and Inmon have written vast texts on data warehousing. So much in fact that you can find literature to support or criticize almost any design. On this page I want to give my 6 Guiding Principles that I use when designing a bespoke data warehouse. A corollary to my principles is that no particular design feature is bad, it can only be deemed appropriate/inappropriate in a particular implementation.

1. No Loss of Data
All information taken from source systems should be held without losing information. This might sound simple, but there can be subtle losses of data by omitting columns, trimming columns, correcting data, denormalising, transformation, cleansing, etc. The test is any question that the source system can answer (within the tables extracted) should also be answerable from the DW (and with the exact same answer).

2. Keep it Simple
Minimise physical layers of transformation that make it difficult to determine data at source. A common question will be “why does this record have that value in the DW?” The simpler, and less physical, the transformation, the easier it is to reconcile DW data with source systems. By making reconciliation easier, we will be able to build trust that the DW is valid and can be relied on to give a “correct” answer. If there is a bug in the ETL, it will be easier to uncover in a simple DW. Once a view of the data is materialised in the DW, it becomes more difficult to ascertain where the data came from and which insert and update statements have acted on the data.

3. Ease of Enhancement
This is closely related to the second principle in keeping it simple. The simpler the architecture and minimal physical transformation, the easier it will be for future developers to enhance the DW to add/extend data sources or perform new analysis. DWs are invariably never finished as changes in business activity, regulatory requirements, enterprise systems, competitive action, supplier action, etc. will drive new DW information and analytical requirements. The best indication for ease of enhancement is how much development effort did it take to create? Enhancements will be a proportion of the initial development effort. A DW that took 10 developer years to create will take 10 times the effort to enhance than a DW that took 1 developer year to create. Even if both DWs contain the same information.

4. Historical Integrity
If the DW users require it, historical integrity should be maintained so the DW can report on what the values "were" at the time of the transactions (facts). While it is important to maintain this information, it can be expensive, so it should only be maintained on tables that require it and only on columns of these tables that are worthy of it.

5. High Performance and Room for Scale
The DW should be able to perform well with the anticipated volumes. There is no excuse for slow queries that were known in advance along with anticipated table sizes. Of course, as the DW grows in unanticipated ways and new queries/analyses are written, further optimisation will be required. The initial design should take into consideration the size of the tables and how they will be joined with related tables. This will lead to some large tables often joined having similar clustered indexes, and appropriate normalisation, partitioning etc. These considerations are often very easy to implement at the beginning, but become more time consuming once implemented and built upon.

6. Appropriate Design
Just like building architecture, there is no one design that suits all. The design should be appropriate for the purpose. We won’t follow others’ design guidelines blindly.

Saturday, July 16, 2016

Don't Do Time, Do The Distance

Every now and again I take this blog off-road with a reminder that there is more to life than work. This is one of those times.

My friends all know that I enjoy exercising - everyday. The reason I enjoy it is that I do the distance, not the time. So my big tip for anyone wanting to enjoy fitness and mental health is to do the distance, not the time.

There is a subtle distinction between time and distance. Doing the time, is exercising for a preset time, whereas doing the distance is completing a preset distance (or task). You might end up getting the same exercise, but the difference in enjoyment and satisfaction is huge.

By doing the distance, you are thinking about achieving something. Whereas doing the time, you just wait for the time to pass and it drags on. Just as the least enjoyable jobs are where you follow the clock, exercise by time is a drag. The most enjoyable jobs and exercises are ones where you achieve something, semi-regardless of the time.

Also, while you are working to achieve the distance, you will tend to run/swim/peddle/skip etc. more efficiently. This comes naturally as you aim to get to the finish with the least effort. You will try out different techniques and learn a more efficient styles. All the while, getting there in a more graceful way each day. This is a positive feedback loop where you get the benefit of getting faster and stronger by finishing with less effort.

Arthur Lydiard understood this all those years ago when he had his Olympic running hopefuls repeatedly run over a preset course in the Auckland hills. In my seaside village there are hundreds of swimmers who take to a 1.2km ocean course every day. They love it, and a big part of it is that they are achieving their goal every day. As each day passes, they find it easier and more graceful than the day before. Even if your preference is going to the gym, try setting the activities and reps in advance, rather than staying there for x minutes.

So, get out there and do the distance by walking home, to the next station, or to your favourite café. Each day you will get healthier and stronger with a sense of grace and achievement.

Monday, July 4, 2016

Relative Dates, Today, Yesterday, This Week, Last Week etc

A common requirement in Business Intelligence is for reports on relative dates, such as

  • Today,
  • Yesterday
  • Same Day Last Year
  • This Week
  • Last Week
  • This Period
  • Last Period
  • This Year
  • Last Year
Where these are dates relative to today on the fiscal calendar. It is possible to create these members on the Time Intelligence dimension. However, since they are so widely used, and often in combination with other Time Intelligence members, there is better performance and better functionality if these members are in the time dimension. 

Putting them in the time dimension is relatively easy, all you need is an existing time dimension that has Date as a level of granularity and the ability to use a view (or the SSAS dsv) to create additional date attributes.

In your Date view, simply 
  1. Add in a column for RelativeDay, another column for RelativeWeek etc. This column will have the "relative day" value. For example, 
    • CASE 
    •  WHEN CONVERT(DATE, GETDATE())=MyDateTable.TheDate THEN 'Today'
    •  WHEN CONVERT(DATE, GETDATE()-1)=MyDateTable.TheDate THEN 'Yesterday'
    • ELSE 'Other Days'
    • End RelativeDay
  2. Add in additional WHEN clauses for Day Before Yesterday, Same Day Last Year, Yesterday, Last Year etc if you want them. 
  3. Add an attribute for each relative date into the Date dimension. One for RelativeDay, one for RelativeWeek, one for RelativePeriod etc.
  4. Then, all you need to do is ensure that the Date dimension is updated early every morning. You are probably updating your dimensions many times per day. By default dimension attributes are flexible. These attributes need to be flexible as they are changing, literally every day.
From then on, cube users and reports can use these relative dates. For example, a classic use would be to have a report that filters on This Week and Last Week, and shows 8-14 columns, one for each day in the last 1-2 weeks. Whenever a user requests this report the default showing will be for the last 8-14 days, without the user having to select the actual date.

Adding these relative dates will not increase the size of your cube. The date dimension will be marginally bigger. These relative dates are easier to use than Time Intelligence and perform much better. Especially better when you want relative dates and Time Intelligence.

Saturday, April 30, 2016

I Don't Know

These are 3 words I like to hear.

It means the person has the confidence to say there is something they don't know. Surely, everybody has things they don't know? However, in my industry (computer software), it appears that there are many that find "I don't know" (IDK) as tough to verbalise as "I love you". Why do technicians have trouble saying IDK? Is it insecurity or a fear of appearing ignorant?

Making up answers is unhelpful. It can lead to poor decisions. By hearing IDK, it gives you the chance to research the correct answer before taking action. A simple analogy would be asking a colleague if it's raining outside. If your colleague can't say IDK and gives you their best guess, they might be right. But if they're wrong you will get wet or take rain gear unnecessarily. The more useful response would be IDK, or "IDK - but it was sunny when I was out an hour ago".

I am happy to seek information from people who can say IDK. People who make up answers are the ones we avoid asking in future. I found early on in my career that saying IDK can also reduce tension. You are saying that t's OK not to know. I typically followup with "but I want to know and will find out for you".

It takes a self-assured person to say 'I don't know', so let's all be self-assured and more helpful..


Wednesday, March 16, 2016

Memory is Oxygen for Large Database Systems

I am frequently invited in to assist with database (OLTP, DW and OLAP) performance, and often the issue is IO throughput related. So, the solution tends to involve tuning queries (and indexes) to reduce IO or improving the IO capacity of the SAN (or both). Of course, in all of these IO related performance issues, RAM can be a great elixir. Databases love memory. They are designed to use RAM by holding popular pages, compiled plans and work tables in memory. So, if you have a large database with a performance issue; very quick, and relatively cheap relief might be gained simply by adding RAM.

SQL Server is particularly good at exploiting available RAM - and (just as important) reducing memory demands if the operating system is paging. With the default SQL install options, all you need to do is give RAM to the OS and SQL will use it to hold more data and plans in memory. Sometimes this can be an exponential beneficial effect since, by reducing IO demands, the SAN is no longer a bottleneck and remaining IO is executed much faster.

Multidimensional OLAP databases also like to have lots of RAM. Even larger than the combined OLAP database size! See earlier blog. Unfortunately, SQL Server OLAP is not as memory effective as the relational SQL, but it will hold data (and aggregations) in memory, improving query and processing performance. SQL Tabular is an exception, as it requires the entire cube to reside in memory. It will simply not work without sufficient memory.

RAM, by and large, has no licencing cost with SQL Server, which can be a big consideration. Adding CPUs will generally incur significant OS and SQL licensing costs. I said, by and large, since with SQL 2012 and SQL 2014, there are, unfortunately, limits on SQL Server Standard Edition memory.
  • SQL 2008 Standard Edition has no memory limit
  • SQL 2012 Standard Edition has 64GB memory limit
  • SQL 2014 Standard Edition has 128GB memory limit
  • (SQL Server Enterprise Edition versions support their OS limits)
So, if you have a performance issue with a large database, and can't wait for a database tuning expert; before adding CPUs, Fusion IO cards etc.; try giving your system more RAM.

Monday, January 25, 2016

Basket Analysis Made Easy

This article is aimed at technologists who hear about Basket Analysis and need a simple explanation with examples. Basket Analysis used to be the preserve of high cost specialist providers. Now we have OLAP technology and hardware that makes Basket Analysis available at commodity prices. Everyone involved in retail analytics should understand Basket Analysis. Essentially, Basket Analysis supports analysis of sales by the combinations of items purchased.

To elaborate, I will provide two classic examples of Basket Analysis.

1. Menu Trim
The business would like to remove a couple of low value items from the menu (product list). A smaller menu generally supports lower operational cost and inventory investment. Candidate items might be the lowest by sales value or gross profit contribution. We observe that the lowest value items are Frozen Yoghurt and Fruit Salad, which we can find by a simple slice of items and their sales/gross profit. However, this does not take into account whether customers come in to buy the product on its own, or whether they are buying lots of other products too. We can see from the Basket Analysis query below, that Frozen Yoghurt customers buy lots of other products with their Frozen Yoghurt. Basket Analysis can sum all the products the customer buys with Frozen Yoghurt, which we call Affected Net Sales. This suggests that we can remove Fruit Salad from the menu we only risk the $15,414 of sales, but if we remove the Frozen Yoghurt, we risk $1,899,151 of sales. We say that it is "at risk" because the customers might be coming in especially for the Frozen Yoghurt, and purchasing other products while there. Without Frozen Yoghurt these customers might go elsewhere for their entire basket.

2. Attracting Large Value Transactions
Your business might have a strategy to increase sales by promoting menu items that tend to be involved in large value transactions. Basket Analysis can help you do this by identifying menu items (product items) involved in large transactions. In the example above, Avocado Toast has a high transaction average of $105, while the average transaction for all items is $45. So, the business strategy might be to discount and/or promote Avocado Toast to attract these high value customers.The marketing geniuses will work out a way to attract these customers, which might be by discounting Avocado Toast, offering a discounted side item, promoting quality/health features, inter-company coupon, etc. Once the campaign starts, Basket Analysis can very succinctly report on the success of the campaign. One of the key indicators will be the transaction average of Avocado Toast transactions during the campaign. The sales of Avocado Toast might increase two fold, but if the Transaction Average is halved, you are probably only attracting customers who purchase Avocado Toast alone. Ie, you are not attracting customers with large transaction value. In the chart below, we can see that the Avocado Toast promotion has been successful in increasing Avocado Toast transactions, however, these additional transactions have been small value transactions. The marketing geniuses might need to try a new strategy!

I hope this has shown how Basket Analysis can help business' create better strategies and measure how well these strategies are working. This is just two simple examples of an infinite number of possibilities, only limited by your creativity. Basket Analysis has become a commodity feature in retail analysis and is becoming an essential tool for successful retail strategies.

Thursday, August 27, 2015

Supporting Multiple First Day of the Week Calendars in a Cube

A not uncommon requirement is to support weeks with different start dates. For example, the Finance team might want to see weeks as Tuesday through Monday, while a store manager A might prefer to see weeks from Monday through Sunday, and store manager B might prefer to see weeks as Saturday thru Friday. Do not create new dimensions for this requirement. That is unnecessary and would lead to degraded query performance.
It is quite simple to support multiple week starting dates, without complicating your cube or degrading performance. You should already have a Date dimension, with one member per calendar day. That's all you need. Now, just add attributes to each day denoting the weekno and dayofweek for each of the, up to 7, week start days. Just to be clear, the number of records in your Date table/view will not change, just new columns.
In your existing date dimension table (or view in my projects) add in, up to, 7 new columns for WeekStartingMonday, WeekStartingTuesday, WeekStartingWednesday, etc, which will contain the WeekId for the week defined by the start date. For the first (Monday thru Sunday) week in the year, these days will all have a WeekStartingMonday key of 1. For the first (Tuesday thru Monday) week in the year, these days will all have a WeekStartingTuesday key of 1. To put it another way, each attribute will have a number between 1 and 53 denoting the week number. These 7 attributes will be subtly different as WeekStartingTuesday will have Tuesday as the first day with the new week etc. You could give these weeks captions as well, but not necessary at this stage. Note, if you want to have full weeks at the beginning and end of each year, you will need to add another 7 attributes, which will be the WeekStartingMondayYear, WeekStartingTuesdayYear etc. This attribute will have the year that the day/week is in (ie a number such as 2015, 2016 etc.) Then you can ensure that each year will have exactly 52 or 53 full weeks.
In the table/view for your date dimension, add in another 7 new columns for WeekStartingMondayDay, WeekStartingTuesdayDay, etc, which will have the DayOfWeekId for the day as it behaves in that week start. For example, WeekStartingMondayDay will always be 1 for Monday, 2 for Tuesday; WeekStartingTuesdayDay will always be 1 for Tuesday, 2 for Wednesday etc.
In your Date dimension, add the new 14 (or 21 if you include year) attributes. For each of the day attributes keep the WeekStartingMondayDay number 1-7 as the key, but provide the name as the dayofweek name (Monday, Tuesday etc.) Eg, the key might be 3 and the name might be 'Friday' for a WeekStartingWednesdayDay member. In SSAS specify the OrderBy attribute =Key. That way, when the attribute is browsed, SSAS will sequence days by the key, not the name. So, WeekStartingWednesdayDay will have Wednesday (key=1) as the first day, Thursday (key=2) as the second day etc. Your week attributes (WeekStartingMonday, WeekStartingTuesday etc) will be OK with the number 1-53 as the key and name. Note, ensure that OrderBy attribute =Key for the weeks also, otherwise, SSAS might perform a character sequencing (by caption) of your weeks.
Create 7 hierarchies, one for each Week Start day, with Week as the first level and day as the lower level. You may also want to add in Year at the top level of the hierarchy.
Now, your users can choose between one of 7 hierarchies for weeks, depending on which day they would like the week to start.
Users can also cross join one of these 7 hierarchies with Dates, then, they will see a heading with the date and the day of week underneath, and broken by weeks.
This solution will perform well in your cube. Since these attributes are simply attributes of existing members, the measure groups will not change in size with their addition. Also, these attributes will perform well as they are physical attributes, not calculations.

Wednesday, June 24, 2015

SQL Server Advanced Scan, aka Merry-go-round or Scan Piggybacking

I think SQL Server 2005 was the first SQL Server version with Advanced Scan. I have been excited by this feature, comforted by the thought my databases are performing better, but also a little unsure of its efficacy. According to the documentation, when one query is running a table scan, subsequent queries that require a table scan will "piggyback" on the first scan, avoiding additional scan IO. Apparently, any number of queries can all piggyback on the same scan. Once the first one completes, the others carry on from the beginning, as they may have joined in after the first scan started. So there should never be more than one scan of any one table at a point in time.

What a great feature? However, there are no Profiler traces for Advanced Scan, so how do you know it is actually working? I've just ran a performance test of Advanced Scan with 6 concurrent queries that all scan the same table (with slightly different predicates). I ran another test with one query running solo.

I was impressed with the results.
The table I used is about 10 times the size of the available memory, so SQL had no chance of caching the table. The solo scan ran as fast as the read ahead engines could provide the data. The multiple read engines (from the solo scan) kept the SAN storage busy with an IO queue count constantly greater 50 and reached a steady state of about 100MB/second. I ran this a couple of times to warm up the system. Note, this single query was running with MAXDOP=1. It would not run faster with MAXDOP=0 as the SAN storage was already working at maximum throughput with the workload of the multiple readahead engines.

I then ran 6 queries concurrently that all required a scan of the same table, but with slightly different predicates. These 6 scans all completed in a similar time as the first solo scan! That told me that Advanced Scan was working, but where could I see it? I could not find any evidence in Profiler (apart from the duration) that Advanced Scan was operating. In the queries I had SET STATISTICS IO ON. The statistics had the evidence that Advanced Scan was functioning. All 7 queries had similar logical reads, physical reads, CPU time, and elapsed time. However, the 6 concurrent queries had much smaller read-ahead reads. The total for the 6 concurrent queries was equivalent to the one solo scan.

It's unfortunate, that Profiler does not have an Advanced Scan event (or flag on scan). However, this test does demonstrate that the feature works and that it can be quantified.

Note: Advanced Scan is only available in Enterprise Edition. I initially tried to monitor this feature in a developer version of SQL Server, but it gave mixed results. Also note, the test above had Max Degree of Parallelism set to 1. I received similar results with MAXDOP set to 0.

Tuesday, June 16, 2015

Natural Keys are Healthy

It seems that I often attract suggestions from developers who have recently been reading Ralph Kimball. This week, it was a suggestion to add surrogate keys to all the dimension tables because this "would make the datamart more extensible and scalable"! I had to argue my preference to only use synthetic (surrogate) keys when there is a need. If the natural business key works and there is no need for SCD (slowing changing dimensions or Type 2, as Kimball would say) I would keep the design simple and use the existing natural key. That key is already maintained by the source system and business users are typically familiar with it. In this case the main dimension table contains book titles and has a natural key of ISBN. I think that since the ISBN is an International Standard Book Number, and its attributes don't need SCD support, it would make a perfectly good primary key of the titles dimension table. Also, the fact tables would have ISBN as a foreign key - just like the source system. This requires much less ETL logic and much easier reconciliation. No need to translate surrogate keys when determining which facts are related to ISBN=x. Why would you want to replace ISBN with a synthetic number, which is meaningless to everyone? I struggle with applications that are more complex than they need to be. I think Einstein said something along the lines of "Everything Should Be Made as Simple as Possible, But Not Simpler".

Some might argue that the fact table will be thinner with a numeric foreign key rather than the 13 byte ISBN. That is true, however I think it's risky to remove the natural key (ISBN) from the fact table as it makes reconciliation with ISBN difficult in the event of a bug in the synthetic key allocation. Also, the fact and dimension tables might like indexes on these keys, and with natural and synthetic keys, there might be a need for twice the number of indexes. Overall, I would argue that natural keys offer higher performance.

I have seen datamarts mystified by inappropriate use of surrogate keys. To illustrate, imagine the titles dimension constructed with a synthetic surrogate key and the ISBN no longer unique. This might be argued by a few to be superior as it would support historical integrity. (Hmm, like historical integrity is necessarily superior.) Think about a slowly changing ISBN! Someone enters a new title as "Kate the Cat", when it should have been "Katie the Cat". After a few weeks, it gets noticed and is corrected. However the SCD datamart is too clever and says that it was "Kate the Cat" for those first few weeks, so it must remain so, for historical integrity! From now on, users of this dimension must combine "Kate the Cat" with "Katie the Cat" to get sales information on this book. This can be frustrating for users, who know only one book called "Katie the Cat".

I have also seen datamarts where the dimension tables have blown out by more than 20 times (1 Million customers turns into a 20 Million row customer dimension) because all attributes are maintained with SCD logic. So each customer record had been updated an average of 20 times. Sometimes these changes are irrelevant to the business regarding SCD, such as a field containing the number of Customer Service calls. In this case, every time the customer calls up, there will be a new record in the customer dimension, with just the call count incremented by 1. For what purpose?

I have also seen date dimensions with surrogate keys! This also puzzles me. Does the developer believe that attributes of a date will change and that reports should show those dates with historical integrity? To my simple mind, that just seems to be a quick way to make a datamart/DW more complex for no benefit. Unless, there is benefit in increasing future work loads as maintenance and reconciliation become bigger chores.

I would add, that if you really do identify a need for SCD I would encourage you to only maintain SCD on the identified attributes. Do not apply SCD logic to all attributes blindly.

Please don't see this blog as anti Kimball. I truly believe that he has brought a lot of rigour and maturity to data warehousing. However, to follow his dimensional modelling blindly is not pragmatic and does not help organisations trying to build business intelligence.