Monday, January 19, 2015

Basket Analysis IV - Menu Trim

This is the third post in a series on Basket Analysis. To read the first post go to Basket Analysis I.

Retail stores are always trying to balance range of stocked items with inventory efficiency. A common task is to perform a "menu trim" where low profit contribution items are removed from the menu. A simplistic approach would be to remove the items with the lowest associated sales value or gross profit. However, this can result in trimming items that, while not profitable in themselves, are attracting customers and have high "associated" sales and profit contribution.
Consider the menu items above that are ordered by item sales. The Berry Smoothie is the item with the lowest sales value and gross profit. A simplistic view would be to say let's trim Berry Smoothie off the menu as it only represents $795.23 of sales. However, from our basket analysis query, we can see that the Berry Smoothie has a high Associated Net Sales and Associated Gross Profit. If we took Berry Smoothie off the menu, we might be risking $19,400 of associated sales. This is the total value of all transactions that have included Berry Smoothie. It might be that customers are coming into the store primarily for the Berry Smoothie and getting other items with the smoothie. If we remove Berry Smoothie from the menu these customers might not patronize our store.
Basket Analysis does not answer these questions directly, but it does give us information to make a better informed decision. In the case above, we might consider trimming the Vegeburger from the menu, as it has the lowest Associated Net Sales and lowest Associated Gross Profit. It appears that customers who come in to buy vegeburgers, purchase that item alone. So if that item was trimmed from the menu, it would appear to put less sales at risk.

Friday, December 19, 2014

Basket Analysis III: Promoting Sales Items Associated with Heavy Transactions

This is the third post in a series on Basket Analysis. To read the first post go to Basket Analysis I.

One common retail strategy is to promote a relatively low value item that is associated with "heavy" baskets. That is to identify sales items that tend to be in baskets that have many other profitable items. Sometimes these items are a little quirky and customers go to that particular retailer for the quirky item, and while they are there, they purchase many other items. The rational is that if we can attract more customers to purchase the quirky item, which as little profit contribution, they will purchase many other, profitable, items. So the promotion on the quirky item can be discounted with low, or even negative, gross profit. Basket Analysis technology can help identify these items, and it can also accurately measure the performance of the campaign. 

Here is a graph showing some of the key performance indicators around a promotion of Green Smoothies. This particular promotion increased, by a factor of 5, the sales of Green Smoothies, but these increased sales did not pull through, as hoped, the additional sales. See the green line near the bottom of the chart - it shows the sales value of Green Smoothies, and you can see a clear increase in their sales during the promotion period. Now look at the Average Txn value including Green Smoothies (blue bar) during the promotion period. It drops from about $20/transaction to about $16/transaction. Also, notice the purple line showing the Green Smoothie affected sales. See how it has increased, but only by the value of the increased Green Smoothie sales. This line shows the total value of all transactions that have at least one Green Smoothie. This suggests that promotion shoppers are coming in to purchase the Green Smoothies, but they are only purchasing Green Smoothies. It is valuable to get an early appreciation of the effectiveness of this campaign, so that it can be halted or modified early.

Note, traditional sales reporting tools could easily report on sales values of Green Smoothies, but it is only with Basket Analysis tools that we can get metrics like Green Smoothie affected sales and Average transaction value including Green Smoothies.

For the next Basket Analysis post see Basket Analysis IV.

Monday, December 15, 2014

Basket Analysis II: Would you like a drink with that?

This is the second post in a series on Basket Analysis. To read the first post go to Basket Analysis I.

A common question a customer receives when ordering food is "Would you like a drink with that?" This is an example of on-selling, where the customer has come in with an order, and the salesperson, cashier, etc. looks for an opportunity to increase the size (and profit contribution) of the transaction, and potential satisfaction of the customer, by inviting the customer to add another item to the transaction.  The activity can be much more subtle, such as "We have some hot-from-the-oven blueberry muffins today - would you like one with your coffee?" The intention is the same, and the activity usually involves some staff training and encouragement. Naturally, some training sessions are better than others and some supervisors, motivations, etc are better than others at encouraging staff to invite these on-sell opportunities. Basket Analysis is a great tool to help monitor and analyse the effectiveness of staff, training, and promotional activities. For example, a program could be developed to encourage staff to offer oven-hot blueberry muffins. Without Basket Analysis tools, a sales report might display the ratio of blueberry muffins per 1000 transactions to compare various staff, training programs etc. However, this report will be inappropriately affected by a customer purchasing a dozen blueberry muffins in one transaction. Basket Analysis will provide metrics such as % of hot drink transactions that contain blueberry muffins.

The charts below report on the effectiveness of the training campaign.  The top graph shows that all 6 cashiers are handing similar numbers of transactions with hot drinks.  The training campaign was conducted during week 29. Notice how most cashiers increased their attach rate of muffins by about 30%.  Elizabeth did not change her attach rate (maybe she missed the training) and Barry's attach rate increased, but not as much as the other cashiers.

For the next Basket Analysis post see Basket Analysis III.

Sunday, December 14, 2014

High Performing Big Table Loads in SSIS

If you are trying to load a very large SQL Server table with fast load, you might be thwarted by not having the rows per batch/commit that you think you have.

Here is a rather vanilla fast table load destination using SSIS data flow task. Notice how I have Rows per batch=200,000 and Maximum insert commit size=200,000. You might expect that the fast load will insert about 200,000 rows per batch and transaction. You would be wrong, as it is also gated by the buffer size of the data flow task. To determine how many rows are being inserted per batch/transaction, simply use SQL Profiler and trace the INSERT BULK commands. They are simply SQL:BatchCompleted events. To see the number of rows, look at the RowCounts column in the trace. It will probably be a little less than 10,000.
Firstly why is 10,000 rows per commit slow? If you have no indexes on your large table, the chances are that the loading is very fast. However, if there are non clustered indexes, the loading is slowed down by sorting the batch records into each index sequence and applying them to the indexes. Applying the keys to the index could touch up to 10,000 pages. Then the updates are committed, requiring writing out to disk all updates. Increasing the commit size can make updating indexes more efficient. Note, the performance does not get forever faster as the commit size increases. There is a memory cost with the large commit sizes and sorting larger groups of records increases at a greater then linear rate. To determine the best size for your task, run and monitor several sizes and plot their speeds. Choose the buffer size (batch/commit size) that maximizes load speed. Note, the optimal speed for a small table, will not be the same once the table becomes very large with indexes larger than db cache memory.

Also note, if you are loading a table from scratch, it can be more efficient to load the table with just a clustered index, then create the indexes in parallel after loading. Of course, partitioned tables will give you more flexibility in tuning your index build and loading.

Anyway, to increase the commit size, you will need to increase the Data Flow buffer size. See below, the DefaultBufferMaxRows is 10,000 and DefaultBufferSize is 10,485,760 bytes. Increasing these configuration settings to get larger commits.

Monday, December 8, 2014

Basket Analysis I: What is Basket Analysis

This is planned to be the first blog in a series on Basket Analysis.  In this blog, I will try and answer the question, What is Basket Analysis?

Basket Analysis technology offers an exciting and innovative way of looking at retail data. Simple business intelligence will provide retailers with all sorts of analysis by individual products, including sales by cashiers, customers, profit contribution, waste, discounts, labour, time of day, weather,etc. All this analysis is essential for most retailers in today's market. However, more and more retailers are using the behaviour of shoppers to look for opportunities to improve their operation.

Basket Analysis is about analysing which sales items are bought together (in the one transaction, or different transactions for the same customer).   A Basket Analysis solution will enable you to query and report on which products are bought together and what the profit contribution of all the "baskets" (transactions) that involve a combination of products that are in, or out of, the transaction.

Basket Analysis is not a set of metrics or KPIs from which you can run your business. It is a data solution that allows you to ask questions about the product mix of transactions. Perhaps the most simple question is "Tell me information on the total transaction value by individual product items".   For every sales item, show me the
  • Item Average Price
  • Item Sales Quantity
  • Item Sales Value
  • Item Cost of Goods
  • Item Gross Profit %
  • Item Average Discount
  • Transaction Sales (total value of transactions including this item)
  • Transaction Count (number of transactions including this item)
  • Transaction Average Value
  • Transaction Average Quantity (average items/transactions)
  • Transaction Average Discount
  • Transaction Cost of Goods
  • Transaction Gross Profit
  • Transaction Gross Profit Percent
  • Transaction Value Multiple (total transaction sales/item sales)
Here is an illustration report of that very simple question that might be asked of a Basket Analysis solution.  It is simple, because the analysis is by individual items.  A more sophisticated query would be on particular combinations (existence, or non-existence) of items.  That is for subsequent blogs in this series.

From this simple report there are already many interesting insights available.  For example
  • Items that have a low sales item value but high transaction value (Transaction Value Multiple) are items that whilst they contribute very little profit themselves maybe bringing in customers who purchase more profitable items.
  • Items with a very low Transaction Value Multiplier, close to 1, are bought alone and probably do not induce customers to buy other products.
  • This simple report for one sales item, such as Cheese Cake, by waiter/cashier can give you insights as to which staff/outlets are good at on selling.
  • Sales items with the highest Transaction Gross Profit are ones that are likely to be important to your business, even though they may not represent well in top individual sales items.
For the next Basket Analysis post see Basket Analysis II.

Thursday, November 20, 2014

Yikes - A read query is blocking everything!

This is something I saw in an OLTP database with an application (.NET) read query scanning a multi-billion row table while holding a full table S lock. The query thought it was using the default isolation level of read committed. Almost every other query in the system was locked in a chain behind this read query as all update queries on the table (arguably the most important table in this OLTP application) also prevented any other read queries from jumping the queue and had them waiting for a lock too.

By default, SQL Server read queries will use a locking isolation level of READ COMMITTED. Most developers just assume this and, understandably, do not bother to set isolation level. With READ COMMITTED, a read query will only lock the current record while scanning many rows. It has an "intent" lock on the table, which is compatible with all other "intent" locks on the table. However, if you are using connection pooling, which most applications do nowadays, it is possible that your query has picked up an isolation level of REPEATABLE READ from a recycled connection. This is quite serious. Now your query won't release locks as it scans through rows, and once it has read a few thousand rows, it will stop taking row/page locks and lock the entire table.  This is called Lock Escalation, as you have taken so many locks on one object, SQL Server says, "that's enough you may as well have one lock on the entire table". Now it gets interesting. Any INSERT, UPDATE, or DELETE query will have to wait on your query, since they are incompatible with your S (share) lock on the table. What happens next is even more interesting (in a nefarious kind of way). All subsequent read queries (except NOLOCK read queries) will wait for a lock too.  They are compatible with the full read lock but they are incompatible with the IX locks of the insert, update, delete queries, so they must wait in queue. I call this the "no queue jumping" rule.

How exactly does your query get a REPEATABLE READ isolation level when it didn't ask for it? The answer is connection pooling. An application program can change the isolation level of the connection and leave the changed isolation on the released connection. The next thread to pick up the connection will have the isolation level left behind. Many people have raised this issue before and there has been a Connect issue supported by many developers.
Unfortunately, I feel, Microsoft's response has been to close the "Bug" with an "as By Design" flag!

So what can we do to avoid queries escalating to full table locks when they assume isolation of READ COMMITTED?
  1. The cleanest solution would be for the connection pool to pass on used connections with their default settings. This is unlikely to happen in the near future as indicated above by Microsoft. 
  2. Yould could disable connection pooling. This is quite heavy handed and you will pay for it in a slower application as new connections generally take 100's of milliseconds to create.
  3. In your application that uses connection pooling, ensure that all connections are reset to the default isolation level before release. Note, sp_reset_connection does not do this. Also note, there might be times when a piece of code does not follow this standard. Unfortunately, the non standard code won't be affected.  It will be the innocent code that takes on the connection that initiates the locking issue.  
  4. In your application that uses connection pooling, ensure that all new connections are set to the desired isolation level. Ie, don't rely on a new connection having the default settings. This is probably a good practice, as all code that follows this standard will be OK.
  5. Use WITH (NOLOCK) table hint on your read queries. This will certainly avoid the lock escalation. However, there can be interesting data integrity issues in reading with no locks, such as reading the same row twice. I think there is a place for NOLOCK, but I wouldn't advise the use NOLOCK throughout your application. The problem is the use of REPEATABLE READ not READ COMMITTED.
  6. Use WITH (READCOMMITTED) table hint on your read queries. This will override REPEATABLE READ isolation level on the connection for this table. So the problem above is avoided, but it is a bit of a task changing all the queries to use this table hint. Note, query hints have a bad rap, which is understandable when they override the optimizer's choice of access path. I see no problem with query hints when they are requesting a specific isolation level.
  7. Use READ COMMITTED SNAPSHOT ISOLATION on your database. This is not a solution as the read query is on a connection that has a REPEATABLE READ isolation. As the name suggests, READ COMMITTED SNAPSHOT ISOLATION only works on connections that are using READ COMMITTED
  8. Use TRANSACTION COMMITTED SNAPSHOT ISOLATION on your database. This is not a solution either, as it does not override the REPEATABLE READ isolation level of the read query.
  9. Wait for upgrade to SQL Server 2014.  It looks like the isolation levels are reset on pooled connections.  However, you should check, as this information has simply be posted in an msdn forum, albeit from a SQL team member. 
How can I determine if REPEATABLE READ is being used by my application?
  • A SQL Server Profiler trace on ExistingConnection events will provide the isolation level of all active connections. Just look for set transaction isolation level and see what follows. If any of your current connections have REPEATABLE READ you should be concerned. ExistingConnection will only display currently active connections. To see if isolation level is changed, you will need to trace SQL:BatchCompleted and SP:StmtCompleted, and Prepare SQL. Looking for the text '%set%transaction%isolation%level%repeatable%read%'.
  • Another useful Profiler trace is to trace Lock:Escalation. I believe that any table lock escalation in your OLTP environment should be of some concern to you, particularly if you have locking or deadlock issues. However, if you have too many to look at, you could filter in/out by Type=5 (Object), objectId (table) and/or Mode=3 (S-share).
How lucky are we? There are two ways of viewing the locking incident.

  1. Unfortunate that it happened and get a quick fix asap for this particular query.
  2. Fortunate that this relatively small incident occurred, which enabled us to uncover something quite serious (full table locks in an OLTP environment). The issue, has given us an opportunity to discover the dynamics of connection pooling and helping us find an application wide fix to prevent its future occurrence. Also, if there is a future locking incident, we have more knowledge to diagnose that too. I see this as analogous to someone having high blood pressure and going to the doctor for a fix.  The high blood pressure could be seen as annoying and simply needing some pills to lower the blood pressure, or it could be an opportunity of determining the cause of the high blood pressure (be it diet, stress, etc.) and rectifying the underlying cause.

Wednesday, November 19, 2014

Relative Dates - Today, Yesterday, LastWeek etc.

A useful little piece of functionality I often add to OLAP cubes is what I call Relative Dates.  In the Date dimension I add attributes CurrentDay, CurrentWeek, CurrentPeriod, CurrentYear etc.  These attributes are simply functions within the Dates dimension view, which gets processed early every morning. For example CurrentDay could be something like
when d.MyDate=convert(date,getdate()) then 'Current Day'
when d.MyDate=convert(date,dateadd(dd,-364,getdate())) then 'Current Day LY'
when d.MyDate=convert(date,dateadd(dd,-1,getdate())) then 'Yesterday'
when d.MyDate=convert(date,dateadd(dd,-365,getdate())) then 'Yesterday LY'
else 'Other days'
end CurrentDay,

Note, Current Day Last Year is -364 days, which will ensure that it will be the same day of week. It's the same as saying -52 weeks.

Note, you only need to process the Date dimension, the large fact tables do not need to be reprocessed, as their Date dimension keys are not changing. It's just the, flexible, attributes of the Date dimension that are changing daily.

Note, it is not practical to create a hierarchy of these attributes as they don't fall within oneanother. For example, Today and Yesterday might be in the same fiscal week, or they might fall in consecutive weeks.

The logic for CurrentWeek etc is a bit more complicated as your case statement will be checking if the FiscalWeek of getdate() is the same as the FiscalWeek of MyDate. Eg.
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,getdate())) then 'Current Week'
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,dateadd(WW,-52,getdate()))) then 'Current Week LY'
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,dateadd(WW,-1,getdate()))) then 'Last Week'
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,dateadd(WW,-53,getdate()))) then 'Last Week LY'
else 'Other weeks'
end CurrentWeek,

You can do the same sort of logic for Periods, Quarters, and Years. Note, this functionality does not replace Time Intelligence. If anything it adds to it as Time Intelligence will go across the Relative Dates attributes, or, more likely, you can use the Relative Dates attributes to filter dates and use a regular date hierarchy on rows/columns with Time Intelligence. The advantage of the CurrentDay, CurrentWeek, CurrentPeriod, and CurrentYear attributes is that reports can be relative in time. Ie, the filter can be a multi-select of this period and last period, showing weeks on columns, which will have 5-9 weeks.  So the report will always display the last 5-9 weeks, without the user having to select the current period etc. These filters will work in Reporting Services, Excel, PerformancePoint, or any other cube browser/reporting tool.

Wednesday, September 3, 2014

SQL Functions and Plan Caching

I feel that I have become typecast as a Business Intelligence specialist.  However I often employ my SQL skills and I still enjoy the challenge of solving pure SQL performance issues.  Here's an issue, which presented itself as a table valued Function in an OLTP environment that was taking a lot of compilation energy (6.7 seconds) every execution, while the execution was only 0.15 seconds.  It was executing about 1/second at peak.

Now, I am a great fan of functions, but I know that they are not compiled like stored procedures, so I am weary of using them for complex SQL queries.  This query was very complex.  I've got the function's code below, but it is based on a much more complex view, which joins 5 tables.  There is a picture of the function execution above.  See how the execution time is only 125ms, while the parse and compile time is 6638ms!  The obvious solution is to convert this function to a stored procedure, which will enable SQL to store the plan.

However, this would involve changes to .NET code, which would involve a lot of testing and change control.  I found another solution, which enabled us to fix the Function without having any change in .NET code, permitting a quick fix to production.  The fix was to alter the function so that it has multiple lines!  This way SQL Server will treat the 'multistatement table-valued function' as a stored procedure and enable it to be cached.  You can see from the execution results below that their is no parse and compile time, and the execution time is the same 125ms.
How does it do this?  It appears that SQL Server will treat the multistatement table-valued function as a stored procedure.  We can see this from the plan cache query below.

Note, SQL Server is able to cache Adhoc queries, so that your execution of a function with a particular parameter can be cached.  Don't let this cloud your analysis, unless your OLTP environment is going to  execute the function with exactly the same Adhoc text.  You will see these 'Adhoc' cached plans above with objtype='Adhoc'

Note, there is documentation that explains this behaviour, although you might have missed it.  See and read the text near 'multistatement table-valued functions'.

What does this tell us?  Firstly, it reminds us that compile time is not always trivial.  Secondly, it reminds us that we should treat Functions as different to Procedures, and only use Functions where they are appropriate.  I find them most useful where they are doing something like string manipulation and optimisation is not appropriate.

Old function (with 6.8 second compile time every execution)
create function MyFunction 
(@Parameter1 int) 
SELECT t2.*,t1.MyDate
from (select max(t4.TheDate) MyDate, 
      from vBranchCount as t3 
      left outer join  (SELECT *
                        FROM vBranchCount
                        WHERE vBranchCount.Status = 'Applied') AS t4 
ON t3.TranCode = t4.TranCode 
                AND t3.BranchNo = t4.BranchNo 
                AND t3.Cash1 = t3.Cash1 
                AND t3.Cash2 = t4.Cash2 
                AND t3.Cash3 = t4.Cash3 
                AND t3.Cash4 = t4.Cash4 
                AND t4.TheDate < t3.TheDate
      where t3.StockCountPlaceId = @Parameter1
      group by t3.BranchNo,t3.TranCode,t3.Cash1,t3.Cash2,t3.Cash3, t3.Cash4) as t1
inner join vBranchCount as t2 
  on t1.BranchNo = t2.BranchNo
 and t1.TranCode = t2.TranCode
 and t1.MyDate = t2.CashDate
 and t1.Cash1 = t2.Cash1
 and t1.Cash2 = t2.Cash2
 and t1.Cash3 = t2.Cash3
 and t1.Cash4 = t2.Cash4  )

New Function with 6.8 second compile time only on first execution.
create function MyNewFunction 
(@Parameter1 int) 
(  TeamID bigint,
           Rain bit,
           WitchHat varchar(50),
           JoyStormID bigint,
           Happyness int,
           TeamName varchar(250),
           CashIn money,
           CashOut float,
           TheDate datetime,
           JoyClerkID bigint,
           JoyClerkName varchar(150),
           TranCode bigint,
           Bills float,
           CashDate datetime,
           BranchNo bigint,
           Temperature varchar(50),
           StockCountPlaceId bigint,
           Applicability int,
           Readyness int,
           SiteID bigint,
           BillsToChange int,
           Cash1 varchar(50),
           Cash2 varchar(50),
           Cash3 varchar(50),
           Cash4 varchar(50),
           JoyUnit1 int,
           JoyUnit2 int,
           JoyUnit3 int,
           JoyUnit4 int,
           TranSystemId bigint,
           Handle1 varchar(35),
           Handle2 varchar(35),
           Consumer varchar(50),
           JoyConsumerId bigint,
           Rainbow varchar(50),
           StormCode varchar(50),
           ChangeDate datetime)
SELECT t2.*,t1.MyDate
from (select max(t4.TheDate) MyDate, 
 from vBranchCount as t3 
 left outer join  (SELECT *
FROM vBranchCount
WHERE vBranchCount.Status = 'Applied') AS t4 
ON t3.TranCode = t4.TranCode 
AND t3.BranchNo = t4.BranchNo 
AND t3.Cash1 = t3.Cash1 
AND t3.Cash2 = t4.Cash2 
AND t3.Cash3 = t4.Cash3 
AND t3.Cash4 = t4.Cash4 
AND t4.TheDate < t3.TheDate
 where t3.StockCountPlaceId = @Parameter1
 group by t3.BranchNo,t3.TranCode,t3.Cash1,t3.Cash2,t3.Cash3, t3.Cash4) as t1
inner join vBranchCount as t2 
 on t1.BranchNo = t2.BranchNo
 and t1.TranCode = t2.TranCode
 and t1.MyDate = t2.CashDate
 and t1.Cash1 = t2.Cash1
 and t1.Cash2 = t2.Cash2
 and t1.Cash3 = t2.Cash3
 and t1.Cash4 = t2.Cash4   


Query plan cache

SELECT usecounts, cacheobjtype, objtype, text 
FROM sys.dm_exec_cached_plans 
CROSS APPLY sys.dm_exec_sql_text(plan_handle) 
WHERE (text like '%MyFunction%' or text like '%MyNewFunction%')
and objtype <> 'Adhoc'
and cacheobjtype='Compiled Plan'
ORDER BY usecounts DESC;

Wednesday, August 27, 2014

Database IO Throughput

I've written on this before, but I feel I need to say it again in a different way as so many people fail to grasp the significance of IO throughput capacity.  If you see messages in the SQL log along the lines of the following, then you have a very very serious IO throughput issue. I didn't make these logs up.  They are real logs!
The best analogy I can draw is if you can imagine your company has 1000 staff and you have a building of 20 floors with 4 lifts that can adequately house all staff.  Your 1000 staff tend to drift in between 9 and 10AM and the 4 lifts can move the 1000 staff to their desired floors in the hour.  Infrequently, staff will queue for a lift as the first lift will be full, but it's rarely longer than one lift ride for the wait.  During the day, the lift demands are lower and there are no queues.

Now, over time, your company grows to 2500 staff, and you've added another 30 floors to your building and the lifts are extended to cater for all 50 floors.  The CEO is relieved that the building capacity has been increased to house the new staff and that the lifts work across all 50 floors.  However, no one has thought about the increased throughput required by the lifts.  Between 9 and 10AM 2500 staff now drift into the office and want a lift to their floors, however the 4 lifts are only capable of moving about 1200 staff/hour.  The queues for the lifts build up and are not fully drained until after 11AM.  The company is now in trouble with disgruntled staff and find it needs to ask staff to come in earlier or later.  Also, there is the hidden issue of the fire exit stairways also having throughput limits.  With 1000 staff, the building could be evacuated in 15 minutes, now with 2500 staff (and the same stairways) it will take over 40 minutes.

This is analogous to a SQL Server database growing over time, and the SAN (playing the role of the lifts) is extended to hold more TeraBytes, but its throughput capacity (measured in MegaBytes/second of reading and writing) is not increased.  The IO queues start to build up and IO times degrade.  The SQL Log picture above is an example where hundreds of IO every minute are taking longer than 15 seconds to complete!  This is an unbelievably dire situation that can cause SQL Server to shutdown. 

I am not advising that you monitor your SQL Server log for long running IOs, although if you see them, you should act immediately.  I advise you to monitor IO throughput, average latency, and queue length continually, using Perfmon or related tool.  You can detect throughput capacity when MB/second tends to flatline while IO latency and queue lengths are waxing and waning.  Unfortunately, Perfmon, is not a complete tool, but it does have all the base information and there are many third party tools, or you can do what I do and create your own by having Perfmon send records to a common SQL Server database and cube.  You only need a few counters and only record every 60 seconds, or so.  If the Perfmon database is on another server, there will be very little load on your production servers, essentially no IO.  Your Perfmon database could hold performance data for all your production servers, so that you have all the performance information you need in one place.  Also, when you have a performance issue you have historical information of this issue and trends earlier in time.  Companies that don't continually record these metrics are flying somewhat blind, or, at least, flying with blinkers on.  Here's an earlier blog on creating a Perfmon cube, although, unfortunately, the online demonstration is no longer available.  

This posting is not intended to provide answers to the IO throughput issue.  The solution can be many and varied.  The solution will probably involve speeding up the storage system, but it might be to make the database more IO efficient.  The most important point I can make is that you should be monitoring and be able to recognise this issue.  Also, don't let a storage person tell you that you have hit the maximum possible throughput.  There are always ways to increase the throughput performance of your storage, as there are always ways to improve your database performance.

Monday, June 30, 2014

Signs your Data Warehouse is Heading for the Boneyard

I have come across too many data warehouse projects that have obvious signs of failing with project sponsors in denial.  Here are a few signs that you might look out for and take seriously.

  1. You've spent one million dollars and haven't got one million dollars of value. Data warehouse (DW) projects are never really finished, as there are continual changes in business practices, business software, competition, business strategy, regulation, etc. DWs need to be flexible so that they can continually be updated/enhanced to adapt to these changes in the business. If you've spent one million dollars and not got value from that, you are probably developing something that is inflexible. Typically, I would be encouraging, even very large companies not to spend that much money without showing benefit. The vision for a DW might be very large, but start with the trickiest, largest, and/or most crucial data. The DW should manage that data and provide value to the business before more work is commissioned.
  2. Your project is 3 months overdue and the development manager is telling you that more resources are required. There are probably fundamental design issues that will continue to cause issues. Such a long development cycle will probably lead to an inflexible DW.
  3. Your user community do not trust the output. The DW is producing reports, dashboards, alerts, etc., but the user community do not trust the data and aren't using it to run the business.
  4. The same question can be asked to different parts of the warehouse with different answers. One transaction detail should, ideally, be stored in only one place in the DW. If it is stored in mulitple places, the design is probably not optimal, and if they are providing different answers, there is definitely a problem.
  5. It's almost impossible to reconcile the data warehouse with your OLTP systems. The ETL (extract, transform and load) processes are so complex and involve so much custom code that no one can reconcile DW facts with OLTP systems. The ETL processes are probably too complex.
  6. Your developers are telling you that they are spending a significant amount of time developing functionality that will make the data warehouse more flexible. This is all too common in my view. Developers like to develop, and they can create large generic applications to schedule jobs, "clean data", manage file transfers etc. Most generic DW software is already available in Microsoft's SQL Server and related technologies. The role of the developers is to use this software to create a bespoke application that is as simple as possible. Try and stick to existing vendor software. They will maintain it and provide future releases. You will always have the option to move platforms in the future having spent much less money.
  7. The data warehouse requires a full-time person to manage the jobs and exceptions. DW schedules should be reliable enough to manage themselves, simply providing alerts for late data etc. Intervention should only be required where there is a defect in the DW and a diagnosis and change is required. Even when there is a catastrophic failure, the DW should start up and simply continue from where it was.
  8. Some of the development team have left before value is delivered. DW projects should be delivered in small releases that provide value. If the release cycle is so long that significant team members are leaving, there is probably an underlying design issue.
  9. There are performance issues that are so severe that it is looking difficult to process one day's data in a day. This is a bright red flag.  DW will only want to grow with more feed systems and more data. Such a large performance issue should be seen as a huge question mark over the DW.
  10. The development team are telling you that they need to spend a lot of time up front to develop software that's going to make the DW more flexible! It's almost a certainty that their software will make the DW more inflexible. Think of it this way, if it takes 50 dev days to create a DW, it will take a proportion of that 50 dev days to make a change.  If the first deliverable takes 1000 dev days, enhancements will tend to be a proportion of 1000 days.

Monday, January 13, 2014

Determining a wayward query plan while it's executing

Occasionally, I get production performance issues of a long running SQL query, that is consuming lots of system resources and is running for much longer than usual.

You can see the long running query simply by using sp_who2, or some variant, that displays active SQL queries.  So you now have the SPID of the running query that is taking a long time.  I would typically then want to get the SQL command.

You can easily get the SQL for this query by using dbcc inputbuffer(SPID).  I might then use the sql query to get a show plan to see how it should execute.  You can quite easily do this by cutting and pasting the SQL into a new window and requesting a show plan.  However, sometimes the query plan that is running is not the same as the query plan that SQL Server tells you it would use now.  This can be for a variety of reasons including new statistics, parameter sniffing and different connection properties.

So what you really want to get is the plan that is currently executing.  You can obtain this from the sys.dm_exec_query_plan table.  First of all you will need the plan handle, which you can get from sys.dm_exec_requests using the SPID as the session_id.

Use the plan_handle to get the plan from sys.dm_exec_query_plan.  The query plan will return as an xml document.  Simply click on the xml in the query results window for SQLEM to open the query plan in a new window.

Now you have the query plan that is actually executing and you can compare that to the query plan you expected SQL to use.  From here you can tune the query to use a more appropriate query plan.

Note, I would encourage you to use all other options before using index or join hints to get the query plan you desire.  This is because once you override the optimiser with your index or join hint, it will always use that hint even when your table structures or data changes so much that it is obviously silly.  Optimiser hints such as FASTFIRSTROW, redundant predicates etc are OK with me, as they don't override the optimiser, they just tell the optimiser that you want a plan that will get the first row fast.

Saturday, August 10, 2013

Sargable Predicates are usually best practice

Every SQL query writer should know that sargable predicates are generally much faster than non-sargable predicates.  Essentially a sargable predicate can be used by the relational engine to navigate an index.  For example, the predicate Where IpCountryID=@i 

is sargable as the relational engine can use the variable @i to navigate an index on IpCountryID.  Similarly
Where IpCountryID=@i-1

is also sargable as the relational engine can add 1 to @i and use that to navigate an index on IpCountryID.  On the other hand
Where IpCountryID+1=@i
is not sargable as the relational engine cannot navigate an index while there is an arithmetic expression on IPCountryID.  These equivalent sargable and non-sargable predicates and their plans can be seen above.  The sargable predicate query uses an index seek to get the particular record, while the non-sargable predicate uses the same index but it scans the entire index checking if IPCountryID+1=@i.  An example of how the sargability of predicates is important is where you are joining a table to itself where the key has an arithmetic function.  For example, getting the previous record as in the following example
SELECT t1.*,t2.value
FROM MyTable t1
LEFT JOIN MyTable t2
Notice how the predicate is sargable on the left side but not the right.  This means that the relational engine can navigate to t2 using an index.  If the predicate was (logically equivalent) the relational engine could not navidate to t2 using an index.  

I found an interesting situation while tuning a query on a couple of nested views.  It's not easy to reproduce, and to reproduce it I need to include other joins and logic, which will make it hard for you to comprehend.  The essence of the first view was to restrict the records to the last rolling 53 weeks.  It had the following predicate
WHERE TransactionDate >= DATEADD(WW,-53,GETDATE())
This predicate is sargable, which should be a good thing.  However, there are several views on top of this base view that are asking for month date ranges. So they have predicates with hard coded date ranges (on top of the first view) ie
WHERE TransactionDate between '20160101' and '20160131'
This predicate is also sargable, which should also be a good thing.  However, the issue I uncovered was that the optimiser now had a choice of TransactionDate>=DATEADD(WW,-53,GETDATE()) and TransactionDate>='20160101' as the lower bound to the index seek.  Sometimes it was choosing TransactionDate>=DATEADD(WW,-53,GETDATE()) as the lower bound, which is much slower than TransactionDate>='20160101', since 20160101 is a higher date (and there are no transactions for 2016 yet).  The solution to these poor performing queries was to change the first view to make the sargable predicate in the base query non-sargable.  It became 
WHERE DATEADD(WW,+53,TransactionDate) >= GETDATE()
Now the optimizer has no choice, it will use '20160101' and '20160131' as seek predicates on the index and the query runs orders of magnitude faster.

Please don't takeaway the feeling that non-sargable predicates are good.  The takeaway is that it is worthwhile looking into the plans of long running queries, understanding why the optimizer chose that path and help it choose a faster path, if one exists.

Friday, August 9, 2013


I might begin by apologizing for such a long gap since my last blog.  I've had family issues that have taken a lot of pain, time and energy.   This blog is a reminder to everyone that as important as your career and interesting as the technology is, you need to spend more time with the people you love and doing the things you enjoy. For much of the last 20 years I haven't always done this and I've had a couple of big reminders recently.  One of the things I enjoy is unicycling, and last month I had the opportunity of competing in the North American Unicycle Championships in Pennsylvania.  So much fun, and so many interesting people.  That's me, the big guy on the right in white, early on in the marathon race, where I won my age group.

The local newspaper did a little article on me when I returned to Sydney.

So remember to stop work everyday and do something you enjoy and spend more time with people you love.  The technical challenges will still be there tomorrow.

Next blog, I promise, will be a technical one.  It will be about tuning a query by making a predicate non-sargeable!  Yes, I hear you, that's against all the best practices, but there is a good reason for it in this particular case.

Thursday, April 4, 2013

Avoiding a Common Deadlock

A common cause of deadlocks is concurrent identical update/delete queries.  For example, a query such as
DELETE FROM MyTable WHERE MyCode=@code
When two queries start executing this query with different codes, they have a good chance of deadlocking.  This can be via lock escalation, where query A requests an escalation to a table lock, which will wait on the query B.  As soon as the query B requests a lock escalation, or any lock that query A has, there is a deadlock.  You can have a deadlock without table escalation, by the two concurrent queries requesting page locks.  Eventually query A might ask for a page that query B has and query B is waiting on query A.
A very simple solution to this kind of deadlock is to take a Table lock at the beginning of the query.  That way whoever gets in first will finish before the other query starts, removing the possibility of a deadlock on that one table.  This can be as simple as
Taking a table lock will reduce parallelism.  No other queries (except NOLOCK) will be able to read the table while the TABLOCKX is held.   However, the reduction in parallelism is generally less painful that experiencing deadlocks.  Also, queries that DELETE/UPDATE many rows in a single query will necessarily lock large parts of the table (often the whole table if lock escalation is triggered), so taking an explicit table lock for the query generally is not too onerous.
Note, another trick that I sometimes use in stored procedures that are looping through many records with possible updates is to take an explicit lock on the table at the beginning of a transaction.  For example, the following code will take an explicit table lock on MyTable, which it holds until the transaction is committed.
-       - - put your code in here to perform updates/inserts/deletes on the table(s)
A couple of notes for deadlocks and locking
  • The amount of locking and deadlocks exponentially increase as transaction time increase.  So, anything you can do to reduce transaction time will reduce contention.
  • It is common practice for middle tier applications (eg .NET) to start transactions and call the database server multiple times within the transaction.  This is OK, but it is even safer (and reduced transaction time) if you encapsulate the logic in a stored procedure, and let the stored procedure begin/commit the transaction.
  • SELECT queries can be called without locks (NOLOCK).  DELETE/UPDATE/INSERT queries cannot be called without locking.
  • Read queries (DELETE/UPDATE/INSERT queries also read queries for this point) generally only lock the current record/page by using CURSOR STABILITY lock isolation.  It is possible for them to use more onerous strategies, such as, REPEATABLE READ, in which case they do not release read locks until the end of the transaction.  This can lead to contention and deadlocks with concurrent update queries. Sometimes these REPEATABLE READ read queries will benefit from an explicit table lock.
  • Lock escalation is generally your enemy in avoiding deadlocks.  If you see that a query is regularly escalating to a table lock, it is probably better for you to explicitly take a table lock at the beginning of the transaction/query as I outlined above.
  • Transactions can be started and committed while a stored procedure holds position on a cursor.  This way, you can commit and release locks every n records while processing rows in a cursor.
  • Use TRY CATCH in your procedures to handle exceptions such as deadlocks.

Wednesday, March 20, 2013

Pitfalls of Big Data

I have just assisted in another Big Data project review.  There appear to be a set of common pitfalls that new Big Data developers fall into.   In this blog I hope to shed light on a few pitfalls that are quite easy to avoid, but difficult to remedy. 

  • Insisting on flat file input sources, then simply load these into flat database tables.  I understand that some source systems are remote and that flat files are the appropriate vehicle to transport data to your warehouse.  However, if the source system is a relational system within your operations centre, then it will be much easier and more reliable to simply read the data straight from the source system.  It is more reliable, as databases will ensure data typing and referential integrity.  It will be faster as it avoids all the IO involved with writing the flat file and reading it again.  There is also the maintenance issue of adding/removing columns from flat files.  Typically both consumers and providers need to be synchronised for such a change, whereas a database can accommodate this change without the need for consumer and providers synchronising their change.
  • Measuring your Big Data project by the number of tera-, peta- or exa- bytes. This is an almost irrelevant figure.  Just like the amount of money companies spend on R and D won't tell you which one will come up with the next big idea. Big Data projects should be measured by the beneficial value of the information they provide.  When specifying storage requirements, amount of storage always comes after performance specifications (MB/second and ms/IO, for example).
  • Engaging the services company with the best Big Data sales person. Consultancy and software firms have discovered that Big Data projects can be very large and profitable.  Consequently, their best sales persons are employed to win, and they aim for the biggest projects. The big project is not always in the interest of the client, and the best salespersons are not always followed by the best developers. 
  • Technology is not your most important decision. The technology you choose is important, but it's not the overriding factor.  Also, as a corollary, only choose the technology, as you need it.  For example, if you are planning on taking 12 months to load the data, leave selection of the visualisation technology until after you have started loading.  This will enable you to make a better choice on the software (you will have some good sample data) and the software market will be a few months more mature.  
  • Purchasing production hardware before you need it.  If your project isn't going live for 12 months, there is no need to purchase production software until close to that time.  At that time you will have a better idea of the specifications required, and the hardware will be a little more advanced and/or cheaper.
  • Taking Kimball or Inmon literally and make all dimension attributes Type II.   Some naive Big Data developers worship Kimball and measure their technical prowess by how many Type II dimensions and attributes they have.  So much so that not only do they make your DW performance suffer, they remove natural keys making querying on current values impossible for end users.  It's OK to have Type II dimensions, but only where they benefit the business users.  Almost always, when you have Type II dimensions, you should also keep your business (natural) keys in the fact tables.  For example, a transaction record might have a surrogate key for Customer, which points to the customer record, as it was at that time.    If the Customer attributes change, later transactions will point to a different Customer record with a higher surrogate key.  Now, when the business selects transactions for customers that belong to "Harry", they wont see transactions for customers that belong to Harry unless Harry was the customer owner at the time of the transaction.  This can be very confusing and annoying to business users.
  • Reload entire tables that only get inserted to from source systems.  This is quite obvious even to naive developers and consequently an uncommon pitfall.
  • Reject data from the business because it doesn't conform to a Big Data rule.  For example, reject transactions because they have an invalid ItemId column.  This is a kind of "Holier than thou" issue.  The Big Data custodians want to keep their warehouse clean so they reject bad data.  The problem is that this record was for a sale, it has an amount and is part of the gross sales.  By rejecting it, the warehouse has got an invalid view of the business.  A valid view would be to include the transaction, albeit, with an "unknown" sales item.  It might be that at a later time the sales item table comes through with that missing ItemId, which the warehouse should accommodate without rejection or prejudice.
  • Force your business users to using one querying/visualisation technology.  The Big Data project will have selected specific software for collection, loading, storing, cubing, data mining etc.  However, the business users should be free to select the technology of their preference for querying and visualising this information.  Nowadays, structured data is well suited to OLAP cubes, and most warehouses exploit OLAP cubes.   There are hundreds of software applications that can consume cubes, and the business users should be free to use the ones they prefer.  It doesn't matter to the warehouse how many different cube browser technologies there are.  It might matter how many concurrent queries there are, but that's a different matter.

Thursday, January 24, 2013

MDX Running totals on a nested ordered set

I like MDX puzzles, and just got a good one today. The requirement was to have running totals over an ordered set, but the set was only ordered at the lowest level, and the running totals were just within the parent member. Ie, a list of countries and cities, where cities are ordered descending by [net sales] within country, and [profit] has a running total for the cities within that country.

As I have said before, anything is possible with MDX, and here is the solution for ordering cities within countries and having running totals of profit of cities within countries. I haven't ordered the countries here, but they could be independently ordered.

Essentially, what the query does is order the cities (along with city.All). For the running total it takes the members from the City.All member to the current member and aggregates them. Simple really.

set OrderedCities as
,[Measures].[Net Sales], DESC)  
member [Measures].[CityRank] as rank(([Geography].[Country].CurrentMember,[Geography].[City].currentmember),OrderedCities)
member [Measures].[CountryRank] as rank(([Geography].[Country].CurrentMember,[Geography].[City].[All]),OrderedCities)

member [Measures].[ProfitRunningTotal] as
Topcount(OrderedCities-TopCount(OrderedCities, ([Measures].[CountryRank]))

{[Measures].[Net Sales]
} on columns,

OrderedCities  on rows

from [MyCube]