Thursday, November 20, 2014

Yikes - A read query is blocking everything!

This is something I saw in an OLTP database with an application (.NET) read query scanning a multi-billion row table while holding a full table S lock. The query thought it was using the default isolation level of read committed. Almost every other query in the system was locked in a chain behind this read query as all update queries on the table (arguably the most important table in this OLTP application) also prevented any other read queries from jumping the queue and had them waiting for a lock too.

By default, SQL Server read queries will use a locking isolation level of READ COMMITTED. Most developers just assume this and, understandably, do not bother to set isolation level. With READ COMMITTED, a read query will only lock the current record while scanning many rows. It has an "intent" lock on the table, which is compatible with all other "intent" locks on the table. However, if you are using connection pooling, which most applications do nowadays, it is possible that your query has picked up an isolation level of REPEATABLE READ from a recycled connection. This is quite serious. Now your query won't release locks as it scans through rows, and once it has read a few thousand rows, it will stop taking row/page locks and lock the entire table.  This is called Lock Escalation, as you have taken so many locks on one object, SQL Server says, "that's enough you may as well have one lock on the entire table". Now it gets interesting. Any INSERT, UPDATE, or DELETE query will have to wait on your query, since they are incompatible with your S (share) lock on the table. What happens next is even more interesting (in a nefarious kind of way). All subsequent read queries (except NOLOCK read queries) will wait for a lock too.  They are compatible with the full read lock but they are incompatible with the IX locks of the insert, update, delete queries, so they must wait in queue. I call this the "no queue jumping" rule.

How exactly does your query get a REPEATABLE READ isolation level when it didn't ask for it? The answer is connection pooling. An application program can change the isolation level of the connection and leave the changed isolation on the released connection. The next thread to pick up the connection will have the isolation level left behind. Many people have raised this issue before and there has been a Connect issue supported by many developers.
Unfortunately, I feel, Microsoft's response has been to close the "Bug" with an "as By Design" flag!

So what can we do to avoid queries escalating to full table locks when they assume isolation of READ COMMITTED?
  1. The cleanest solution would be for the connection pool to pass on used connections with their default settings. This is unlikely to happen in the near future as indicated above by Microsoft. 
  2. Yould could disable connection pooling. This is quite heavy handed and you will pay for it in a slower application as new connections generally take 100's of milliseconds to create.
  3. In your application that uses connection pooling, ensure that all connections are reset to the default isolation level before release. Note, sp_reset_connection does not do this. Also note, there might be times when a piece of code does not follow this standard. Unfortunately, the non standard code won't be affected.  It will be the innocent code that takes on the connection that initiates the locking issue.  
  4. In your application that uses connection pooling, ensure that all new connections are set to the desired isolation level. Ie, don't rely on a new connection having the default settings. This is probably a good practice, as all code that follows this standard will be OK.
  5. Use WITH (NOLOCK) table hint on your read queries. This will certainly avoid the lock escalation. However, there can be interesting data integrity issues in reading with no locks, such as reading the same row twice. I think there is a place for NOLOCK, but I wouldn't advise the use NOLOCK throughout your application. The problem is the use of REPEATABLE READ not READ COMMITTED.
  6. Use WITH (READCOMMITTED) table hint on your read queries. This will override REPEATABLE READ isolation level on the connection for this table. So the problem above is avoided, but it is a bit of a task changing all the queries to use this table hint. Note, query hints have a bad rap, which is understandable when they override the optimizer's choice of access path. I see no problem with query hints when they are requesting a specific isolation level.
  7. Use READ COMMITTED SNAPSHOT ISOLATION on your database. This is not a solution as the read query is on a connection that has a REPEATABLE READ isolation. As the name suggests, READ COMMITTED SNAPSHOT ISOLATION only works on connections that are using READ COMMITTED
  8. Use TRANSACTION COMMITTED SNAPSHOT ISOLATION on your database. This is not a solution either, as it does not override the REPEATABLE READ isolation level of the read query.
  9. Wait for upgrade to SQL Server 2014.  It looks like the isolation levels are reset on pooled connections.  However, you should check, as this information has simply be posted in an msdn forum, albeit from a SQL team member. 
How can I determine if REPEATABLE READ is being used by my application?
  • A SQL Server Profiler trace on ExistingConnection events will provide the isolation level of all active connections. Just look for set transaction isolation level and see what follows. If any of your current connections have REPEATABLE READ you should be concerned. ExistingConnection will only display currently active connections. To see if isolation level is changed, you will need to trace SQL:BatchCompleted and SP:StmtCompleted, and Prepare SQL. Looking for the text '%set%transaction%isolation%level%repeatable%read%'.
  • Another useful Profiler trace is to trace Lock:Escalation. I believe that any table lock escalation in your OLTP environment should be of some concern to you, particularly if you have locking or deadlock issues. However, if you have too many to look at, you could filter in/out by Type=5 (Object), objectId (table) and/or Mode=3 (S-share).
How lucky are we? There are two ways of viewing the locking incident.

  1. Unfortunate that it happened and get a quick fix asap for this particular query.
  2. Fortunate that this relatively small incident occurred, which enabled us to uncover something quite serious (full table locks in an OLTP environment). The issue, has given us an opportunity to discover the dynamics of connection pooling and helping us find an application wide fix to prevent its future occurrence. Also, if there is a future locking incident, we have more knowledge to diagnose that too. I see this as analogous to someone having high blood pressure and going to the doctor for a fix.  The high blood pressure could be seen as annoying and simply needing some pills to lower the blood pressure, or it could be an opportunity of determining the cause of the high blood pressure (be it diet, stress, etc.) and rectifying the underlying cause.

1 comment:

Richard Lees said...

Looks like this might be fixed in SQL 2014.