Tuesday, November 22, 2011

Fix Query Performance by Creating Statistics

SQL Server's optimiser will automatically create statistics on individual columns that are used in predicates.  See picture of database properties to the right.  However, it will not create statistics on column combinations unless they are composite columns in an index.  This is where you might be able to help the optimiser by creating statistics on column combinations.  This technique is most appropriate where the columns in your predicates have a relationship.  For example, my InternetLogtable (1.2 Million rows) contains a row for each internet request.  It has columns ClientAgentId and ClientHostId, where ClientAgent is the name of the browser, and ClientHost is the IP address of the client machine.  As you can imagine, most requests from one ClientHost will tend to have the same ClientAgent.  However, SQL Server's optimiser does not know this.  It has statistics on ClientAgentId and ClientHostID so if when it determines the cost of the following queries it can quite accurately determine how many rows will be returned.  For the first query it estimates 332 rows will be returned.  For the second query it estimates 110 rows will be returned.  In both cases, the estimate is very accurate.  (There do happen to be indexes on each of the individual predicate columns, although even if there weren't indexes, it would automatically create statistics and get similar estimates.) 
select * from internetlogtable with (nolock)
       where ClientAgentId=72383
select * from internetlogtable with (nolock)
      where ClientHostID=417408
However if we combine both predicates, as in the following query

select * from internetlogtable with (nolock)
where ClientAgentId=72383
    and ClientHostID=417408
The optimiser assumes that ClientAgentId and ClientHostId are independent and thus there will be very few, if any, records satisfying both predicates.  Essentially it is guessing that the number of rows returned will be 332/1200000 * 110/1200000.  It rounds the number up to 1.   Here is the query plan without statistics.  Notice how the Estimated Number of Rows is 1.  Actually, when it says 1, it typically is estimating <1 records, but it always rounds up to 1.

Now, what would happen if we create statistics on the two predicate columns?
create statistics StatsAgentIdHostId
  on internetlogtable (ClientHostID,ClientAgentID)

SQL Server's optimiser now has data on the combinations of ClientHostId and ClientAgentId and finds that this particular combination will return about 110 rows.  Here is the query plan.  Notice that the Estimated Number of Rows is now 110.96!  This is much more accurate.

In the query example I have used here, the query plans are different, although not  dramatically different.  However, if this table was joined with a few others, some of which very large, the query plan can change dramatically and make the difference between a sub second query and a several minute query.

Understanding these subtleties of the SQL Server query optimiser will help you tune queries in an efficient manner.  If you were not aware of this technique to create bespoke statistics, you might solve this issue with a new index on the two columns, which would solve this query performance but at a great cost to overall performance.

By the way, here is an illustration of what statistics SQL Server keeps.  There is a lot of information.  For example, there are even as many key ranges as can fit in an 8KB page with the number of distinct keys, equal keys and rows in the range.  Notice the All density column in the second set of statistics.  It is telling us that a single value for ClientHostId, will on average return 0.0001507841 of the total rows.  It also tells us that ClientHostId and ClientAgentId will, on average return 0.0001226091 of the rows.  Not much less!  Ie, if you have the ClientHostId, the ClientAgentId does not filter the rows much more.

I encourage you to get to know the SQL optimiser better.  The more you know about the optimiser, the more you will be able to tune your queries.  And the way to get to know the optimiser is through the query show plan.

No comments: