Thursday, November 26, 2009

When to use clustered indexes

I often found experienced database designers, new to SQL Server, will come with preconceived ideas about the optimiser and storage engine. One of these ideas is that clustered indexes should be used sparingly, since they come at a cost with limited benefit. With SQL Server, this is by and large untrue, and as a general rule, most tables will be better off with one of the indexes being clustered.

SQL Server (and Sybase) have historically had a structurally different way of maintaining clustered indexes. They actually use the index leaf pages to hold the data columns, which enables them to save space, reduce the number of pages touched in an index path, and use them to very efficiently sequence scan.

There is a useful performance analysis of clustered versus non-clustered PK index on This paper looks primarily at the cost and throughput of inserts, updates and deletes. By and large, the clustered indexes performed better. There is one significant exception, where there are high volume concurrent inserts, with keys in close proximity. In this situation, concurrent can compete for locks and latches with other active transactions with similar keys. By the way, this does not necessarily mean that you shouldn't use a clustered index for the table, rather, it might mean your clustered index should be on a key that won't create hot spots. For example tbTransaction might have contention with TransactionID clustered, but a clustered index on AccountID would be less contentious and might even be better serving queries.

One area that this paper doesn't look at, and in my view, is one of the most important reasons for clustering, is queries that join very large tables. If you are joining two or more very large tables, the optimiser will like to have the opportunity to Merge Join the larger tables. For example, a query that was to join tbTransaction with tbTransactionCharges and many other smaller tables is likely to benefit from having both large tables clustered in the same sequence. This could be TransactionId, or, perhaps more likely, AccountID. If queries on these two tables were often filtered by particular accounts (or tbAccount columns) then having all 3 tables clustered by AccountID (and also TransactionID or TransactionDate for the transaction tables) is likely to provide very good join performance.

We are very fortunate with relational databases, in that we can change the index structure without having to rewrite any application queries. However, I would encourage database designers to analyse in some detail the activity on the biggest tables, and design appropriate clustered indexes early on. Clustereed indexes are much more difficult to change at a later date than non clustered indexes. Also, for those smaller tables, make one of the indexes (typically the natural primary key) clustered.

For a real-time demonstration of a database application continuously processing about 300 records/sec see The Perfmon application is close to real-time and is processing about 300 perfmon records all day, every day, on commodity 32 bit (yes I really should upgrade) hardware.

No comments: