The best analogy I can draw is if you can imagine your company has 1000 staff and you have a building of 20 floors with 4 lifts that can adequately house all staff. Your 1000 staff tend to drift in between 9 and 10AM and the 4 lifts can move the 1000 staff to their desired floors in the hour. Infrequently, staff will queue for a lift as the first lift will be full, but it's rarely longer than one lift ride for the wait. During the day, the lift demands are lower and there are no queues.
Now, over time, your company grows to 2500 staff, and you've added another 30 floors to your building and the lifts are extended to cater for all 50 floors. The CEO is relieved that the building capacity has been increased to house the new staff and that the lifts work across all 50 floors. However, no one has thought about the increased throughput required by the lifts. Between 9 and 10AM 2500 staff now drift into the office and want a lift to their floors, however the 4 lifts are only capable of moving about 1200 staff/hour. The queues for the lifts build up and are not fully drained until after 11AM. The company is now in trouble with disgruntled staff and find it needs to ask staff to come in earlier or later. Also, there is the hidden issue of the fire exit stairways also having throughput limits. With 1000 staff, the building could be evacuated in 15 minutes, now with 2500 staff (and the same stairways) it will take over 40 minutes.
This is analogous to a SQL Server database growing over time, and the SAN (playing the role of the lifts) is extended to hold more TeraBytes, but its throughput capacity (measured in MegaBytes/second of reading and writing) is not increased. The IO queues start to build up and IO times degrade. The SQL Log picture above is an example where hundreds of IO every minute are taking longer than 15 seconds to complete! This is an unbelievably dire situation that can cause SQL Server to shutdown.
I am not advising that you monitor your SQL Server log for long running IOs, although if you see them, you should act immediately. I advise you to monitor IO throughput, average latency, and queue length continually, using Perfmon or related tool. You can detect throughput capacity when MB/second tends to flatline while IO latency and queue lengths are waxing and waning. Unfortunately, Perfmon, is not a complete tool, but it does have all the base information and there are many third party tools, or you can do what I do and create your own by having Perfmon send records to a common SQL Server database and cube. You only need a few counters and only record every 60 seconds, or so. If the Perfmon database is on another server, there will be very little load on your production servers, essentially no IO. Your Perfmon database could hold performance data for all your production servers, so that you have all the performance information you need in one place. Also, when you have a performance issue you have historical information of this issue and trends earlier in time. Companies that don't continually record these metrics are flying somewhat blind, or, at least, flying with blinkers on. Here's an earlier blog on creating a Perfmon cube, although, unfortunately, the online demonstration is no longer available.
This posting is not intended to provide answers to the IO throughput issue. The solution can be many and varied. The solution will probably involve speeding up the storage system, but it might be to make the database more IO efficient. The most important point I can make is that you should be monitoring and be able to recognise this issue. Also, don't let a storage person tell you that you have hit the maximum possible throughput. There are always ways to increase the throughput performance of your storage, as there are always ways to improve your database performance.
No comments:
Post a Comment