Taking the Guesswork out of SQL Server Performance Profiling Part 2
18 May 2007
Wait Event tracing can be a very powerful diagnostic technique. It can show you, for example, exactly who is taking this exclusive latch (LATCH_EX) and why. Even just seeing the sequence of wait events can be interesting and revealing. In this article I demonstrate some more powerful, and slightly more complex, examples of where time can go, and where to look for it. We will examine SQL and Windows OS Scheduling latency, and concurrency for scheduler resources, using both regular and parallel queries.
NOTE:
If you need some further reference and background on wait event-based performance profiling please refer to my previous article and the references at the end of that article. In particular, two ex-members of Microsoft's SQL Server Customer Advisory team, Tom Davidson and Gert Drapers, have been presenting and write on this subject since 2003. For research purposes I still maintain my own external stored procedures on www.sqlinternals.com, I encourage you to experiment with them, and learn how SQL statements are actually executed, how underlying mechanisms work.
As mentioned briefly in the previous article, the user mode scheduler (UMS) takes care of scheduling 'work' or a 'task' on a worker thread. This can be a real OS thread (sometimes called a system thread) or, when running in fiber mode (lightweight pooling enabled), a fiber or user mode thread.
I want to make it clear that 'the scheduler' is not a separately working entity or thread controlling everything; it's just a code path executed by the task that's currently running in the context of that scheduler. So 'the scheduler' is a logical concept, not a physical thing.
Normally there is one scheduler per CPU (or core), and the scheduler attempts to spread the workload between worker threads. These worker threads can run some code, start waiting when a waitable event occurs, resume work, or start another task running on another thread on the same scheduler and go to sleep until the resource wait is finished.
By setting the 'affinity mask' option, a scheduler can be pinned to it's own CPUs Setting an affinity mask can be beneficial for cache reuse, TLB ((page) address translation look aside buffers) and other CPU or cache-related resource reuse. On my dual core test box it's interesting to observe the behavior of the Windows OS scheduler decides when it runs a SQL Server thread. If I don't set the affinity mask, then the Windows OS scheduler evenly spreads the 'CPU workload' across the two cores. If I set the affinity mask to use two CPU's (cores in this case) then Windows will pin the workload to one CPU. For a short running query I cannot measure a noticeable difference, but for high-end workloads I know that it can be beneficial.
To demonstrate the impact of SQL Server and OS scheduler decisions, I want to show some query examples and statistics, and try to explain how to interpret the data.
In the first example a 100% CPU bound query, called killCPU, is run against a 10000 row table (t1). The scripts required to run this query are included with the code download for this article (link):
Here is the killCPU query:
use mytest
go
exec begin_schedstats
exec begin_waitstats
SELECT COUNT(*)
FROM t1 a
INNER LOOP JOIN t1 b
ON b.c1 = a.c1 option (maxdop 1)
exec end_waitstats
exec end_schedstats
For the sake of the tests, Maxdop is set to 1 to force the query not to use a parallel query plan. I measured the results on my test box, an AMD Athlon64 X2 (dual core) running at 2000Ghz, with SQL Server 2005 SP2. This is the output:
Warning: The join order has been enforced because a local join hint is used.
-----------
10000
(1 row(s) affected)
wait_type waits wait_time signal_wait_time
------------------------------- ------------------------------------ --------
session_id CPU_time tot_sched_time elapsed_time PIO writes LIO
---------- ------------------- -------------- ------------ ------- ------ ------
56 31281 31548 31574 0 20 29174
sched_id preemptive_switches context_switches idle_switches yield_count
---------- ------------------- -------------------- --------------------------
0 1 22 59 157
1 0 98 15 7861
The first set of columns of the above output (starting with the wait_type column) displays the differences between the beginning and end snapshots from sys.dm_os_wait_stats. As you can see, the columns are empty, indicating that no wait events were detected in this case.
The second set of columns, starting with session_id, displays the difference between the beginning and end snapshots from sys.dm_exec_sessions. From this output we can see:
291744 LIO's (logical I/O's) – the query performs a recursive loop through all rows in t1. This results, according to sys.dm_exec_sessions, in almost 300000 logical reads from cache. Logical I/O is purely CPU and memory bound, which corresponds to the 100% CPU observation when running this query
31.3 seconds were spent on the CPU
31.5 seconds was the total scheduled time, which according to the BOL represents the time the requests for this session were scheduled for execution
31.6 seconds was the total elapsed time (this time includes any wait time)
This means that while this query is running it consumes one full CPU (or core) for the length of the query: in this case 31 seconds.
Interestingly, this same query runs in about 20 seconds on SQL Server 2000 SP4, for the same number of logical I/Os. This seems to indicate that doing the same amount of work for this query is more expensive on SQL Server 2005. Since we are comparing CPU time with CPU, and no waits are involved, wait event analysis can't help us here. It is possible to investigate this issue further using CPU profiling tools, but that is beyond the scope of this article.
The final set of columns, starting with sched_id, show the differences in the start and end snapshots onsys.dm_os_schedulers. It displays a row of data per regular scheduler, each row shows the preemptive switch, context switches and idle switch count and the count of scheduler yields.
On my dual core machine, we see two rows, indicating two schedulers (one scheduler per CPU core).The query was on Scheduler '1', since it apparently did 7861 scheduler yields. This yielding enables other tasks that are bound to this scheduler to get work done too.
All over the SQL Server engine code are yield calls; this yielding is necessary to prevent CPU bound processes from monopolizing the CPU. For example, if the killCPU query was allowed to run on the CPU without yielding and checking for other SQL Server work, it would run for 30 seconds without allowing any other work within the SQL Server instance. This is a side effect of the user mode scheduler of SQL Server, since the operating system itself doesn't care about SQL Server's scheduling; it does its own scheduling and it can suspend a SQL Server worker thread at any time. By taking a quick look at the SQL Server 2005 disassembly with a debugger, I found at least 325 locations where scheduler yield was called. SQL Server has a built-in 4 millisecond time quantum, which a task can use up before yielding.
When a task yields the scheduler, the scheduler code will check to see if there are other runable tasks on the runable queue, and it will check for eventual I/O completions, and timer expirations. The scheduler can switch to another task or return to the original task. This behavior can be seen in the context_switches column. In this case, since only one query was running on the system, it hardly happened and all yields quickly resulted in resuming execution of the killCPU query. You can tell this from the fact that there are no scheduler wait events registered. If there was other work for this scheduler, you would start to see waits for the scheduler (we will see this effect later).
For further information on the workings of the scheduler, I can recommend Sameer Tejani's Scheduling chapter in 'SQL Server 2005 Practical Troubleshooting'.
This covers the case of a single killCPU query running on an otherwise idle machine. However, what happens when we have two of these killer queries running at the same time? Two things can happen:
Each query gets it's own scheduler and thus CPU
The two killCPU queries can end up on the same scheduler
The former is preferable, to use an understatement. Fortunately the SQL Server 2005 Scheduler is pretty smart nowadays. It maintains load average statistics per scheduler, and it decides on a task-by-task basis where each task will run. However, once a task is on a scheduler, it will not move, even if it turned out to be a bad decision to run it on the chosen scheduler.
With SQL Server 2000 the situation is worse: each connection is bound to a specific scheduler (the scheduler being chosen on a round robin basis). If you happen to come into the office and start your work, and your application connection is assigned to the same scheduler as a "bulk load from hell" that your co-worker started just before you…well, you'd better get some coffee (or restart your application, and hope it connects to another scheduler).
Let's start with an example of the 'good' case: two killCPU queries, one per scheduler, one per CPU.
Following are the results from the _exec_sessions and _os_schedulers views. For brevity and clarity I only present the statistics for one of the two queries:
session_id CPU_time tot_sched_time elapsed_time PIO writes LIO
---------- -------------------- -------------- ------------ ------------ -------
56 31390 31564 31572 0 19 291735
sched_id preemptive_switches context_switches idle_switches yield_count
----------- ------------------- ---------------- ------------- ------------
0 0 95 16 7545
1 0 164 14 7921
I observed the following, when running these queries:
The elapsed and CPU times are still about 31 seconds.
Both tasks do roughly the same number of scheduler yields, as expected
You could see wait events for sos_scheduler_yield, but hardly ever above 0 ms
Perfmon and task manager report that both the CPUs were 100% busy.
Now the bad case, two killCPU queries on one scheduler and on one CPU.
As I noted earlier, SQL 2005 is pretty smart and it's pretty hard to get this to happen; just start up the queries in sequence, and wait and see where they are scheduled. On SQL Server 2000 it's much easier to achieve this. Just connect one session after the other and the sessions get bound to scheduler 0 and scheduler 1 right after each other.
Interestingly enough, what you will see in task manager depends on the CPU affinity setting in sp_configure. If you leave the affinity mask option at the default setting of '0', then the Windows scheduler will evenly divide the workload across two CPU's, and task manager will show each CPU 50% busy.
However, if you use an affinity mask of '3' (in this 2 CPU / core case) then each scheduler with its threads will be bound to one CPU. This results in a task manager picture where you will see one idle CPU and one CPU 100% busy.
In any event, the end result of the test doesn't depend on the affinity; in both cases the killCPU queries were bound to the same SQL Server scheduler and they had to context switch between each other.
Here are the results from the os_wait_stats, exec_sessions and os_schedulers views for one of the two killCPU sessions:
wait_type waits wait_time signal_wait_time
------------------------- ------ -------------------- ----------------
SOS_SCHEDULER_YIELD 15078 58766 58766
WRITELOG 4 15 15
session_id CPU_time tot_sched_time elapsed_time PIO writes LIO
---------- -------------------- -------------- ------------ ---- -------- ------
56 31219 31463 60915 6 27 291897
sched_id preemptive_switches context_switches idle_switches yield_count
----------- -------------------- -------------------- ------------- ------------
0 1 45 93 260
1 0 14631 16 15264
From these results we can see that:
We waited 58.8 seconds for sos_scheduler_yields
The elapsed time is now almost doubled to 60.9 seconds
CPU time for the query is still 31 seconds.
Both killCPU queries were on scheduler '1'.
It is important to reiterate that SQL Server doesn't maintain wait event statistics on a per session basis. So, the fact that we measured almost 60 seconds of wait does not mean that this session waited 60 seconds. In reality, we cannot tell which session was waiting for how long – just that there were 60 seconds of wait time overall.
Of course, in this simple experiment there were only these two queries running on the server so we can safely deduce that each session waited about the same time: 30 seconds. Every time a session yielded the CPU, it resumed the running of the other query. So, we can conclude that the elapsed time of this query is made up from the 31 seconds CPU time and the 30 seconds wait time,
Although this is not a pretty sight, at least you can measure what is happening! The phenomenon observed in these results is called SOS scheduler pressure. It wasn't possible to measure this in SQL Server 2000, at all. By using DBCC sqlperf(umsstats) you could see a runnable queue being there, but you could not tell how bad it was or how much of your valuable response time your were losing waiting for busy schedulers (or better worker threads on that scheduler).
Although SQL 2005 wait event statistics are useful, SQL Server maintains these statistics per scheduler and exposes them accumulated for the whole server. Therefore they lose their value if you need to look at the work done by an individual session (unless you are doing isolated testing), and they also cannot be used to examine individual tasks.
WARNING!
These tools are implemented using an external stored procedure (a dll) and are invasive: they modify 'in memory' SQL Server code. This is totally unsupported in a production environment and neither the author nor his employer can be held responsible for any damage or support. Having said this, the tools have been tested in a developer environment, and I've used the 2000 tools in heavily loaded benchmark environments without damaging any data (to my knowledge), or causing significant slowdown of the application. Check the readme.txt on how to install and use the tools.
In the following section I will demonstrate how the granular data that my tools provide can be used to illustrate the mechanics of time consumption in SQL Server. This sort of data is especially useful when running multiple queries, or queries that utilize sub tasks: i.e. parallel queries.
I use the xp_sqlstats 'L' to load the external stored procedure, and I use either readwait from the command prompt, or the xp_waitstats stored procedure to show per session (and per execution context (EC)) wait statistics.
Here's what xp_waitstats shows for our single killCPU query:
Spid EC ResourceDescription Time(ms) Count SignalTime AvgTime Percentage
----- ---- --------------------- ----------- ------ ---------- ------- --------
58 0 Elapsed time 31186 0 0 0 n/a
58 0 CPU 31063 5 0 6212.6 100 %
58 0 SOS_SCHEDULER_YIELD 0 7726 0 0 0 %
58 0 Unaccounted for 123 0 0 0 0 %
And here are the stats for one session running two killCPU queries concurrently on one scheduler:
Spid EC ResourceDescription Time(ms) Count SignalTime AvgTime Percentage
----- ---- --------------------- ----------- ------- ---------- ------- ----------
57 0 Elapsed time 60678 0 0 0 n/a
57 0 CPU 31390 5 0 6278 52 %
57 0 SOS_SCHEDULER_YIELD 29171 7787 29171 3.74611 48 %
57 0 Unaccounted for 117 0 0 0 0 %
Here we can see directly why our session takes 60 seconds, instead of deducing it from server wide statistics: next to spending 31 seconds on the CPU, we lose 30 seconds waiting on the scheduler (i.e. due to SQLOS scheduler pressure).
Now let's introduce a new 'challenge': external CPU pressure. To introduce this 'external pressure', I use a Windows ResourceKit tool called CPUstress. With CPUstress, you can simulate CPU load, and give the threads applying the load different scheduler priorities or different 'busy' loads.
For the next example, I set CPUstress to use two threads, on normal priority (just like SQL Server) one per CPU, each thread using maximum CPU cycles. However, both CPU's are 100% busy, so now the SQL Server threads need to compete with threads external to SQL Server. Since it's external to SQL Server, it cannot be measured by SQLServer.
This is what xp_waitstats now shows for our single killCPU query:
51 0 Elapsed time 66216 0 0 0 n/a
51 0 CPU 33656 1 0 33656 51 %
51 0 SOS_SCHEDULER_YIELD 0 8364 0 0 0 %
51 0 Unaccounted for 32560 0 0 0 49 %
The total elapsed time is 66 seconds – 33 seconds on the CPU, and another 33 seconds "Unaccounted for". I didn't see any waits happening: 'my' CPU cycles and response time are lost to an external consumer, and this is shown as 'Unaccounted for' time.
This is also reflected in the sys.dm_exec_sessions statistics. The total_sched_time is now equal toelapsed_time, but the CPU_time is still about 33 seconds:
session_id CPU_time tot_sched_time elapsed_time PIO writes LIO
---------- -------------------- -------------- ------------ ---- ------ ---------
52 32500 62914 62927 0 29 291840
SQL Server doesn't know better: it thinks we are on the scheduler, and we were in fact: our thread was waiting on the Windows OS run queue, waiting to be scheduled by Windows. The CPU time is still the 33 seconds because the CPU time shown by SQL Server is directly acquired from Windows (through the API callGetThreadTime). Unfortunately the Windows OS doesn't maintain a counter for OS scheduling waits, unlike many UNIXes out there. So you can only deduce it from statistics such as those shown in the example. In real life, I would keep an eye on the run queue through perfmon, to see if the run queue stays high for prolonged periods.
Another way of losing measurable time can be memory pressure. If your system is running out of memory and starts paging, you cannot see this from the SQL Server statistics, it's simply out of the view of SQL Server, and the waitstats tools also reports this as 'unaccounted for' time.
Another way to deuce that you are experiencing SQL Server scheduler pressure is the observation of increasingsignal wait times. The signal wait time is the time a task waited between the completion of the wait event and the time it took to get on the scheduler again to continue its work. The busier a scheduler gets, the higher the signal wait times get. Unfortunately, during my tests I also observed that the busier the system the higher allwait times get. This is (very likely) due to the fact that the maintenance of start, signal and stop times is done in the SQL Server code itself, so if a system is very busy it can take longer to process the end of a wait event.
This means that the same physical I/O timed at 10 ms at the OS level, can take 16ms in a PAGEIOLATCH_SHfor instance, because it takes time to get scheduled on the CPU, and you only hope that your thread is not pushed off of the CPU, preemptively, by a higher priority thread that is ready to run.
This is something to be aware of: on very CPU bound systems, sometimes the waits are not what they seem.
Let's take a look at an example. The scripts required to follow through are provided in the code download:
Here are the statistics collected using xp_waitstats, while the two scripts were running:
Spid EC ResourceDescription Time(ms) Count SignalTime AvgTime Percent
----- --- -------------------- ----------- ------- ---------- -------- -------
54 0 Elapsed time 101622 0 0 0 n/a
54 0 CPU 40531 4 0 10132.75 40 %
54 0 PAGEIOLATCH_SH 60390 8314 15 7.263652 59 %
54 0 SOS_SCHEDULER_YIELD 0 9785 0 0 0 %
54 0 ASYNC_NETWORK_IO 46 1 0 46 0 %
54 0 Unaccounted for 655 0 0 0 1 %
----- --- -------------------- ----------- -------- ---------- ------- -------
55 0 Elapsed time 80853 0 0 0 n/a
55 0 CPU 6219 3 0 2073 8 %
55 0 SOS_SCHEDULER_YIELD 0 1594 0 0 0 %
55 0 WAITFOR 74453 19 0 3918.57 92 %
55 0 Unaccounted for 181 0 0 0 0 %
As you can see the session with spid=54 spends 40% of it's time on the CPU and (almost) 60% waiting for I/O's. The other session, spid=55, spends 8% of it's time on the CPU and 92% sleeping, waiting for aWAITFOR timeout to occur. You can also see yielding going on but since both the measurements were separately taken on an idle system, the yielding doesn't end up in waits but in continuing with the session itself. You see that there is no (or hardly any) signal time.
Now, if I start a whole bunch of those scripts, 12 'waitfor' scripts, and 6 'random i/o' scripts, you will see that signal times for the PAGEIOLATCH waits and the wait time for SOS_SCHEDULER_YIELD waits suddenly are way up (I only show 4 sessions for clarity reasons):
Spid EC ResourceDescription Time(ms) Count SignalTime AvgTime Percent
----- ---- -------------------- ----------- -------- ---------- -------- -------
58 0 Elapsed time 107880 0 0 0 n/a
58 0 CPU 7234 3 0 2411.33 7 %
58 0 SOS_SCHEDULER_YIELD 14640 1977 14640 7.40515 14 %
58 0 WAITFOR 84828 19 46 4464.63 79 %
58 0 Unaccounted for 1178 0 0 0 1 %
59 0 Elapsed time 104190 0 0 0 n/a
59 0 CPU 6953 3 0 2317.66 7 %
59 0 SOS_SCHEDULER_YIELD 14218 1894 14218 7.50686 14 %
59 0 WAITFOR 82078 19 46 4319.89 79 %
59 0 Unaccounted for 941 0 0 0 1 %
65 0 Elapsed time 274599 0 0 0 n/a
65 0 CPU 39563 4 0 9890.75 14 %
65 0 SOS_SCHEDULER_YIELD 27515 12037 27484 2.28586 10 %
65 0 PAGEIOLATCH_SH 184750 8760 2296 21.0901 67 %
65 0 ASYNC_NETWORK_IO 156 5 31 31.2 0 %
65 0 Unaccounted for 22615 0 0 0 8 %
66 0 Elapsed time 271361 0 0 0 n/a
66 0 CPU 46656 4 0 11664 17 %
66 0 SOS_SCHEDULER_YIELD 29125 14472 29093 2.01250 11 %
66 0 PAGEIOLATCH_SH 168968 8732 2390 19.3504 62 %
66 0 ASYNC_NETWORK_IO 31 1 0 31 0 %
66 0 Unaccounted for 26581 0 0 0 10 %
69 0 Elapsed time 266655 0 0 0 n/a
69 0 CPU 39110 4 0 9777.5 15 %
69 0 PAGEIOLATCH_SH 181031 8688 2171 20.8369 68 %
69 0 ASYNC_NETWORK_IO 156 2 78 78 0 %
69 0 SOS_SCHEDULER_YIELD 24031 11886 23968 2.02179 9 %
69 0 Unaccounted for 22327 0 0 0 8 %
From this we learn that a serious workload can increase the overall signal wait times. So increasing wait times can be a sign of scheduler pressure: more or faster CPU's might help, or a better balanced load, or timing, of heavy CPU bound queries.
A different subject, but also related to scheduler usage, is parallel query. When you have multiple CPU's and thus multiple schedulers, it can happen that queries are parallelized by the optimizer. The next example shows a 'create clustered index' on SQL Server 2005 SP2, as seen from xp_waitstats:
56 0 Elapsed time 160535 0 0 0 n/a
56 0 CPU 516 4 0 129 0 %
56 0 SOS_SCHEDULER_YIELD 0 88 0 0 0 %
56 0 PAGEIOLATCH_SH 6781 1569 0 4.32186 4 %
56 0 CXPACKET 152984 3 0 50994.6 95 %
56 0 PAGEIOLATCH_EX 31 8 0 3.875 0 %
56 0 PAGELATCH_SH 0 1 0 0 0 %
56 0 WRITELOG 0 1 0 0 0 %
56 0 Unaccounted for 223 0 0 0 0 %
56 1 Elapsed time 153008 0 0 0 n/a
56 1 CPU 15282 5 0 3056.4 10 %
56 1 SOS_SCHEDULER_YIELD 15 2025 15 0.00740 0 %
56 1 PAGEIOLATCH_SH 20968 962 140 21.7962 14 %
56 1 PAGELATCH_UP 250 52 0 4.80769 0 %
56 1 IO_COMPLETION 112843 11889 15 9.49137 74 %
56 1 SLEEP_TASK 15 1 0 15 0 %
56 1 PAGEIOLATCH_UP 2984 28 0 106.571 2 %
56 1 LATCH_EX 312 5 0 62.4 0 %
56 1 CMEMTHREAD 0 3 0 0 0 %
56 1 CXPACKET 0 1 0 0 0 %
56 1 LATCH_SH 234 1 0 234 0 %
56 1 Unaccounted for 105 0 0 0 0 %
56 2 Elapsed time 151231 0 0 0 n/a
56 2 CPU 14422 5 0 2884.4 10 %
56 2 LATCH_EX 734 11 0 66.7272 0 %
56 2 SOS_SCHEDULER_YIELD 31 2027 31 0.01529 0 %
56 2 PAGEIOLATCH_SH 19890 897 125 22.1739 13 %
56 2 IO_COMPLETION 111265 11830 46 9.40532 74 %
56 2 PAGELATCH_UP 109 42 0 2.59523 0 %
56 2 SLEEP_TASK 562 36 0 15.6111 0 %
56 2 PAGEIOLATCH_UP 3671 31 0 118.419 2 %
56 2 QUERY_EXECUTION_INDE 15 1 0 15 0 %
56 2 CMEMTHREAD 0 2 0 0 0 %
56 2 CXPACKET 0 1 0 0 0 %
56 2 Unaccounted for 532 0 0 0 0 %
(35 row(s) affected)
For one session id different 'EC' columns appear. This shows the different execution context id's for this one session, which means the query was parallelized. In this case ECID 1 ran on a different scheduler than ECID 2. ECID 0 spend most of it's time waiting for a CXPACKET wait: waiting for the sub tasks to finish their work. You can nicely see the distribution of wait time per sub task: 10% CPU time, around 14% waiting for physical reads and 74% of the time waiting for IO_COMPLETION. By some tracing I found that this is related to sort/merge traffic related to the build of the clustered index.
Hopefully, the examples presented here demonstrate why I feel that, while SQL 2005 wait event statistics are useful and a good step forward, it would be much better if SQL Server collected those statistics per session and per subtask.
The way these statistics are represented now means that they are only useful for isolated testing, and to get a global impression on how the SQL Server is spending its time.