CMT Utilization
From Siwiki
Contents |
[edit] UltraSPARC T1 Utilization
With the introduction of Sun Fire T2000/T1000 servers using UltraSPARC T1 processor, Sun has taken a radically different approach to building scalable servers. UltraSPARC T1 processor is == best perceived as a system on the chip. In order to understand the performance of any system we need to start with understanding the CPU utilization of that system. Let us see how software and hardware thread scheduling is done on UltraSPARC T1, why conventional tools like mpstat don't show the complete picture and what it really means by CPU utilization for this T1 processor. While thinking about this issue, I wrote "corestat" a new tool to monitor the core utilization of T1 processor and I will discuss the use of this tool too.
Let us start with the overview of basic concepts which will help understand the rationale for addressing the CPU utilization aspect separately for UltraSPARC T1's CMT architecture.
[edit] CMT and UltraSPARC T1 at a glance
UltraSPARC T1 processor presents Chip Multiprocessing combined with Chip Multi threading. Processor architecture consists of eight cores with four hardware threads per core. Each core has one integer pipeline and four threads within a core share the same pipeline. There are two types of shared resources on the processor. Each core shares Level 1 (L1) Instruction and Data cache as well as the Translation Lookaside Buffer (TLB) and all the cores share the on chip Level 2 (L2) cache. L2 cache is a 12 way set associative unified (instruction and data combined) cache.
[edit] Thread scheduling on UltraSPARC T1
The Solaris Operating System kernel treats each hardware thread of a core as a separate CPU which makes T1 processor look like a 32 CPU system. In reality its a single physical processor with 32 virtual processors. Conventional tools like mpstat and prtdiag report 32 CPUs on T1. The Solaris Operating system schedules software threads onto these virtual processors (hardware threads) very similar to a conventional SMP system. There is a one to one mapping of software threads onto these hardware threads and a software thread is always scheduled on one hardware thread till its time quantum expires or is pre-empted by another higher priority software thread.
Hardware scheduler decides the use of the pipeline by the hardware threads sharing the same core. Every cycle the hardware thread scheduler switches threads within a core, allowing the same hardware thread to run at least every 4th cycle. There are two specific situations under which a hardware thread can get to run for more than one cycle in four consecutive cycles. These situations arise when a hardware thread becomes idle or gets stalled.
[edit] What does it mean by an “idle” hardware thread on UltraSPARC T1
Conventionally a processor is considered to be idle by the kernel when there is no runnable thread in the system which can be scheduled on that processor. On previous generation SPARC processors, an idle state related to the pipeline of the processor remaining unused. For a CMT processor like T1 if there are not enough runnable threads in the system then one or more hardware threads in a core remain idle.
Main differences in behavior of an idle virtual processor (hardware thread) of T1 compared to the idle CPU in conventional SMP are :
- A hardware thread becoming idle doesn't mean an entire core becomes idle. Processor core will still continue to execute instructions on behalf of other threads in the core.
- Solaris kernel has been optimized for T1 processor so that when a hardware thread becomes idle, it is parked. A parked thread is taken out of the mix of threads available in a core for scheduling. Its time slice is allocated to the next runnable thread from the same core.
- An idle (parked) thread doesn't consume any cycles on UltraSPARC T1. On a non CMT SPARC processor based system an idle processor executes an idle loop in the kernel.
- A hardware thread becoming idle doesn't necessarily reduce core utilization. It also doesn't slow down other threads sharing the same core. Core utilization depends on how efficiently a thread can execute its instructions.
- Mpstat on Sun Fire T2000 reports an idle thread in the same way as it reports an idle CPU in conventional SMP system.
- On a conventional system an idleness of a processor is inherently linked to the idleness of the system. On UltraSPARC T1 one or more hardware threads can be idle but the processor could still be executing instructions at reasonable capacity. These two aspects are not directly related.
- Only when all four threads from the same core become idle, that core becomes idle and utilization drops to zero.
[edit] What does it mean by a “stalled” thread on UltraSPARC T1
On a T1 processor when a thread stalls due to a long latency instruction (such as a load missing in the cache), it is taken out of the mix of schedulable threads with allowing the next ready to run thread from the same core to use its time slice. Similar to conventional processors, a stalled thread on T1 is reported as busy by mpstat. On conventional processors a stalled (e.g. on cache miss) thread occupies the pipeline and hence results in low system utilization. In case of T1 the core can still get utilized by other nonstalled runnable threads.
[edit] Understanding processor utilization
For a T1 processor a thread being idle and a core becoming idle are two different things and hence need to be understood separately. Here are some commonly asked questions in this regard :
[edit] There is already vmstat and mpstat so why do we need to think about anything else
On UltraSPARC T1 Solaris tools like mpstat only report the state of a hardware thread and don't show the core utilization. Conventionally if a processor is not idle it is considered as busy. A stalled processor is also conventionally considered busy because for non CMT processors the pipeline of a stalled processor is not available for other runnable threads in the system. However on a T1 processor a stalled thread doesn't mean stalled pipeline. On T1 processor vmstat and mpstat output should really be interpreted as the report of pipeline occupancy by software threads. For non CMT processors idle time reported by mpstat or vmstat can be used to decide on adding more load on the system. On a CMT processor like T1, we also need to look at the core utilization before making the same decision.
[edit] How can we understand core utilization on UltraSPARC T1 if mpstat doesn't show it?
Core utilization of a T1 corresponds to the number of instructions executed by that core. Cpustat is a tool available on Solaris to monitor system behavior using hardware performance counters. T1 processor has two hardware performance counters per thread (there are no core specific counters). One of the performance counters always reports instruction count and the other can be programmed to measure other events such as cache misses and TLB misses etc. A typical cpustat command looks like :
cpustat -c pic0=L2_dmiss_ld,pic1=Instr_cnt 1
which will report Data cache misses in L2 cache and the instructions, executed in user mode at 1 second interval by all the enabled threads.
[edit] Corestat
A new tool “Corestat” for online monitoring of core utilization. Core utilization is reported for all the available cores by aggregating the instructions executed by all the threads in that core. Its a perl script which forks cpustat command at run time and then aggregates the instruction count to derive the core utilization. A T1 core can best execute 1 instruction/cycle and hence the maximum core utilization is directly proportional to the frequency of the processor.
Corestat can be downloaded from here.
Corestat can be used in two modes :
- For online monitoring purpose, it requires root privileges. This is the default mode of operation. Default reporting interval is 10 sec and it assumes the frequency of 1200 MHz.
- It can be used to report core utilization by post processing already sampled cpustat data.
Usage :
$ corestat
corestat : Permission denied. Needs root privilege...
Usage : corestat [-v] [-f <infile>] [-i <interval>] [-r <freq>]
-v : Report version number
-f infile : Filename containing sampled cpustat data
-i interval : Reporting interval in sec (default = 10 sec)
-r freq : CPU frequency in MHz (default = 1200 MHz)
# corestat
Core Utilization
CoreId %Usr %Sys %Total
------ ----- ----- ------
0 16.23 18.56 34.80
1 26.09 13.42 39.52
2 28.97 11.47 40.44
3 28.63 11.74 40.38
4 29.18 12.95 42.13
5 29.25 11.31 40.56
6 29.10 15.96 45.06
7 23.97 12.55 36.51
------ ----- ----- ------
Avg 26.43 13.50 39.92
mpstat data for the same period from the same system looks like :
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 2 0 4191 7150 6955 1392 93 374 573 14 1433 78 22 0 0
1 2 0 179 11081 10956 1180 132 302 1092 13 1043 79 21 0 0
2 1 0 159 9524 9388 1085 141 261 1249 14 897 79 21 0 0
3 0 0 3710 10540 10466 621 231 116 1753 2 215 70 29 0 0
4 5 0 28 355 1 2485 284 456 447 30 2263 77 23 0 0
5 5 0 25 350 1 2541 280 534 445 26 2315 78 22 0 0
6 3 0 26 331 0 2501 267 545 450 28 2319 78 22 0 0
7 2 0 30 292 1 2390 232 534 475 23 2244 77 22 0 0
8 4 0 22 265 1 2188 220 499 429 26 2118 75 25 0 0
9 2 0 28 319 1 2348 258 513 440 26 2161 76 24 0 0
10 4 0 23 308 0 2384 259 514 430 22 2220 76 24 0 0
11 4 0 27 292 0 2366 237 518 438 30 2209 77 23 0 0
12 11 0 31 314 0 2446 253 530 458 27 2290 78 22 0 0
13 4 0 31 273 1 2334 223 523 428 25 2261 79 21 0 0
14 12 0 29 298 1 2405 247 521 435 25 2286 78 22 0 0
15 4 0 32 330 1 2445 272 526 450 24 2248 77 22 0 0
16 5 0 28 271 0 2311 219 528 406 29 2188 76 23 0 0
17 4 0 23 309 1 2387 253 537 442 25 2234 78 22 0 0
18 3 0 25 312 1 2412 257 534 449 26 2216 78 22 0 0
19 3 0 29 321 1 2479 262 545 462 31 2287 78 22 0 0
20 14 0 29 347 0 2474 289 541 457 24 2253 78 22 0 0
21 4 0 29 315 1 2406 259 534 469 24 2240 77 22 0 0
22 4 0 27 290 1 2406 243 531 480 25 2258 77 22 0 0
23 4 0 27 286 1 2344 235 531 445 26 2240 77 22 0 0
24 3 0 30 279 0 2292 228 518 442 22 2160 77 23 0 0
25 3 0 26 275 1 2340 227 538 448 25 2224 76 23 0 0
26 4 0 22 294 1 2349 247 529 479 26 2197 77 23 0 0
27 4 0 27 324 1 2459 270 544 476 25 2256 77 23 0 0
28 4 0 25 300 1 2426 249 549 461 27 2253 77 23 0 0
29 5 0 27 323 1 2463 269 541 447 23 2277 77 22 0 0
30 2 0 27 289 1 2386 239 535 463 26 2222 77 23 0 0
31 3 0 29 363 1 2528 304 525 446 26 2251 76 23 0 0
Here we can see each core is executing 39% of its max capacity. Interestingly mpstat output for the same period shows that all the virtual CPUs are all 100% busy. Together it shows that in this particular case even 100% busy threads can not utilize any of the core to its max capacity due to the stalls.
From corestat data we can get an idea about the absolute capacity of the core available for more work or performance. Higher the percentage of core usage means the core is getting saturated and has less head room available for processing more load. It also means that the pipeline is being used more efficiently. However, lower core utilization doesn't simply mean more room for applying more load. All the virtual CPUs can be 100% busy and still the core utilization could be low.
[edit] How to use core utilization data along with conventional stat tools
Core utilization (as seen above from corestat) and mpstat or vmstat need to be used together to make decisions about system utilization.
Here is some explanation of a few commonly observed scenarios :
Vmstat reports 75% idle and core utilization is only 20% :
Since vmstat reports huge idle time as well as the core usage is also low, there is head room for applying more load. Any performance gain by increasing load will depend on the characteristic of the application.
Vmstat reports 100% busy and core utilization is 50% :
Since vmstat reports all threads being 100% busy, there is really no more head room to schedule any more software threads. Hence the system is at its peak load. Low (i.e. 50%) core utilization indicates that the application is only utilizing each core to its 50% capacity and the cores are not saturated.
Vmstat reports 75% idle but core utilization is 50% :
Since core utilization is higher than that reported by vmstat, this is an indication that the processor can get saturated by having fewer software threads than the available hardware threads. It is also an indication of a low CPI application. In this case, scalability will be limited by core saturation and adding more load after a certain point will not help achieve any more performance.
As with any other system on Sun Fire T2000 as the load increases, more threads become busy and core utilization also goes up. Since thread saturation (i.e. virtual CPU saturation) and core saturation are two different aspects of system utilization, we need to monitor both simultaneously in order to determine whether an application is likely to saturate a core by using fewer threads. In that case, applying additional load on the system will not deliver any more throughput. On the other hand if all the threads get saturated but core utilization shows more head room then that means the application has stalls and it is a high CPI application. Application level tuning, partitioning of resources using processor sets (psrset(1M)) or binding of LWPs (pbind(1M)) could be some techniques to improve the performance in such cases.
