DTrace Topics Overhead
From Siwiki
Contents |
[edit] DTrace Topics: Overhead
This article is about the overheads incurred when using DTrace, and is part of the DTrace Topics collection. A general understanding of DTrace is assumed knowledge, which can be studied from the DTrace Topics: Intro section.
DTrace is a dynamic troubleshooting and analysis tool first introduced in the Solaris 10 and OpenSolaris operating systems.
Completion
Difficulty
Audience All DTrace users
[edit] Overheads
This article discusses the overheads incurred when using DTrace. Understanding overheads can be important for:
- DTrace in production - we want to use DTrace to measure performance issues, not create them
- Measuring times - understanding how much DTrace overheads will skew the nanosecond timestamps that it measures
DTrace has been designed to operate with low overheads when enabled, and zero overheads when not. DTrace also features saftey measures, including an automatic abort if it detects it is degrading system performance.
Despite these measures, DTrace is a powerful tool that can noticably degrade application performance if used in certain ways. By learning how these overheads happen, you can avoid them or take their effects into account.
[edit] truss
The Solaris truss command is a nifty tool to trace process system calls (and much more):
$ truss date
execve("/usr/bin/date", 0x08047D94, 0x08047D9C) argc = 1
resolvepath("/usr/bin/date", "/usr/bin/date", 1023) = 13
resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12
xstat(2, "/usr/bin/date", 0x08047B88) = 0
open("/var/ld/ld.config", O_RDONLY) = 3
fxstat(2, 3, 0x08047AC0) = 0
mmap(0x00000000, 148, PROT_READ, MAP_SHARED, 3, 0) = 0xD2FD0000
close(3) = 0
sysconfig(_CONFIG_PAGESIZE) = 4096
xstat(2, "/lib/libc.so.1", 0x08047440) = 0
[...]
While DTrace has similar features, DTrace isn't truss.
truss works by sending control messages to a process's /proc/.../lwpctl file, to stop lwp execution as it enters and exits system calls. This is effectively setting sychronous breakpoints so that the target process context-switches off the CPU for every syscall event, and truss context-switches on to read process details (twice for every syscall). All that extra context-switching is very bad for performance, and scales with the number of syscalls performed.
[edit] DTrace
You might have noticed by now that the output of DTrace can be shuffled slightly (check these timestamps carefully):
# dtrace -n 'syscall::getpid:entry { trace(timestamp); }'
dtrace: description 'syscall::getpid:entry ' matched 1 probe
CPU ID FUNCTION:NAME
0 74351 getpid:entry 134180685407613
0 74351 getpid:entry 134180686354892
0 74351 getpid:entry 134180689017893
0 74351 getpid:entry 134180707138284
0 74351 getpid:entry 134180711908959
1 74351 getpid:entry 134180681371879
1 74351 getpid:entry 134180682762836
1 74351 getpid:entry 134180686545405
1 74351 getpid:entry 134180707596696
1 74351 getpid:entry 134180710348800
^C
When the CPU column changes from 0 -> 1, the timestamp jumps backwards slightly. This is normal, and easily rectified by post sorting output.
DTrace operates in the kernel and places traced output in per-CPU kernel buffers as the events happen (synchronously). The overheads to do this are minimum, since we aren't context-switching to a user-land thread, and it uses a per-CPU buffer to prevent multi CPU conflicts. Then, at a (configurable) rate of 1 Hertz, the user-land /usr/sbin/dtrace consumer reads the contents of the CPU buffers one by one, printing them out. Hense the jump in time as the CPU column changes.
The /usr/sbin/dtrace is printing the events asynchronously at a comfortable rate of 1 Hertz; unlike truss, which prints events synchronously and will step on the CPU as fast as the application make system calls.
[edit] Not Enabled
The not enabled overhead of DTrace is either zero or near zero.
[edit] Dynamic Probes
Thousands of dynamic probes available with DTrace, from providers such as fbt and pid.
DTrace's dynamic probes have zero overhead when not enabled. How is this possible? Adding any feature must come at some cost, even if it only a few CPU instructions (mov, cmp, je). Well, it turns out that this is true. Lets see how:
No DTrace,
# mdb -k Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp scsi_vhci zfs random ip hook neti sctp arp usba s1394 nca ufs lofs audiosup md cpc crypto nfs fcp fctl logindmux ptm sppp ipc ] > bdev_strategy::dis bdev_strategy: pushq %rbp bdev_strategy+1: movq %rsp,%rbp bdev_strategy+4: subq $0x10,%rsp bdev_strategy+8: pushq %r12 [...] bdev_strategy+0x79: popq %r12 bdev_strategy+0x7b: leave bdev_strategy+0x7c: ret
DTracing fbt::bdev_strategy:entry,
> bdev_strategy::dis bdev_strategy: int $0x3 bdev_strategy+1: movq %rsp,%rbp bdev_strategy+4: subq $0x10,%rsp bdev_strategy+8: pushq %r12 [...] bdev_strategy+0x79: popq %r12 bdev_strategy+0x7b: leave bdev_strategy+0x7c: ret
The above show the bdev_strategy kernel function disassembeled (AMD64), both without and with DTrace enabled. Understanding the CPU instructions is not important for now, the outputs have been included so that you can see the following:
- Without DTrace - the function runs as normal
- With DTrace enabled - the start of the funcion has been modified
So without DTrace, functions run normally; only when DTrace is enabled do we see something different - DTrace patches the function to call a DTrace routine when the function is executed.
[edit] Static Probes
A few hundred static probes are available from DTrace, and more can be added to application code. Static providers include sdt and io.
The overhead of static probes is near zero when not enabled. To understand this further, there is a static probe in bdev_strategy called io:::start for us to examine:
No DTrace,
> bdev_strategy::dis [...] bdev_strategy+0x54: nop bdev_strategy+0x55: nop bdev_strategy+0x56: nop bdev_strategy+0x57: nop bdev_strategy+0x58: nop [...]
DTracing io:::start,
> bdev_strategy::dis [...] bdev_strategy+0x54: nop bdev_strategy+0x55: lock nop bdev_strategy+0x57: nop bdev_strategy+0x58: nop [...]
When not enabled, static probes exist in code as a few nop instructions (no-operation - a dummy CPU instruction that does nothing, fast). When enabled, these nop instructions are changed to call DTrace.
The not-enabled overheads of static probes is that of adding a few nop instructons, which usually causes a negligible performance impact. Check the code path they exist in, and try not to add static probes to tight loops.
[edit] Enabled
When DTrace is enabled, CPU overheads are incurred as the probes fire. This section describes the degree of these overheads, and provides some percentage values as approximations (your overheads may vary).
In general, a probe event rate of less than 1000 per second is likely to have negligible impact.
[edit] Fixed Rate
The profile provider can be used to sample at a fixed rate, eg 1000 Hertz. The CPU cost at 1000 Hertz or less will be very low, close to 0% CPU (in testing, 0.1% CPU). Since the rate is fixed, we know beforehand what the impact would be, making such scripts the safest to use in performance sensitive environments such as production servers.
Fixed-rate sampling at 1001 Hertz,
# dtrace -n 'profile:::profile-1001 { @[execname] = count(); }'
dtrace: description 'profile:::profile-1001 ' matched 1 probe
^C
dtwm 1
rpc.rstatd 2
dtrace 3
gnome-terminal 26
ssh 82
Xorg 112
sched 1703
Of course, sampling introduces the possibility of sampling errors. A key feature of DTrace is the ability to trace events, not just sample them.
[edit] Demand Rated
When tracing events with DTrace, the impact depends on the rate of probe events per second, which depends on the frequency of the event being traced. Rates less than 1000 per second should cause little performance impact. Some examples:
- Tracing “slow” disk events may cost less than 0.2% CPU.
- Tracing process creation would expect even fewer events, costing closer to 0.0% CPU.
- Tracing very rapid application function events can cost over 10% CPU,
- eg, every function entry and return from Xorg's text segment, at over 6000 events per second.
Tracing Xorg's function entries and returns,
# dtrace -n 'pid$target:a.out::entry,pid$target:a.out::return' -p `pgrep Xorg` dtrace: description 'pid$target:a.out::entry,pid$target:a.out::return' matched 8478 probes CPU ID FUNCTION:NAME 0 40548 SmartScheduleTimer:entry 0 58313 SmartScheduleTimer:return 0 40006 WakeupHandler:entry 0 38946 xf86Wakeup:entry 0 39273 xf86VTSwitchPending:entry 0 57040 xf86VTSwitchPending:return 0 56713 xf86Wakeup:return 0 40004 NoopDDA:entry 0 57769 NoopDDA:return 0 57771 WakeupHandler:return 0 40530 GetTimeInMillis:entry 0 58295 GetTimeInMillis:return 0 41350 AccessXRepeatKeyExpire:entry 0 59115 AccessXRepeatKeyExpire:return 0 40530 GetTimeInMillis:entry 0 58295 GetTimeInMillis:return 0 58211 WaitForSomething:return 0 40446 WaitForSomething:entry 0 40530 GetTimeInMillis:entry 0 58295 GetTimeInMillis:return [...fast scrolling output...]
Tip: Fast scrolling output consumes CPU. Do you really want that much output anyway? Try summarizing data with DTrace instead, which will reduce the performance overhead.
