In my blogpost When the oracle wait interface isn’t enough I showed how a simple asynchronous direct path scan of a table was spending more than 99% of it’s time on CPU, and that perf showed me that 68% (of the total elapsed time) was spent on a spinlock unlock in the linux kernel which was called by io_submit().
This led to some very helpful comments from Tanel Poder. This blogpost is a materialisation of his comments, and tests to show the difference.
First take a look at what I gathered from ‘perf’ in the first article:
# Samples: 501 of event 'cpu-clock' # Event count (approx.): 501 # # Overhead Command Shared Object Symbol # ........ ....... .................. ................................... # 52.50% oracle [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore | --- _raw_spin_unlock_irqrestore mptspi_qcmd scsi_dispatch_cmd scsi_request_fn __blk_run_queue queue_unplugged blk_flush_plug_list blk_finish_plug | |--99.24%-- do_io_submit | sys_io_submit | system_call_fastpath | io_submit | skgfqio | ksfd_skgfqio | ksfdgo
This shows 52.5% of the time of profiling a “select count(*) from t2” on the server process was spending it’s time on unlocking a spinlock.
This was in the previous blogpost, and tanel commented the following:
I would be suspicious of any of the “irqrestore” functions shown as the main CPU cycle consumer – as right after enabling interrupts again on a CPU may be just the first chance for the profiler interrupt to kick in and do the RIP and stack backtrace read. This is highly dependent on the hardware (how new CPUs) and OS version + VM version + whether the VM allows the guest OS to use hardware performance counters directly.
Let’s reiterate what I was doing: I was profiling the execution using Linux’ in-kernel perf functionality, but, because of the lack of access of the kernel’s performance registers because I was running on VMWare Fusion (desktop virtualisation), I was using perf in the following way: perf record -e cpu-clock.
These are a partial list of perf’s triggering events:
List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] ref-cycles [Hardware event] cpu-clock [Software event] ...etc...
If no specific event is specified, perf tries to use ‘cpu-cycles’, which has the indication [Hardware event], which means the kernel’s performance registers are used to gather information. If this is not possible (because virtualisation disables access to the performance registers), the software event ‘cpu-clock’ can be used. This is what I used in the previous article.
However, cpu-clock is a software event. And this event (cpu-clock) is depended on the timer interrupt. And the function we see we spent most time on (_raw_spin_unlock_irqrestore) is the re-enabling of IRQ’s for this process when this spinlock is unlocked. So this _could_ mean we did not spend our time on this function, but can not tell, because the timing source was disabled.
However, there was another helpful comment from Tanel:
VMWare Fusion 5.x should already allow some CPU perf counters to be accessed directly in the VM guest. It requires a new enough CPU though (it works in my late 2011 MBP, but not in the 2009 MBP). There’s a setting under “advanced options” under “processors & memory” -> “You can use code profiling applications such as VTune or OProfile to optimize or debug software that runs inside a virtual machine.”
Indeed, there is such a function, and let’s enable it and try again in EXACTLY the same way, but now using the ‘cpu-cycles’ method (which is default).
# Samples: 669 of event 'cycles' # Event count (approx.): 288603593 # # Overhead Command Shared Object Symbol # ........ ....... ................. ....................................... # 11.31% oracle oracle [.] sxorchk | --- sxorchk | |--98.50%-- kcbhxoro | kcbhvbo | kcbzvb | kcbldrget | kcbgtcr | ktrget3 | ktrget2 | kdst_fetch | kdstf00000010000kmP | kdsttgr | qertbFetch | qergsFetch | opifch2 | kpoal8 | opiodr | ttcpip | opitsk | opiino | opiodr | opidrv | sou2o | opimai_real | ssthrdmain | main | __libc_start_main | --1.50%-- kcbhvbo kcbzvb
This is radically different! All of a sudden the top function is not a spinlock in the kernel any more, but an Oracle function!
Let’s look at the top 5 locations where time is spend with exactly the same case, but with -e cycles (the default) and -e cpu-clock (non-default/software timer):
# perf record -g -p 2527 ^C # perf report -n -g none ... # Samples: 580 of event 'cycles' # Event count (approx.): 256237297 # # Overhead Samples Command Shared Object Symbol # ........ ........... ....... .................. ....................................... # 17.47% 100 oracle oracle [.] sxorchk 7.99% 47 oracle oracle [.] kdstf00000010000kmP 6.01% 35 oracle oracle [.] kcbhvbo 3.25% 19 oracle oracle [.] kdst_fetch 3.01% 17 oracle [kernel.kallsyms] [k] __wake_up_bit
And now the same execution, but with the software timer:
# perf record -g -p 2527 -e cpu-clock ^C # perf report -n -g none ... # Samples: 422 of event 'cpu-clock' # Event count (approx.): 422 # # Overhead Samples Command Shared Object Symbol # ........ ........... ....... ................. ................................ # 78.67% 332 oracle [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 4.03% 17 oracle oracle [.] sxorchk 2.13% 9 oracle oracle [.] kcbhvbo 1.90% 8 oracle oracle [.] kdstf00000010000kmP 0.95% 4 oracle oracle [.] qeaeCn1Serial
This reveals some information: it seems that when profiling with the software timer, the “_raw_spin_unlock_irqrestore” function “eats” a lot of samples, which are “stolen” from the functions where they are spent:
sxorchk has 100 samples with the hardware timer, and 17 with the software timer.
kcbhvbo has 35 samples with the hardware timer, and has 9 with the software timer.
kdstf00000010000kmP has 47 samples with the hardware timer, and has 8 with the software timer.
So, general conclusion is that it’s important to understand what you are measuring, and if that method has implication on what you are measuring.
Conclusion specific to perf: do not use cpu-clock if you can use the hardware event.