When the Oracle wait interface isn’t enough

Oracle has done a great job with the wait interface. It has given us the opportunity to profile the time spend in Oracle processes, by keeping track of CPU time and waits (which is time spend not running on CPU). With every new version Oracle has enhanced the wait interface, by making the waits more detailed. Tuning typically means trying to get rid of waits as much as possible.

But what if your execution is optimised to the point that there are (almost) no waits left? Before you think this is theoretical: this is possible, especially with Oracle adaptive direct path reads (which are non Oracle cached IOs), visible by the wait “direct path read”. Of course I am talking about the omission of waits, which happen with adaptive direct path reads if your system is able to provide the request results fast enough. There isn’t a wait because if the IO request result is returned fast enough, the process doesn’t have to wait. Whilst this sounds very obvious, the “traditional” Oracle IO requests (visible with the waits “db file sequential read” and “db file scattered read”) do always generate a wait, no matter how fast the IO requests where.

Here is a trace excerpt from a fill table scan where the IO was fast enough not to generate only a few waits:

PARSE #140145843472584:c=0,e=28,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1385308947947766
EXEC #140145843472584:c=0,e=31,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1385308947947823
WAIT #140145843472584: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308947947871
WAIT #140145843472584: nam='asynch descriptor resize' ela= 1 outstanding #aio=0 current aio limit=1562 new aio limit=1592 obj#=75579 tim=1385308947947969
WAIT #140145843472584: nam='direct path read' ela= 428 file number=5 first dba=28418 block cnt=126 obj#=75579 tim=1385308947989097
FETCH #140145843472584:c=161976,e=174323,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1385308948122218
WAIT #140145843472584: nam='SQL*Net message from client' ela= 249 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308948122600
FETCH #140145843472584:c=0,e=2,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,plh=3321871023,tim=1385308948122689
WAIT #140145843472584: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308948122709
WAIT #140145843472584: nam='SQL*Net message from client' ela= 210 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308948122938
CLOSE #140145843472584:c=0,e=15,dep=0,type=3,tim=1385308948555460

The most interesting part of the raw trace file is between the EXEC line and the first FETCH line. There is first a ‘SQL*Net message to client’ wait, then a ‘asynch descriptor resize’ wait, and then a single ‘direct path read’ wait. This is a single wait line for doing IO, while the fetch line shows that 20941 blocks are read by doing physical IO. The fetch line shows that most of the elapsed time (e) is spend on running on cpu (c). This means that details about how those 20941 blocks where read are (except for the single ‘direct path read’ wait) not available.

But what if you want to understand more about what the process is doing here? Except for a few wait lines, all the processing details that waits give are gone. It’s more or less only the PARSE/EXEC/FETCH lines, where the first fetch line contains more than 99% of all the time.

The answer to that on linux is perf. Perf is a profiler that is embedded in the linux kernel (since 2.6.32). I’ve written more about perf, use the search field on this blog find articles on how to setup and use perf. Now let’s see what is happening in this situation: what is Oracle doing to execute the above mentioned SQL (select count(*) from t2)?

I’ve ran perf on the session above with ‘perf record -g -e cpu-clock -p PID’, and the result (with ‘perf report’) is shown below:

    67.58%   oracle  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
             |
             --- _raw_spin_unlock_irqrestore
                |          
                |--99.19%-- mptspi_qcmd
                |          scsi_dispatch_cmd
                |          scsi_request_fn
                |          __blk_run_queue
                |          queue_unplugged
                |          blk_flush_plug_list
                |          blk_finish_plug
                |          generic_file_read_iter
                |          generic_file_aio_read
                |          aio_rw_vect_retry
                |          aio_run_iocb
                |          io_submit_one
                |          do_io_submit
                |          sys_io_submit
                |          system_call_fastpath
                |          io_submit
                |          
                 --0.81%-- __wake_up
                           fsnotify_add_notify_event
                           inotify_handle_event
                           send_to_group
                           fsnotify
                           vfs_write
                           sys_write
                           system_call_fastpath
                           __write_nocancel

     4.40%   oracle  oracle             [.] sxorchk

What is shown here, is that 68% of the time the process ran on CPU, it was spending it’s time in kernel mode ([k]), on a function called _raw_spin_unlock_irqrestore. This function was called in two different ways, but in 99% of the time it came from mptspi_qcmd. This is the device specific kernel driver. What is even more remarkable, is that when we follow the backtrace up (by reading down), that the process was in fact issuing IO’s (the io_submit system call)!

This means that instead of spending time on waiting for IOs to finish, this system is spending time on spinning on a spin lock (alike what is latch in Oracle) for issuing commands to a SCSI device.

The next function in which the Oracle process spend time, is an Oracle function (visible by [.], which means user land function), called sxorchk. This function is a xor check (governed by the db_block_checking parameter).

As a summary: does this means the Oracle wait interface is useless? Of course not. But if the wait interface simply does not provide enough information, like when 99% of the time is only visible as CPU time, you need to step to another layer and investigate there. Perf opens up the CPU time, and is able to tell you how the CPU time is composed.

10 comments
  1. Frits, is this in a VM?
    And which Linux kernel version/CPU HW architecture?

    I would be suspicious of any of the “irqrestore” functions shown as the main CPU cycle consumer – as right after enabling interrupts again on a CPU may be just the first chance for the profiler interrupt to kick in and do the RIP and stack backtrace read. This is highly dependent on the hardware (how new CPUs) and OS version + VM version + whether the VM allows the guest OS to use hardware performance counters directly. The perf counter interrupts can be non-maskable (NMI) nowadays apparently – but it could well be that the more complex operation of stack backtrace walking is delayed to a soft interrupt task – and won’t get done before the interrupts are enabled again for the CPU.

    This doesn’t affect what you said about the CPU time being spent in the io_submit() calls and all the SCSI stuff under it – but I wouldn’t be surprised if you were actually *not* spinning for the spinlock all that time (it’s a spinlock release operation, after all) and the _raw_spin_unlock_irqrestore() call just happens to be the one that enables interrupts, causing the latest stack backtrace call to finally kick in always when it has disabled the interrupts and is finishing its execution. In the cases where I’ve actually detected spinning, I tend to see _raw_spin_lock_irqsave(), do_raw_spin_lock() and delay_tsc().

    This is highly platform, CPU ver, VM, OS ver dependent.

    • This is a VM.
      It’s vmware fusion professional version 5.0.3
      The VM is Oracle Linux 6u3 with kernel 2.6.39-300.26.1 UEK

      vmware fusion does not allow the hardware performance counter to be used directly (I can not use the default perf setting (use the hardware PMU event, UNHALTED_CORE_CYCLES), but have to use the cpu clock).

      Your analysis makes sense. I’ve read the kernel source of the call (lxr.no) and couldn’t figure out why it would spin that much. Indeed if there was visible contention, it would make much more sense to be in the lock phase of the spinlock, rather than in the unlock phase.

    • savvinov said:

      Hi Frits,

      thanks for a great blog!

      So while doing async I/O, Oracle is in fact burning CPU? I.e. it’s not just showing as CPU time because of a hole in Oracle instrumentation (“it’s not something OWI knows about so let’s assume it’s CPU”), it’s really spending CPU cycles on that spinlock?

      And if we add enough similar async I/O workload we could even CPU-starve the system?

      Can you please confirm that my understanding is correct?

      Best regards,
      Nikolay

  2. Btw, VMWare Fusion 5.x should already allow some CPU perf counters to be accessed directly in the VM guest. It requires a new enough CPU though (it works in my late 2011 MBP, but not in the 2009 MBP). There’s a setting under “advanced options” under “processors & memory” ->

    “You can use code profiling applications such as VTune or OProfile to optimize or debug software that runs inside a virtual machine.”

    • You are absolutely right!

      Under advanced options I’ve got a setting to expose VT-x/EPT to the VM (to virtualise inside the VM), and a setting “enable code profiling applications in this virtual machine”, which says “enables running modern code profiling applications by providing support for CPU performance monitoring counters inside this virtual machine”.

      • Even though I didn’t reproduce the issue you are having, I just ran a few parallel direct path read scans and indeed with perf’s hardware cpu-cycles method, I saw the spinning in action (do_raw_spin_lock() + delay_tsc()) – and when measuring the same workload with software interrupt method (cpu-clock) – the mentioned functions disappeared and were replaced by the _raw_spin_unlock_irqrestore() as you saw.

        So, at least in my workload – the spinning was actually happening – and happening in the “lock()” function, but the perf-s software interrupt based profiling method just wasn’t able to catch this.

  3. This reminds me an issue I once saw on a bit Itanium box running Linux, where I guessed something like this must have been happening (I was using OProfile back then), but now it’s more evident.

  4. Jeremy said:

    Just saw the fulltime script that craig shallahamer published, citing some collaboration with you. must be related to this – looks very cool!

Leave a reply to Jeremy Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.