When the Oracle wait interface isn’t enough, part 2: understanding measurements.

In my blogpost When the oracle wait interface isn’t enough I showed how a simple asynchronous direct path scan of a table was spending more than 99% of it’s time on CPU, and that perf showed me that 68% (of the total elapsed time) was spent on a spinlock unlock in the linux kernel which was called by io_submit().

This led to some very helpful comments from Tanel Poder. This blogpost is a materialisation of his comments, and tests to show the difference.

First take a look at what I gathered from ‘perf’ in the first article:

# Samples: 501  of event 'cpu-clock'
# Event count (approx.): 501
#
# Overhead  Command       Shared Object                               Symbol
# ........  .......  ..................  ...................................
#
    52.50%   oracle  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore    
             |
             --- _raw_spin_unlock_irqrestore
                 mptspi_qcmd
                 scsi_dispatch_cmd
                 scsi_request_fn
                 __blk_run_queue
                 queue_unplugged
                 blk_flush_plug_list
                 blk_finish_plug
                |          
                |--99.24%-- do_io_submit
                |          sys_io_submit
                |          system_call_fastpath
                |          io_submit
                |          skgfqio
                |          ksfd_skgfqio
                |          ksfdgo

This shows 52.5% of the time of profiling a “select count(*) from t2″ on the server process was spending it’s time on unlocking a spinlock.

This was in the previous blogpost, and tanel commented the following:

I would be suspicious of any of the “irqrestore” functions shown as the main CPU cycle consumer – as right after enabling interrupts again on a CPU may be just the first chance for the profiler interrupt to kick in and do the RIP and stack backtrace read. This is highly dependent on the hardware (how new CPUs) and OS version + VM version + whether the VM allows the guest OS to use hardware performance counters directly.

Let’s reiterate what I was doing: I was profiling the execution using Linux’ in-kernel perf functionality, but, because of the lack of access of the kernel’s performance registers because I was running on VMWare Fusion (desktop virtualisation), I was using perf in the following way: perf record -e cpu-clock.

These are a partial list of perf’s triggering events:

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  ...etc...

If no specific event is specified, perf tries to use ‘cpu-cycles’, which has the indication [Hardware event], which means the kernel’s performance registers are used to gather information. If this is not possible (because virtualisation disables access to the performance registers), the software event ‘cpu-clock’ can be used. This is what I used in the previous article.

However, cpu-clock is a software event. And this event (cpu-clock) is depended on the timer interrupt. And the function we see we spent most time on (_raw_spin_unlock_irqrestore) is the re-enabling of IRQ’s for this process when this spinlock is unlocked. So this _could_ mean we did not spend our time on this function, but can not tell, because the timing source was disabled.

However, there was another helpful comment from Tanel:

VMWare Fusion 5.x should already allow some CPU perf counters to be accessed directly in the VM guest. It requires a new enough CPU though (it works in my late 2011 MBP, but not in the 2009 MBP). There’s a setting under “advanced options” under “processors & memory” -> “You can use code profiling applications such as VTune or OProfile to optimize or debug software that runs inside a virtual machine.”

Indeed, there is such a function, and let’s enable it and try again in EXACTLY the same way, but now using the ‘cpu-cycles’ method (which is default).

# Samples: 669  of event 'cycles'
# Event count (approx.): 288603593
#
# Overhead  Command      Shared Object                                   Symbol
# ........  .......  .................  .......................................
#
    11.31%   oracle  oracle             [.] sxorchk                            
             |
             --- sxorchk
                |          
                |--98.50%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch
                |          kdstf00000010000kmP
                |          kdsttgr
                |          qertbFetch
                |          qergsFetch
                |          opifch2
                |          kpoal8
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                |          
                 --1.50%-- kcbhvbo
                           kcbzvb

This is radically different! All of a sudden the top function is not a spinlock in the kernel any more, but an Oracle function!

Let’s look at the top 5 locations where time is spend with exactly the same case, but with -e cycles (the default) and -e cpu-clock (non-default/software timer):

# perf record -g -p 2527 
^C
# perf report -n -g none
...
# Samples: 580  of event 'cycles'
# Event count (approx.): 256237297
#
# Overhead      Samples  Command       Shared Object                                   Symbol
# ........  ...........  .......  ..................  .......................................
#
    17.47%          100   oracle  oracle              [.] sxorchk                            
     7.99%           47   oracle  oracle              [.] kdstf00000010000kmP                
     6.01%           35   oracle  oracle              [.] kcbhvbo                            
     3.25%           19   oracle  oracle              [.] kdst_fetch                         
     3.01%           17   oracle  [kernel.kallsyms]   [k] __wake_up_bit        

And now the same execution, but with the software timer:

# perf record -g -p 2527 -e cpu-clock
^C
# perf report -n -g none
...
# Samples: 422  of event 'cpu-clock'
# Event count (approx.): 422
#
# Overhead      Samples  Command      Shared Object                            Symbol
# ........  ...........  .......  .................  ................................
#
    78.67%          332   oracle  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore 
     4.03%           17   oracle  oracle             [.] sxorchk                     
     2.13%            9   oracle  oracle             [.] kcbhvbo                     
     1.90%            8   oracle  oracle             [.] kdstf00000010000kmP         
     0.95%            4   oracle  oracle             [.] qeaeCn1Serial                 

This reveals some information: it seems that when profiling with the software timer, the “_raw_spin_unlock_irqrestore” function “eats” a lot of samples, which are “stolen” from the functions where they are spent:
sxorchk has 100 samples with the hardware timer, and 17 with the software timer.
kcbhvbo has 35 samples with the hardware timer, and has 9 with the software timer.
kdstf00000010000kmP has 47 samples with the hardware timer, and has 8 with the software timer.

So, general conclusion is that it’s important to understand what you are measuring, and if that method has implication on what you are measuring.
Conclusion specific to perf: do not use cpu-clock if you can use the hardware event.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,134 other followers

%d bloggers like this: