Archive

Tag Archives: perf

This is a small blogpost on using ‘perf’. I got an error message when I tried to run ‘perf top’ systemwide:

# perf top
Too many events are opened.
Try again after reducing the number of events

What actually is the case here, is actually described in the perf wiki:

Open file limits
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor per event per-thread or per-cpu.
On a 16-way system, when you do:
perf stat -e cycles sleep 1
You are effectively creating 16 events, and thus consuming 16 file descriptors.

The point for this blogpost is perf (in Oracle Linux 7.1) says ‘too many events’, and hidden away in the perf wiki the true reason for the message is made clear: perf opens up a file descriptor per cpu thread, which means that if you are on a big system you might get this message if the open files (file descriptors actually) limit is set lower than the number of cpu threads.

You can see the current set limits using ‘ulimit -a’:

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 189909832
max locked memory       (kbytes, -l) 21878354152
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16384
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

If you are root, you can simply set the ‘open files’ limit higher than the amount of cpu threads, and perf will work:

# ulimit -n 10240
Advertisement

In my blogpost When the oracle wait interface isn’t enough I showed how a simple asynchronous direct path scan of a table was spending more than 99% of it’s time on CPU, and that perf showed me that 68% (of the total elapsed time) was spent on a spinlock unlock in the linux kernel which was called by io_submit().

This led to some very helpful comments from Tanel Poder. This blogpost is a materialisation of his comments, and tests to show the difference.

First take a look at what I gathered from ‘perf’ in the first article:

# Samples: 501  of event 'cpu-clock'
# Event count (approx.): 501
#
# Overhead  Command       Shared Object                               Symbol
# ........  .......  ..................  ...................................
#
    52.50%   oracle  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore    
             |
             --- _raw_spin_unlock_irqrestore
                 mptspi_qcmd
                 scsi_dispatch_cmd
                 scsi_request_fn
                 __blk_run_queue
                 queue_unplugged
                 blk_flush_plug_list
                 blk_finish_plug
                |          
                |--99.24%-- do_io_submit
                |          sys_io_submit
                |          system_call_fastpath
                |          io_submit
                |          skgfqio
                |          ksfd_skgfqio
                |          ksfdgo

This shows 52.5% of the time of profiling a “select count(*) from t2” on the server process was spending it’s time on unlocking a spinlock.

This was in the previous blogpost, and tanel commented the following:

I would be suspicious of any of the “irqrestore” functions shown as the main CPU cycle consumer – as right after enabling interrupts again on a CPU may be just the first chance for the profiler interrupt to kick in and do the RIP and stack backtrace read. This is highly dependent on the hardware (how new CPUs) and OS version + VM version + whether the VM allows the guest OS to use hardware performance counters directly.

Let’s reiterate what I was doing: I was profiling the execution using Linux’ in-kernel perf functionality, but, because of the lack of access of the kernel’s performance registers because I was running on VMWare Fusion (desktop virtualisation), I was using perf in the following way: perf record -e cpu-clock.

These are a partial list of perf’s triggering events:

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  ...etc...

If no specific event is specified, perf tries to use ‘cpu-cycles’, which has the indication [Hardware event], which means the kernel’s performance registers are used to gather information. If this is not possible (because virtualisation disables access to the performance registers), the software event ‘cpu-clock’ can be used. This is what I used in the previous article.

However, cpu-clock is a software event. And this event (cpu-clock) is depended on the timer interrupt. And the function we see we spent most time on (_raw_spin_unlock_irqrestore) is the re-enabling of IRQ’s for this process when this spinlock is unlocked. So this _could_ mean we did not spend our time on this function, but can not tell, because the timing source was disabled.

However, there was another helpful comment from Tanel:

VMWare Fusion 5.x should already allow some CPU perf counters to be accessed directly in the VM guest. It requires a new enough CPU though (it works in my late 2011 MBP, but not in the 2009 MBP). There’s a setting under “advanced options” under “processors & memory” -> “You can use code profiling applications such as VTune or OProfile to optimize or debug software that runs inside a virtual machine.”

Indeed, there is such a function, and let’s enable it and try again in EXACTLY the same way, but now using the ‘cpu-cycles’ method (which is default).

# Samples: 669  of event 'cycles'
# Event count (approx.): 288603593
#
# Overhead  Command      Shared Object                                   Symbol
# ........  .......  .................  .......................................
#
    11.31%   oracle  oracle             [.] sxorchk                            
             |
             --- sxorchk
                |          
                |--98.50%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch
                |          kdstf00000010000kmP
                |          kdsttgr
                |          qertbFetch
                |          qergsFetch
                |          opifch2
                |          kpoal8
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                |          
                 --1.50%-- kcbhvbo
                           kcbzvb

This is radically different! All of a sudden the top function is not a spinlock in the kernel any more, but an Oracle function!

Let’s look at the top 5 locations where time is spend with exactly the same case, but with -e cycles (the default) and -e cpu-clock (non-default/software timer):

# perf record -g -p 2527 
^C
# perf report -n -g none
...
# Samples: 580  of event 'cycles'
# Event count (approx.): 256237297
#
# Overhead      Samples  Command       Shared Object                                   Symbol
# ........  ...........  .......  ..................  .......................................
#
    17.47%          100   oracle  oracle              [.] sxorchk                            
     7.99%           47   oracle  oracle              [.] kdstf00000010000kmP                
     6.01%           35   oracle  oracle              [.] kcbhvbo                            
     3.25%           19   oracle  oracle              [.] kdst_fetch                         
     3.01%           17   oracle  [kernel.kallsyms]   [k] __wake_up_bit        

And now the same execution, but with the software timer:

# perf record -g -p 2527 -e cpu-clock
^C
# perf report -n -g none
...
# Samples: 422  of event 'cpu-clock'
# Event count (approx.): 422
#
# Overhead      Samples  Command      Shared Object                            Symbol
# ........  ...........  .......  .................  ................................
#
    78.67%          332   oracle  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore 
     4.03%           17   oracle  oracle             [.] sxorchk                     
     2.13%            9   oracle  oracle             [.] kcbhvbo                     
     1.90%            8   oracle  oracle             [.] kdstf00000010000kmP         
     0.95%            4   oracle  oracle             [.] qeaeCn1Serial                 

This reveals some information: it seems that when profiling with the software timer, the “_raw_spin_unlock_irqrestore” function “eats” a lot of samples, which are “stolen” from the functions where they are spent:
sxorchk has 100 samples with the hardware timer, and 17 with the software timer.
kcbhvbo has 35 samples with the hardware timer, and has 9 with the software timer.
kdstf00000010000kmP has 47 samples with the hardware timer, and has 8 with the software timer.

So, general conclusion is that it’s important to understand what you are measuring, and if that method has implication on what you are measuring.
Conclusion specific to perf: do not use cpu-clock if you can use the hardware event.

%d bloggers like this: