Archive

Tag Archives: performance

In my blogpost When the oracle wait interface isn’t enough I showed how a simple asynchronous direct path scan of a table was spending more than 99% of it’s time on CPU, and that perf showed me that 68% (of the total elapsed time) was spent on a spinlock unlock in the linux kernel which was called by io_submit().

This led to some very helpful comments from Tanel Poder. This blogpost is a materialisation of his comments, and tests to show the difference.

First take a look at what I gathered from ‘perf’ in the first article:

# Samples: 501  of event 'cpu-clock'
# Event count (approx.): 501
#
# Overhead  Command       Shared Object                               Symbol
# ........  .......  ..................  ...................................
#
    52.50%   oracle  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore    
             |
             --- _raw_spin_unlock_irqrestore
                 mptspi_qcmd
                 scsi_dispatch_cmd
                 scsi_request_fn
                 __blk_run_queue
                 queue_unplugged
                 blk_flush_plug_list
                 blk_finish_plug
                |          
                |--99.24%-- do_io_submit
                |          sys_io_submit
                |          system_call_fastpath
                |          io_submit
                |          skgfqio
                |          ksfd_skgfqio
                |          ksfdgo

This shows 52.5% of the time of profiling a “select count(*) from t2″ on the server process was spending it’s time on unlocking a spinlock.

This was in the previous blogpost, and tanel commented the following:

I would be suspicious of any of the “irqrestore” functions shown as the main CPU cycle consumer – as right after enabling interrupts again on a CPU may be just the first chance for the profiler interrupt to kick in and do the RIP and stack backtrace read. This is highly dependent on the hardware (how new CPUs) and OS version + VM version + whether the VM allows the guest OS to use hardware performance counters directly.

Let’s reiterate what I was doing: I was profiling the execution using Linux’ in-kernel perf functionality, but, because of the lack of access of the kernel’s performance registers because I was running on VMWare Fusion (desktop virtualisation), I was using perf in the following way: perf record -e cpu-clock.

These are a partial list of perf’s triggering events:

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  ...etc...

If no specific event is specified, perf tries to use ‘cpu-cycles’, which has the indication [Hardware event], which means the kernel’s performance registers are used to gather information. If this is not possible (because virtualisation disables access to the performance registers), the software event ‘cpu-clock’ can be used. This is what I used in the previous article.

However, cpu-clock is a software event. And this event (cpu-clock) is depended on the timer interrupt. And the function we see we spent most time on (_raw_spin_unlock_irqrestore) is the re-enabling of IRQ’s for this process when this spinlock is unlocked. So this _could_ mean we did not spend our time on this function, but can not tell, because the timing source was disabled.

However, there was another helpful comment from Tanel:

VMWare Fusion 5.x should already allow some CPU perf counters to be accessed directly in the VM guest. It requires a new enough CPU though (it works in my late 2011 MBP, but not in the 2009 MBP). There’s a setting under “advanced options” under “processors & memory” -> “You can use code profiling applications such as VTune or OProfile to optimize or debug software that runs inside a virtual machine.”

Indeed, there is such a function, and let’s enable it and try again in EXACTLY the same way, but now using the ‘cpu-cycles’ method (which is default).

# Samples: 669  of event 'cycles'
# Event count (approx.): 288603593
#
# Overhead  Command      Shared Object                                   Symbol
# ........  .......  .................  .......................................
#
    11.31%   oracle  oracle             [.] sxorchk                            
             |
             --- sxorchk
                |          
                |--98.50%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch
                |          kdstf00000010000kmP
                |          kdsttgr
                |          qertbFetch
                |          qergsFetch
                |          opifch2
                |          kpoal8
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                |          
                 --1.50%-- kcbhvbo
                           kcbzvb

This is radically different! All of a sudden the top function is not a spinlock in the kernel any more, but an Oracle function!

Let’s look at the top 5 locations where time is spend with exactly the same case, but with -e cycles (the default) and -e cpu-clock (non-default/software timer):

# perf record -g -p 2527 
^C
# perf report -n -g none
...
# Samples: 580  of event 'cycles'
# Event count (approx.): 256237297
#
# Overhead      Samples  Command       Shared Object                                   Symbol
# ........  ...........  .......  ..................  .......................................
#
    17.47%          100   oracle  oracle              [.] sxorchk                            
     7.99%           47   oracle  oracle              [.] kdstf00000010000kmP                
     6.01%           35   oracle  oracle              [.] kcbhvbo                            
     3.25%           19   oracle  oracle              [.] kdst_fetch                         
     3.01%           17   oracle  [kernel.kallsyms]   [k] __wake_up_bit        

And now the same execution, but with the software timer:

# perf record -g -p 2527 -e cpu-clock
^C
# perf report -n -g none
...
# Samples: 422  of event 'cpu-clock'
# Event count (approx.): 422
#
# Overhead      Samples  Command      Shared Object                            Symbol
# ........  ...........  .......  .................  ................................
#
    78.67%          332   oracle  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore 
     4.03%           17   oracle  oracle             [.] sxorchk                     
     2.13%            9   oracle  oracle             [.] kcbhvbo                     
     1.90%            8   oracle  oracle             [.] kdstf00000010000kmP         
     0.95%            4   oracle  oracle             [.] qeaeCn1Serial                 

This reveals some information: it seems that when profiling with the software timer, the “_raw_spin_unlock_irqrestore” function “eats” a lot of samples, which are “stolen” from the functions where they are spent:
sxorchk has 100 samples with the hardware timer, and 17 with the software timer.
kcbhvbo has 35 samples with the hardware timer, and has 9 with the software timer.
kdstf00000010000kmP has 47 samples with the hardware timer, and has 8 with the software timer.

So, general conclusion is that it’s important to understand what you are measuring, and if that method has implication on what you are measuring.
Conclusion specific to perf: do not use cpu-clock if you can use the hardware event.

This is the fourth post on a serie of postings on how to get measurements out of the cell server, which is the storage layer of the Oracle Exadata database machine. Up until now, I have looked at the measurement of the kind of IOs Exadata receives, the latencies of the IOs as as done by the cell server, and the mechanism Exadata uses to overcome overloaded CPUs on the cell layer.

This post is about the statistics on the disk devices on the operating system, which the cell server also collects and uses. The disk statistics are ideal to combine with the IO latency statistics.

This is how a dump of the collected statistics (which is called “devio_stats”) is invoked on the cell server, using cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('devio_stats',0)"; 

This will output the name of the thread-log file, in which the “devio_stats” dump has been made.

This is a quick peek at the statistics this dump provides (first 10 lines):

[IOSTAT] Dump IO device stats for the last 1800 seconds
2013-10-28 04:57:39.679590*: Dump sequence #34:
[IOSTAT] Device - /dev/sda
ServiceTime Latency AverageRQ numReads numWrites DMWG numDmwgPeers numDmwgPeersFl trigerConfine avgSrvcTimeDmwg avgSrvcTimeDmwgFl
0.000000 0.000000 10 0 6 0 0 0 0 0.000000 0.000000
0.111111 0.111111 15 7 38 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 4 8 0 0 0 0 0.000000 0.000000
0.000000 0.000000 31 0 23 0 0 0 0 0.000000 0.000000
0.000000 0.000000 8 0 1 0 0 0 0 0.000000 0.000000
0.058824 0.058824 25 0 17 0 0 0 0 0.000000 0.000000
etc.

These are the devices for which the cell server keeps statistics:

grep \/dev\/ /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_85.trc
[IOSTAT] Device - /dev/sda
[IOSTAT] Device - /dev/sda3
[IOSTAT] Device - /dev/sdb
[IOSTAT] Device - /dev/sdb3
[IOSTAT] Device - /dev/sdc
[IOSTAT] Device - /dev/sde
[IOSTAT] Device - /dev/sdd
[IOSTAT] Device - /dev/sdf
[IOSTAT] Device - /dev/sdg
[IOSTAT] Device - /dev/sdh
[IOSTAT] Device - /dev/sdi
[IOSTAT] Device - /dev/sdj
[IOSTAT] Device - /dev/sdk
[IOSTAT] Device - /dev/sdl
[IOSTAT] Device - /dev/sdm
[IOSTAT] Device - /dev/sdn
[IOSTAT] Device - /dev/sdo
[IOSTAT] Device - /dev/sdp
[IOSTAT] Device - /dev/sdq
[IOSTAT] Device - /dev/sdr
[IOSTAT] Device - /dev/sds
[IOSTAT] Device - /dev/sdt
[IOSTAT] Device - /dev/sdu

What is of interest here is that if the cell disk is allocated inside a partition instead of the whole disk, the cell server will keep statistics on both the entire device (/dev/sda, dev/sdb) and the partition (/dev/sda3, dev/sdb3). Also, the statistics are kept on both the rotating disks and the flash disks, as you would expect.

When looking in the “devio_stats” dump, there are a few other things which are worthy to notice. The lines with statistics do not have timestamp or other time indicator, it’s only statistics. The lines are displayed per device, with the newest line on top. The dump indicates it dumps the IO device statistics which the cell keeps for the last 1800 seconds (30 minutes). If you count the number of lines which (apparently) are kept by the cell server, the count is 599, not 1800. If you divide the time by the number of samples, it appears the cell takes a device statistics snapshot every 3 seconds. The cell server picks up the disk statistics from /proc/diskstats. Also, mind the cell measures the differences between two periods in time, which means the numbers are averages over a period of 3 seconds.

Two other things are listed in the statistics: ‘trigerConfine’ (which probably should be “triggerConfine”), which is a mechanism for Oracle to manage under performing disks.
The other thing is “DMWG”. At this moment I am aware DMWG means “Disk Media Working Group”, and works with the concept of peers.

To get a better understanding of what the difference is between the ServiceTime and Latency columns, see this excellent writeup on IO statistics from Bart Sjerps. You can exchange the ServiceTime for svctm of iostat or storage wait as Bart calls it, and Latency for await or host wait as Bart calls it.

Exadata is about doing IO. I think if there’s one thing people know about Exadata, that’s it. Exadata brings (part of the) processing potentially closer to the storage media, which will be rotating disks for most (Exadata) users, and optionally can be flash.

But with Exadata, you either do normal alias regular IO, which will probably be single block IO, or multiblock IO, which hopefully gets offloaded. The single block reads are hopefully coming from the flashcache, which can be known by looking at v$sysstat/v$sesstat at the statistic (“cell flash cache read hits”), not directly by looking at the IO related views. To understand the composition of the response time of a smartscan, there is even lesser instrumentation in the database (for background, look at this blogpost, where is shown that the smartscan wait does not detail any of the steps done in a smartscan. In other words: if you experience performance differences on Exadata, and the waits point towards IO, there’s not much analysis which can be done to dig deeper.

Luckily, the Exadata storage server provides a very helpful dump which details IO latencies of what the cell considers celldisks (which are both flash and rotating disks). The dump provides:

- IO size by number of reads and writes
- IO size versus latency for reads and writes
- IO size versus pending IO count for reads and writes
- IO size versus pending IO sizes for reads and writes

This is how this dump is executed (in the cellcli of course):

alter cell events="immediate cellsrv.cellsrv_dump('iolstats',0)";

As with the other dumps, the cellcli provides the name of the trace file where the requested dump has been written to. If we look inside this trace file, this is how an IO latencies dump looks like:

IO length (bytes):          Num read IOs:       Num write IOs:
[    512 -    1023)                212184               104402
[   1024 -    2047)                     0               138812
[   2048 -    4095)                     0               166282
[   4096 -    8191)                    35               134095
[   8192 -   16383)                498831               466674
[  16384 -   32767)                  2006                73433
[  32768 -   65535)                    91                15072
[  65536 -  131071)                   303                 4769
[ 131072 -  262143)                   297                 6376
[ 262144 -  524287)                  1160                  230
[ 524288 - 1048575)                  2278                   36
[1048576 - 2097151)                   459                   21

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [       32 | [       64 | [      128 | [      256 | [      512 | [     1024 | [     2048 | [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 | [   524288 |
                     ||        63) |       127) |       255) |       511) |      1023) |      2047) |      4095) |      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |   1048575) |
---------------------||------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
[     512,     1023) ||      31075 |      14592 |      69575 |      55370 |       7744 |        385 |        725 |       6489 |       7044 |      11663 |       4030 |       1770 |       1310 |        408 |          4 |
[    4096,     8191) ||          0 |          6 |          5 |          6 |          0 |          0 |          0 |          0 |          7 |          8 |          3 |          0 |          0 |          0 |          0 |
[    8192,    16383) ||         66 |        101 |       3189 |       6347 |        717 |       1826 |      23168 |     124246 |     191169 |      79157 |      37032 |      18508 |      12778 |        526 |          1 |
[   16384,    32767) ||         22 |         46 |         22 |       1403 |         90 |         46 |         57 |         65 |         77 |        124 |         39 |          5 |          7 |          3 |          0 |
...

What struck me as odd, is the name of the celldisk (CD_02_enkcel01 here) is below the first table (IO lengths) about this celldisk(!)

In my previous post we saw a command to reset statistics (a cell events command). There is a command to reset the statistics for this specific dump (‘iolstats’) too (to be executed in the cellcli of course):

alter cell events = "immediate cellsrv.cellsrv_resetstats(iolstats)";

Next, I executed a smartscan

IO length (bytes):          Num read IOs:       Num write IOs:
[   4096 -    8191)                     0                   24
[ 524288 - 1048575)                     8                    0
[1048576 - 2097151)                   208                    0

Average IO-latency distribution stats for CDisk CD_02_enkcel01

Number of Reads iosize-latency distribution
IO len(B)\IO lat(us) || [     4096 | [     8192 | [    16384 | [    32768 | [    65536 | [   131072 | [   262144 |
                     ||      8191) |     16383) |     32767) |     65535) |    131071) |    262143) |    524287) |
---------------------||------------|------------|------------|------------|------------|------------|------------|
[  524288,  1048575) ||          0 |          0 |          3 |          1 |          0 |          2 |          2 |
[ 1048576,  2097151) ||          1 |          3 |         15 |         22 |         89 |         59 |         19 |

As can be seen, the statistics have been reset (on the local cell/storage server!). This makes diagnosing the physical IO subsystem of Exadata possible!

When you are administering an Exadata or more Exadata’s, you probably have multiple databases running on different database or “computing” nodes. In order to understand what kind of IO you are doing, you can look inside the statistics of your database, and look in the data dictionary what that instance or instances (in case of RAC) have been doing. When using Exadata there is a near 100% chance you are using either normal redundancy or high redundancy, of which most people know the impact of the “write amplification” of both normal and high redundancy of ASM (the write statistics in the Oracle data dictionary do not reflect the additional writes needed to satisfy normal (#IO times 2) or high (#IO times 3) redundancy). This means there might be difference in IOs between what you measure or think for your database is doing, and actually is done at the storage level.

But what if you want to know what is happening on the storage level, so on the level of the cell or actually “cellsrv”, which is the process which makes IO flow to your databases? One option is to run “iostat -x”, but that gives a list that is quite hard readable (too much disk devices); and: it doesn’t show you what the reason for the IO was: redo write? controlfile read? Archivelog? This would especially be great if you want to understand what is happening if your IO behaves different than you expect, and you’ve ruled out IORM.

Well, it is possible to get an IO overview (cumulative since startup)! Every storage server keeps a table of IO reasons. This table can be dumped into a trace file on the cell; to generate a dump with an overview of what kind of IOs are done; use “cellcli” locally on a cell, and enter the following command:

alter cell events="immediate cellsrv.cellsrv_dump('ioreasons',0)";

This doesn’t generate anything useful as output on the command line, except for the name of the thread-logfile where we can find the contents of the dump we requested:

Dump sequence #18 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_14.trc
Cell enkcel01 successfully altered

As an aid for searching your dump in thread-logfile: search (“/” when you use “less” for it), enter the following (using the above example, with sequence #18): “/sequence\ #18″, without ‘”‘.

This is an example from a cell in the Enkitec lab, which I used for this example:

Cache::dumpReasons           I/O Reason Table
2013-10-23 08:11:06.869047*: Dump sequence #18:
Cache::dumpReasons Reason                  Reads Writes
Cache::dumpReasons ------------------------------------
Cache::dumpReasons UNKNOWN                436784 162942
Cache::dumpReasons RedoLog Write               0  80329
Cache::dumpReasons RedoLog Read              873      0
Cache::dumpReasons ControlFile Read       399993      0
Cache::dumpReasons ControlFile Write           0 473234
Cache::dumpReasons ASM DiskHeader IO        4326      4
Cache::dumpReasons BufferCache Read        27184      0
Cache::dumpReasons DataHeader Read          2627      0
Cache::dumpReasons DataHeader Write            0   1280
Cache::dumpReasons Datafile SeqRead           45      0
Cache::dumpReasons Datafile SeqWrite           0    373
Cache::dumpReasons HighPriority Checkpoint Write      0   6146
Cache::dumpReasons DBWR Aged Write             0    560
Cache::dumpReasons ReuseBlock Write            0    150
Cache::dumpReasons Selftune Checkpoint Write      0 116800
Cache::dumpReasons RequestLit Write            0     25
Cache::dumpReasons Archivelog IO               0    255
Cache::dumpReasons TrackingFile IO          2586   2698
Cache::dumpReasons ASM Relocate IO             0    200
Cache::dumpReasons ASM Replacement IO          0     91
Cache::dumpReasons ASM CacheCleanup IO         0   4514
Cache::dumpReasons ASM UserFile Relocate       0   2461
Cache::dumpReasons ASM Redo IO                 0  10610
Cache::dumpReasons ASM Cache IO             1953      0
Cache::dumpReasons ASM PST IO                  0     44
Cache::dumpReasons ASM Heartbeat IO           26 162984
Cache::dumpReasons ASM BlockFormat IO          0   3704
Cache::dumpReasons ASM StaleFile IO            0    675
Cache::dumpReasons OSD Header IO               0    315
Cache::dumpReasons Smart scan              11840      0

Please mind the numbers here are IOs, it doesn’t say anything about the size of the IOs. Also please mind these are numbers of a single cell, you probably have 3, 7 or 14 cells.

In my opinion this IO summary can be of much value during IO performance investigations, but also during proofs of concept.

If the cell has been running for a while, these number may grow very big. In order to make an easy baseline, the IO reason numbers can be reset, so you can start off your test or proof-of-concept run and measure what actually has happened on the cell layer! In order to reset the IO reason table, enter the following command in the cellcli:

alter cell events = "immediate cellsrv.cellsrv_resetstats(ioreasons)"; 

This will reset the IO reasons table in the cell.

PS1: Thanks to Nikolay Kovachev for pointing out the ‘ioreasons’ resetstats parameter. Indeed ‘all’ is way too blunt.
PS2: The IO numbers seem to be the number IO requests the cell has gotten from it’s clients (ASM and database) for data, not for metadata. During a smartscan metadata flows in between the database and the cell server before data is actually served.

Exadata gets its performance by letting the storage (the exadata storage server) participate in query processing, which means part of the processing is done as close as possible to where the data is stored. The participation of the storage server in query processing means that a storage grid can massively parallel (depending on the amount of storage servers participating) process a smart scan request.

However, this also means additional CPU is used on the storage layer. Because there is no real limit on how many queries can use smartscans (and/or hybrid columnar compression, in other words: processing) on the available storage servers, this means a storage server can get overloaded, which could hurt performance. To overcome this problem, Oracle introduced the ‘passthrough’ functionality in the storage server. In the exadata book, it is explained that this functionality came with storage server version 11.2.2.3.0 and Oracle database version 11.2.0.3 and Exadata bundle patch 7. It also explains that the ‘passthrough’ functionality means that the storage server deliberately starts sending non-storage processed data during the smartscan. So when this happens, you still do a smartscan (!), but your foreground process or parallel query slave gets much more data, and needs to process more. The database-side statistic to know this is happening is “cell physical IO bytes sent directly to DB node to balance CPU usage” which is at the database level in v$sysstat and on the session level in v$sesstat.

But how does this work on the storage server?

On the storage server, the passthrough mode properties are governed by a few “underbar” or “undocumented” parameters in the storage server. In order to get the (current) values of the Exadata storage server, the following command can be used on the “cellcli”:

alter cell events="immediate cellsrv.cellsrv_dump('cellparams',0)";

The cell will echo the thread-logfile in which the output of this dump is put:

Dump sequence #1 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_87.trc
Cell enkcel01 successfully altered

Now load this tracefile (readonly) in your favourite text manipulation tool (I use ‘less’ because less is more).

The “underbar” parameters which are of interest are the following parameters:

_cell_mpp_cpu_freq = 2
_cell_mpp_threshold = 90
_cell_mpp_max_pushback = 50

The “MPP” function is responsible for the passthrough functionality. I can’t find anywhere what “MPP” means, my guess is “Measurement (of) Performance (for) Passthrough”. These parameters govern how it works.

_cell_mpp_cpu_freq seems to be the frequency at which the MPP code measures the host CPU, “2″ means per “200ms”.
_cell_mpp_threshold seems to be the CPU usage threshold after which the passthrough functionality kicks in.
_cell_mpp_max_pushback seems to be the maximum percentage of blocks (unsure what the exact granularity is) which are sent to the database in passthrough mode.

In order to get a good understanding of what MPP does, there is a MPP specific dump which could be very beneficial to diagnose MPP related matters. This dump is available on the storage server, which means in the cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('mpp_stats',0)";

The cell will once again echo the thread-logfile in which the output of this dump is put:

Dump sequence #8 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
Cell enkcel01 successfully altered

Now peek in the tracefile!

Trace file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
ORACLE_HOME = /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.32-400.11.1.el5uek
Version:        #1 SMP Thu Nov 22 03:29:09 PST 2012
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.2.1_LINUX.X64_130109

*** 2013-10-21 10:54:05.994
UserThread: LWPID: 16114 userId: 22 kernelId: 22 pthreadID: 0x4d1e5940
2013-10-21 14:36:04.910675*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 0 out of 4232 total blocks.  Percent = 0.000000%
2013-10-21 14:36:04.910675*: Dump sequence #3:
[MPP] Current cell cpu utilization: 7
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 8 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
...
[MPP] Mon Oct 21 14:05:57 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0

So, what do we see here? We see a cell tracefile which is in the well-known Oracle trace format. This means it starts off with a header which is specific to the cell server.

Then we see a timestamp with three asterisks in front of it. The time in the timestamp is (10:54:05.994), which is is roughly 3 hours and 30 minutes earlier than the timestamp with the next messages, which is 14:36:04.910. The line with the three asterisks is the creation timestamp of the tracefile, which is when the thread which we are using was created. The next line is also created just after that time, which lists the LWPID, userId, etc.

The line with [MPP] is created because of the mpp_stats dump. The timestamp has a single asterisk after it, which means the time is an approximation. The line tells important information: during the approximate 30 minutes in this dump, 4232 blocks where processed by this cell, and 0 blocks where “executed” in “passthru” mode.

Next, the measurements which where taken every 200ms of the past are printed to get an exact overview of measured CPU business and the rate at which “pushback” alias “passthru” was applied.

To see what this means, let’s generate CPU business on the cell, and see if we can get the storage server to invoke “passthru”. There is a simple trick to let a fake process take 100% on it’s CPU thread with common linux shell tools: ‘yes > /dev/null &’. The storage server which I use has 16 CPU threads, so I start 16 of these processes, to effectively have a process that can monopolise every CPU thread.

Next, I started a (sufficiently large; 7GB) scan on a table via sqlplus, and then dumped ‘mpp_stats’ using the method described in this blog.

2013-10-21 16:30:27.977624*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 2728 out of 13287 total blocks.  Percent = 20.531347%
2013-10-21 16:30:27.977624*: Dump sequence #10:
[MPP] Current cell cpu utilization: 8
[MPP] Mon Oct 21 16:30:27 2013 [Cell CPU History] 8 [Pushback Rate] 0
...
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 95 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 95 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 99 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 98 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 98 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 99 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 100 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 0
...

This shows it all! The header shows that during the last 30 minutes, this storage server sended 2728 blocks out of the total of 13287 blocks via passthrough mode. Further, in the lines which contain the historical measurements, the “pushback rate” can be seen climbing up to 50%, because the CPU usage was above 90%.

Please mind the techniques I’ve described here are done on one storage server, while a normal Exadata setup has 3 (8th/quarter rack), 7 (half rack) or 14 (full rack) storage servers.

Follow

Get every new post delivered to your Inbox.

Join 1,883 other followers

%d bloggers like this: