Archive

Oracle EE

This is yet another blogpost on Oracle’s direct path read feature which was introduced for non-parallel query processes in Oracle version 11.

For full table scans, a direct path read is done (according to my tests and current knowledge) when:

- The segment is bigger than 5 * _small_table_threshold.
– Less than 50% of the blocks of the table is already in the buffercache.
– Less than 25% of the blocks in the buffercache are dirty.

Also (thanks to Freek d’Hooge who pointed me to an article from Tanel Poder) you can change the optimizer statistics to change the segment size for the direct path decision. Please mind that whilst this uses the statistics the optimizer uses, this is NOT an optimizer decision, but a decision made in the “code path”, so during execution.

So let’s take a look at my lab environment (Oracle Linux 6.3, 64 bit, Oracle 11.2.0.3 and ASM)

Small table threshold:

NAME						   VALUE
-------------------------------------------------- -------
_small_table_threshold				   1011

Table information:

TS@v11203 > select blocks from user_segments where segment_name = 'T2';

    BLOCKS
----------
     21504

So if we take small table threshold times and multiply it by five, we get 5055. This means that the size of table T2 is more than enough so should be scanned via direct path:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  0

TS@v11203 > select count(*) from t2;

  COUNT(*)
----------
   1000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  1

Well, that’s that, this seems quite simple.

I’ve created a relatively big table and created a (normal) index on it in the same database. The index is created on a single column, called ‘id’. If I issue a count(id), the whole index needs to be scanned, and Oracle will choose a fast full index scan. A fast full index scan is a scan which just needs to read all the blocks, not necessarily in leaf order. This means it can use multiblock reads (which reads in the order of allocated adjacent blocks).

Let’s check just to be sure:

TS@v11203 > select count(id) from bigtable;

Execution Plan
----------------------------------------------------------
Plan hash value: 106863591

------------------------------------------------------------------------------------
| Id  | Operation	      | Name	   | Rows  | Bytes | Cost (%CPU)| Time	   |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	   |	 1 |	13 | 19662   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE       | 	   |	 1 |	13 |		|	   |
|   2 |   INDEX FAST FULL SCAN| I_BIGTABLE |	34M|   425M| 19662   (2)| 00:03:56 |
------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

If we look at the index size, the size of the index makes this segment a candidate for direct path reads:

TS@v11203 > select blocks from user_segments where segment_name = 'I_BIGTABLE';

    BLOCKS
----------
     72704

If we look at number of small table threshold times five (5055), this index is much bigger than that. Also, this is bigger than table T2. Let’s execute select count(id) from bigtable, and look at the statistic ‘index fast full scans (direct read)':

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

TS@v11203 > select count(id) from bigtable;

 COUNT(ID)
----------
  32000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

Huh? This statistic tells me there hasn’t been a direct path read! This means that this read has been done in the “traditional way”. This is a bit…counter intuitive. I’ve traced the session, and indeed it’s doing the traditional multiblock reads via the scattered read waits.

I did a fair bit of fiddling around with the parameters which are reported to be involved, and found out I can get the database to do direct path reads by changing the parameter “_very_large_object_threshold”. The information found on the internet reports this value is in megabytes. A quick stroll through a number of different database (all on 11.2.0.3) shows this parameter is quite probably statically set at “500”.

If I calculate the size in megabytes of the index I_BIGTABLE, the size is 568M. This is clearly higher than the value of “_very_large_object_threshold”. I can get the same index scanned via direct path reads by changing the value of “_very_large_object_threshold” to 100.

This interesting, because it looks like this parameter does the same for full scans on index segments as “_small_table_threshold” does for full scans on table segments: the size of the segment to be scanned needs to be bigger than five times.

There are also differences: small table threshold is set in blocks, (apparently) very large object threshold is set in megabytes. Also, small table threshold is set by default at 2% of the size of the buffercache (so it scales up with bigger caches), very large object threshold seems to be fixed at 500. If my finding is correct, then it means an index segment needs to be bigger than 500*5=2500M to be considered for direct path reads. It’s unknown to me if the 50% limit for blocks in the cache and the 25% limit for dirty blocks is subject to this too.

When an Oracle process starts executing a query and needs to do a full segment scan, it needs to make a decision if it’s going to use ‘blockmode’, which is the normal way of working on non-Exadata Oracle databases, where blocks are read from disk and processed by the Oracle foreground process, either “cached” (read from disk and put in the database buffercache) or “direct” (read from disk and put in the process’ PGA), or ‘offloaded mode’, where part of the execution is done by the cell server.

The code layer where the Oracle database process initiates the offloading is ‘kcfis'; an educated guess is Kernel Cache File Intelligent Storage. Does a “normal” alias non-Exadata database ever use the ‘kcfis’ layer? My first guess would be ‘no’, but we all know guessing takes you nowhere (right?). Let’s see if a “normal” database uses the ‘kcfis’ functions on a Linux x64 (OL 6.3) system with Oracle 11.2.0.3 64 bit using ASM.

The only way to profile kernel functions that I am aware of is using ‘gdb’ and breaking on functions in the Oracle executable:
(the process id shown below ought to be the process id of an oracle database process id you are going to execute in. Do not, I repeat: not do this with other processes, especially the ones that do important tasks!)

# gdb -p 42
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) rbreak ^kcfis.*
Breakpoint 1 at 0x2204094
<function, no debug info> kcfis_get_sched_delay;
Breakpoint 2 at 0x220410a
<function, no debug info> kcfis_capability_tab_get;
Breakpoint 3 at 0x2204150
<function, no debug info> kcfis_can_session_migrate;
Breakpoint 4 at 0x2204190
<function, no debug info> kcfis_fob_storage_attr;
Breakpoint 5 at 0x22041d0
<function, no debug info> kcfis_init_resource_limit;
Breakpoint 6 at 0x22041f2
<function, no debug info> kcfis_tablespace_is_on_sage;
...
(gdb) c
Continuing.

Okay, we got the debugger set, now let’s execute a simple query (doing a full scan) to see if kcfis is touched on a “normal” server or not!

TS@v11203 > select count(*) from t2;

I am on OS-X (Apple) using iTerm, and I see the tab of the gdb session turning red: gdb generated some output on the screen!

(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) 

So, we hit a kcfis function! Let me put upfront that I do not have sourcecode of any kind, so my statements about what a function does are actually guesses. Anyway: from the name of the function (kcfis_tablespace_is_on_sage) it looks like a function in the kcfis layer which determines if a tablespace is on an Exadata storage server (exadata’s internal name is/was ‘sage’). Nice. so we hit the layer for determination if the tablespace is on Exadata.

Let’s continue the profiling with gdb:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000220410a in kcfis_capability_tab_get ()
(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) c
Continuing.

We see another call to the function ‘kcfis_tablespace_is_on_sage’, and a call to ‘kcfis_capability_tab_get’. The last function probably tries to probe the table (but could be tablespace) to get the capabilities. This could be the function which checks the requirements for hybrid columnar compression, I am not sure about that.

At what point during the processing of the full segment scan does the kcfis_tablespace_is_on_sage occur? One way of investigating this, is profiling some functions we know a (direct path) full scan does, and see where the kcfis_tablespace_is_on_sage kicks in. When the buffer cache is flushed prior to executing a full scan, and the SQL is made unique, so it has to be parsed, the following sequence of events happens:

- Parse
– Execute
– A ‘SQL*Net message to client’ wait
– A ‘db file sequential read’ wait (for reading the segment header)
– Potentially a ‘asynch descriptor resize’ wait
– The full scan is done asynchronously (potentially revealing some ‘direct path read’ waits)

So if we profile on start and end of a wait (kslwtbctx and kslwtectx), a single block read (pread64), AIO (io_submit and io_getevents_0_4) and of course kcfis_tablespace_is_on_sage, we should be able to see that:

Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, pread64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 6, io_getevents_0_4 (ctx=0x7fb42f475000, min_nr=2, nr=128, events=0x7fffb4c66768, timeout=0x7fffb4c67770) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

So what do we see here?
Row 1 : Here the wait ending for ‘SQL*Net message from client’.
Row 2-3 : This is the ‘SQL*Net message to client’ wait.
Row 5-6-7 : This is begin wait, pread64 for reading the segment header and the end wait.
Row 8&9 : The session probes for Exadata.
Row 10-11 : Start and end of a wait, ‘asynch descriptor resize’ (verified by trace file).
Row 12- : Here the full scan takes off.

So…a query starts (I’ve made it unique by adding a dummy hint, so it’s parsed again) the SQL*Net roundtrip occurs, the segment header is read, then the process looks if it’s on Exadata, which it isn’t here, and starts an asynchronous full scan.

So let’s repeat this check on a database which is on Exadata! In order to do so, we need to be aware several calls are not done on exadata: pread64, io_submit and io_getevents_0_4, because we do not do local IO, but issue them over infiniband. So I breaked on kslwtbctx, kslwtectx, kcfis_tablespace_is_on_sage. This is how that looks like:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000905cf62 in kslwtectx ()
Breakpoint 3, 0x0000000002230366 in kcfis_tablespace_is_on_sage ()

Mmmh, this is different. The ‘SQL*Net message from client’ wait ends, and before any other wait occurs, the existence of exadata is checked. This is different from the non-exadata case. Let’s take a look at the backtrace of the break on kcfis_tablespace_is_on_sage:

(gdb) bt
#0  0x0000000002230366 in kcfis_tablespace_is_on_sage ()
#1  0x0000000001402eb0 in qesSageEnabled ()
#2  0x0000000009234d20 in kkdlgstd ()
#3  0x0000000001a6111d in kkmfcblo ()
#4  0x000000000922f76d in kkmpfcbk ()
#5  0x000000000942e538 in qcsprfro ()
#6  0x000000000942de29 in qcsprfro_tree ()
#7  0x000000000942de6e in qcsprfro_tree ()
#8  0x0000000002dd80c5 in qcspafq ()
#9  0x0000000002dd51d9 in qcspqbDescendents ()
#10 0x0000000002dd91e4 in qcspqb ()
#11 0x0000000001a6b2be in kkmdrv ()
#12 0x0000000002584c76 in opiSem ()
#13 0x000000000258ac8b in opiDeferredSem ()
#14 0x000000000257dc32 in opitca ()
#15 0x0000000001ec3d7d in kksFullTypeCheck ()
#16 0x00000000092a7256 in rpiswu2 ()
#17 0x0000000001eca977 in kksLoadChild ()
#18 0x0000000009298448 in kxsGetRuntimeLock ()
#19 0x000000000925aa34 in kksfbc ()
#20 0x000000000925556e in kkspsc0 ()
#21 0x0000000009254e6a in kksParseCursor ()
#22 0x000000000933cb25 in opiosq0 ()
#23 0x0000000001b82a46 in kpooprx ()
#24 0x0000000001b80d2c in kpoal8 ()
#25 0x00000000091fb8b8 in opiodr ()
#26 0x000000000939e696 in ttcpip ()
#27 0x000000000180f011 in opitsk ()
#28 0x0000000001813c0a in opiino ()
#29 0x00000000091fb8b8 in opiodr ()
#30 0x000000000180af4c in opidrv ()
#31 0x0000000001e0a77b in sou2o ()
#32 0x0000000000a0cc05 in opimai_real ()
#33 0x0000000001e106ec in ssthrdmain ()
#34 0x0000000000a0cb71 in main ()

What is interesting to see, is line 23, backtrace layer number 21: kksParseCursor. So actually during parsing the detection of storage servers happens already, not when it actually starts a full segment scan needs to make a decision to do a smartscan or not.

With Exadata version 11.2.3.2.0 came the Unbreakable Linux Kernel for Exadata, which had been the stock EL5 redhat kernel prior to this version (2.6.18). With the unbreakable kernel came the opportunity to run the perf utility. This utility has the opportunity to see which functions are active inside an executable when there’s a symbol table. And the oracle database executable has a symbol table! One reason to do this, is to get a more granular overview of what the Oracle database is doing than the wait interface, especially to get a more detailed overview of what the database is doing in what is visible in the wait interface as ‘on cpu’.

Right after the Exadata upgrade, I ran a simple query (which probably doesn’t reflect any real customer case) to get an idea. Previously I have been running ‘select count(*) from bigtable’ on Exadata before, and saw most of it being CPU, and a minor part of it being ‘cell smart table scan’. Now with perf I have the opportunity to get more details on what is the time spend on CPU!

These are the top 5 functions from that session:

    31.50%     476423   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    30.20%     456774   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
     7.48%     113083   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     6.96%     105301   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.94%      44475   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

So the top function used during the processing of the SQL is a function called ‘sxorchk’. Sadly, Oracle doesn’t provide information about this. Anyway, this is executing a XOR function for the reason of checking information gotten from storage. Probably the name of this function is System XOR CHecK.

Alright, once we know this we can look into the parameters of the instance I am executing the SQL, which is taken from the default Exadata parameter template (!!):

SYS@t11203 AS SYSDBA> show parameter db_block

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_block_buffers		     integer	 0
db_block_checking		     string	 FALSE
db_block_checksum		     string	 TYPICAL
db_block_size			     integer	 8192

So, this should be related to db_block_checksum, db_block_checking is off/false. Well, since this is a test instance, let’s turn it off, and redo the scan:

    49.35%     480911   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
    11.01%     107299   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     6.56%      63885   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     3.97%      38712   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     3.49%      33970   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdst_fetch

The sxorchk function is gone now! This is also reflected in the responsetime: the time with db_block_checksum set to typical is: 00:01:02.44, and with db_block_checksum set to false is: 00:00:41.68 on my system. That is a difference of roughly 20 seconds, or, as we can see from the first perf-function list: 31.5% of the time. That is a significant amount of time!

When I discussed this with Tuomas Pystynen, he asked me a very valid question: if this is a smartscan, the database does not get blocks, it gets result-sets. So blocks cannot be checked on the database layer. Is this offloaded to the storage/cell server?

Well, if it is offloaded to the storage, it will not be reflected in a function on the database layer. In other words, something is XOR’ed by the database foreground process, which is set (and can be reset) with db_block_checksum! Logical conclusion on these facts would be blocks are involved in some way…

There is way to know what is actually happening: looking at the backtrace of sxorchk function! Let’s do that!

    31.50%   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.76%-- kcbhxoro
                |          kcbhvbo
                |          |          
                |          |--100.00%-- kcbhvb_corruptcheck
                |          |          kcbl_sage_block_check
                |          |          kcfis_oss_block_verify
                |          |          kcfis_validate_translation
                |          |          kcfis_process_completed_req
                |          |          kcfis_process_completed_buffer
                |          |          kcfis_process_reaped_io
                |          |          kcfis_read
                |          |          kcbl_predpush_get
                |          |          kcbldrget
                |          |          kcbgtcr
                |          |          ktrget3
                |          |          ktrget2
                |          |          kdst_fetch

What we see here is the function is called from the kcfis (Kernel Cache File Intelligent Storage is my guess) layer from a function called ‘kcfis_oss_block_verify’, in other words: a block, which is actually the resultset which is send from the cell server, is validated/checked. That is understandable, but the usage of the parameter ‘db_block_checksum’ for setting this is misleading, to put it in a nice way.

Next question which I asked myself is: how about a normal/non-Exadata database? Well, I can mimic a non-Exadata database by setting the parameter ‘cell_offload_processing’ to false to disable smartscans!

This is how the top-5 functions look like with db_block_checksum set to true without smartscan:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    18.53%     353741   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.05%     191847   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.35%     102161   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.73%      52103   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

And this is how the top-5 functions look like with db_block_checksum set to false: without smartscan:

    36.51%     706798   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.47%     202645   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.58%     107941   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     3.57%      69044   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     2.38%      46036   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] ktrgcm

If we get the backtrace of the sxorchk function without smartscan enabled:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.79%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch

We can see the sxorchk function is called from kcbldrget (the direct path load function), with more or less the same helper function to prepare for XOR function, and no other layers (like kcfis in the backtrace with smartscan enabled).

My conclusion is: db_block_checksum is a parameter which enables checking of the data it has gotten from “storage”, which apparently is done via a XOR function. This checking is done for blocks on a regular/non-Exadata system, which the parameter suggests, but on Exadata with a smartscan the checking still is done, regardless the fact that no blocks but result-sets are gotten from “storage”.

The checking takes a significant portion of time during processing of my (very simple!) query. Probably other functions can make the overall processing more CPU intensive, which means the relative portion of time spend on checking gets less.

With Exadata/smartscan the relative amount of time spend on sxorchk with my simple SQL is 32%, without smartscan the relative amount of time drops to 21%. This is still a significant amount of time (and the function the most time is spend in, in both cases!).

Final conclusion is you should think about the setting of this parameter if you are doing much physical IO, and set it according to the needs of the database.

One addition: this is done on a V2 half rack version of Exadata (Nehalem CPU), so anyone with X2-[28] or X3-[28] is welcome to profile a ‘select count(*) from bigtable’ and post it here to see if there’s any difference. The database version is 11.2.0.3 BP15.

In my previous post I touched the topic of a “new” codepath (codepath means “way of utilising the operating system”) for of full segment scans. Full segment scan means scanning a segment (I use “segment” because it could be a partition of a table or index) which is visible commonly visible in an explain plan (but not restricted to) as the rowsources “TABLE ACCESS FULL”, “FAST FULL INDEX SCAN” and “BITMAP FULL SCAN”.

Look at my presentation About multiblock reads to see how and when direct path reads kick in, and what the difference between the two is. Most notably, Oracle has released very little information about asynchronous direct path reads.

This post is about the implementation of direct path reads in the Oracle executable in 11.2.0.3 and the OS implementation on Linux.
For completeness: this is Oracle Linux 6.3 X64 with kernel 2.6.39-300.17.3.el6uek.x86_64. Database 11.2.0.3 64 bit, no PSU/CPU.
The database uses two ASM disks and the clusterware in single machine mode.
The system is running in a VM on my Macbook OSX 10.8.2 with SSD, VMWare Professional 5.0.2.

The database action which is traced with gdb is a simple “select count(*) from t2″, where t2 is a table without any indexes or constraints, and big enough to make the database engine choose a direct path read. Oracle can choose a number of new, largely undocumented mechanisms for optimising specifically direct path reads. The mechanisms observed in this blogpost are the true asychronous reads, which means the foreground process can issue several IO read requests, and reap them if they are finished, and increasing the number of simultaneous IO requests.

1. Observing the IO requests: getting the picture straight

First we restore the setup from my previous post, which means a normal database process (using sqlplus), and a root session with gdb attached to this process using “gdb -p “

Now set breakpoints for the operating system IO calls (with the intention to see io_submit() and io_getevents()) to show the IO calls for the oracle foreground process, and let the oracle process continue processing by using “c” in the debugger:

(gdb) rbreak ^io_.*
Breakpoint 1 at 0xa08c20
 io_prep_pwritev;
Note: breakpoint 1 also set at pc 0xa08c20.
...
Breakpoint 45 at 0x7f2e71b1dc0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 1-45, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Now issue the full segment scan in the Oracle session, and look at the output:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()

.
First thing that comes to mind is: why the TWO calls for the io_submit (io_submit@plt () and io_submit () from /lib64/libaio.so.1) and only 1 call for io_getevents (io_getevents@plt ())?

This is where ‘plt’ needs an introduction. ‘plt’ means procedure linkage table. This is a construction for functions used by an executable that uses dynamically linked shared libraries, which is exactly what the ‘oracle’ executable uses. If you look at the address of the @plt call, you see it is in the oracle executable (see /proc/maps), and io_submit () from /lib/libaio.so.1 is, as the line indicates from the dynamically linked library libaio.so.1 (which also can be seen from /proc//maps).

Back to io_getevents. We see only the io_getevents@plt call, which means that either the executable fakes the system call (?), or that we are missing something. This lead me to investigate the libaio library itself. This can be done using ‘nm -D’ (nm is an executable):

# nm -D /lib64/libaio.so.1
0000000000000000 A LIBAIO_0.1
0000000000000000 A LIBAIO_0.4
0000003f38200710 T io_cancel
0000003f38200670 T io_cancel
0000003f38200690 T io_destroy
0000003f382006a0 T io_getevents
0000003f38200620 T io_getevents
0000003f38200570 T io_queue_init
0000003f38200590 T io_queue_release
0000003f382005b0 T io_queue_run
0000003f382005a0 T io_queue_wait
0000003f382006e0 T io_queue_wait
0000003f38200680 T io_setup
0000003f38200660 T io_submit

.
Aha! there are TWO io_getevents here! Now THAT is interesting!

Let’s see which io_getevents call or calls are used by the oracle executable:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) rbreak ^io_get.*
Breakpoint 46 at 0x3f382006a0
 io_getevents;
...
Breakpoint 53 at 0xa09030
 io_getevents@plt;

.
And list the breakpoints:

(gdb) info break
Num     Type           Disp Enb Address            What
46      breakpoint     keep y   0x0000003f382006a0 io_getevents
47      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
48      breakpoint     keep y   0x0000003f382006a0 io_getevents
49      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
50      breakpoint     keep y   0x0000003f382006a0 io_getevents
51      breakpoint     keep y   0x0000003f382006a0 io_getevents
52      breakpoint     keep y   0x0000003f382006a0 io_getevents
53      breakpoint     keep y   0x0000000000a09030 io_getevents@plt

.
So, all the io_getevents breakpoints are for address 0x0000003f382006a0! If we look back at the libaio contents, we see addresses for the io_getevents functions have addresses 0x0000003f382006a0 and 0x0000003f38200620.

Let’s make gdb break and continue at the second io_getevents address too!

(gdb) break *0x0000003f38200620
Breakpoint 54 at 0x3f38200620
(gdb) commands
Type commands for breakpoint(s) 54, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Now run the full segment scan in the sqlplus session again, and see what gdb shows:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1

.
Okay! We got the system call back!

So, we now got a breakpoint set in gdb which is able show us the asynchronous operating system calls Oracle uses in a scan, which are io_submit and io_getevents. Oracle uses more calls for managing asynchronous IO, but for clarity I focus on io_submit and io_getevents.

2. Add wait events to direct path read profile

At this point I can add two breakpoints for the wait instrumentation (kslwtbctx (enter wait) and kslwtectx (end wait)) using:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 55 at 0x8f9a652
 kslwtbctx;
Breakpoint 56 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 55-56, one per line.
End with a line saying just "end".
>silent
>f
>c
>end

.
If I rerun the full scan again, I can see there are no waits for the IO related calls, just io_submit, then io_getevents, which I can tell because I don’t see kslwtbctx and kslwtectx during the IO related calls. I see them AFTER the IO is done, which is normal, expected behavior. Please mind the cause of this is this is a VM where the disks are on an SSD, but most probably the disks (which is a file with VMWare Fusion) are in the cache of my MacBook, which makes IO very fast.

As introduced in the previous post, the last function call seen is entering of a wait using ‘kslwtbctx’, which is the event for ‘SQL*Net message from client’, which is the foreground waiting for user input to process:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

.
Now see what happens if I slow the IO’s down using cgroups.

I slowed down both my (ASM-)disks to 1 IOPS, and ran the scan in sqlplus again. This is a snippet of the gdb output:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008fa1334 in kslwtectx ()

.
Actually, above is the typical pattern I see: a submitted IO via io_submit, four io_getevents calls, registering a wait event, then another io_getevents call, and ending of the wait event, given IO isn’t too fast. (in real life it all depends on the speed of the IO beneath the operating system, and mostly is a mixture of the last and the previous gdb output)

The pattern can have multiple forms, for example, when the scan starts, after the initialisation of the asynchronous IO dependencies (like an IO context on the OS and the IO slots in Oracle), the first calls are two io_submit calls (in order to get the minimal number of asynchronous IO’s (2) in flight).

Apparently, the four io_getevents calls after io_submit are non-blocking. This is my assumption, simply because I see these calls scrolling over my screen until the wait event is registered, and the scrolling stops at another io_getevents call. But can we be sure?

3. Digging deeper with debuginfo

In order to look deeper into the io_getevents calls, the arguments of the io_getevents function specifically, the debuginfo package of libaio needs to be installed (the debuginfo packages are available at http://oss.oracle.com/ol6/debuginfo). After that, the gdb session needs to be restarted in order to pickup the debug information.

This is how the gdb output looks like with the libaio debuginfo installed:

#0 0x0000000002cfb352 in io_prep_pread ()
#0 0x0000000000a09bb0 in io_submit@plt ()
#0 io_submit (ctx=0x7ff6ceb2c000, nr=1, iocbs=0x7fff2a4e09e0) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e9048, timeout=0x7fff2a4ea050) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ec128, timeout=0x7fff2a4ed130) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e8e48, timeout=0x7fff2a4e9e50) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ebf28, timeout=0x7fff2a4ecf30) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=1, nr=128, events=0x7fff2a4e8e38, timeout=0x7fff2a4e9e40) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008fa1334 in kslwtectx ()

.
Now we see a LOT of more interesting information!

First of all, all the IO calls involved are done in the same “aio_context”: ctx=0x7ff6ceb2c000. This means that any asynchronous IO that is done during this query reports it’s readiness here. This means IO’s can be reaped out of order(!).

Next is min_nr. The non-instrumented io_getevents calls have min_nr set to 2, the io_getevents call which is instrumented has min_nr set to 1.

nr is the maximum number of IO’s that can be reaped by this call in this aio_context. As far as I know, the number of slots cannot be higher than 32, which means that if I am right, there will not be more than 32 requests.

The last io_getevents argument is timeout, which is really interesting to understand and verify the behavior I described. This value is a pointer to a struct which holds the timeout specification. In order to actually know the timeout value, we need to print the contents of the struct.

This is where gdb, once the debuginfo is available, can help. Let’s modify the breakpoints the following way:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) break *0x0000003f38200620
Breakpoint 38 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 38, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) rbreak ^kslwt[be]ctx
Breakpoint 39 at 0x8f9a652
 kslwtbctx;
Breakpoint 40 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 39-40, one per line.
End with a line saying just "end".
>silent
>f
>end
(gdb) rbreak ^io_.*
Breakpoint 41 at 0x3f38200570: file io_queue_init.c, line 28.
int io_queue_init(int, io_context_t *);
...
Breakpoint 74 at 0x7f549fd44c0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 41-74, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Notice the “print *timeout” with breakpoint 38 (*0x0000003f38200620) alias the “correct” io_getevents call.
Using “print *timeout” we can let gdb follow the pointer, and print the contents of the struct. Let’s see if my assumption about the non-instrumented io_getevents is correct:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  io_submit (ctx=0x7f54a1956000, nr=1, iocbs=0x7fff78f059b0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0dfd8, timeout=0x7fff78f0efe0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$21 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f110b8, timeout=0x7fff78f120c0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0ddd8, timeout=0x7fff78f0ede0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f10eb8, timeout=0x7fff78f11ec0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=1, nr=128, events=0x7fff78f0ddc8, timeout=0x7fff78f0edd0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 600, tv_nsec = 0}

.
Yes! What I saw on my screen in gdb is now proven with the printing of the timeout struct!! The non-instrumented io_getevents calls just peak in the completion queue (ctx) to see if min_nr (3 in this case) IO requests are ready for four times, then register a wait event, and issue another io_getevents, look in the same completion queue (ctx), but only for 1 ready IO request, and now wait for 600 seconds.

4. Conclusion

The reason for this post is to show the essence of how the direct path reads work. There is much more to be said on this subject, especially with the adaptive or “auto tune” mechanism, which scales up the number of asynchronous IO’s in flight.

Recently I was discussing some IO related waits with some friends. The wait I was discussing was ‘kfk: async disk IO’. This wait was always visible in Oracle version 11.2.0.1 and seems to be gone in version 11.2.0.2 and above. Here is the result of some investigation into that.

First: the wait is not gone with version 11.2.0.2 and above, which is very simple to prove (this is a database version 11.2.0.3):

SYS@v11203 AS SYSDBA> select name, parameter1, parameter2, parameter3, wait_class from v$event_name where name like 'kfk: async disk IO';

NAME		     PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
-------------------- ---------- ---------- ---------- ----------------------------------------------------------------
kfk: async disk IO   count	intr	   timeout    System I/O

What is interesting, is that the wait class is ‘System I/O’. I don’t know the official definition of the wait class ‘System I/O’, but it tells me that it is something a background process is waiting for, not my foreground process. But I could be wrong…

Let’s look at an excerpt of a tracefile of a asynchronous, full segment scan in Oracle version 11.2.0.1. This host is running in a VM on OL6.3 X64:

PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356620409310181 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=120,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310180
EXEC #3:c=0,e=63,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310297
WAIT #3: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1356620409310351
WAIT #3: nam='Disk file operations I/O' ela= 9 FileOperation=2 fileno=5 filetype=2 obj#=-1 tim=1356620409311203
WAIT #3: nam='Disk file operations I/O' ela= 238 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409312218
WAIT #3: nam='kfk: async disk IO' ela= 14 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409313703
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409314449
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315169
WAIT #3: nam='kfk: async disk IO' ela= 5 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315850
WAIT #3: nam='Disk file operations I/O' ela= 42 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409316451
WAIT #3: nam='kfk: async disk IO' ela= 33 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409317903
WAIT #3: nam='kfk: async disk IO' ela= 403 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409320529
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409321950
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409323627
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409325424
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409327121
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409329153
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409330861
WAIT #3: nam='kfk: async disk IO' ela= 8 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409332534
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409334179

What is quit clear to see, is the database is doing some stuff which is instrumented in the wait interface (‘Disk file operations I/O’), but the event ‘kfk: async disk IO’ is much occurring. It is NOT taking an enormous amount of time (ela < 10 (microseconds) most of the time).

Let's dig a little deeper. Most events are instrumentation for calls to the operating system to do something. Let's take the linux 'strace' utility, and list the waits together with the system calls. This is done with the 'strace -e write=all -p PID’, here is a fragment:

io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
io_getevents(139892671860736, 2, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 122880, 0}}, {0, 0}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 107) = 107
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam='kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 35  33 20 63 6f 75 6e 74 3d  ' ela= 5 3 count= |
 | 00030  31 20 69 6e 74 72 3d 30  20 74 69 6d 65 6f 75 74  1 intr=0  timeout |
 | 00040  3d 34 32 39 34 39 36 37  32 39 35 20 6f 62 6a 23  =4294967 295 obj# |
 | 00050  3d 37 33 34 32 36 20 74  69 6d 3d 31 33 35 36 36  =73426 t im=13566 |
 | 00060  32 31 31 31 30 32 35 38  30 31 34                 21110258 014      |
write(9, "\n", 1)                       = 1
 | 00000  0a                                                .                 |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b585420, 0x7f3b4b585420, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 106) = 106
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam='kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 36  20 63 6f 75 6e 74 3d 31  ' ela= 6  count=1 |
 | 00030  20 69 6e 74 72 3d 30 20  74 69 6d 65 6f 75 74 3d   intr=0  timeout= |
 | 00040  34 32 39 34 39 36 37 32  39 35 20 6f 62 6a 23 3d  42949672 95 obj#= |
 | 00050  37 33 34 32 36 20 74 69  6d 3d 31 33 35 36 36 32  73426 ti m=135662 |
 | 00060  31 31 31 30 32 35 39 37  35 39                    11102597 59       |
write(9, "\n", 1)                       = 1
 | 00000  0a                                                .                 |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 106) = 106
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam='kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 39  20 63 6f 75 6e 74 3d 31  ' ela= 9  count=1 |
 | 00030  20 69 6e 74 72 3d 30 20  74 69 6d 65 6f 75 74 3d   intr=0  timeout= |
 | 00040  34 32 39 34 39 36 37 32  39 35 20 6f 62 6a 23 3d  42949672 95 obj#= |
 | 00050  37 33 34 32 36 20 74 69  6d 3d 31 33 35 36 36 32  73426 ti m=135662 |
 | 00060  31 31 31 30 32 36 32 31  37 38                    11102621 78       |

What is interesting here, is that the event ‘kfk: async disk IO’ is reported after both the io_getevents() and the io_submit() call. (with the io_submit call an IO request or IO requests are dispatched to the operating system, with the io_getevents call the operating system (AIO) completion queue is examined to see if any requests are ready). So we can conclude that ‘System I/O’ is not appropriate (by my definition), this is something my foreground process apparently is waiting for. We can also conclude that this event is related to IO. But what does it mean?

In order to dig deeper, we need to know what the database is doing internally. The most obvious way is to read the source. But, it’s not possible to get the Oracle database source if your not working for Oracle (and probably explicitly have been granted access). There are a number of ways to see more. I wrote about ‘perf’, but also the gdb (gnu debugger) can be used.

In order to debug a running Oracle database process, you can attach to a process with the debugger (as root) with ‘gdb -p PID’. An equivalent of using trace and looking for io_submit() and io_getevents() calls is:

# gdb -p 14122
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
...
0x0000003f38a0e530 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) 

Now enter the following commands (if you go into gdb yourself, it becomes apparent which are the commands):

(gdb) set pagination off
(gdb) break io_submit
Breakpoint 1 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) break 'io_getevents@plt'
Breakpoint 2 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) continue
Continuing.

And run the full segment scan again:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()

Now we need to know the function call which Oracle uses for the registering waits, for which we used the tracefile generated by SQL trace earlier. The first line in the tracefile of our current session if we issue the full scan again will be the ending of the wait ‘SQL*Net message from client’. The call which Oracle uses for writing this line is write(). Let’s clean up the current breakpoints, and set a breakpoint for the write call (stop the debugger’s current action using CTRL-c):

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break write
Breakpoint 3 at 0x3f38a0e4c0: file ../sysdeps/unix/syscall-template.S, line 82. (3 locations)
(gdb) c
Continuing.

(‘c’ is an abbreviation of ‘continue’)
And run the scan again.
Because of the breakpoint set, gdb will break execution when the write() call is issued:

Breakpoint 3, write () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

And generate a backtrace (a list of all the functions which where called to reach the write() call), with the ‘backtrace’ command (‘bt’):

(gdb) bt
#0  write () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000086014dd in sdbgrfuwf_write_file ()
#2  0x0000000008601426 in sdbgrfwf_write_file ()
#3  0x00000000085caafb in dbgtfdFileWrite ()
#4  0x00000000085ca685 in dbgtfdFileAccessCbk ()
#5  0x00000000085c9d6d in dbgtfPutStr ()
#6  0x0000000001a8d93a in dbktWriteTimestamp ()
#7  0x0000000001a8d6b5 in __PGOSF61_dbktWriteNfy ()
#8  0x00000000085caa06 in dbgtfdFileWrite ()
#9  0x00000000085ca685 in dbgtfdFileAccessCbk ()
#10 0x00000000085c9d6d in dbgtfPutStr ()
#11 0x0000000001a8e291 in dbktPri ()
#12 0x00000000009f143e in ksdwrf ()
#13 0x00000000009f1d8f in ksdwrfn ()
#14 0x0000000005abf0e1 in kxstTraceWait ()
#15 0x000000000821e88d in kslwtectx ()
#16 0x00000000083c2a79 in __PGOSF24_opikndf2 ()
#17 0x000000000143b790 in opitsk ()
#18 0x00000000014406da in opiino ()
#19 0x00000000083c54fd in opiodr ()
#20 0x0000000001437b60 in opidrv ()
#21 0x00000000018aac97 in sou2o ()
#22 0x00000000009d3ef1 in opimai_real ()
#23 0x00000000018aff26 in ssthrdmain ()
#24 0x00000000009d3e5d in main ()

What is interesting to see here, is the function at number 15, kslwtectx(). We know this write reports the ending of waiting at ‘SQL*Net message from client’. This function is ‘Kernel Service Layer WaiT End ConTeXt’ (educated guess; I’ve had a little help from Tanel Poder to find the Oracle wait interface functions). The function which marks the start of a wait is called ‘kslwtbctx’, and the function which ends a wait is called ‘kslwtectx’. Let’s put a break and continue on io_submit, io_getevents, kslwtbctx and kslwtectx:

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break 'io_getevents@plt'
Breakpoint 5 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtbctx
Breakpoint 6 at 0x8217fda
(gdb) commands
Type commands for breakpoint(s) 6, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtectx
Breakpoint 7 at 0x821e3d8
(gdb) commands
Type commands for breakpoint(s) 7, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

Now run the full scan again; this is what is seen in the debugger:

#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()

This goes on, until the IO calls end, and the foreground does some other stuff at the end, ending with the beginning of waiting on ‘SQL*Net message from client':

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()

Okay, back to ‘kfk: async disk IO’. We see the ‘io_getevents()’ call which ALWAYS has a kslwtbctx() call before, and always a kslwtectx() call after the io_getevents() call itself. Please mind we are on 11.2.0.1. This strongly suggests ‘kfk: async disk IO’ is the instrumentation of the io_getevents() call. Or io_getevents() is part of this wait event.

Let’s make the io_getevents() call take longer, and see if that is reflected in the wait, to prove the hypothesis made above. This can be done using the ‘catch’ function of the debugger:

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) catch syscall io_getevents
Catchpoint 8 (syscall 'io_getevents' [208])
(gdb) commands
Type commands for breakpoint(s) 8, one per line.
End with a line saying just "end".
>silent
>shell sleep 0.01
>c
>end
(gdb) c
Continuing.

Please mind it’s ‘io_getevents’ here instead of ‘io_getevents@plt’, because the catch function looks the syscall up in an header file, instead of the symbol table in the executable.

Now run the full scan again, with SQL trace turned on at level 8 to show the waits. Previously we saw waits with an elapsed time mostly lesser than 10ms. here is a snippet from the trace:

PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356626115353145 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=203,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115353116
EXEC #3:c=1000,e=80,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115355176
WAIT #3: nam='SQL*Net message to client' ela= 30 driver id=1413697536 #bytes=1 p3=0 obj#=73426 tim=1356626115355832
WAIT #3: nam='kfk: async disk IO' ela= 26404 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115384177
WAIT #3: nam='kfk: async disk IO' ela= 25050 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115411677
WAIT #3: nam='kfk: async disk IO' ela= 24885 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115437990
WAIT #3: nam='kfk: async disk IO' ela= 25002 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115464244

*** 2012-12-27 17:35:15.492
WAIT #3: nam='kfk: async disk IO' ela= 25192 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115492891
WAIT #3: nam='kfk: async disk IO' ela= 25175 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115521419
WAIT #3: nam='kfk: async disk IO' ela= 24804 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115549124
WAIT #3: nam='kfk: async disk IO' ela= 24980 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115576955
WAIT #3: nam='kfk: async disk IO' ela= 24988 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115604879

A little explanation here: the debugger triggers the syscall catcher on entrance and leaving of the syscall. The times being somewhat more than 20ms, instead of a few microseconds gives a fair indication the wait is related. We can also look in the debugger:

(gdb) info break
Num     Type           Disp Enb Address            What
8       catchpoint     keep y                      syscall "io_getevents" 
	catchpoint already hit 174 times
        silent
        shell sleep 0.01
        c

Yes, it has been hit 174 times. So ‘kfk: async disk IO’ is the instrumentation of the ‘io_getevents()’ call, or more actions of which the io_getevents() is part of.

Okay, now that we know that, let’s switch to Oracle version 11.2.0.3…
This is what I find on doing EXACTLY the same (full scan) in the tracefile:

PARSING IN CURSOR #140382932318408 len=23 dep=0 uid=84 oct=3 lid=84 tim=1356626776003874 hv=1020534364 ad='7f491298' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #140382932318408:c=0,e=41,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003873
EXEC #140382932318408:c=0,e=31,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003962
WAIT #140382932318408: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356626776004009
WAIT #140382932318408: nam='direct path read' ela= 340 file number=5 first dba=28418 block cnt=126 obj#=75579 tim=1356626776045733
FETCH #140382932318408:c=164975,e=188719,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356626776192752
STAT #140382932318408 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=20944 pr=20941 pw=0 time=188708 us)'
STAT #140382932318408 id=2 cnt=1000000 pid=1 pos=1 obj=75579 op='TABLE ACCESS FULL T2 (cr=20944 pr=20941 pw=0 time=68726 us cost=5738 siz=0 card=1000000)'

Okay. First thing to notice is the ‘kfk: async disk IO’ events are gone. We see the PARSE line, the SQL Net message, and a single ‘direct path read’ wait, and then the FETCH line. This is absolutely different behavior than version 11.2.0.1!
What is happening? I think the only way to understand more about this, is running the debugger again with notification of io_submit, io_getevents, kslwtbctx and kslwtectx:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

That’s odd! There are no calls to the code location’s kslwtbctx and kslwtectx during IO processing! So, despite ‘kfk: async disk IO’ still being a wait event, it doesn’t seem to be instrumenting io_getevents() anymore. What if we make io_getevents() take longer (I’ve used the same number as the earlier catching of the syscall, 0.01 second):

PARSE #140382932318408:c=0,e=168,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973832834
EXEC #140382932318408:c=0,e=68,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973833385
WAIT #140382932318408: nam='SQL*Net message to client' ela= 671 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356627973835367
WAIT #140382932318408: nam='asynch descriptor resize' ela= 680 outstanding #aio=0 current aio limit=152 new aio limit=182 obj#=75579 tim=1356627973837361

*** 2012-12-27 18:06:16.440
FETCH #140382932318408:c=185972,e=2604507,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356627976440376

No, the time in the FETCH line increases (elapsed; e=2604507, cpu; c=185972; there’s 2418535 microseconds of time which is not spend on CPU), but there is only little time instrumented by wait events. A reasonable conclusion is the ‘kfk: disk async IO’ wait event is different in this version (11.2.0.3). I’ve also tried to slow down the io_submit() call the same way, with the same result: the time increases in the FETCH line, but there’s nothing the database writes a wait line for.

What if I slow down the disk? This can be done using cgroups, as described in this post. I’ve set my both my ASM disks to 1 IO per second. That should result in waits! This is the resulting trace output:

PARSE #140382932318408:c=0,e=20,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964787836
EXEC #140382932318408:c=0,e=64,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964788256
WAIT #140382932318408: nam='SQL*Net message to client' ela= 703 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356633964790122
WAIT #140382932318408: nam='asynch descriptor resize' ela= 699 outstanding #aio=0 current aio limit=238 new aio limit=264 obj#=75579 tim=1356633964792014
WAIT #140382932318408: nam='direct path read' ela= 497938 file number=5 first dba=23939 block cnt=13 obj#=75579 tim=1356633965295483

*** 2012-12-27 19:46:05.795
WAIT #140382932318408: nam='direct path read' ela= 495498 file number=5 first dba=23953 block cnt=15 obj#=75579 tim=1356633965795382

*** 2012-12-27 19:46:06.295
WAIT #140382932318408: nam='direct path read' ela= 495890 file number=5 first dba=23969 block cnt=15 obj#=75579 tim=1356633966295208
WAIT #140382932318408: nam='direct path read' ela= 495889 file number=5 first dba=23985 block cnt=15 obj#=75579 tim=1356633966795127

So, if I slow the IO down, I get ‘direct path read’ wait events. How does that look when I use the break/continue technique on this configuration?

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

The interesting part are the io_getevents() calls, which consistently are 4 calls before kslwtbctx is called, and another one or two calls are done before kslwtectx is called, and wait time is registered (see above). Interesting, but no ‘kfk: async disk IO’.

Now let’s look at the same output when Oracle 11.2.0.1 is throttled:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()

This looks remarkably the same as the 11.2.0.3 version!

Conclusion

It seems the database issues a number of io_getevents() calls after asynchronous IO calls are submitted with io_submit(), which do return immediately (also known as ‘non-blocking’). They just peak at the completion queue, if an IO is available in the completion queue, the process processes that IO. The number of non-blocking calls is 3 with version 11.2.0.1 and 4 with 11.2.0.3 (please mind the ‘slot’ mechanism which controls the number of simultaneous AIO requests is probably in play here too). If these calls do not return any IO request, the wait interface is called (kslwtbctx()), and a blocking io_getevents() call is issued (which waits for at least 1 request to be returned too), for which the wait time is visible as event ‘direct path read’.

It also seems the first non-blocking io_getevents() call is instrumented by the ‘kfk: async disk IO’ event with Oracle 11.2.0.1, for reasons unknown to me. This instrumentation is not present in version 11.2.0.3.

Why?

I guess the first question which comes to mind when reading this title is ‘Why’? For a database, but I guess for any IO depended application, we want IO’s to be faster, not throttle them, alias make them slower. Well, the ‘why’ is: if you want to investigate IO’s, you sometimes want them to slow down, so it’s easier to see them. Also, (not so) recent improvements in the Oracle database made great progress in being able to use the available bandwidth by doing IO in parallel, which could strip away much of the ability to see them in Oracle’s popular SQL trace.

Virtualisation

I use VMWare Fusion on my MacBook and use Linux in the VM’s to run the Oracle datatabase. Desktop virtualisation, like VMWare Fusion (and Virtualbox and VMWare Workstation, I think all desktop virtualisation products) use the operating system IO subsystem. This introduces a funny effect: if you stress the IO subsystem in the VM, and measure throughput, it looks like the disk or disks are getting faster and faster every run. The reason for this effect is the blocks in the file, which is the disk from the perspective of the VM, are getting touched (read and/or written) more and more, thus are increasingly better candidates for caching from the perspective of the underlying operating system.

I think that if you combine the ‘disk getting faster’ effect with the need to investigate IO’s, you understand that it can be beneficial to throttle IO’s in certain cases.

Cgroups

The mechanism which can be used to control and throttle resources is ‘cgroups’. Cgroups is a Linux kernel feature, which is an abbreviation of ‘control groups’, and has the function to limit, account and isolate resource usage (see this wikipedia article. Cgroups are a Linux kernel feature since kernel version 2.6.24. This means there is no cgroups in Redhat and Oracle linux version 5, but there is version 6.

The idea behind cgroups is to have control over resources in a system, which becomes more and more important with today’s systems getting bigger. Cgroups have been created to function from single processes to complete (virtualised) systems.

Simple setup and usage

(please mind all commands are either executed as root (indicated by ‘#’), or as a regular user (oracle in my case, indicated by ‘$’))

First, we need to make sure the ‘blkio’ controller is available:

# grep blkio /proc/mounts || mkdir -p /cgroup/blkio ; mount -t cgroup -o blkio none /cgroup/blkio

Next, we create a cgroup called ‘iothrottle':

# cgcreate -g blkio:/iothrottle

In order to throttle IO on the device, we need to find the major and minor number of the block device. If you use ASM, you can list the PATH field in the V$ASM_DISK view, and generate a long listing of it on linux:

$ ls -ls /dev/oracleasm/disk1 
0 brw-rw----. 1 oracle dba 8, 16 Dec 15 13:22 /dev/oracleasm/disk1

This shows that the major and minor number of the block device are: 8 16.

The next step is to use the ‘read_iops_device’ configuration option of the blkio controller to apply throttling to the ‘iothrottle’ cgroup. The ‘read_ops_device’ configuration option uses the following format: major_number:minor_number nr_IO_per_second (major:minor, space, maximum number of read IO’s per second)

cgset -r blkio.throttle.read_iops_device="8:16 10" iothrottle

Okay, we now have a cgroup called ‘iothrottle’ setup, and used the ‘read_iops_device’ option of the ‘blkio’ controller. Please mind there are no processes assigned to the cgroup yet. The next steps are to use an IO generation and measurement tool to first measure uncapped IO performance, then assign the process to the ‘iothrottle’ cgroup, and rerun performance measurement.

For the IO tests I use ‘fio’. This tool gives you the opportunity to investigate your system’s IO subsystem and IO devices performance. This is my fio.run file:

$ cat fio.run 
[global]
ioengine=libaio
bs=8k
direct=1
numjobs=4
rw=read

[simple]
size=1g
filename=/dev/oracleasm/disk1

Now run it! Please mind I’ve snipped a large part of the output, because fio gives a great deal of output, which is extremely interesting, but not really relevant to this blog:

$ fio --section=simple fio.run 
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
fio 1.57
Starting 4 processes
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
Jobs: 2 (f=2): [_R_R] [100.0% done] [123.3M/0K /s] [15.4K/0  iops] [eta 00m:00s]
...

So, we did an average of 15.4K read IOPS. Now lets put the process which runs fio in the ‘iothrottle’ cgroup!
Get PID of the process we just used with ‘fio':

$ echo $$
5994

And assign the ‘iothrottle’ cgroup to it:

# echo 5994 > /cgroup/blkio/iothrottle/tasks

You can see if your process is assigned a cgroup by reading the ‘cgroup’ file in ‘proc':

$ cat /proc/self/cgroup 
1:blkio:/iothrottle

Okay, we are assigned the ‘iothrottle’ cgroup! Now rerun the ‘simple’ fio benchmark:

$ fio --section=simple fio.run 
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
fio 1.57
Starting 4 processes
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
Jobs: 4 (f=4): [RRRR] [0.3% done] [81K/0K /s] [9 /0  iops] [eta 14h:37m:42s]

To be honest, I cancelled this fio run after a little while, because the time it takes to run is very long (approximately 14 hours and 30 minutes, as can seen above).
I think this example shows the cgroup ‘iothrottle’ in action very clear!

Cgroup usage in reality

I can’t imagine anybody want to echo all the process ID’s in the cgroup’s ‘tasks’ file in order to get these processes in a certain cgroup. With my DBA background, I would love to have control over an entire (all processes belonging to a) database. Setting up cgroups like done above (manually) means you have to set it up every time the server reboots. Luckily, there is a way to automate cgroup creation and assigning!

cgconfig service
In order to create cgroups, there is a service called ‘cgconfig’, which reads the file /etc/cgconfig.conf. In order to get the ‘iothrottle’ cgroup, and the disk throttling configuration created automatically, use this configuration:

mount {
	blkio = /cgroup/blkio;
}

group iothrottle {
	blkio {
		blkio.throttle.read_iops_device="8:16 10";
	}
}

In order to use this configuration restart the cgconfig service using ‘service cgconfig restart’. Optionally, you can enable automatic starting of this service on startup using ‘chkconfig –level 2345 cgconfig on’ (optionally check when this service is started with ‘chkconfig –list cgconfig’). Now the cgroup is created. But now do we assign processes to it?

cgred service
This is where the cgred service is for. This daemon uses a simple configuration file: /etc/cgrules.conf. Once configured and active, this service assigns cgroups to users, groups or processes. For the purpose of limiting IO from an Oracle database, I created this simple line:

oracle			blkio			/iothrottle

Now the cgred service can be started using ‘service cgred restart’. Optionally, you can enable automatic starting of this service using ‘chkconfig –level 2345 cgred on’.

Summary

The purpose of this blogpost was to introduce cgroups, and let you understand why I choose the IO throttling functionality. Next it showed how to setup cgroups manually, and a simple test to prove it works, with enough information to let you repeat the test for yourself. The last part showed how to automate cgroup creation and assignment.

A word of caution is on it’s place. It’s a fairly new feature at the time of writing, which means it could break or does not work as expected. So use at your own risk! In my limited tests it worked like a charm.

Recently I patched an 11.2.0.2 grid infrastructure to an higher version. After the patching I started the grid infrastructure on that host, and ASM was unable to start. Looking in the alert.log file of the ASM instance it turned out that upon starting ASM, even before the contents of the pfile/spfile was displayed, the ASM crashed with the ORA-00600 error:

Sat Oct 13 14:35:07 2012
NOTE: No asm libraries found in the system
* instance_number obtained from CSS = 1, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 1 
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 30960 free: 30960)
DFLT Huge Pages allocation successful (allocated: 67107111)
***********************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
CELL communication is configured to use the following interface(s) for this instance
    1.1.1.1 
CELL interconnect IPC version: Oracle RDS/IP (generic)
IPC Vendor 1 Protocol 3
  Version 4.1
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/11.2.0/grid/dbs/arch
Autotune of undo retention is turned on. 
LICENSE_MAX_USERS = 0
SYS auditing is disabled
NOTE: Volume support  enabled
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_20405.trc  (incident=66599):
ORA-00600: internal error code, arguments: [kmgs_component_init_3], [60], [65], [17], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_66599/+ASM1_ora_20405_i66599.trc
Sweep [inc][66599]: completed
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

A quick google didn't result in any hits, nor searching the MOS knowledge base and bug database.

I decided to start the ASM instance manually, and got an interesting message on screen just prior to the instance crashing:

ORA-32004: obsolete or deprecated parameter(s) specified for ASM instance

Upon investigation of the contents of the spfile, it turned out the parameter 'remote_os_authent' was set for the ASM instance. Removing this parameter returned normal, expect behaviour, in other words: not crashing the ASM instance.

I tried to replay/redo this failure in a VM on my laptop (using Oracle Grid Infrastructure 11.2.0.3 in non-clustered mode), but was UNABLE to crash ASM by setting 'remote_os_authent' to both true and false. So it could either be a 'cluster feature' or a feature specific to version 11.2.0.2.

The reason for this blogpost is to have a little bit of documentation on the OERI [kmgs_component_init_3], so if anybody encounters this problem, there is at least some clue to be found, or even a resolution.

ps. kmgs_component_init is in the VOS layer (operating system depended layer)

When you use Oracle ASM (Automatic Storage Management) for your database, the permissions on the block devices on the operating system layer which are used by ASM need to be changed. To be more precise, the owner and group need to be set to ‘oracle’ and ‘dba’ (Oracle documentation) in my case.

I used to do this in a very lazy way, using a simple ‘/bin/chown oracle.dba /dev/sdb’ in /etc/rc.local. This worked for me with RHEL/OL version 5. This has changed with RHEL/OL 6, because the system V startup system has changed to ‘upstart’. Also, the disk devices change ownership back in OL6 if you set it by hand to oracle.dba.

Oracle has its own resolution for generating block device files especially for ASM, called ASMLib. This generally works well, but what is exactly happening to generate the devices, and how to troubleshoot the generation of the devices is (as far as I know) not documented nor generally known. I’ve been bitten a few times on my test systems, which is the reason I don’t like ASMLib, and do not feel comfortable with ASMLib.

The linux system has functionality which equals the block device generation of ASMLib, which is called udev. Udev isn’t exactly over-documented. Also, there seems to be a difference between the tools (scsi_id and rule file syntax) in RHEL/OL 5 and RHEL/OL 6. But there is much written about it. And it’s very simple, once you know what to do.

I have a separate (virtual) disk which I want to use with ASM. After startup, the block device of this disk is called /dev/sdb. In order to let udev identify this disk, the linux utility ‘scsi_id’ can be used to fetch the unique SCSI identifier of the disk. This is of importance, because the correct disk must be specified to be used with ASM. This sounds a bit dumb (of course we need the correct disk), but Linux names the devices in the order at which they are seen by the kernel. This means it’s possible that /dev/sdb is another disk if someone decides to add a disk to the system, and reboots the server. (please mind most servers do not change very much, nor I have seen this happening)

And there’s another complication: VMWare doesn’t provide a unique device number by default. In order to let VMWare provide Linux a unique number, ‘disk.EnableUUID = “TRUE”‘ must be added to the virtual machine configuration file (“vmx file”). After a reboot, ‘scsi_id’ is able to provide a serial!

So at this point, you should have a disk or disks to be used with ASM. First grab the unique serial of the disk (this is the RHEL6/OL6 syntax of the scsi_id command!):

# scsi_id -g -u -d /dev/sdb
36000c29127c3ae0670242b058e863393

Next, we need to make a udev rule file. In this file, we specify this SCSI id (which we fetched with scsi_id), and specify how we want that device created:

# cd /etc/udev/rules.d/
# cat 99-oracle-asmdevices.rules
KERNEL=="sd*", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", ENV{ID_SERIAL}=="36000c29127c3ae0670242b058e863393", NAME+="oracleasm/disk1", OWNER="oracle", GROUP="dba", MODE="0660"

(There is no ’99-oracle-asmdevices.rules’ file by default, you need to create it yourself)

That’s all. When the kernel finds a “sd*” device, subsystem “block”, with the serial “36000c29127c3ae0670242b058e863393″, it creates a device file ‘/dev/oracleasm/disk1′, owner oracle, etc.

This is also convenient because the disk string of ASM is very easy to set now: ‘/dev/oracleasm/*’.

Please mind if the block devices have the correct permissions, it doesn’t matter to ASM if the device file names have swapped. The ASM instance scans the device headers for it’s contents.

(Version edited after comments -> rman backup as copy)
(Version edited to include delete leftover datafile from rman)

Recently I was asked to rename a tablespace. The environment was Oracle version 11.2.0.3 (both database and clusterware/ASM).

This is the test case I build to understand how that works:
(I couldn’t find a clean, straightforward description how to do that, which is why I blog it here)

I created an empty tablespace ‘test1′ for test purposes:

SYS@v11203 AS SYSDBA> create bigfile tablespace test1 datafile size 10m;

(I use bigfile tablespaces only with ASM. Adding datafiles is such a labour intensive work, bigfile tablespaces elimenate that, when auto extent is correctly set)

A tablespace can be easily renamed with the alter tablespace rename command:

SYS@v11203 AS SYSDBA> alter tablespace test1 rename to test2;

This changes the Oracle data dictionary to reflect a new name. This doesn’t touch the underlying datafile:

SYS@v11203 AS SYSDBA> select * from dba_data_files;

FILE_NAME
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   FILE_ID TABLESPACE_NAME		       BYTES	 BLOCKS STATUS	  RELATIVE_FNO AUT   MAXBYTES  MAXBLOCKS INCREMENT_BY USER_BYTES USER_BLOCKS ONLINE_
---------- ------------------------------ ---------- ---------- --------- ------------ --- ---------- ---------- ------------ ---------- ----------- -------
+DATA/v11203/datafile/system.260.785186841
	 1 SYSTEM			   734003200	  89600 AVAILABLE	     1 YES 3.4360E+10	 4194302	 1280  732954624       89472 SYSTEM

+DATA/v11203/datafile/sysaux.261.785186847
	 2 SYSAUX			   629145600	  76800 AVAILABLE	     2 YES 3.4360E+10	 4194302	 1280  628097024       76672 ONLINE

+DATA/v11203/datafile/undotbs1.262.785186849
	 3 UNDOTBS1			   477102080	  58240 AVAILABLE	     3 YES 3.4360E+10	 4194302	  640  476053504       58112 ONLINE

+DATA/v11203/datafile/users.264.785186857
	 4 USERS			     5242880	    640 AVAILABLE	     4 YES 3.4360E+10	 4194302	  160	 4194304	 512 ONLINE

+DATA/v11203/datafile/ts.266.785323903
	 5 TS				   419430400	  51200 AVAILABLE	  1024 YES 3.5184E+13 4294967293	12800  418381824       51072 ONLINE

+DATA/v11203/datafile/iops.267.785326883
	 6 IOPS 			  5452595200	 665600 AVAILABLE	  1024 YES 3.5184E+13 4294967293	12800 5451546624      665472 ONLINE

+DATA/v11203/datafile/test1.268.789380535
	 7 TEST2			    10485760	   1280 AVAILABLE	  1024 NO	    0	       0	    0	 9437184	1152 ONLINE

To rename the datafile in ASM, offline the tablespace, copy the datafile using RMAN, rename the datafile in the Oracle data dictionary, and online the tablespace again:

Offline the tablespace:

SYS@v11203 AS SYSDBA> alter tablespace test2 offline;

Copy the datafile using RMAN:

RMAN> copy datafile '+DATA/v11203/datafile/test1.268.789380535' to '+DATA';

Starting backup at 23-JUL-12
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=28 device type=DISK
channel ORA_DISK_1: starting datafile copy
input datafile file number=00007 name=+DATA/v11203/datafile/test1.268.789380535
output file name=+DATA/v11203/datafile/test2.269.789380645 tag=TAG20120723T082404 RECID=1 STAMP=789380644
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:01
Finished backup at 23-JUL-12

Rename the datafile in the Oracle data dictionary:

SYS@v11203 AS SYSDBA> alter database rename file '+DATA/v11203/datafile/test1.268.789380535' to '+DATA/v11203/datafile/test2.269.789380645';

Please mind the ‘old’ filename is at ‘input datafile’, and the ‘new’ filename is at ‘output file name’ with the RMAN output.

Next, and finally: online the tablespace:

SYS@v11203 AS SYSDBA> alter tablespace test2 online;

(the old datafile is gone)

Update:
The RMAN copy command and data dictionary update could also be done with RMAN backup as copy and switch datafile:

Status of the database after renaming:

SYS@v11203 AS SYSDBA> select file_id, file_name, tablespace_name from dba_data_files;

   FILE_ID FILE_NAME					      TABLESPACE_NAME
---------- -------------------------------------------------- ------------------------------
	 1 +DATA/v11203/datafile/system.260.785186841	      SYSTEM
	 2 +DATA/v11203/datafile/sysaux.261.785186847	      SYSAUX
	 3 +DATA/v11203/datafile/undotbs1.262.785186849       UNDOTBS1
	 4 +DATA/v11203/datafile/users.264.785186857	      USERS
	 5 +DATA/v11203/datafile/ts.266.785323903	      TS
	 6 +DATA/v11203/datafile/iops.267.785326883	      IOPS
	 7 +DATA/v11203/datafile/test1.269.789411511	      TEST2

Offline the tablespace (my database is in NOARCHIVELOG, online backup (as copy) can only be done in ARCHIVELOG, when doing so, the datafile needs recovery):

RMAN> sql "alter tablespace test2 offline";

sql statement: alter tablespace test2 offline

Backup the datafile as copy. It is very convenient to use file number (file_id), this makes it much simpler to do this:

RMAN> backup as copy datafile 7 format '+DATA';

Starting backup at 23-JUL-12
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile copy
input datafile file number=00007 name=+DATA/v11203/datafile/test1.269.789411511
output file name=+DATA/v11203/datafile/test2.268.789411665 tag=TAG20120723T170105 RECID=2 STAMP=789411665
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:01
Finished backup at 23-JUL-12

Now switch to the backup copy:

RMAN> switch datafile 7 to copy;

datafile 7 switched to datafile copy "+DATA/v11203/datafile/test2.268.789411665"

And online the tablespace again:

RMAN> sql "alter tablespace test2 online";

sql statement: alter tablespace test2 online

Please mind this leaves the old datafile in place, so it needs to be removed explicitly:

RMAN> delete noprompt copy of datafile 7;

released channel: ORA_DISK_1
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=44 device type=DISK
List of Datafile Copies
=======================

Key     File S Completion Time Ckp SCN    Ckp Time       
------- ---- - --------------- ---------- ---------------
5       7    A 23-JUL-12       1779771    23-JUL-12      
        Name: +DATA/v11203/datafile/test1.268.789415137

deleted datafile copy
datafile copy file name=+DATA/v11203/datafile/test1.268.789415137 RECID=5 STAMP=789415216
Deleted 1 objects

An alternative is to do this with asmcmd, but it’s far less elegant:

$ asmcmd rm -f +DATA/v11203/datafile/test1.269.789411511

Some of you might have experimented with, or used Oprofile, ltrace/strace, Systemtap, and wished you could look further into the userland process, like dtrace does, because dtrace allows you to profile and look into a process’ function calls.

If you’ve used oprofile/strace/systemtap/etc., you probably found out you can see all the kernel functions which are processed, but it does not get into userspace, or with very limited information. Perhaps the only tool which is different is gdb, which enables you to see userspace function names, but gdb is not a profiler, it’s a debugger. And it works best with (I assume it made for) debugging symbols in the executable. Most (all I presume) commercial executables, like the oracle database executable, do not have debugging symbols.

Some time ago, a tweet from Kevin Closson caught my attention, which showed a profile of a Linux system running Oracle, which showed the functions inside Oracle. This totally shocked me! That is something I want to be able to do on Linux too!!

It was done using the perf utility. The perf utility is delivered with the kernel (rpm) package on EL Linuxes, and the since I am using Oracle Linux, I can tell you it’s delivered with the ‘UEK’ kernels (the 2.6.32+ ones). I’ve eventually used the latest version of the UEK kernel (2.6.39-200.24.1.el6uek.x86_64), which you get when you subscribe to the ol6_UEK_latest yum channel. Newer versions of the kernel have more perf functions.

The ‘perf top’ sub-utility shows the kernel and userland functions which are active (by sampling active processes). It can be used in a variety of ways, some of them are: system-wide, per processor or per process. This allows you to do look beyond the Oracle wait interface. As Tanel Poder often states in his performance classes: the Oracle wait interface is a good starting point for performance investigation, but sometimes you need to look deeper. You always need to use the appropriate layer to look at when investigating issues.

Note when using virtualization: virtualization does not allow direct access to the performance monitoring hardware. So if you try to use perf on that (or oprofile for that matter), you get no output. You can get around that with perf by using a software event. I use the software event cpu-clock. This means I invoke perf with the option ‘-e cpu-clock’

For example: this statement uses a lot of CPU, and might (depending on Oracle version, operating system type and version) never return an answer. But it is perfect to have a statement that consumes approximately 1 CPU, and keeps doing that for some time:

select count(*) from dba_extents, dba_extents, dba_extents

A quick glimpse on the statistics profile using sql_monitor reveals:

Global Stats
=========================================
| Elapsed |   Cpu   |  Other   | Buffer |
| Time(s) | Time(s) | Waits(s) |  Gets	|
=========================================
|     109 |	103 |	  6.23 |   9755 |
=========================================

Of course I could look in the sql monitoring plan details, which would reveal what row sources are active at the moment, but let’s see what ‘perf top -e cpu-clock -p shows:

   PerfTop:     108 irqs/sec  kernel: 0.9%  exact:  0.0% [1000Hz cpu-clock],  (target_pid: 12727)
--------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ ____________________________________________________

             1414.00 58.8% kxhrHash                  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              389.00 16.2% qerhjSplitProbe           /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              263.00 10.9% rworofprVeryFastUnpackRow /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              122.00  5.1% smbget                    /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               93.00  3.9% sorgetqbf                 /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               86.00  3.6% qersoFetch                /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               31.00  1.3% qerhjInnerProbeHashTable  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

This shows 58.8% of the samples takes of this process are busy in the function ‘kxhrHash’. So it’s the hashing which this process is busy with. Another important and interesting aspect is: there is very little time spend in kernel mode: (kernel: 0.9%). But…the information about where the time is spend could have been read in the sql monitoring report: it reports CPU and waits per row source.

Where perf comes in to fulfill something which could not be done different, is when you got multiple processes, and you cannot setup perf for every distinct process, or you want to understand system-wide performance. To mimic a system doing OLTP, I’ve setup Kevin Closson’s SLOB in PIO mode, and use perf top -e cpu-clock without specifying a process ID, so I measure the entire system. I get the following profile:

  PerfTop:     984 irqs/sec  kernel:92.4%  exact:  0.0% [1000Hz cpu-clock],  (all, 1 CPU)
--------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                           DSO
             _______ _____ __________________________________ ______________________________________________________________________

            13102.00 88.3% _raw_spin_unlock_irqrestore        [kernel.kallsyms]                                                     
              772.00  5.2% copy_page                          [kernel.kallsyms]                                                     
              298.00  2.0% __do_softirq                       [kernel.kallsyms]                                                     
              114.00  0.8% scsi_request_fn                    [kernel.kallsyms]                                                     
               29.00  0.2% kcbgtcr                            /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle                  
               25.00  0.2% kmem_cache_alloc                   [kernel.kallsyms]                                                     
               25.00  0.2% finish_task_switch                 [kernel.kallsyms]                                                     
               22.00  0.1% kcbzib                             /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle                  
               17.00  0.1% sxorchk                            /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

Conclusion: the majority of the processes spend time in kernel calls (kernel.kallsyms). This is not too surprising: I've setup SLOB in 'PIO mode', which means I've reduced the Oracle database buffer cache as much as possible. With Oracle version 11.2.0.3, the smallest buffer cache I can create is 4MB. Next I've let 10 readers do index range scans (that is one of the functions of SLOB: swamping your system with single block (8KB) IO's). The majority of the processes are spending time in the kernel function _raw_spin_unlock_irqrestore. This is how a system which is swamped with physical IO looks like with perf.

This all changes when I revert SLOB to LIO mode, and issue the same test (SLOB run with 10 readers):

   PerfTop:    1004 irqs/sec  kernel: 3.1%  exact:  0.0% [1000Hz cpu-clock],  (all, 1 CPU)
---------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________________________________________

              478.00 23.9% kcbgtcr                     /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              154.00  7.7% qetlbr                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              149.00  7.4% kcbz_fp_buf                 /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              133.00  6.6% kdsgrp                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
              104.00  5.2% kafger                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               94.00  4.7% kcbrls                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               60.00  3.0% ktrgcm                      /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle
               60.00  3.0% ktrget3                     /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle

Ah, we now spend only 3.1% in kernel mode, and the majority of the time is spend in kcbgtcr (quite probably meaning 'kernel cache buffer get consistent read')! I've modified the buffer cache to be able to contain all blocks (also known as SLOB 'LIO mode'), so once they are read, only a logical IO is needed to fetch the block.

Follow

Get every new post delivered to your Inbox.

Join 2,058 other followers

%d bloggers like this: