Archive

Oracle

This is yet another blogpost on Oracle’s direct path read feature which was introduced for non-parallel query processes in Oracle version 11.

For full table scans, a direct path read is done (according to my tests and current knowledge) when:

- The segment is bigger than 5 * _small_table_threshold.
- Less than 50% of the blocks of the table is already in the buffercache.
- Less than 25% of the blocks in the buffercache are directy.

Also (thanks to Freek d’Hooge who pointed me to an article from Tanel Poder) you can change the optimizer statistics to change the segment size for the direct path decision. Please mind that whilst this uses the statistics the optimizer uses, this is NOT an optimizer decision, but a decision made in the “code path”, so during execution.

So let’s take a look at my lab environment (Oracle Linux 6.3, 64 bit, Oracle 11.2.0.3 and ASM)

Small table threshold:

NAME						   VALUE
-------------------------------------------------- -------
_small_table_threshold				   1011

Table information:

TS@v11203 > select blocks from user_segments where segment_name = 'T2';

    BLOCKS
----------
     21504

So if we take small table threshold times and multiply it by five, we get 5055. This means that the size of table T2 is more than enough so should be scanned via direct path:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  0

TS@v11203 > select count(*) from t2;

  COUNT(*)
----------
   1000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  1

Well, that’s that, this seems quite simple.

I’ve created a relatively big table and created a (normal) index on it in the same database. The index is created on a single column, called ‘id’. If I issue a count(id), the whole index needs to be scanned, and Oracle will choose a fast full index scan. A fast full index scan is a scan which just needs to read all the blocks, not necessarily in leaf order. This means it can use multiblock reads (which reads in the order of allocated adjacent blocks).

Let’s check just to be sure:

TS@v11203 > select count(id) from bigtable;

Execution Plan
----------------------------------------------------------
Plan hash value: 106863591

------------------------------------------------------------------------------------
| Id  | Operation	      | Name	   | Rows  | Bytes | Cost (%CPU)| Time	   |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	   |	 1 |	13 | 19662   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE       | 	   |	 1 |	13 |		|	   |
|   2 |   INDEX FAST FULL SCAN| I_BIGTABLE |	34M|   425M| 19662   (2)| 00:03:56 |
------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

If we look at the index size, the size of the index makes this segment a candidate for direct path reads:

TS@v11203 > select blocks from user_segments where segment_name = 'I_BIGTABLE';

    BLOCKS
----------
     72704

If we look at number of small table threshold times five (5055), this index is much bigger than that. Also, this is bigger than table T2. Let’s execute select count(id) from bigtable, and look at the statistic ‘index fast full scans (direct read)’:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

TS@v11203 > select count(id) from bigtable;

 COUNT(ID)
----------
  32000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

Huh? This statistic tells me there hasn’t been a direct path read! This means that this read has been done in the “traditional way”. This is a bit…counter intuitive. I’ve traced the session, and indeed it’s doing the traditional multiblock reads via the scattered read waits.

I did a fair bit of fiddling around with the parameters which are reported to be involved, and found out I can get the database to do direct path reads by changing the parameter “_very_large_object_threshold”. The information found on the internet reports this value is in megabytes. A quick stroll through a number of different database (all on 11.2.0.3) shows this parameter is quite probably statically set at “500″.

If I calculate the size in megabytes of the index I_BIGTABLE, the size is 568M. This is clearly higher than the value of “_very_large_object_threshold”. I can get the same index scanned via direct path reads by changing the value of “_very_large_object_threshold” to 100.

This interesting, because it looks like this parameter does the same for full scans on index segments as “_small_table_threshold” does for full scans on table segments: the size of the segment to be scanned needs to be bigger than five times.

There are also differences: small table threshold is set in blocks, (apparently) very large object threshold is set in megabytes. Also, small table threshold is set by default at 2% of the size of the buffercache (so it scales up with bigger caches), very large object threshold seems to be fixed at 500. If my finding is correct, then it means an index segment needs to be bigger than 500*5=2500M to be considered for direct path reads. It’s unknown to me if the 50% limit for blocks in the cache and the 25% limit for dirty blocks is subject to this too.

When an Oracle process starts executing a query and needs to do a full segment scan, it needs to make a decision if it’s going to use ‘blockmode’, which is the normal way of working on non-Exadata Oracle databases, where blocks are read from disk and processed by the Oracle foreground process, either “cached” (read from disk and put in the database buffercache) or “direct” (read from disk and put in the process’ PGA), or ‘offloaded mode’, where part of the execution is done by the cell server.

The code layer where the Oracle database process initiates the offloading is ‘kcfis’; an educated guess is Kernel Cache File Intelligent Storage. Does a “normal” alias non-Exadata database ever use the ‘kcfis’ layer? My first guess would be ‘no’, but we all know guessing takes you nowhere (right?). Let’s see if a “normal” database uses the ‘kcfis’ functions on a Linux x64 (OL 6.3) system with Oracle 11.2.0.3 64 bit using ASM.

The only way to profile kernel functions that I am aware of is using ‘gdb’ and breaking on functions in the Oracle executable:
(the process id shown below ought to be the process id of an oracle database process id you are going to execute in. Do not, I repeat: not do this with other processes, especially the ones that do important tasks!)

# gdb -p 42
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) rbreak ^kcfis.*
Breakpoint 1 at 0x2204094
<function, no debug info> kcfis_get_sched_delay;
Breakpoint 2 at 0x220410a
<function, no debug info> kcfis_capability_tab_get;
Breakpoint 3 at 0x2204150
<function, no debug info> kcfis_can_session_migrate;
Breakpoint 4 at 0x2204190
<function, no debug info> kcfis_fob_storage_attr;
Breakpoint 5 at 0x22041d0
<function, no debug info> kcfis_init_resource_limit;
Breakpoint 6 at 0x22041f2
<function, no debug info> kcfis_tablespace_is_on_sage;
...
(gdb) c
Continuing.

Okay, we got the debugger set, now let’s execute a simple query (doing a full scan) to see if kcfis is touched on a “normal” server or not!

TS@v11203 > select count(*) from t2;

I am on OS-X (Apple) using iTerm, and I see the tab of the gdb session turning red: gdb generated some output on the screen!

(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) 

So, we hit a kcfis function! Let me put upfront that I do not have sourcecode of any kind, so my statements about what a function does are actually guesses. Anyway: from the name of the function (kcfis_tablespace_is_on_sage) it looks like a function in the kcfis layer which determines if a tablespace is on an Exadata storage server (exadata’s internal name is/was ‘sage’). Nice. so we hit the layer for determination if the tablespace is on Exadata.

Let’s continue the profiling with gdb:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000220410a in kcfis_capability_tab_get ()
(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) c
Continuing.

We see another call to the function ‘kcfis_tablespace_is_on_sage’, and a call to ‘kcfis_capability_tab_get’. The last function probably tries to probe the table (but could be tablespace) to get the capabilities. This could be the function which checks the requirements for hybrid columnar compression, I am not sure about that.

At what point during the processing of the full segment scan does the kcfis_tablespace_is_on_sage occur? One way of investigating this, is profiling some functions we know a (direct path) full scan does, and see where the kcfis_tablespace_is_on_sage kicks in. When the buffer cache is flushed prior to executing a full scan, and the SQL is made unique, so it has to be parsed, the following sequence of events happens:

- Parse
- Execute
- A ‘SQL*Net message to client’ wait
- A ‘db file sequential read’ wait (for reading the segment header)
- Potentially a ‘asynch descriptor resize’ wait
- The full scan is done asynchronously (potentially revealing some ‘direct path read’ waits)

So if we profile on start and end of a wait (kslwtbctx and kslwtectx), a single block read (pread64), AIO (io_submit and io_getevents_0_4) and of course kcfis_tablespace_is_on_sage, we should be able to see that:

Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, pread64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 6, io_getevents_0_4 (ctx=0x7fb42f475000, min_nr=2, nr=128, events=0x7fffb4c66768, timeout=0x7fffb4c67770) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

So what do we see here?
Row 1 : Here the wait ending for ‘SQL*Net message from client’.
Row 2-3 : This is the ‘SQL*Net message to client’ wait.
Row 5-6-7 : This is begin wait, pread64 for reading the segment header and the end wait.
Row 8&9 : The session probes for Exadata.
Row 10-11 : Start and end of a wait, ‘asynch descriptor resize’ (verified by trace file).
Row 12- : Here the full scan takes off.

So…a query starts (I’ve made it unique by adding a dummy hint, so it’s parsed again) the SQL*Net roundtrip occurs, the segment header is read, then the process looks if it’s on Exadata, which it isn’t here, and starts an asynchronous full scan.

So let’s repeat this check on a database which is on Exadata! In order to do so, we need to be aware several calls are not done on exadata: pread64, io_submit and io_getevents_0_4, because we do not do local IO, but issue them over infiniband. So I breaked on kslwtbctx, kslwtectx, kcfis_tablespace_is_on_sage. This is how that looks like:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000905cf62 in kslwtectx ()
Breakpoint 3, 0x0000000002230366 in kcfis_tablespace_is_on_sage ()

Mmmh, this is different. The ‘SQL*Net message from client’ wait ends, and before any other wait occurs, the existence of exadata is checked. This is different from the non-exadata case. Let’s take a look at the backtrace of the break on kcfis_tablespace_is_on_sage:

(gdb) bt
#0  0x0000000002230366 in kcfis_tablespace_is_on_sage ()
#1  0x0000000001402eb0 in qesSageEnabled ()
#2  0x0000000009234d20 in kkdlgstd ()
#3  0x0000000001a6111d in kkmfcblo ()
#4  0x000000000922f76d in kkmpfcbk ()
#5  0x000000000942e538 in qcsprfro ()
#6  0x000000000942de29 in qcsprfro_tree ()
#7  0x000000000942de6e in qcsprfro_tree ()
#8  0x0000000002dd80c5 in qcspafq ()
#9  0x0000000002dd51d9 in qcspqbDescendents ()
#10 0x0000000002dd91e4 in qcspqb ()
#11 0x0000000001a6b2be in kkmdrv ()
#12 0x0000000002584c76 in opiSem ()
#13 0x000000000258ac8b in opiDeferredSem ()
#14 0x000000000257dc32 in opitca ()
#15 0x0000000001ec3d7d in kksFullTypeCheck ()
#16 0x00000000092a7256 in rpiswu2 ()
#17 0x0000000001eca977 in kksLoadChild ()
#18 0x0000000009298448 in kxsGetRuntimeLock ()
#19 0x000000000925aa34 in kksfbc ()
#20 0x000000000925556e in kkspsc0 ()
#21 0x0000000009254e6a in kksParseCursor ()
#22 0x000000000933cb25 in opiosq0 ()
#23 0x0000000001b82a46 in kpooprx ()
#24 0x0000000001b80d2c in kpoal8 ()
#25 0x00000000091fb8b8 in opiodr ()
#26 0x000000000939e696 in ttcpip ()
#27 0x000000000180f011 in opitsk ()
#28 0x0000000001813c0a in opiino ()
#29 0x00000000091fb8b8 in opiodr ()
#30 0x000000000180af4c in opidrv ()
#31 0x0000000001e0a77b in sou2o ()
#32 0x0000000000a0cc05 in opimai_real ()
#33 0x0000000001e106ec in ssthrdmain ()
#34 0x0000000000a0cb71 in main ()

What is interesting to see, is line 23, backtrace layer number 21: kksParseCursor. So actually during parsing the detection of storage servers happens already, not when it actually starts a full segment scan needs to make a decision to do a smartscan or not.

This is just a very small post on how to watch the progress of the “CopyBack” state of a freshly inserted disk in an Exadata “Computing” (database) node. A disk failed in the (LSI Hardware) RAID5 set, and the hotspare disk was automatically used. The failed disk was replaced, and we are now awaiting the intermediate “CopyBack” phase.

The current state of the disks is visible using the following command:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0Firmware state: Copyback
Device Firmware Level: 0D70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: 0B70

But what is the progress? Some googling came up with (almost) the answer (http://trac.camsentry.com/wordpress/tag/ldinfo/), I modified it a tiny bit to make it refresh:

while $(true); do /opt/MegaRAID/MegaCli/MegaCli64 adpeventlog getlatest 200 -f ~/adpeventlog.txt a0; awk '/^Time/{TIME=$0};/Seconds/{SECS=$5}/^Event Desc/{printf("%25.25s %5.5s %s\n",TIME,SECS,$0);TIME=" ";SECS=""}' ~/adpeventlog.txt|grep -v fan|tac; sleep 5; done

This wil refresh every 5 seconds and show the progress of the CopyBack, and the state changes after it once it gets to that.

With Exadata version 11.2.3.2.0 came the Unbreakable Linux Kernel for Exadata, which had been the stock EL5 redhat kernel prior to this version (2.6.18). With the unbreakable kernel came the opportunity to run the perf utility. This utility has the opportunity to see which functions are active inside an executable when there’s a symbol table. And the oracle database executable has a symbol table! One reason to do this, is to get a more granular overview of what the Oracle database is doing than the wait interface, especially to get a more detailed overview of what the database is doing in what is visible in the wait interface as ‘on cpu’.

Right after the Exadata upgrade, I ran a simple query (which probably doesn’t reflect any real customer case) to get an idea. Previously I have been running ‘select count(*) from bigtable’ on Exadata before, and saw most of it being CPU, and a minor part of it being ‘cell smart table scan’. Now with perf I have the opportunity to get more details on what is the time spend on CPU!

These are the top 5 functions from that session:

    31.50%     476423   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    30.20%     456774   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
     7.48%     113083   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     6.96%     105301   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.94%      44475   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

So the top function used during the processing of the SQL is a function called ‘sxorchk’. Sadly, Oracle doesn’t provide information about this. Anyway, this is executing a XOR function for the reason of checking information gotten from storage. Probably the name of this function is System XOR CHecK.

Alright, once we know this we can look into the parameters of the instance I am executing the SQL, which is taken from the default Exadata parameter template (!!):

SYS@t11203 AS SYSDBA> show parameter db_block

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_block_buffers		     integer	 0
db_block_checking		     string	 FALSE
db_block_checksum		     string	 TYPICAL
db_block_size			     integer	 8192

So, this should be related to db_block_checksum, db_block_checking is off/false. Well, since this is a test instance, let’s turn it off, and redo the scan:

    49.35%     480911   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
    11.01%     107299   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     6.56%      63885   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     3.97%      38712   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     3.49%      33970   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdst_fetch

The sxorchk function is gone now! This is also reflected in the responsetime: the time with db_block_checksum set to typical is: 00:01:02.44, and with db_block_checksum set to false is: 00:00:41.68 on my system. That is a difference of roughly 20 seconds, or, as we can see from the first perf-function list: 31.5% of the time. That is a significant amount of time!

When I discussed this with Tuomas Pystynen, he asked me a very valid question: if this is a smartscan, the database does not get blocks, it gets result-sets. So blocks cannot be checked on the database layer. Is this offloaded to the storage/cell server?

Well, if it is offloaded to the storage, it will not be reflected in a function on the database layer. In other words, something is XOR’ed by the database foreground process, which is set (and can be reset) with db_block_checksum! Logical conclusion on these facts would be blocks are involved in some way…

There is way to know what is actually happening: looking at the backtrace of sxorchk function! Let’s do that!

    31.50%   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.76%-- kcbhxoro
                |          kcbhvbo
                |          |          
                |          |--100.00%-- kcbhvb_corruptcheck
                |          |          kcbl_sage_block_check
                |          |          kcfis_oss_block_verify
                |          |          kcfis_validate_translation
                |          |          kcfis_process_completed_req
                |          |          kcfis_process_completed_buffer
                |          |          kcfis_process_reaped_io
                |          |          kcfis_read
                |          |          kcbl_predpush_get
                |          |          kcbldrget
                |          |          kcbgtcr
                |          |          ktrget3
                |          |          ktrget2
                |          |          kdst_fetch

What we see here is the function is called from the kcfis (Kernel Cache File Intelligent Storage is my guess) layer from a function called ‘kcfis_oss_block_verify’, in other words: a block, which is actually the resultset which is send from the cell server, is validated/checked. That is understandable, but the usage of the parameter ‘db_block_checksum’ for setting this is misleading, to put it in a nice way.

Next question which I asked myself is: how about a normal/non-Exadata database? Well, I can mimic a non-Exadata database by setting the parameter ‘cell_offload_processing’ to false to disable smartscans!

This is how the top-5 functions look like with db_block_checksum set to true without smartscan:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    18.53%     353741   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.05%     191847   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.35%     102161   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.73%      52103   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

And this is how the top-5 functions look like with db_block_checksum set to false: without smartscan:

    36.51%     706798   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.47%     202645   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.58%     107941   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     3.57%      69044   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     2.38%      46036   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] ktrgcm

If we get the backtrace of the sxorchk function without smartscan enabled:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.79%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch

We can see the sxorchk function is called from kcbldrget (the direct path load function), with more or less the same helper function to prepare for XOR function, and no other layers (like kcfis in the backtrace with smartscan enabled).

My conclusion is: db_block_checksum is a parameter which enables checking of the data it has gotten from “storage”, which apparently is done via a XOR function. This checking is done for blocks on a regular/non-Exadata system, which the parameter suggests, but on Exadata with a smartscan the checking still is done, regardless the fact that no blocks but result-sets are gotten from “storage”.

The checking takes a significant portion of time during processing of my (very simple!) query. Probably other functions can make the overall processing more CPU intensive, which means the relative portion of time spend on checking gets less.

With Exadata/smartscan the relative amount of time spend on sxorchk with my simple SQL is 32%, without smartscan the relative amount of time drops to 21%. This is still a significant amount of time (and the function the most time is spend in, in both cases!).

Final conclusion is you should think about the setting of this parameter if you are doing much physical IO, and set it according to the needs of the database.

One addition: this is done on a V2 half rack version of Exadata (Nehalem CPU), so anyone with X2-[28] or X3-[28] is welcome to profile a ‘select count(*) from bigtable’ and post it here to see if there’s any difference. The database version is 11.2.0.3 BP15.

Recently I’ve spoken at the RMOUG training days 2013 in Denver (the mile high city). It was a first time for me to speak for the RMOUG and being in Denver. Thanks to the “sequestration” (federal budget cuts) the lines piled at immigration at Minneapolis (Minneapolis and Saint Paul, the twin cities) airport, and because my plane left more than one hour to late and my layover time was one hour and fifteen minutes, I tried to rebook my flight from Minneapolis to Denver. But, this flight turned out to be delayed too. This meant I was able to get on this flight!

At Denver airport Tim Gorman volunteered to fetch me at the airport and bring me to my Hotel. Tim did fetch multiple people, which meant we gathered at a bar where Debra Lilley, Mogens Norgaard, Claes (the tank commander), Martin Widlake amongst others where already waiting.

The RMOUG training days where held in the Denver conference centre, which have a very iconic blue bear at the front:

foto

This is the same conference centre where Collaborate 2013 will be held!

I delivered 3 talks: About multiblock reads (which I will be presenting at collaborate too, but named “How Oracle secretly changed multiblock reads”), Exadata OLTP (a study on processing single block reads on Exadata configurations and non-Exadata using Kevin Closson’s SLOB (Silly Little Oracle Benchmark) and a live Oracle function call tracing hacking session. The idea for the last presentation was conceived just prior to the conference, when I learned some people didn’t make it to Denver, and there where session slots to be filled. All three of these presentations will be presented at theOracle Usergroup Norway VĂ¥rseminar 2013 too!

After the conference we went skiing for a few days in Breckenridge with a group of friends, organised by Kellyn Pot’vin and Tim Gorman. Thank you Kellyn and Tim for all the work at both RMOUG and with this famous ‘Faux Table’ event, I really appreciate all the effort you put in these events! I also want to mention much other people who put effort in making things happen at the Faux table, for cooking, washing, driving, etc.

A few weeks further down the line was the annual HotSOS conference in Dallas. I travelled with Toon Koppelaars. Once again we found a long line for immigration, but eventually we where through immigration quick enough to fetch a beer at a bar near the gate of the connecting flight where each and every seat was equipped with an iPad, which you had to use to order something. Remarkable. At the conference I delivered my ‘About multiblock reads’ presentation. I was very pleased to be invited to speak at a conference which is dedicated to Oracle and performance. Once again this was a terrific meet-up with a huge amount of friends. This is an impression from the game night: Karl Arao from Enkitec tuning a pile of wood blocks (Jenga), overlooked by Gwen Shapira from Pythian and Mark Farnham.foto-1
There was a big number of dutch people at this year’s Hotsos conference: Toon Koppelaars, with whom I travelled, Marco Gralike, Gerwin Hendriksen and Jacco Landlust.

i would like to thank my employer VX Company for their support to make this possible.

In order to look how Exadata smartscans are different, first let’s have a peek the Oracle full segment/multiblock read evolution as short as possible:

a) Traditional multiblock reads, visible via the event ‘db file scattered read’
The essence is: Multiple adjacent blocks are read from disk, and put in the buffercache. Because every read is sequentially processed, IO latency is a performance penalty for every physical read. This works roughly this way: get a set of adjacent blocks from the segment header, fetch these blocks from disk, process these blocks, then get the next set of adjacent blocks, fetch these blocks from disk, process these blocks, etc.

b) Direct path multiblock reads, visible via the event ‘direct path read’
The essence is: Multiple IOs are done asynchronously, one or more IOs are reaped and processed, after which the number of IOs is brought back to the number of IOs the process want to keep in flight. Blocks are read to the process’ PGA (which means the IO result is not shared with other processes). Because of the asynchronous way of issuing multiple requests, the process does not suffer from the IO latency penalty of every single IO. This works roughly this way: get a set of adjacent blocks from the segment header, issue an asynchronous IO request for these, get a next set of adjacent blocks from the segment header, issue another asynchronous IO request, process one or more of the IO requests which are ready, issue IO requests for the number requests reaped, process one or more of the IO requests ready, etc. During processing, Oracle measures CPU and IO times, and can decide to add one or more concurrent requests to the two IO’s which it tries to keep in flight.

Back to smartscans

It’s not very hard to understand that direct path multiblock reads can perform much better than traditional multiblock reads. Probably at this time you think: yes, I know, but what does this have to do with Exadata? This is all information about regular processing! Well, Exadata uses the regular Oracle database executable. This means that part of the codepath of smartscans is shared with the normal/non-Exadata Oracle database. Obviously, there is a part that is unique to Exadata.

This is best viewed with a backtrace of the call to submit an IO request. This a full backtrace of the submit of an IO request of a full table scan on Linux to a database on ASM:

io_submit
skgfqio
ksfd_skgfqio
ksfdgo
ksfdaio
kfk_ufs_async_io
kfk_submit_ufs_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

Of course the functions used internally in the executable are not documented. But it’s very useful to look at them to gain a better understanding of what is happening. First look at the function at line line 24, kcbgtcr (Kernel Cache Buffers GeT Consistent Read). This is the function to perform a logical IO. One line up on line number 23 is the function kcbldrget (Kernel Cache Buffers direct path LoaDeR GET). This function indicates that the execution did choose the direct path read code path. In fact, the kcbl prefixed functions are believed to belong to Oracle direct path read codepath. Then roughly the ksfd, kfio, kfk, ksfd, and lastly skgfqio is executed, which performs the actual submit of an IO using io_submit().

Now let’s look how the equivalent submit of an IO request looks like on Exadata with smartscan turned off:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_issue_read
ossdisk_read
oss_read
kfk_submit_one_oss_io
kfk_submit_oss_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

First locate the kcbgtcr function, which is on line 31 (forget about trying to find logic in the line numbers; backtraces are read from bottom to top, whilst the numbering logic is from top to bottom). One line up on number 30 is the function kcbldrget again. Okay, that looks the same. If we read the backtrace up, it’s easy to spot the same layers, in fact the same functions: ksfd, kfio up to the kfk layer.

In the kfk layer there is a slight difference, which is understandable: on line 7 of the non-Exadata backtrace we see the function kfk_submit_ufs_io, while on Exadata the same function is kfk_submit_oss_io on line 14. I think this deserves a little Explanation. Exadata is the marketing name of the database machine, which internally was called ‘Sage’ in Oracle, and this name still surfaces sometimes, like in Exadata naming, or in patch descriptions. Quite probably OSS means ‘Oracle Sage Software’. So, this means that the process is aware it needs to read something from an Exadata storage server, and chooses a function that is meant to set that up.

The next two functions (kfk_ufs_async_io on line 6 and kfk_submit_one_oss_io on line 13) probably do logically the same, but are different because the infrastructure is different.

The next function up in both backtraces is where it gets really interesting, because now the code has to do something entirely different: on the non-Exadata system the ksfd layer is entered again, in order to get to the function skgfqio, which submit’s the IO request using the io_submit call. On the Exadata system, we see a call which I have not encountered outside of Exadata: oss_read. With the knowledge gained above, we can tell this quite probably is an Exadata specific call, which is inside an entire layer: oss. From the function names we can guess it prepares the IO request, and then issues it. Once the oss layer is crossed, we enter another layer: skgxp. The skgxp (System Kernel Generic inter-process Communication (Xfer?) Protocol) layer is NOT unique to Exadata, it’s the communication layer which is used by an Oracle RAC database for inter-process communication. So it seems like Oracle re-used the knowledge gained with RAC inter-process communication for the communication with the (Exadata) storage server.

Can we now please get to smartscans?

Yes, we didn’t encounter a smartscan yet, I did show a backtrace of a regular direct path read on Exadata. The reason is to show the difference between a traditional system (meaning disks which are presented to the system as local disks) and an Exadata system, which has to use infiniband and has to fetch the information it needs from Exadata storage servers. This is how a smartscan read call backtrace looks like:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_cread
oss_cread
kcfis_read
kcbl_predpush_get
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010010kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

The first thing which did struck me is the number of functions did decrease. On the other hand, this does not say much (you can make functions as long or as short as you wish). Also, a smartscan is done using a number of steps, of which some are already been done, which are not visible from this backtrace. After the submit of a smartscan there is another number of steps; of course reaping the submitted scan requests, but also verification of the reaped request.

It’s now interesting to see how much different the codepath looks like. The kcbgtcr function is still present, at line 15. One line up there’s the kcbldrget function, which reveals the process chose the direct read path codepath during execution. If we go up one line we see a function in the kcbl layer, which is Exadata specific as far as I know: kcbl_predpush_get. Again: all has been setup for doing a smartscan prior to the point where the process enters the point of this backtrace: submitting a request to an Exadata storage server. This means the process has the information needed (what information to ask from what storage server) prior to arriving at this point.

One layer up is a call to the kcfis (Kernel Cache File Intelligent Storage is my guess) layer, then approximately the same calls in the oss layer, but there’s a difference: it’s oss_cread and ossdisk_cread instead of oss_read and ossdisk_read, which make the call an Exadata request, instead of a request for database blocks. After the oss layer, there’s the skgxp layer again, which are exactly the same calls for both the non-smartscan and smartscan.

Conclusion

The purpose of this blogpost is to show the differences between submitting an IO request on a traditional system, on an Exadata system with smartscans turned off, and with smartscans turned on.

There is a lot of ground to cover on this. It’s not doable to cover this all in one blogpost. That’s the reason I try to pick a specific part, and work from that. Any comments are welcome.

From the backtraces it’s quite good visible regular IO’s are processed and done the same way on non-Exadata and Exadata, on Exadata the request is submitted via the skgxp layer to a storage server, instead of an IO request done via io_submit. This makes it very probable that single block IO’s are done the same way too, which means no ‘magic’ performance enhancement is possible, because it’s processed the same way on Exadata as on non-Exadata, the only difference is the IO request is done differently.

If we look at the comparison between smartscan and non-smartscan requests, it becomes prevalent there is something different happening. But the basic processing is the same, with which I mean the process is doing exactly the same as non-Exadata processes. From the smartscan backtrace it becomes visible that a process has to travel through the direct path code layer (kcbl), in order to get a smartscan, because otherwise it would be impossible to issue kcbl_predpush_get, which is the call for issuing a smartscan.

Disclaimer: I am not an Oracle employee. This information has all been obtained by profiling execution (using perf/nm/gdb). I have never seen any Oracle database and Exadata source code. There is a chance some assumptions are wrong.

In my previous post I touched the topic of a “new” codepath (codepath means “way of utilising the operating system”) for of full segment scans. Full segment scan means scanning a segment (I use “segment” because it could be a partition of a table or index) which is visible commonly visible in an explain plan (but not restricted to) as the rowsources “TABLE ACCESS FULL”, “FAST FULL INDEX SCAN” and “BITMAP FULL SCAN”.

Look at my presentation About multiblock reads to see how and when direct path reads kick in, and what the difference between the two is. Most notably, Oracle has released very little information about asynchronous direct path reads.

This post is about the implementation of direct path reads in the Oracle executable in 11.2.0.3 and the OS implementation on Linux.
For completeness: this is Oracle Linux 6.3 X64 with kernel 2.6.39-300.17.3.el6uek.x86_64. Database 11.2.0.3 64 bit, no PSU/CPU.
The database uses two ASM disks and the clusterware in single machine mode.
The system is running in a VM on my Macbook OSX 10.8.2 with SSD, VMWare Professional 5.0.2.

The database action which is traced with gdb is a simple “select count(*) from t2″, where t2 is a table without any indexes or constraints, and big enough to make the database engine choose a direct path read. Oracle can choose a number of new, largely undocumented mechanisms for optimising specifically direct path reads. The mechanisms observed in this blogpost are the true asychronous reads, which means the foreground process can issue several IO read requests, and reap them if they are finished, and increasing the number of simultaneous IO requests.

1. Observing the IO requests: getting the picture straight

First we restore the setup from my previous post, which means a normal database process (using sqlplus), and a root session with gdb attached to this process using “gdb -p “

Now set breakpoints for the operating system IO calls (with the intention to see io_submit() and io_getevents()) to show the IO calls for the oracle foreground process, and let the oracle process continue processing by using “c” in the debugger:

(gdb) rbreak ^io_.*
Breakpoint 1 at 0xa08c20
 io_prep_pwritev;
Note: breakpoint 1 also set at pc 0xa08c20.
...
Breakpoint 45 at 0x7f2e71b1dc0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 1-45, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Now issue the full segment scan in the Oracle session, and look at the output:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()

.
First thing that comes to mind is: why the TWO calls for the io_submit (io_submit@plt () and io_submit () from /lib64/libaio.so.1) and only 1 call for io_getevents (io_getevents@plt ())?

This is where ‘plt’ needs an introduction. ‘plt’ means procedure linkage table. This is a construction for functions used by an executable that uses dynamically linked shared libraries, which is exactly what the ‘oracle’ executable uses. If you look at the address of the @plt call, you see it is in the oracle executable (see /proc/maps), and io_submit () from /lib/libaio.so.1 is, as the line indicates from the dynamically linked library libaio.so.1 (which also can be seen from /proc//maps).

Back to io_getevents. We see only the io_getevents@plt call, which means that either the executable fakes the system call (?), or that we are missing something. This lead me to investigate the libaio library itself. This can be done using ‘nm -D’ (nm is an executable):

# nm -D /lib64/libaio.so.1
0000000000000000 A LIBAIO_0.1
0000000000000000 A LIBAIO_0.4
0000003f38200710 T io_cancel
0000003f38200670 T io_cancel
0000003f38200690 T io_destroy
0000003f382006a0 T io_getevents
0000003f38200620 T io_getevents
0000003f38200570 T io_queue_init
0000003f38200590 T io_queue_release
0000003f382005b0 T io_queue_run
0000003f382005a0 T io_queue_wait
0000003f382006e0 T io_queue_wait
0000003f38200680 T io_setup
0000003f38200660 T io_submit

.
Aha! there are TWO io_getevents here! Now THAT is interesting!

Let’s see which io_getevents call or calls are used by the oracle executable:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) rbreak ^io_get.*
Breakpoint 46 at 0x3f382006a0
 io_getevents;
...
Breakpoint 53 at 0xa09030
 io_getevents@plt;

.
And list the breakpoints:

(gdb) info break
Num     Type           Disp Enb Address            What
46      breakpoint     keep y   0x0000003f382006a0 io_getevents
47      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
48      breakpoint     keep y   0x0000003f382006a0 io_getevents
49      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
50      breakpoint     keep y   0x0000003f382006a0 io_getevents
51      breakpoint     keep y   0x0000003f382006a0 io_getevents
52      breakpoint     keep y   0x0000003f382006a0 io_getevents
53      breakpoint     keep y   0x0000000000a09030 io_getevents@plt

.
So, all the io_getevents breakpoints are for address 0x0000003f382006a0! If we look back at the libaio contents, we see addresses for the io_getevents functions have addresses 0x0000003f382006a0 and 0x0000003f38200620.

Let’s make gdb break and continue at the second io_getevents address too!

(gdb) break *0x0000003f38200620
Breakpoint 54 at 0x3f38200620
(gdb) commands
Type commands for breakpoint(s) 54, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Now run the full segment scan in the sqlplus session again, and see what gdb shows:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1

.
Okay! We got the system call back!

So, we now got a breakpoint set in gdb which is able show us the asynchronous operating system calls Oracle uses in a scan, which are io_submit and io_getevents. Oracle uses more calls for managing asynchronous IO, but for clarity I focus on io_submit and io_getevents.

2. Add wait events to direct path read profile

At this point I can add two breakpoints for the wait instrumentation (kslwtbctx (enter wait) and kslwtectx (end wait)) using:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 55 at 0x8f9a652
 kslwtbctx;
Breakpoint 56 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 55-56, one per line.
End with a line saying just "end".
>silent
>f
>c
>end

.
If I rerun the full scan again, I can see there are no waits for the IO related calls, just io_submit, then io_getevents, which I can tell because I don’t see kslwtbctx and kslwtectx during the IO related calls. I see them AFTER the IO is done, which is normal, expected behavior. Please mind the cause of this is this is a VM where the disks are on an SSD, but most probably the disks (which is a file with VMWare Fusion) are in the cache of my MacBook, which makes IO very fast.

As introduced in the previous post, the last function call seen is entering of a wait using ‘kslwtbctx’, which is the event for ‘SQL*Net message from client’, which is the foreground waiting for user input to process:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

.
Now see what happens if I slow the IO’s down using cgroups.

I slowed down both my (ASM-)disks to 1 IOPS, and ran the scan in sqlplus again. This is a snippet of the gdb output:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008fa1334 in kslwtectx ()

.
Actually, above is the typical pattern I see: a submitted IO via io_submit, four io_getevents calls, registering a wait event, then another io_getevents call, and ending of the wait event, given IO isn’t too fast. (in real life it all depends on the speed of the IO beneath the operating system, and mostly is a mixture of the last and the previous gdb output)

The pattern can have multiple forms, for example, when the scan starts, after the initialisation of the asynchronous IO dependencies (like an IO context on the OS and the IO slots in Oracle), the first calls are two io_submit calls (in order to get the minimal number of asynchronous IO’s (2) in flight).

Apparently, the four io_getevents calls after io_submit are non-blocking. This is my assumption, simply because I see these calls scrolling over my screen until the wait event is registered, and the scrolling stops at another io_getevents call. But can we be sure?

3. Digging deeper with debuginfo

In order to look deeper into the io_getevents calls, the arguments of the io_getevents function specifically, the debuginfo package of libaio needs to be installed (the debuginfo packages are available at http://oss.oracle.com/ol6/debuginfo). After that, the gdb session needs to be restarted in order to pickup the debug information.

This is how the gdb output looks like with the libaio debuginfo installed:

#0 0x0000000002cfb352 in io_prep_pread ()
#0 0x0000000000a09bb0 in io_submit@plt ()
#0 io_submit (ctx=0x7ff6ceb2c000, nr=1, iocbs=0x7fff2a4e09e0) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e9048, timeout=0x7fff2a4ea050) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ec128, timeout=0x7fff2a4ed130) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e8e48, timeout=0x7fff2a4e9e50) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ebf28, timeout=0x7fff2a4ecf30) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=1, nr=128, events=0x7fff2a4e8e38, timeout=0x7fff2a4e9e40) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008fa1334 in kslwtectx ()

.
Now we see a LOT of more interesting information!

First of all, all the IO calls involved are done in the same “aio_context”: ctx=0x7ff6ceb2c000. This means that any asynchronous IO that is done during this query reports it’s readiness here. This means IO’s can be reaped out of order(!).

Next is min_nr. The non-instrumented io_getevents calls have min_nr set to 2, the io_getevents call which is instrumented has min_nr set to 1.

nr is the maximum number of IO’s that can be reaped by this call in this aio_context. As far as I know, the number of slots cannot be higher than 32, which means that if I am right, there will not be more than 32 requests.

The last io_getevents argument is timeout, which is really interesting to understand and verify the behavior I described. This value is a pointer to a struct which holds the timeout specification. In order to actually know the timeout value, we need to print the contents of the struct.

This is where gdb, once the debuginfo is available, can help. Let’s modify the breakpoints the following way:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) break *0x0000003f38200620
Breakpoint 38 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 38, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) rbreak ^kslwt[be]ctx
Breakpoint 39 at 0x8f9a652
 kslwtbctx;
Breakpoint 40 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 39-40, one per line.
End with a line saying just "end".
>silent
>f
>end
(gdb) rbreak ^io_.*
Breakpoint 41 at 0x3f38200570: file io_queue_init.c, line 28.
int io_queue_init(int, io_context_t *);
...
Breakpoint 74 at 0x7f549fd44c0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 41-74, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Notice the “print *timeout” with breakpoint 38 (*0x0000003f38200620) alias the “correct” io_getevents call.
Using “print *timeout” we can let gdb follow the pointer, and print the contents of the struct. Let’s see if my assumption about the non-instrumented io_getevents is correct:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  io_submit (ctx=0x7f54a1956000, nr=1, iocbs=0x7fff78f059b0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0dfd8, timeout=0x7fff78f0efe0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$21 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f110b8, timeout=0x7fff78f120c0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0ddd8, timeout=0x7fff78f0ede0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f10eb8, timeout=0x7fff78f11ec0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=1, nr=128, events=0x7fff78f0ddc8, timeout=0x7fff78f0edd0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 600, tv_nsec = 0}

.
Yes! What I saw on my screen in gdb is now proven with the printing of the timeout struct!! The non-instrumented io_getevents calls just peak in the completion queue (ctx) to see if min_nr (3 in this case) IO requests are ready for four times, then register a wait event, and issue another io_getevents, look in the same completion queue (ctx), but only for 1 ready IO request, and now wait for 600 seconds.

4. Conclusion

The reason for this post is to show the essence of how the direct path reads work. There is much more to be said on this subject, especially with the adaptive or “auto tune” mechanism, which scales up the number of asynchronous IO’s in flight.

Recently I was discussing some IO related waits with some friends. The wait I was discussing was ‘kfk: async disk IO’. This wait was always visible in Oracle version 11.2.0.1 and seems to be gone in version 11.2.0.2 and above. Here is the result of some investigation into that.

First: the wait is not gone with version 11.2.0.2 and above, which is very simple to prove (this is a database version 11.2.0.3):

SYS@v11203 AS SYSDBA> select name, parameter1, parameter2, parameter3, wait_class from v$event_name where name like 'kfk: async disk IO';

NAME		     PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
-------------------- ---------- ---------- ---------- ----------------------------------------------------------------
kfk: async disk IO   count	intr	   timeout    System I/O

What is interesting, is that the wait class is ‘System I/O’. I don’t know the official definition of the wait class ‘System I/O’, but it tells me that it is something a background process is waiting for, not my foreground process. But I could be wrong…

Let’s look at an excerpt of a tracefile of a asynchronous, full segment scan in Oracle version 11.2.0.1. This host is running in a VM on OL6.3 X64:

PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356620409310181 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=120,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310180
EXEC #3:c=0,e=63,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310297
WAIT #3: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1356620409310351
WAIT #3: nam='Disk file operations I/O' ela= 9 FileOperation=2 fileno=5 filetype=2 obj#=-1 tim=1356620409311203
WAIT #3: nam='Disk file operations I/O' ela= 238 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409312218
WAIT #3: nam='kfk: async disk IO' ela= 14 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409313703
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409314449
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315169
WAIT #3: nam='kfk: async disk IO' ela= 5 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315850
WAIT #3: nam='Disk file operations I/O' ela= 42 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409316451
WAIT #3: nam='kfk: async disk IO' ela= 33 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409317903
WAIT #3: nam='kfk: async disk IO' ela= 403 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409320529
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409321950
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409323627
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409325424
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409327121
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409329153
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409330861
WAIT #3: nam='kfk: async disk IO' ela= 8 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409332534
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409334179

What is quit clear to see, is the database is doing some stuff which is instrumented in the wait interface (‘Disk file operations I/O’), but the event ‘kfk: async disk IO’ is much occurring. It is NOT taking an enormous amount of time (ela < 10 (microseconds) most of the time).

Let's dig a little deeper. Most events are instrumentation for calls to the operating system to do something. Let's take the linux 'strace' utility, and list the waits together with the system calls. This is done with the 'strace -e write=all -p PID’, here is a fragment:

io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
io_getevents(139892671860736, 2, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 122880, 0}}, {0, 0}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 107) = 107
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam='kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 35  33 20 63 6f 75 6e 74 3d  ' ela= 5 3 count= |
 | 00030  31 20 69 6e 74 72 3d 30  20 74 69 6d 65 6f 75 74  1 intr=0  timeout |
 | 00040  3d 34 32 39 34 39 36 37  32 39 35 20 6f 62 6a 23  =4294967 295 obj# |
 | 00050  3d 37 33 34 32 36 20 74  69 6d 3d 31 33 35 36 36  =73426 t im=13566 |
 | 00060  32 31 31 31 30 32 35 38  30 31 34                 21110258 014      |
write(9, "\n", 1)                       = 1
 | 00000  0a                                                .                 |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b585420, 0x7f3b4b585420, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 106) = 106
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam='kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 36  20 63 6f 75 6e 74 3d 31  ' ela= 6  count=1 |
 | 00030  20 69 6e 74 72 3d 30 20  74 69 6d 65 6f 75 74 3d   intr=0  timeout= |
 | 00040  34 32 39 34 39 36 37 32  39 35 20 6f 62 6a 23 3d  42949672 95 obj#= |
 | 00050  37 33 34 32 36 20 74 69  6d 3d 31 33 35 36 36 32  73426 ti m=135662 |
 | 00060  31 31 31 30 32 35 39 37  35 39                    11102597 59       |
write(9, "\n", 1)                       = 1
 | 00000  0a                                                .                 |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 106) = 106
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam='kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 39  20 63 6f 75 6e 74 3d 31  ' ela= 9  count=1 |
 | 00030  20 69 6e 74 72 3d 30 20  74 69 6d 65 6f 75 74 3d   intr=0  timeout= |
 | 00040  34 32 39 34 39 36 37 32  39 35 20 6f 62 6a 23 3d  42949672 95 obj#= |
 | 00050  37 33 34 32 36 20 74 69  6d 3d 31 33 35 36 36 32  73426 ti m=135662 |
 | 00060  31 31 31 30 32 36 32 31  37 38                    11102621 78       |

What is interesting here, is that the event ‘kfk: async disk IO’ is reported after both the io_getevents() and the io_submit() call. (with the io_submit call an IO request or IO requests are dispatched to the operating system, with the io_getevents call the operating system (AIO) completion queue is examined to see if any requests are ready). So we can conclude that ‘System I/O’ is not appropriate (by my definition), this is something my foreground process apparently is waiting for. We can also conclude that this event is related to IO. But what does it mean?

In order to dig deeper, we need to know what the database is doing internally. The most obvious way is to read the source. But, it’s not possible to get the Oracle database source if your not working for Oracle (and probably explicitly have been granted access). There are a number of ways to see more. I wrote about ‘perf’, but also the gdb (gnu debugger) can be used.

In order to debug a running Oracle database process, you can attach to a process with the debugger (as root) with ‘gdb -p PID’. An equivalent of using trace and looking for io_submit() and io_getevents() calls is:

# gdb -p 14122
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
...
0x0000003f38a0e530 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) 

Now enter the following commands (if you go into gdb yourself, it becomes apparent which are the commands):

(gdb) set pagination off
(gdb) break io_submit
Breakpoint 1 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) break 'io_getevents@plt'
Breakpoint 2 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) continue
Continuing.

And run the full segment scan again:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()

Now we need to know the function call which Oracle uses for the registering waits, for which we used the tracefile generated by SQL trace earlier. The first line in the tracefile of our current session if we issue the full scan again will be the ending of the wait ‘SQL*Net message from client’. The call which Oracle uses for writing this line is write(). Let’s clean up the current breakpoints, and set a breakpoint for the write call (stop the debugger’s current action using CTRL-c):

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break write
Breakpoint 3 at 0x3f38a0e4c0: file ../sysdeps/unix/syscall-template.S, line 82. (3 locations)
(gdb) c
Continuing.

(‘c’ is an abbreviation of ‘continue’)
And run the scan again.
Because of the breakpoint set, gdb will break execution when the write() call is issued:

Breakpoint 3, write () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

And generate a backtrace (a list of all the functions which where called to reach the write() call), with the ‘backtrace’ command (‘bt’):

(gdb) bt
#0  write () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000086014dd in sdbgrfuwf_write_file ()
#2  0x0000000008601426 in sdbgrfwf_write_file ()
#3  0x00000000085caafb in dbgtfdFileWrite ()
#4  0x00000000085ca685 in dbgtfdFileAccessCbk ()
#5  0x00000000085c9d6d in dbgtfPutStr ()
#6  0x0000000001a8d93a in dbktWriteTimestamp ()
#7  0x0000000001a8d6b5 in __PGOSF61_dbktWriteNfy ()
#8  0x00000000085caa06 in dbgtfdFileWrite ()
#9  0x00000000085ca685 in dbgtfdFileAccessCbk ()
#10 0x00000000085c9d6d in dbgtfPutStr ()
#11 0x0000000001a8e291 in dbktPri ()
#12 0x00000000009f143e in ksdwrf ()
#13 0x00000000009f1d8f in ksdwrfn ()
#14 0x0000000005abf0e1 in kxstTraceWait ()
#15 0x000000000821e88d in kslwtectx ()
#16 0x00000000083c2a79 in __PGOSF24_opikndf2 ()
#17 0x000000000143b790 in opitsk ()
#18 0x00000000014406da in opiino ()
#19 0x00000000083c54fd in opiodr ()
#20 0x0000000001437b60 in opidrv ()
#21 0x00000000018aac97 in sou2o ()
#22 0x00000000009d3ef1 in opimai_real ()
#23 0x00000000018aff26 in ssthrdmain ()
#24 0x00000000009d3e5d in main ()

What is interesting to see here, is the function at number 15, kslwtectx(). We know this write reports the ending of waiting at ‘SQL*Net message from client’. This function is ‘Kernel Service Layer WaiT End ConTeXt’ (educated guess; I’ve had a little help from Tanel Poder to find the Oracle wait interface functions). The function which marks the start of a wait is called ‘kslwtbctx’, and the function which ends a wait is called ‘kslwtectx’. Let’s put a break and continue on io_submit, io_getevents, kslwtbctx and kslwtectx:

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break 'io_getevents@plt'
Breakpoint 5 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtbctx
Breakpoint 6 at 0x8217fda
(gdb) commands
Type commands for breakpoint(s) 6, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtectx
Breakpoint 7 at 0x821e3d8
(gdb) commands
Type commands for breakpoint(s) 7, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

Now run the full scan again; this is what is seen in the debugger:

#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()

This goes on, until the IO calls end, and the foreground does some other stuff at the end, ending with the beginning of waiting on ‘SQL*Net message from client’:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()

Okay, back to ‘kfk: async disk IO’. We see the ‘io_getevents()’ call which ALWAYS has a kslwtbctx() call before, and always a kslwtectx() call after the io_getevents() call itself. Please mind we are on 11.2.0.1. This strongly suggests ‘kfk: async disk IO’ is the instrumentation of the io_getevents() call. Or io_getevents() is part of this wait event.

Let’s make the io_getevents() call take longer, and see if that is reflected in the wait, to prove the hypothesis made above. This can be done using the ‘catch’ function of the debugger:

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) catch syscall io_getevents
Catchpoint 8 (syscall 'io_getevents' [208])
(gdb) commands
Type commands for breakpoint(s) 8, one per line.
End with a line saying just "end".
>silent
>shell sleep 0.01
>c
>end
(gdb) c
Continuing.

Please mind it’s ‘io_getevents’ here instead of ‘io_getevents@plt’, because the catch function looks the syscall up in an header file, instead of the symbol table in the executable.

Now run the full scan again, with SQL trace turned on at level 8 to show the waits. Previously we saw waits with an elapsed time mostly lesser than 10ms. here is a snippet from the trace:

PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356626115353145 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=203,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115353116
EXEC #3:c=1000,e=80,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115355176
WAIT #3: nam='SQL*Net message to client' ela= 30 driver id=1413697536 #bytes=1 p3=0 obj#=73426 tim=1356626115355832
WAIT #3: nam='kfk: async disk IO' ela= 26404 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115384177
WAIT #3: nam='kfk: async disk IO' ela= 25050 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115411677
WAIT #3: nam='kfk: async disk IO' ela= 24885 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115437990
WAIT #3: nam='kfk: async disk IO' ela= 25002 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115464244

*** 2012-12-27 17:35:15.492
WAIT #3: nam='kfk: async disk IO' ela= 25192 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115492891
WAIT #3: nam='kfk: async disk IO' ela= 25175 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115521419
WAIT #3: nam='kfk: async disk IO' ela= 24804 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115549124
WAIT #3: nam='kfk: async disk IO' ela= 24980 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115576955
WAIT #3: nam='kfk: async disk IO' ela= 24988 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115604879

A little explanation here: the debugger triggers the syscall catcher on entrance and leaving of the syscall. The times being somewhat more than 20ms, instead of a few microseconds gives a fair indication the wait is related. We can also look in the debugger:

(gdb) info break
Num     Type           Disp Enb Address            What
8       catchpoint     keep y                      syscall "io_getevents" 
	catchpoint already hit 174 times
        silent
        shell sleep 0.01
        c

Yes, it has been hit 174 times. So ‘kfk: async disk IO’ is the instrumentation of the ‘io_getevents()’ call, or more actions of which the io_getevents() is part of.

Okay, now that we know that, let’s switch to Oracle version 11.2.0.3…
This is what I find on doing EXACTLY the same (full scan) in the tracefile:

PARSING IN CURSOR #140382932318408 len=23 dep=0 uid=84 oct=3 lid=84 tim=1356626776003874 hv=1020534364 ad='7f491298' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #140382932318408:c=0,e=41,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003873
EXEC #140382932318408:c=0,e=31,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003962
WAIT #140382932318408: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356626776004009
WAIT #140382932318408: nam='direct path read' ela= 340 file number=5 first dba=28418 block cnt=126 obj#=75579 tim=1356626776045733
FETCH #140382932318408:c=164975,e=188719,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356626776192752
STAT #140382932318408 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=20944 pr=20941 pw=0 time=188708 us)'
STAT #140382932318408 id=2 cnt=1000000 pid=1 pos=1 obj=75579 op='TABLE ACCESS FULL T2 (cr=20944 pr=20941 pw=0 time=68726 us cost=5738 siz=0 card=1000000)'

Okay. First thing to notice is the ‘kfk: async disk IO’ events are gone. We see the PARSE line, the SQL Net message, and a single ‘direct path read’ wait, and then the FETCH line. This is absolutely different behavior than version 11.2.0.1!
What is happening? I think the only way to understand more about this, is running the debugger again with notification of io_submit, io_getevents, kslwtbctx and kslwtectx:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

That’s odd! There are no calls to the code location’s kslwtbctx and kslwtectx during IO processing! So, despite ‘kfk: async disk IO’ still being a wait event, it doesn’t seem to be instrumenting io_getevents() anymore. What if we make io_getevents() take longer (I’ve used the same number as the earlier catching of the syscall, 0.01 second):

PARSE #140382932318408:c=0,e=168,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973832834
EXEC #140382932318408:c=0,e=68,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973833385
WAIT #140382932318408: nam='SQL*Net message to client' ela= 671 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356627973835367
WAIT #140382932318408: nam='asynch descriptor resize' ela= 680 outstanding #aio=0 current aio limit=152 new aio limit=182 obj#=75579 tim=1356627973837361

*** 2012-12-27 18:06:16.440
FETCH #140382932318408:c=185972,e=2604507,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356627976440376

No, the time in the FETCH line increases (elapsed; e=2604507, cpu; c=185972; there’s 2418535 microseconds of time which is not spend on CPU), but there is only little time instrumented by wait events. A reasonable conclusion is the ‘kfk: disk async IO’ wait event is different in this version (11.2.0.3). I’ve also tried to slow down the io_submit() call the same way, with the same result: the time increases in the FETCH line, but there’s nothing the database writes a wait line for.

What if I slow down the disk? This can be done using cgroups, as described in this post. I’ve set my both my ASM disks to 1 IO per second. That should result in waits! This is the resulting trace output:

PARSE #140382932318408:c=0,e=20,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964787836
EXEC #140382932318408:c=0,e=64,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964788256
WAIT #140382932318408: nam='SQL*Net message to client' ela= 703 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356633964790122
WAIT #140382932318408: nam='asynch descriptor resize' ela= 699 outstanding #aio=0 current aio limit=238 new aio limit=264 obj#=75579 tim=1356633964792014
WAIT #140382932318408: nam='direct path read' ela= 497938 file number=5 first dba=23939 block cnt=13 obj#=75579 tim=1356633965295483

*** 2012-12-27 19:46:05.795
WAIT #140382932318408: nam='direct path read' ela= 495498 file number=5 first dba=23953 block cnt=15 obj#=75579 tim=1356633965795382

*** 2012-12-27 19:46:06.295
WAIT #140382932318408: nam='direct path read' ela= 495890 file number=5 first dba=23969 block cnt=15 obj#=75579 tim=1356633966295208
WAIT #140382932318408: nam='direct path read' ela= 495889 file number=5 first dba=23985 block cnt=15 obj#=75579 tim=1356633966795127

So, if I slow the IO down, I get ‘direct path read’ wait events. How does that look when I use the break/continue technique on this configuration?

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

The interesting part are the io_getevents() calls, which consistently are 4 calls before kslwtbctx is called, and another one or two calls are done before kslwtectx is called, and wait time is registered (see above). Interesting, but no ‘kfk: async disk IO’.

Now let’s look at the same output when Oracle 11.2.0.1 is throttled:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()

This looks remarkably the same as the 11.2.0.3 version!

Conclusion

It seems the database issues a number of io_getevents() calls after asynchronous IO calls are submitted with io_submit(), which do return immediately (also known as ‘non-blocking’). They just peak at the completion queue, if an IO is available in the completion queue, the process processes that IO. The number of non-blocking calls is 3 with version 11.2.0.1 and 4 with 11.2.0.3 (please mind the ‘slot’ mechanism which controls the number of simultaneous AIO requests is probably in play here too). If these calls do not return any IO request, the wait interface is called (kslwtbctx()), and a blocking io_getevents() call is issued (which waits for at least 1 request to be returned too), for which the wait time is visible as event ‘direct path read’.

It also seems the first non-blocking io_getevents() call is instrumented by the ‘kfk: async disk IO’ event with Oracle 11.2.0.1, for reasons unknown to me. This instrumentation is not present in version 11.2.0.3.

Why?

I guess the first question which comes to mind when reading this title is ‘Why’? For a database, but I guess for any IO depended application, we want IO’s to be faster, not throttle them, alias make them slower. Well, the ‘why’ is: if you want to investigate IO’s, you sometimes want them to slow down, so it’s easier to see them. Also, (not so) recent improvements in the Oracle database made great progress in being able to use the available bandwidth by doing IO in parallel, which could strip away much of the ability to see them in Oracle’s popular SQL trace.

Virtualisation

I use VMWare Fusion on my MacBook and use Linux in the VM’s to run the Oracle datatabase. Desktop virtualisation, like VMWare Fusion (and Virtualbox and VMWare Workstation, I think all desktop virtualisation products) use the operating system IO subsystem. This introduces a funny effect: if you stress the IO subsystem in the VM, and measure throughput, it looks like the disk or disks are getting faster and faster every run. The reason for this effect is the blocks in the file, which is the disk from the perspective of the VM, are getting touched (read and/or written) more and more, thus are increasingly better candidates for caching from the perspective of the underlying operating system.

I think that if you combine the ‘disk getting faster’ effect with the need to investigate IO’s, you understand that it can be beneficial to throttle IO’s in certain cases.

Cgroups

The mechanism which can be used to control and throttle resources is ‘cgroups’. Cgroups is a Linux kernel feature, which is an abbreviation of ‘control groups’, and has the function to limit, account and isolate resource usage (see this wikipedia article. Cgroups are a Linux kernel feature since kernel version 2.6.24. This means there is no cgroups in Redhat and Oracle linux version 5, but there is version 6.

The idea behind cgroups is to have control over resources in a system, which becomes more and more important with today’s systems getting bigger. Cgroups have been created to function from single processes to complete (virtualised) systems.

Simple setup and usage

(please mind all commands are either executed as root (indicated by ‘#’), or as a regular user (oracle in my case, indicated by ‘$’))

First, we need to make sure the ‘blkio’ controller is available:

# grep blkio /proc/mounts || mkdir -p /cgroup/blkio ; mount -t cgroup -o blkio none /cgroup/blkio

Next, we create a cgroup called ‘iothrottle’:

# cgcreate -g blkio:/iothrottle

In order to throttle IO on the device, we need to find the major and minor number of the block device. If you use ASM, you can list the PATH field in the V$ASM_DISK view, and generate a long listing of it on linux:

$ ls -ls /dev/oracleasm/disk1 
0 brw-rw----. 1 oracle dba 8, 16 Dec 15 13:22 /dev/oracleasm/disk1

This shows that the major and minor number of the block device are: 8 16.

The next step is to use the ‘read_iops_device’ configuration option of the blkio controller to apply throttling to the ‘iothrottle’ cgroup. The ‘read_ops_device’ configuration option uses the following format: major_number:minor_number nr_IO_per_second (major:minor, space, maximum number of read IO’s per second)

cgset -r blkio.throttle.read_iops_device="8:16 10" iothrottle

Okay, we now have a cgroup called ‘iothrottle’ setup, and used the ‘read_iops_device’ option of the ‘blkio’ controller. Please mind there are no processes assigned to the cgroup yet. The next steps are to use an IO generation and measurement tool to first measure uncapped IO performance, then assign the process to the ‘iothrottle’ cgroup, and rerun performance measurement.

For the IO tests I use ‘fio’. This tool gives you the opportunity to investigate your system’s IO subsystem and IO devices performance. This is my fio.run file:

$ cat fio.run 
[global]
ioengine=libaio
bs=8k
direct=1
numjobs=4
rw=read

[simple]
size=1g
filename=/dev/oracleasm/disk1

Now run it! Please mind I’ve snipped a large part of the output, because fio gives a great deal of output, which is extremely interesting, but not really relevant to this blog:

$ fio --section=simple fio.run 
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
fio 1.57
Starting 4 processes
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
Jobs: 2 (f=2): [_R_R] [100.0% done] [123.3M/0K /s] [15.4K/0  iops] [eta 00m:00s]
...

So, we did an average of 15.4K read IOPS. Now lets put the process which runs fio in the ‘iothrottle’ cgroup!
Get PID of the process we just used with ‘fio’:

$ echo $$
5994

And assign the ‘iothrottle’ cgroup to it:

# echo 5994 > /cgroup/blkio/iothrottle/tasks

You can see if your process is assigned a cgroup by reading the ‘cgroup’ file in ‘proc’:

$ cat /proc/self/cgroup 
1:blkio:/iothrottle

Okay, we are assigned the ‘iothrottle’ cgroup! Now rerun the ‘simple’ fio benchmark:

$ fio --section=simple fio.run 
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
simple: (g=0): rw=read, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=1
fio 1.57
Starting 4 processes
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
fio: only root may flush block devices. Cache flush bypassed!
Jobs: 4 (f=4): [RRRR] [0.3% done] [81K/0K /s] [9 /0  iops] [eta 14h:37m:42s]

To be honest, I cancelled this fio run after a little while, because the time it takes to run is very long (approximately 14 hours and 30 minutes, as can seen above).
I think this example shows the cgroup ‘iothrottle’ in action very clear!

Cgroup usage in reality

I can’t imagine anybody want to echo all the process ID’s in the cgroup’s ‘tasks’ file in order to get these processes in a certain cgroup. With my DBA background, I would love to have control over an entire (all processes belonging to a) database. Setting up cgroups like done above (manually) means you have to set it up every time the server reboots. Luckily, there is a way to automate cgroup creation and assigning!

cgconfig service
In order to create cgroups, there is a service called ‘cgconfig’, which reads the file /etc/cgconfig.conf. In order to get the ‘iothrottle’ cgroup, and the disk throttling configuration created automatically, use this configuration:

mount {
	blkio = /cgroup/blkio;
}

group iothrottle {
	blkio {
		blkio.throttle.read_iops_device="8:16 10";
	}
}

In order to use this configuration restart the cgconfig service using ‘service cgconfig restart’. Optionally, you can enable automatic starting of this service on startup using ‘chkconfig –level 2345 cgconfig on’ (optionally check when this service is started with ‘chkconfig –list cgconfig’). Now the cgroup is created. But now do we assign processes to it?

cgred service
This is where the cgred service is for. This daemon uses a simple configuration file: /etc/cgrules.conf. Once configured and active, this service assigns cgroups to users, groups or processes. For the purpose of limiting IO from an Oracle database, I created this simple line:

oracle			blkio			/iothrottle

Now the cgred service can be started using ‘service cgred restart’. Optionally, you can enable automatic starting of this service using ‘chkconfig –level 2345 cgred on’.

Summary

The purpose of this blogpost was to introduce cgroups, and let you understand why I choose the IO throttling functionality. Next it showed how to setup cgroups manually, and a simple test to prove it works, with enough information to let you repeat the test for yourself. The last part showed how to automate cgroup creation and assignment.

A word of caution is on it’s place. It’s a fairly new feature at the time of writing, which means it could break or does not work as expected. So use at your own risk! In my limited tests it worked like a charm.

Profiling PL/SQL is an Oracle feature since Oracle 9i (2001). Yet I haven’t seen anyone actually use the profiler (besides myself, which is the reason for this post). On the other hand, I have seen shiploads of people guess how the time profile of their PL/SQL looks like. I also have seen people instrument their code, which ranges from bad (to little instrumentation, to much instrumentation, wrong timescale (measuring things that take lesser than milliseconds with a granularity of seconds for example), to good (use correct granularity, use Method-R’s ILO).

Profiling and instrumentation go hand-in-hand, they absolutely don’t rule each other out. Instrumentation points you to the chunk(s) of code which takes the most time, and should be always on. Profiling takes timing to the source code line level, and is the logical next step to do when you’ve been pointed to a piece of code that takes a long time. Profiling should be off by default.

I know the basic form of profiling has been described by a number of websites, including the Oracle documentation and Tim’s Oracle-base. To be honest, I haven’t got very much to add. (Just like all the other websites which seems to have taken a copy of oracle-base)

The description of dbms_profiler in this article is about using dbms_profiler on an Oracle 11.2 database. If the profiler package is not loaded (check with ‘desc dbms_profiler’ with a DBA granted user), load it as SYSDBA with ‘@?/rdbms/admin/profload.sql’. The only thing you have to do, is to add a few tables and a sequence in the schema which is executing the profiler package. The most simple way to do do this, is to log on to the database server, and log on with sqlplus with this owner, and execute:

@?/rdbms/admin/proftab.sql

The next thing to do, is add start and stop commands for the profiler in the piece of ‘trouble code’. The code can be either SQL or PL/SQL. In PL/SQL add:

dbms_profiler.start_profiler ( 'profiler run ' || to_char ( sysdate, 'yyyymmddhh24mi' ) ); --start
dbms_profiler.stop_profiler; --stop

In SQL add ‘exec’ in front of calling dbms_profiler.

There’s a gotcha you really need to know: the profiler records the source line number. What you can do, and most scripts found on the internet do, is join the profiler output with the ALL_SOURCE view to see what happens at that line. Anonymous PL/SQL blocks are not recorded in ALL_SOURCE, so when you profile anonymous PL/SQL blocks, you get ‘anonymous’ in the unit_owner and unit_name fields.

Let’s look at a complete example:
a) Please mind this is an extremely simple example! I use the SYS user for the sake of simplicity. Don’t do this in official environments!
b) I’ve checked the existence of dbms_profiler. In fact, I needed to load the profiler package in this database.
c) I’ve ran the proftab script to generate the needed tables for the profiler.
d) This is the script I’ve used:

-- script: gtt_and_collection.sql
-- setup
set serveroutput on;
create global temporary table gtt ( col1 number, col2 varchar2(10) ) on commit preserve rows;
create or replace procedure insert_example is
        nr_to_loop number:=1000000;
        nr number:=0;
        type col_type is table of gtt%ROWTYPE;
        col col_type:=col_type();
begin
        --dbms_profiler.start_profiler('profiler run '||to_char(sysdate,'yyyymmddhh24mi'));
        dbms_output.put_line('Start: '||to_char(sysdate,'hh24:mi:sssss'));
        for nr in 1..nr_to_loop loop

                /* gtt */
                insert into gtt (col1, col2) values ( nr, 'AABBCCDDEE' );

                /* collection */
                col.extend;
                col(nr).col1 := nr;
                col(nr).col2 := 'AABBCCDDEE';

        end loop;
        commit;
        dbms_output.put_line('Stop: '||to_char(sysdate,'hh24:mi:sssss'));
        --dbms_profiler.stop_profiler;
end;
/
-- execute test
exec insert_example;
-- clean up
--drop procedure insert_example;
truncate table gtt;
drop table gtt;

Let’s run it:

SQL> @gtt_and_collection

Table created.


Procedure created.

Start: 06:57:25074
Stop: 06:58:25110

PL/SQL procedure successfully completed.


Procedure dropped.


Table truncated.

Okay, the procedure ran approximately 1 minute. Let suppose this takes too long. Now what are you going to do? Can this be sped up? Is the time taken by insert into the global temporary table? Some would argue this is a temporary table, so it probably exists in memory. Or is it the collection which eats up the time? Or maybe even the commit after the loop?

Well, enable the profiler by removing the ‘–’ in front of the dbms_profiler calls in the procedure, and run it again! (Please mind flashback database is an incredibly handy feature to redo tests with absolutely the same database. Downside is it means you need to stop the database and mount the database to flashback it. This means other users or developers have downtime)

When it runs with the profiler, nothing extraordinary is seen, it just runs. You have to make a report of the profiler data yourself. This is the script I use:

col owner format a10
col name format a20
col line format 9999 
col source format a40 word_wrap
col occurances format 9999999
col tot_time_s format 999,999,990.999
col run_comment format a40
col pct format 999
select runid, run_date, run_comment from plsql_profiler_runs;
select u.unit_owner OWNER, u.unit_name NAME, d.line# LINE, (select text from all_source where type in ('PACKAGE BODY','FUNCTION','PROCEDURE','TRIGGER') and name = u.unit_name and line=d.line# and owner=u.unit_owner and type=u.unit_type) SOURCE, d.total_occur OCCURANCES, (d.total_time/1000000000) TOT_TIME_S, d.total_time/r.run_total_time*100 "PCT"
from plsql_profiler_runs r, plsql_profiler_units u, plsql_profiler_data d
where r.runid=&&runid
and r.runid=u.runid
and r.runid=d.runid
and d.unit_number=u.unit_number
and (d.total_time/1000000000) > 1 --more than 1 second
order by tot_time_s
/
select (run_total_time/1000000000) tot_time_s from plsql_profiler_runs where runid=&runid;
undefine runid

The most notable change with scripts found on the internet is this script orders the output by time spend, which is why I use it, and you probably too. If you profile a real-life piece of code, there is probably more lines of code involved than fit on your screen, so having the output in the order of the source code doesn’t make sense to me.

What is the output?

     RUNID RUN_DATE  RUN_COMMENT
---------- --------- ----------------------------------------
	 1 26-NOV-12 profiler run 201211260638
	 2 26-NOV-12 profiler run 201211260643

Enter value for runid: 2
old   3: where r.runid=&amp;runid
new   3: where r.runid=2

OWNER	   NAME 		 LINE SOURCE				       OCCURANCES	TOT_TIME_S  PCT
---------- -------------------- ----- ---------------------------------------- ---------- ---------------- ----
SYS	   INSERT_EXAMPLE	   15 col.extend;				  1000000	     2.386    6
SYS	   INSERT_EXAMPLE	   12 insert into gtt (col1, col2) values (	  1000000	    32.836   88
				      nr, 'AABBCCDDEE' );

old   1: select (run_total_time/1000000000) tot_time_s from plsql_profiler_runs where runid=&amp;runid
new   1: select (run_total_time/1000000000) tot_time_s from plsql_profiler_runs where runid=2

      TOT_TIME_S
----------------
	  37.130

Well, this makes it all obvious, doesn’t it? Only two lines of code have spend more than 1 second in total, which are the two shown. The collection management (adding a row) is responsible for 6% of the time, and the insert is responsible for 88%

Follow

Get every new post delivered to your Inbox.

Join 1,047 other followers

%d bloggers like this: