Archive

Oracle EE

This is a small note describing how Oracle implemented the situation which is covered by the db file parallel read wait event. This events happens if Oracle knows it must read multiple blocks which are not adjacent (thus from different random files and locations), and cannot continue processing with the result of a single block. In other words: if it cannot process something after reading a single block (otherwise Oracle will read a single block visible by the wait ‘db file sequential read’).

This is how it shows up if you enable sql trace:

WAIT #139658359011504: nam='db file parallel read' ela= 69997140 files=1 blocks=70 requests=70 obj#=76227 tim=1373200106669612

What this shows, is Oracle issuing a request for 70 blocks. This has an interesting implication for monitoring and looking at the time spend on the event ‘db file parallel read': if you don’t know the number of blocks for which an IO request is issued, it’s impossible to say something about the time. So just monitoring or looking a cumulative time spend in ‘db file parallel read’ doesn’t say anything about IO latency, it only tells something about where the Oracle process did spend its time on.

How did Oracle implement this? This is obviously port specific (which means the implementation will be different on different operating systems). My test environment is Oracle Linux 6u3 X64, Oracle 11.2.0.3 64 bit using ASM.

This is how the requests are asynchronously submitted to the operating system:

Breakpoint 2, io_submit (ctx=0x7f04c0c8d000, nr=70, iocbs=0x7fff86d965f0) at io_submit.c:23

So all the IO requests are submitted in one go!

After the IO requests are submitted (which is not covered by a wait, which makes sense, because the io_submit call is/is supposed to be non blocking.

Next Oracle waits for ALL the IOs to finish, covered by the ‘db file parallel read’ wait event:

Breakpoint 13, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, io_getevents_0_4 (ctx=0x7f04c0c8d000, min_nr=70, nr=128, events=0x7fff86d9c798, timeout=0x7fff86d9d7a0) at io_getevents.c:46
Breakpoint 14, 0x0000000008fa1334 in kslwtectx ()

We see kslwtbctx which indicates the start of a waitevent, then a io_getevents call:

‘ctx’ is the IO context. This is how Linux keeps track of groups of asynchronous IO requests.
‘min_nr’ is the minimal number of requests that must be ready for this call to succeed, this call will wait until ‘timeout’ is reached. io_getevents will just peek if ‘timeout’ is set to zero.
‘nr’ is the maximal number of requests that io_getevents will “fetch”.
‘events’ is a struct (table like structure) that holds the information about the iocb’s (IO control blocks) of the requests.
‘timeout’ is a struct that sets the timeout of this call. For Oracle IO I see timeout being 600 (seconds) most of the time.

The last line show kslwtectx indicating that the wait has ended.

This is part 2 of a number of blogposts about huge Oracle database IO’s.
If you landed on this blogpost and did not read part 1, please read part 1 here.

In part 1 I showed how database IOs of a full table scan could be bigger than 1MB by increasing the db_file_multiblock_read_count parameter to a number beyond 1MB expressed in Oracle blocks. These bigger IOs only happen with direct path reads, not with buffered multiblock reads.

But how much bigger can these IOs be? In part 1 I showed Oracle IOs of 1020 blocks. Is that the limit? To investigate this, I created a much bigger table (table T2 in part 1 had a maximum extent size of 1024 blocks, which meant that the 1020 is the biggest IO possible from this table).

For the sake of this investigation I created a much bigger table to get larger extents:

 EXTENT_ID     BLOCKS
---------- ----------
	 0	    8
...
       263	 8192
       264	 8192
       265	 8192

266 rows selected.

So with my new, big table I got extents up to 8192 blocks. Let’s do the same test on this table as I did in the previous post, which means setting db_file_multiblock_read_count to 2000 with SQL trace at level 8 with throttled IOPS to get waits!

The IOs with the “small” extents of 8 blocks and 128 blocks are done exactly the same as the previous post. The IOs done against the extents with a size of 1024 blocks are the same too, simply because reading the 1020 blocks in one IO is as efficient as it could be! But let’s take a look at the point where the extents become 8192 blocks:

WAIT #140281084141152: nam='direct path read' ela= 7996378 file number=5 first dba=174468 block cnt=1020 obj#=76227 tim=1373104660882677
WAIT #140281084141152: nam='direct path read' ela= 7995820 file number=5 first dba=175489 block cnt=1023 obj#=76227 tim=1373104668882345
WAIT #140281084141152: nam='direct path read' ela= 7996472 file number=5 first dba=176520 block cnt=632 obj#=76227 tim=1373104676882677
WAIT #140281084141152: nam='direct path read' ela= 7998049 file number=5 first dba=177152 block cnt=1024 obj#=76227 tim=1373104684883512
WAIT #140281084141152: nam='direct path read' ela= 7995472 file number=5 first dba=178176 block cnt=1024 obj#=76227 tim=1373104692882932
WAIT #140281084141152: nam='direct path read' ela= 7993677 file number=5 first dba=179200 block cnt=1024 obj#=76227 tim=1373104700880106
WAIT #140281084141152: nam='direct path read' ela= 7996969 file number=5 first dba=180224 block cnt=1024 obj#=76227 tim=1373104708880891
WAIT #140281084141152: nam='direct path read' ela= 5998630 file number=5 first dba=181248 block cnt=1024 obj#=76227 tim=1373104714882889
WAIT #140281084141152: nam='direct path read' ela= 9996459 file number=5 first dba=182272 block cnt=1024 obj#=76227 tim=1373104724882545

I’ve included one IO of 1020 blocks, after which Oracle issues an IO request of 1023 blocks, then 632 blocks, and then it issues a number of requests for 1024 blocks.

It seems that the maximal amount of blocks of multiblock IO requests done by Oracle “unbuffered” or “direct path” is 1024 blocks, which is 8MB! Please mind I have tested this ONLY with Oracle version 11.2.0.3 64 bits on Linux X64 on ASM, with a database and segment blocksize of 8kB.

There’s another peculiarity I want to introduce. Let’s take a look at the dba and block cnt only, together with block_id and blocks (number of blocks in the extent, taken from dba_extents):

                                  BLOCK_ID	  BLOCKS         EXTENT_ID
dba=174468 block cnt=1020           174464          1024               197
dba=175489 block cnt=1023           175488          1024               198
dba=176520 block cnt=632            176512          8192               199
dba=177152 block cnt=1024
dba=178176 block cnt=1024
dba=179200 block cnt=1024
dba=180224 block cnt=1024
dba=181248 block cnt=1024
dba=182272 block cnt=1024
dba=183296 block cnt=1024 

There are two anomalies here: the first one is at dba 175489. It seems that dba 175489/extent id 198 has only one BMB, so Oracle can read 1023 blocks of the extent, instead of the 1020 blocks in the other extents with the size of 1024 blocks.

The second anomaly is much more interesting: extent id 199 starts at block id 176512. This is an extent consisting of 8192 blocks. The first IO starts at block id 176520. This means there are 176520-176512=8 blocks not read, which are probably BMB’s. But this is in line with my expectation. What is odd, is the first IO issued for datablocks in this extent which is 8192 blocks in length is “only” 632 blocks!

After the oddly sized IO request at the beginning of the extent (632), the other IO requests are the full 1024 blocks, which probably is the maximum size of direct path Oracle IO requests.

I wondered about the oddly sized request at the beginning. I looked at the dba’s of the requests, and noticed the dba’s of the 1024 blocks sized requests looked vaguely familiar. Then I spotted the familiarity and a possible reason for the “small” request at the beginning of the extent:

SYS@v11203 AS SYSDBA> select mod(176520,1024) from dual;

MOD(176520,1024)
----------------
	     392

SYS@v11203 AS SYSDBA> select mod(177152,1024) from dual;

MOD(177152,1024)
----------------
	       0

SYS@v11203 AS SYSDBA> select mod(178176,1024) from dual;

MOD(178176,1024)
----------------
	       0

ONLY when the direct path read requests can potentially read “the full” 1024 blocks in one IO, Oracle first issues an IO to align the later requests to a block id being a multiplication of 1024, after which it will issue IO requests of 1024 blocks.

I don’t know why the alignment takes place. I can imagine this alignment can line up with ASM, thus avoiding a single Oracle IO request touching more than one ASM allocation unit. But I am open to other people’s ideas and suggestions about potential reasons!

Stay tuned for further updates about extra huge database IOs!

It’s been a while since I presented the first incarnation of my ‘about multiblock reads’ presentation. When I did this at the UKOUG TEBS conference in Birmingham in 2011, Christian Antognini chaired my presentation. After my presentation Christian showed me it’s possible to set the parameter ‘db_file_multiblock_read_count’ higher than 1MB/db_block_size (which is 128 if your blocksize is 8kB), and you could benefit from it if your hardware is sufficient. In fact, Christian showed me AWR reports (could also be statspack reports, not sure) which showed the benefit.

My understanding of the parameter db_file_multiblock_read_count at the time was:

The maximum value is the operating system’s maximum I/O size expressed as Oracle blocks ((max I/O size)/DB_BLOCK_SIZE). If you set this parameter to a value greater than the maximum, Oracle uses the maximum.

Which is taken from the Reference guide, in which this text is there for both 10.2 and 11.2, and even for 12.1.

This is interesting, because the statement from the Oracle reference guide conflicts with Christians statement and measurements!

My test environment: Oracle 11.2.0.3 on Linux OL6u3 X64 with ASM, on my Macbook with SSD.

I’ve got a table created in a bigfile ASSM tablespace, system allocation type. This means objects get extents allocated which are increased progressively in size in steps:

 EXTENT_ID     BLOCKS
---------- ----------
	 0	    8
...
	15	    8
	16	  128
...
	78	  128
	79	 1024
...
	91	 1024

First let’s take a look at a full scan done “buffered”, alias blocks read to the buffercache, which is visible via the “db file scattered read” wait event. This is the “traditional” multiblock read.

I’ve flushed the buffer cache, set db_file_multiblock_read_count to 128, set event 10949 at level 1 and enabled SQL trace at level 8.
Event 10949 is needed to get a buffered multiblock read of an table which the Oracle database deems ‘big’ (to avoid non-buffered alias direct path reads).
Another thing is both ‘db file sequential read’ and ‘db file scattered read’ will always show up (even if the action (the physical read) is very fast), so there’s no need the change things (slow down I/O) to be sure we get these waits.

The statement executed is ‘select count(*) from t2′.

The first section of extent size 8, which shows the typical alteration of 8 and 7 blocks being read (after the initial smaller sizes) which is because of the space bitmap block every other extent:

WAIT #140150651756760: nam='db file sequential read' ela= 272 file#=5 block#=23938 blocks=1 obj#=75579 tim=1372970940124856
WAIT #140150651756760: nam='db file scattered read' ela= 2162 file#=5 block#=23939 blocks=5 obj#=75579 tim=1372970940127442
WAIT #140150651756760: nam='db file scattered read' ela= 725 file#=5 block#=23944 blocks=8 obj#=75579 tim=1372970940128478
WAIT #140150651756760: nam='db file scattered read' ela= 144 file#=5 block#=23953 blocks=7 obj#=75579 tim=1372970940128908
WAIT #140150651756760: nam='db file scattered read' ela= 253 file#=5 block#=23960 blocks=8 obj#=75579 tim=1372970940129636

Then the extents of 128 blocks, which show 126 blocks being read because of 2 space bitmap blocks at the beginning of every extent of 128 blocks:

WAIT #140150651756760: nam='db file scattered read' ela= 3741 file#=5 block#=24066 blocks=126 obj#=75579 tim=1372970940139996
WAIT #140150651756760: nam='db file scattered read' ela= 955 file#=5 block#=24194 blocks=126 obj#=75579 tim=1372970940142894
WAIT #140150651756760: nam='db file scattered read' ela= 2916 file#=5 block#=24322 blocks=126 obj#=75579 tim=1372970940148287
WAIT #140150651756760: nam='db file scattered read' ela= 2111 file#=5 block#=24450 blocks=126 obj#=75579 tim=1372970940152527
WAIT #140150651756760: nam='db file scattered read' ela= 2952 file#=5 block#=24578 blocks=126 obj#=75579 tim=1372970940157492

Then the extents of 1024 blocks, which start off with 4 space bitmap blocks at the beginning of every extents:

WAIT #140150651756760: nam='db file scattered read' ela= 3079 file#=5 block#=36228 blocks=128 obj#=75579 tim=1372970941036955
WAIT #140150651756760: nam='db file scattered read' ela= 2737 file#=5 block#=36356 blocks=128 obj#=75579 tim=1372970941048199
WAIT #140150651756760: nam='db file scattered read' ela= 2531 file#=5 block#=36484 blocks=128 obj#=75579 tim=1372970941059473
WAIT #140150651756760: nam='db file scattered read' ela= 2912 file#=5 block#=36612 blocks=128 obj#=75579 tim=1372970941071015
WAIT #140150651756760: nam='db file scattered read' ela= 2768 file#=5 block#=36740 blocks=128 obj#=75579 tim=1372970941082467
WAIT #140150651756760: nam='db file scattered read' ela= 287 file#=5 block#=36868 blocks=128 obj#=75579 tim=1372970941091378
WAIT #140150651756760: nam='db file scattered read' ela= 2831 file#=5 block#=36996 blocks=128 obj#=75579 tim=1372970941102969
WAIT #140150651756760: nam='db file scattered read' ela= 536 file#=5 block#=37124 blocks=124 obj#=75579 tim=1372970941111811

We see 7 lines of 128 blocks read (7*128=896) and one line of 124 blocks (896+124=1020) which is 4 blocks short of the extent size of 1024, which are the space bitmap blocks.

Okay, nothing new here.

If I raise the parameter db_file_multiblock_read_count to 2000, and reissue the same ‘select count(*) from t2′, I get exactly the same waits. In other words: nothing different.

So the conclusion is: when using buffered reads, the statement about 1MB being the limit of the size of a single multiblock read indeed is true.

Now let’s move on to direct path reads. In short, when a table is deemed ‘big’ by the database, meaning the number of blocks occupied by the table is more than five times _small_table_threshold (which is 2% of the buffercache by default) then the database foregrounds will not use the buffercache to read the table blocks to, but read them to the process’ PGA to avoid discarding usefull blocks in the cache, and avoid the overhead of concurrency control (latching). There are more criteria, but that’s outside the scope of this blogpost.

First let’s trace a session doing direct path reads. This is easier than the above situation where I forced buffered reads, because I can do the same as above, but skip setting event 10949. On the other hand it’s harder, because in order to see all the IOs using tracing (which means ‘get a wait for all IOs’), I need reduce the number of IOs (I use 1 IOPS), because otherwise some IOs will not get a wait, so not all IOs will be visible in the tracefile (look at this article for learning how to reduce IOPS with Linux, go see me present my About multiblock reads presentation to learn why some waits for direct path read IOs could not show up)

So, here are snippets from the tracefile of a ‘select count(*) from t2′, with db_file_multiblock_read_count set to 128 and SQL trace at level 8:

First we see the reads in the small extents (8 blocks) from BMB to BMB, not stopping at the extent border:

WAIT #140402935204600: nam='direct path read' ela= 996117 file number=5 first dba=23939 block cnt=13 obj#=75579 tim=1373055770533891
WAIT #140402935204600: nam='direct path read' ela= 997014 file number=5 first dba=23953 block cnt=15 obj#=75579 tim=1373055771531222
WAIT #140402935204600: nam='direct path read' ela= 1000275 file number=5 first dba=23969 block cnt=15 obj#=75579 tim=1373055772531716
WAIT #140402935204600: nam='direct path read' ela= 999111 file number=5 first dba=23985 block cnt=15 obj#=75579 tim=1373055773531249
WAIT #140402935204600: nam='direct path read' ela= 999628 file number=5 first dba=24001 block cnt=15 obj#=75579 tim=1373055774531093

Then we see the reads of 126 blocks, of the extents of 128 blocks:

WAIT #140402935204600: nam='direct path read' ela= 1004881 file number=5 first dba=24066 block cnt=126 obj#=75579 tim=1373055778541646
WAIT #140402935204600: nam='direct path read' ela= 989042 file number=5 first dba=24194 block cnt=126 obj#=75579 tim=1373055779531980
WAIT #140402935204600: nam='direct path read' ela= 1002835 file number=5 first dba=24322 block cnt=126 obj#=75579 tim=1373055780541509
WAIT #140402935204600: nam='direct path read' ela= 989545 file number=5 first dba=24450 block cnt=126 obj#=75579 tim=1373055781531621
WAIT #140402935204600: nam='direct path read' ela= 1007730 file number=5 first dba=24578 block cnt=126 obj#=75579 tim=1373055782539986

And then the reads of the extents of the size of 1024 blocks:

WAIT #140402935204600: nam='direct path read' ela= 2000086 file number=5 first dba=36228 block cnt=124 obj#=75579 tim=1373055875542615
WAIT #140402935204600: nam='direct path read' ela= 1995550 file number=5 first dba=36480 block cnt=128 obj#=75579 tim=1373055877539193
WAIT #140402935204600: nam='direct path read' ela= 1999327 file number=5 first dba=36736 block cnt=128 obj#=75579 tim=1373055879539584
WAIT #140402935204600: nam='direct path read' ela= 1998883 file number=5 first dba=36992 block cnt=128 obj#=75579 tim=1373055881539584
WAIT #140402935204600: nam='direct path read' ela= 1998753 file number=5 first dba=37252 block cnt=124 obj#=75579 tim=1373055883539416

If you take a close look, you’ll see IO’s are still missing (just add cnt to dba and see if it fits the next dba, with the exception of BMB’s, which are not read too, so will leave a gap). Nevertheless: this is as we expect.

Now let’s do the same, but set db_file_multiblock_read_count to 2000! In order to reduce length: the IO’s of the 8 blocks and 128 blocks extents are identical. Let’s look at the 1024 blocks extent IOs:

WAIT #140550163651400: nam='direct path read' ela= 7862301 file number=5 first dba=32132 block cnt=1020 obj#=75579 tim=1373057289713781
WAIT #140550163651400: nam='direct path read' ela= 7996018 file number=5 first dba=33156 block cnt=1020 obj#=75579 tim=1373057297713687
WAIT #140550163651400: nam='direct path read' ela= 7996272 file number=5 first dba=34180 block cnt=1020 obj#=75579 tim=1373057305713868
WAIT #140550163651400: nam='direct path read' ela= 7995719 file number=5 first dba=35204 block cnt=1020 obj#=75579 tim=1373057313713547
WAIT #140550163651400: nam='direct path read' ela= 7996310 file number=5 first dba=36228 block cnt=1020 obj#=75579 tim=1373057321713593

Wow! These are HUGE IOs! The entire 1024 block extent is read in one (oracle) IO excluding the 4 BMB’s.

So this means Christian was right, and the Oracle documentation is wrong about the maximum size of db_file_multiblock_read_count for direct path reads.

There’s much more to be said about huge IOs, but I’ll leave that for another blogpost.

This is yet another blogpost on Oracle’s direct path read feature which was introduced for non-parallel query processes in Oracle version 11.

For full table scans, a direct path read is done (according to my tests and current knowledge) when:

- The segment is bigger than 5 * _small_table_threshold.
– Less than 50% of the blocks of the table is already in the buffercache.
– Less than 25% of the blocks in the buffercache are dirty.

Also (thanks to Freek d’Hooge who pointed me to an article from Tanel Poder) you can change the optimizer statistics to change the segment size for the direct path decision. Please mind that whilst this uses the statistics the optimizer uses, this is NOT an optimizer decision, but a decision made in the “code path”, so during execution.

So let’s take a look at my lab environment (Oracle Linux 6.3, 64 bit, Oracle 11.2.0.3 and ASM)

Small table threshold:

NAME						   VALUE
-------------------------------------------------- -------
_small_table_threshold				   1011

Table information:

TS@v11203 > select blocks from user_segments where segment_name = 'T2';

    BLOCKS
----------
     21504

So if we take small table threshold times and multiply it by five, we get 5055. This means that the size of table T2 is more than enough so should be scanned via direct path:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  0

TS@v11203 > select count(*) from t2;

  COUNT(*)
----------
   1000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  1

Well, that’s that, this seems quite simple.

I’ve created a relatively big table and created a (normal) index on it in the same database. The index is created on a single column, called ‘id’. If I issue a count(id), the whole index needs to be scanned, and Oracle will choose a fast full index scan. A fast full index scan is a scan which just needs to read all the blocks, not necessarily in leaf order. This means it can use multiblock reads (which reads in the order of allocated adjacent blocks).

Let’s check just to be sure:

TS@v11203 > select count(id) from bigtable;

Execution Plan
----------------------------------------------------------
Plan hash value: 106863591

------------------------------------------------------------------------------------
| Id  | Operation	      | Name	   | Rows  | Bytes | Cost (%CPU)| Time	   |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	   |	 1 |	13 | 19662   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE       | 	   |	 1 |	13 |		|	   |
|   2 |   INDEX FAST FULL SCAN| I_BIGTABLE |	34M|   425M| 19662   (2)| 00:03:56 |
------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

If we look at the index size, the size of the index makes this segment a candidate for direct path reads:

TS@v11203 > select blocks from user_segments where segment_name = 'I_BIGTABLE';

    BLOCKS
----------
     72704

If we look at number of small table threshold times five (5055), this index is much bigger than that. Also, this is bigger than table T2. Let’s execute select count(id) from bigtable, and look at the statistic ‘index fast full scans (direct read)':

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

TS@v11203 > select count(id) from bigtable;

 COUNT(ID)
----------
  32000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

Huh? This statistic tells me there hasn’t been a direct path read! This means that this read has been done in the “traditional way”. This is a bit…counter intuitive. I’ve traced the session, and indeed it’s doing the traditional multiblock reads via the scattered read waits.

I did a fair bit of fiddling around with the parameters which are reported to be involved, and found out I can get the database to do direct path reads by changing the parameter “_very_large_object_threshold”. The information found on the internet reports this value is in megabytes. A quick stroll through a number of different database (all on 11.2.0.3) shows this parameter is quite probably statically set at “500”.

If I calculate the size in megabytes of the index I_BIGTABLE, the size is 568M. This is clearly higher than the value of “_very_large_object_threshold”. I can get the same index scanned via direct path reads by changing the value of “_very_large_object_threshold” to 100.

This interesting, because it looks like this parameter does the same for full scans on index segments as “_small_table_threshold” does for full scans on table segments: the size of the segment to be scanned needs to be bigger than five times.

There are also differences: small table threshold is set in blocks, (apparently) very large object threshold is set in megabytes. Also, small table threshold is set by default at 2% of the size of the buffercache (so it scales up with bigger caches), very large object threshold seems to be fixed at 500. If my finding is correct, then it means an index segment needs to be bigger than 500*5=2500M to be considered for direct path reads. It’s unknown to me if the 50% limit for blocks in the cache and the 25% limit for dirty blocks is subject to this too.

When an Oracle process starts executing a query and needs to do a full segment scan, it needs to make a decision if it’s going to use ‘blockmode’, which is the normal way of working on non-Exadata Oracle databases, where blocks are read from disk and processed by the Oracle foreground process, either “cached” (read from disk and put in the database buffercache) or “direct” (read from disk and put in the process’ PGA), or ‘offloaded mode’, where part of the execution is done by the cell server.

The code layer where the Oracle database process initiates the offloading is ‘kcfis'; an educated guess is Kernel Cache File Intelligent Storage. Does a “normal” alias non-Exadata database ever use the ‘kcfis’ layer? My first guess would be ‘no’, but we all know guessing takes you nowhere (right?). Let’s see if a “normal” database uses the ‘kcfis’ functions on a Linux x64 (OL 6.3) system with Oracle 11.2.0.3 64 bit using ASM.

The only way to profile kernel functions that I am aware of is using ‘gdb’ and breaking on functions in the Oracle executable:
(the process id shown below ought to be the process id of an oracle database process id you are going to execute in. Do not, I repeat: not do this with other processes, especially the ones that do important tasks!)

# gdb -p 42
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) rbreak ^kcfis.*
Breakpoint 1 at 0x2204094
<function, no debug info> kcfis_get_sched_delay;
Breakpoint 2 at 0x220410a
<function, no debug info> kcfis_capability_tab_get;
Breakpoint 3 at 0x2204150
<function, no debug info> kcfis_can_session_migrate;
Breakpoint 4 at 0x2204190
<function, no debug info> kcfis_fob_storage_attr;
Breakpoint 5 at 0x22041d0
<function, no debug info> kcfis_init_resource_limit;
Breakpoint 6 at 0x22041f2
<function, no debug info> kcfis_tablespace_is_on_sage;
...
(gdb) c
Continuing.

Okay, we got the debugger set, now let’s execute a simple query (doing a full scan) to see if kcfis is touched on a “normal” server or not!

TS@v11203 > select count(*) from t2;

I am on OS-X (Apple) using iTerm, and I see the tab of the gdb session turning red: gdb generated some output on the screen!

(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) 

So, we hit a kcfis function! Let me put upfront that I do not have sourcecode of any kind, so my statements about what a function does are actually guesses. Anyway: from the name of the function (kcfis_tablespace_is_on_sage) it looks like a function in the kcfis layer which determines if a tablespace is on an Exadata storage server (exadata’s internal name is/was ‘sage’). Nice. so we hit the layer for determination if the tablespace is on Exadata.

Let’s continue the profiling with gdb:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000220410a in kcfis_capability_tab_get ()
(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) c
Continuing.

We see another call to the function ‘kcfis_tablespace_is_on_sage’, and a call to ‘kcfis_capability_tab_get’. The last function probably tries to probe the table (but could be tablespace) to get the capabilities. This could be the function which checks the requirements for hybrid columnar compression, I am not sure about that.

At what point during the processing of the full segment scan does the kcfis_tablespace_is_on_sage occur? One way of investigating this, is profiling some functions we know a (direct path) full scan does, and see where the kcfis_tablespace_is_on_sage kicks in. When the buffer cache is flushed prior to executing a full scan, and the SQL is made unique, so it has to be parsed, the following sequence of events happens:

- Parse
– Execute
– A ‘SQL*Net message to client’ wait
– A ‘db file sequential read’ wait (for reading the segment header)
– Potentially a ‘asynch descriptor resize’ wait
– The full scan is done asynchronously (potentially revealing some ‘direct path read’ waits)

So if we profile on start and end of a wait (kslwtbctx and kslwtectx), a single block read (pread64), AIO (io_submit and io_getevents_0_4) and of course kcfis_tablespace_is_on_sage, we should be able to see that:

Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, pread64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 6, io_getevents_0_4 (ctx=0x7fb42f475000, min_nr=2, nr=128, events=0x7fffb4c66768, timeout=0x7fffb4c67770) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

So what do we see here?
Row 1 : Here the wait ending for ‘SQL*Net message from client’.
Row 2-3 : This is the ‘SQL*Net message to client’ wait.
Row 5-6-7 : This is begin wait, pread64 for reading the segment header and the end wait.
Row 8&9 : The session probes for Exadata.
Row 10-11 : Start and end of a wait, ‘asynch descriptor resize’ (verified by trace file).
Row 12- : Here the full scan takes off.

So…a query starts (I’ve made it unique by adding a dummy hint, so it’s parsed again) the SQL*Net roundtrip occurs, the segment header is read, then the process looks if it’s on Exadata, which it isn’t here, and starts an asynchronous full scan.

So let’s repeat this check on a database which is on Exadata! In order to do so, we need to be aware several calls are not done on exadata: pread64, io_submit and io_getevents_0_4, because we do not do local IO, but issue them over infiniband. So I breaked on kslwtbctx, kslwtectx, kcfis_tablespace_is_on_sage. This is how that looks like:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000905cf62 in kslwtectx ()
Breakpoint 3, 0x0000000002230366 in kcfis_tablespace_is_on_sage ()

Mmmh, this is different. The ‘SQL*Net message from client’ wait ends, and before any other wait occurs, the existence of exadata is checked. This is different from the non-exadata case. Let’s take a look at the backtrace of the break on kcfis_tablespace_is_on_sage:

(gdb) bt
#0  0x0000000002230366 in kcfis_tablespace_is_on_sage ()
#1  0x0000000001402eb0 in qesSageEnabled ()
#2  0x0000000009234d20 in kkdlgstd ()
#3  0x0000000001a6111d in kkmfcblo ()
#4  0x000000000922f76d in kkmpfcbk ()
#5  0x000000000942e538 in qcsprfro ()
#6  0x000000000942de29 in qcsprfro_tree ()
#7  0x000000000942de6e in qcsprfro_tree ()
#8  0x0000000002dd80c5 in qcspafq ()
#9  0x0000000002dd51d9 in qcspqbDescendents ()
#10 0x0000000002dd91e4 in qcspqb ()
#11 0x0000000001a6b2be in kkmdrv ()
#12 0x0000000002584c76 in opiSem ()
#13 0x000000000258ac8b in opiDeferredSem ()
#14 0x000000000257dc32 in opitca ()
#15 0x0000000001ec3d7d in kksFullTypeCheck ()
#16 0x00000000092a7256 in rpiswu2 ()
#17 0x0000000001eca977 in kksLoadChild ()
#18 0x0000000009298448 in kxsGetRuntimeLock ()
#19 0x000000000925aa34 in kksfbc ()
#20 0x000000000925556e in kkspsc0 ()
#21 0x0000000009254e6a in kksParseCursor ()
#22 0x000000000933cb25 in opiosq0 ()
#23 0x0000000001b82a46 in kpooprx ()
#24 0x0000000001b80d2c in kpoal8 ()
#25 0x00000000091fb8b8 in opiodr ()
#26 0x000000000939e696 in ttcpip ()
#27 0x000000000180f011 in opitsk ()
#28 0x0000000001813c0a in opiino ()
#29 0x00000000091fb8b8 in opiodr ()
#30 0x000000000180af4c in opidrv ()
#31 0x0000000001e0a77b in sou2o ()
#32 0x0000000000a0cc05 in opimai_real ()
#33 0x0000000001e106ec in ssthrdmain ()
#34 0x0000000000a0cb71 in main ()

What is interesting to see, is line 23, backtrace layer number 21: kksParseCursor. So actually during parsing the detection of storage servers happens already, not when it actually starts a full segment scan needs to make a decision to do a smartscan or not.

With Exadata version 11.2.3.2.0 came the Unbreakable Linux Kernel for Exadata, which had been the stock EL5 redhat kernel prior to this version (2.6.18). With the unbreakable kernel came the opportunity to run the perf utility. This utility has the opportunity to see which functions are active inside an executable when there’s a symbol table. And the oracle database executable has a symbol table! One reason to do this, is to get a more granular overview of what the Oracle database is doing than the wait interface, especially to get a more detailed overview of what the database is doing in what is visible in the wait interface as ‘on cpu’.

Right after the Exadata upgrade, I ran a simple query (which probably doesn’t reflect any real customer case) to get an idea. Previously I have been running ‘select count(*) from bigtable’ on Exadata before, and saw most of it being CPU, and a minor part of it being ‘cell smart table scan’. Now with perf I have the opportunity to get more details on what is the time spend on CPU!

These are the top 5 functions from that session:

    31.50%     476423   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    30.20%     456774   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
     7.48%     113083   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     6.96%     105301   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.94%      44475   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

So the top function used during the processing of the SQL is a function called ‘sxorchk’. Sadly, Oracle doesn’t provide information about this. Anyway, this is executing a XOR function for the reason of checking information gotten from storage. Probably the name of this function is System XOR CHecK.

Alright, once we know this we can look into the parameters of the instance I am executing the SQL, which is taken from the default Exadata parameter template (!!):

SYS@t11203 AS SYSDBA> show parameter db_block

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_block_buffers		     integer	 0
db_block_checking		     string	 FALSE
db_block_checksum		     string	 TYPICAL
db_block_size			     integer	 8192

So, this should be related to db_block_checksum, db_block_checking is off/false. Well, since this is a test instance, let’s turn it off, and redo the scan:

    49.35%     480911   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
    11.01%     107299   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     6.56%      63885   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     3.97%      38712   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     3.49%      33970   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdst_fetch

The sxorchk function is gone now! This is also reflected in the responsetime: the time with db_block_checksum set to typical is: 00:01:02.44, and with db_block_checksum set to false is: 00:00:41.68 on my system. That is a difference of roughly 20 seconds, or, as we can see from the first perf-function list: 31.5% of the time. That is a significant amount of time!

When I discussed this with Tuomas Pystynen, he asked me a very valid question: if this is a smartscan, the database does not get blocks, it gets result-sets. So blocks cannot be checked on the database layer. Is this offloaded to the storage/cell server?

Well, if it is offloaded to the storage, it will not be reflected in a function on the database layer. In other words, something is XOR’ed by the database foreground process, which is set (and can be reset) with db_block_checksum! Logical conclusion on these facts would be blocks are involved in some way…

There is way to know what is actually happening: looking at the backtrace of sxorchk function! Let’s do that!

    31.50%   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.76%-- kcbhxoro
                |          kcbhvbo
                |          |          
                |          |--100.00%-- kcbhvb_corruptcheck
                |          |          kcbl_sage_block_check
                |          |          kcfis_oss_block_verify
                |          |          kcfis_validate_translation
                |          |          kcfis_process_completed_req
                |          |          kcfis_process_completed_buffer
                |          |          kcfis_process_reaped_io
                |          |          kcfis_read
                |          |          kcbl_predpush_get
                |          |          kcbldrget
                |          |          kcbgtcr
                |          |          ktrget3
                |          |          ktrget2
                |          |          kdst_fetch

What we see here is the function is called from the kcfis (Kernel Cache File Intelligent Storage is my guess) layer from a function called ‘kcfis_oss_block_verify’, in other words: a block, which is actually the resultset which is send from the cell server, is validated/checked. That is understandable, but the usage of the parameter ‘db_block_checksum’ for setting this is misleading, to put it in a nice way.

Next question which I asked myself is: how about a normal/non-Exadata database? Well, I can mimic a non-Exadata database by setting the parameter ‘cell_offload_processing’ to false to disable smartscans!

This is how the top-5 functions look like with db_block_checksum set to true without smartscan:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    18.53%     353741   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.05%     191847   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.35%     102161   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.73%      52103   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

And this is how the top-5 functions look like with db_block_checksum set to false: without smartscan:

    36.51%     706798   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.47%     202645   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.58%     107941   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     3.57%      69044   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     2.38%      46036   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] ktrgcm

If we get the backtrace of the sxorchk function without smartscan enabled:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.79%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch

We can see the sxorchk function is called from kcbldrget (the direct path load function), with more or less the same helper function to prepare for XOR function, and no other layers (like kcfis in the backtrace with smartscan enabled).

My conclusion is: db_block_checksum is a parameter which enables checking of the data it has gotten from “storage”, which apparently is done via a XOR function. This checking is done for blocks on a regular/non-Exadata system, which the parameter suggests, but on Exadata with a smartscan the checking still is done, regardless the fact that no blocks but result-sets are gotten from “storage”.

The checking takes a significant portion of time during processing of my (very simple!) query. Probably other functions can make the overall processing more CPU intensive, which means the relative portion of time spend on checking gets less.

With Exadata/smartscan the relative amount of time spend on sxorchk with my simple SQL is 32%, without smartscan the relative amount of time drops to 21%. This is still a significant amount of time (and the function the most time is spend in, in both cases!).

Final conclusion is you should think about the setting of this parameter if you are doing much physical IO, and set it according to the needs of the database.

One addition: this is done on a V2 half rack version of Exadata (Nehalem CPU), so anyone with X2-[28] or X3-[28] is welcome to profile a ‘select count(*) from bigtable’ and post it here to see if there’s any difference. The database version is 11.2.0.3 BP15.

In my previous post I touched the topic of a “new” codepath (codepath means “way of utilising the operating system”) for of full segment scans. Full segment scan means scanning a segment (I use “segment” because it could be a partition of a table or index) which is visible commonly visible in an explain plan (but not restricted to) as the rowsources “TABLE ACCESS FULL”, “FAST FULL INDEX SCAN” and “BITMAP FULL SCAN”.

Look at my presentation About multiblock reads to see how and when direct path reads kick in, and what the difference between the two is. Most notably, Oracle has released very little information about asynchronous direct path reads.

This post is about the implementation of direct path reads in the Oracle executable in 11.2.0.3 and the OS implementation on Linux.
For completeness: this is Oracle Linux 6.3 X64 with kernel 2.6.39-300.17.3.el6uek.x86_64. Database 11.2.0.3 64 bit, no PSU/CPU.
The database uses two ASM disks and the clusterware in single machine mode.
The system is running in a VM on my Macbook OSX 10.8.2 with SSD, VMWare Professional 5.0.2.

The database action which is traced with gdb is a simple “select count(*) from t2″, where t2 is a table without any indexes or constraints, and big enough to make the database engine choose a direct path read. Oracle can choose a number of new, largely undocumented mechanisms for optimising specifically direct path reads. The mechanisms observed in this blogpost are the true asychronous reads, which means the foreground process can issue several IO read requests, and reap them if they are finished, and increasing the number of simultaneous IO requests.

1. Observing the IO requests: getting the picture straight

First we restore the setup from my previous post, which means a normal database process (using sqlplus), and a root session with gdb attached to this process using “gdb -p “

Now set breakpoints for the operating system IO calls (with the intention to see io_submit() and io_getevents()) to show the IO calls for the oracle foreground process, and let the oracle process continue processing by using “c” in the debugger:

(gdb) rbreak ^io_.*
Breakpoint 1 at 0xa08c20
 io_prep_pwritev;
Note: breakpoint 1 also set at pc 0xa08c20.
...
Breakpoint 45 at 0x7f2e71b1dc0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 1-45, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Now issue the full segment scan in the Oracle session, and look at the output:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()

.
First thing that comes to mind is: why the TWO calls for the io_submit (io_submit@plt () and io_submit () from /lib64/libaio.so.1) and only 1 call for io_getevents (io_getevents@plt ())?

This is where ‘plt’ needs an introduction. ‘plt’ means procedure linkage table. This is a construction for functions used by an executable that uses dynamically linked shared libraries, which is exactly what the ‘oracle’ executable uses. If you look at the address of the @plt call, you see it is in the oracle executable (see /proc/maps), and io_submit () from /lib/libaio.so.1 is, as the line indicates from the dynamically linked library libaio.so.1 (which also can be seen from /proc//maps).

Back to io_getevents. We see only the io_getevents@plt call, which means that either the executable fakes the system call (?), or that we are missing something. This lead me to investigate the libaio library itself. This can be done using ‘nm -D’ (nm is an executable):

# nm -D /lib64/libaio.so.1
0000000000000000 A LIBAIO_0.1
0000000000000000 A LIBAIO_0.4
0000003f38200710 T io_cancel
0000003f38200670 T io_cancel
0000003f38200690 T io_destroy
0000003f382006a0 T io_getevents
0000003f38200620 T io_getevents
0000003f38200570 T io_queue_init
0000003f38200590 T io_queue_release
0000003f382005b0 T io_queue_run
0000003f382005a0 T io_queue_wait
0000003f382006e0 T io_queue_wait
0000003f38200680 T io_setup
0000003f38200660 T io_submit

.
Aha! there are TWO io_getevents here! Now THAT is interesting!

Let’s see which io_getevents call or calls are used by the oracle executable:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) rbreak ^io_get.*
Breakpoint 46 at 0x3f382006a0
 io_getevents;
...
Breakpoint 53 at 0xa09030
 io_getevents@plt;

.
And list the breakpoints:

(gdb) info break
Num     Type           Disp Enb Address            What
46      breakpoint     keep y   0x0000003f382006a0 io_getevents
47      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
48      breakpoint     keep y   0x0000003f382006a0 io_getevents
49      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
50      breakpoint     keep y   0x0000003f382006a0 io_getevents
51      breakpoint     keep y   0x0000003f382006a0 io_getevents
52      breakpoint     keep y   0x0000003f382006a0 io_getevents
53      breakpoint     keep y   0x0000000000a09030 io_getevents@plt

.
So, all the io_getevents breakpoints are for address 0x0000003f382006a0! If we look back at the libaio contents, we see addresses for the io_getevents functions have addresses 0x0000003f382006a0 and 0x0000003f38200620.

Let’s make gdb break and continue at the second io_getevents address too!

(gdb) break *0x0000003f38200620
Breakpoint 54 at 0x3f38200620
(gdb) commands
Type commands for breakpoint(s) 54, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Now run the full segment scan in the sqlplus session again, and see what gdb shows:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1

.
Okay! We got the system call back!

So, we now got a breakpoint set in gdb which is able show us the asynchronous operating system calls Oracle uses in a scan, which are io_submit and io_getevents. Oracle uses more calls for managing asynchronous IO, but for clarity I focus on io_submit and io_getevents.

2. Add wait events to direct path read profile

At this point I can add two breakpoints for the wait instrumentation (kslwtbctx (enter wait) and kslwtectx (end wait)) using:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 55 at 0x8f9a652
 kslwtbctx;
Breakpoint 56 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 55-56, one per line.
End with a line saying just "end".
>silent
>f
>c
>end

.
If I rerun the full scan again, I can see there are no waits for the IO related calls, just io_submit, then io_getevents, which I can tell because I don’t see kslwtbctx and kslwtectx during the IO related calls. I see them AFTER the IO is done, which is normal, expected behavior. Please mind the cause of this is this is a VM where the disks are on an SSD, but most probably the disks (which is a file with VMWare Fusion) are in the cache of my MacBook, which makes IO very fast.

As introduced in the previous post, the last function call seen is entering of a wait using ‘kslwtbctx’, which is the event for ‘SQL*Net message from client’, which is the foreground waiting for user input to process:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

.
Now see what happens if I slow the IO’s down using cgroups.

I slowed down both my (ASM-)disks to 1 IOPS, and ran the scan in sqlplus again. This is a snippet of the gdb output:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008fa1334 in kslwtectx ()

.
Actually, above is the typical pattern I see: a submitted IO via io_submit, four io_getevents calls, registering a wait event, then another io_getevents call, and ending of the wait event, given IO isn’t too fast. (in real life it all depends on the speed of the IO beneath the operating system, and mostly is a mixture of the last and the previous gdb output)

The pattern can have multiple forms, for example, when the scan starts, after the initialisation of the asynchronous IO dependencies (like an IO context on the OS and the IO slots in Oracle), the first calls are two io_submit calls (in order to get the minimal number of asynchronous IO’s (2) in flight).

Apparently, the four io_getevents calls after io_submit are non-blocking. This is my assumption, simply because I see these calls scrolling over my screen until the wait event is registered, and the scrolling stops at another io_getevents call. But can we be sure?

3. Digging deeper with debuginfo

In order to look deeper into the io_getevents calls, the arguments of the io_getevents function specifically, the debuginfo package of libaio needs to be installed (the debuginfo packages are available at http://oss.oracle.com/ol6/debuginfo). After that, the gdb session needs to be restarted in order to pickup the debug information.

This is how the gdb output looks like with the libaio debuginfo installed:

#0 0x0000000002cfb352 in io_prep_pread ()
#0 0x0000000000a09bb0 in io_submit@plt ()
#0 io_submit (ctx=0x7ff6ceb2c000, nr=1, iocbs=0x7fff2a4e09e0) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e9048, timeout=0x7fff2a4ea050) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ec128, timeout=0x7fff2a4ed130) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e8e48, timeout=0x7fff2a4e9e50) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ebf28, timeout=0x7fff2a4ecf30) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=1, nr=128, events=0x7fff2a4e8e38, timeout=0x7fff2a4e9e40) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008fa1334 in kslwtectx ()

.
Now we see a LOT of more interesting information!

First of all, all the IO calls involved are done in the same “aio_context”: ctx=0x7ff6ceb2c000. This means that any asynchronous IO that is done during this query reports it’s readiness here. This means IO’s can be reaped out of order(!).

Next is min_nr. The non-instrumented io_getevents calls have min_nr set to 2, the io_getevents call which is instrumented has min_nr set to 1.

nr is the maximum number of IO’s that can be reaped by this call in this aio_context. As far as I know, the number of slots cannot be higher than 32, which means that if I am right, there will not be more than 32 requests.

The last io_getevents argument is timeout, which is really interesting to understand and verify the behavior I described. This value is a pointer to a struct which holds the timeout specification. In order to actually know the timeout value, we need to print the contents of the struct.

This is where gdb, once the debuginfo is available, can help. Let’s modify the breakpoints the following way:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) break *0x0000003f38200620
Breakpoint 38 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 38, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) rbreak ^kslwt[be]ctx
Breakpoint 39 at 0x8f9a652
 kslwtbctx;
Breakpoint 40 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 39-40, one per line.
End with a line saying just "end".
>silent
>f
>end
(gdb) rbreak ^io_.*
Breakpoint 41 at 0x3f38200570: file io_queue_init.c, line 28.
int io_queue_init(int, io_context_t *);
...
Breakpoint 74 at 0x7f549fd44c0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 41-74, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Notice the “print *timeout” with breakpoint 38 (*0x0000003f38200620) alias the “correct” io_getevents call.
Using “print *timeout” we can let gdb follow the pointer, and print the contents of the struct. Let’s see if my assumption about the non-instrumented io_getevents is correct:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  io_submit (ctx=0x7f54a1956000, nr=1, iocbs=0x7fff78f059b0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0dfd8, timeout=0x7fff78f0efe0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$21 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f110b8, timeout=0x7fff78f120c0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0ddd8, timeout=0x7fff78f0ede0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f10eb8, timeout=0x7fff78f11ec0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=1, nr=128, events=0x7fff78f0ddc8, timeout=0x7fff78f0edd0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 600, tv_nsec = 0}

.
Yes! What I saw on my screen in gdb is now proven with the printing of the timeout struct!! The non-instrumented io_getevents calls just peak in the completion queue (ctx) to see if min_nr (3 in this case) IO requests are ready for four times, then register a wait event, and issue another io_getevents, look in the same completion queue (ctx), but only for 1 ready IO request, and now wait for 600 seconds.

4. Conclusion

The reason for this post is to show the essence of how the direct path reads work. There is much more to be said on this subject, especially with the adaptive or “auto tune” mechanism, which scales up the number of asynchronous IO’s in flight.

Follow

Get every new post delivered to your Inbox.

Join 2,060 other followers

%d bloggers like this: