This is yet another blogpost on Oracle’s direct path read feature which was introduced for non-parallel query processes in Oracle version 11.

For full table scans, a direct path read is done (according to my tests and current knowledge) when:

- The segment is bigger than 5 * _small_table_threshold.
- Less than 50% of the blocks of the table is already in the buffercache.
- Less than 25% of the blocks in the buffercache are directy.

Also (thanks to Freek d’Hooge who pointed me to an article from Tanel Poder) you can change the optimizer statistics to change the segment size for the direct path decision. Please mind that whilst this uses the statistics the optimizer uses, this is NOT an optimizer decision, but a decision made in the “code path”, so during execution.

So let’s take a look at my lab environment (Oracle Linux 6.3, 64 bit, Oracle 11.2.0.3 and ASM)

Small table threshold:

NAME						   VALUE
-------------------------------------------------- -------
_small_table_threshold				   1011

Table information:

TS@v11203 > select blocks from user_segments where segment_name = 'T2';

    BLOCKS
----------
     21504

So if we take small table threshold times and multiply it by five, we get 5055. This means that the size of table T2 is more than enough so should be scanned via direct path:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  0

TS@v11203 > select count(*) from t2;

  COUNT(*)
----------
   1000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  1

Well, that’s that, this seems quite simple.

I’ve created a relatively big table and created a (normal) index on it in the same database. The index is created on a single column, called ‘id’. If I issue a count(id), the whole index needs to be scanned, and Oracle will choose a fast full index scan. A fast full index scan is a scan which just needs to read all the blocks, not necessarily in leaf order. This means it can use multiblock reads (which reads in the order of allocated adjacent blocks).

Let’s check just to be sure:

TS@v11203 > select count(id) from bigtable;

Execution Plan
----------------------------------------------------------
Plan hash value: 106863591

------------------------------------------------------------------------------------
| Id  | Operation	      | Name	   | Rows  | Bytes | Cost (%CPU)| Time	   |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	   |	 1 |	13 | 19662   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE       | 	   |	 1 |	13 |		|	   |
|   2 |   INDEX FAST FULL SCAN| I_BIGTABLE |	34M|   425M| 19662   (2)| 00:03:56 |
------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

If we look at the index size, the size of the index makes this segment a candidate for direct path reads:

TS@v11203 > select blocks from user_segments where segment_name = 'I_BIGTABLE';

    BLOCKS
----------
     72704

If we look at number of small table threshold times five (5055), this index is much bigger than that. Also, this is bigger than table T2. Let’s execute select count(id) from bigtable, and look at the statistic ‘index fast full scans (direct read)’:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

TS@v11203 > select count(id) from bigtable;

 COUNT(ID)
----------
  32000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

Huh? This statistic tells me there hasn’t been a direct path read! This means that this read has been done in the “traditional way”. This is a bit…counter intuitive. I’ve traced the session, and indeed it’s doing the traditional multiblock reads via the scattered read waits.

I did a fair bit of fiddling around with the parameters which are reported to be involved, and found out I can get the database to do direct path reads by changing the parameter “_very_large_object_threshold”. The information found on the internet reports this value is in megabytes. A quick stroll through a number of different database (all on 11.2.0.3) shows this parameter is quite probably statically set at “500″.

If I calculate the size in megabytes of the index I_BIGTABLE, the size is 568M. This is clearly higher than the value of “_very_large_object_threshold”. I can get the same index scanned via direct path reads by changing the value of “_very_large_object_threshold” to 100.

This interesting, because it looks like this parameter does the same for full scans on index segments as “_small_table_threshold” does for full scans on table segments: the size of the segment to be scanned needs to be bigger than five times.

There are also differences: small table threshold is set in blocks, (apparently) very large object threshold is set in megabytes. Also, small table threshold is set by default at 2% of the size of the buffercache (so it scales up with bigger caches), very large object threshold seems to be fixed at 500. If my finding is correct, then it means an index segment needs to be bigger than 500*5=2500M to be considered for direct path reads. It’s unknown to me if the 50% limit for blocks in the cache and the 25% limit for dirty blocks is subject to this too.

When an Oracle process starts executing a query and needs to do a full segment scan, it needs to make a decision if it’s going to use ‘blockmode’, which is the normal way of working on non-Exadata Oracle databases, where blocks are read from disk and processed by the Oracle foreground process, either “cached” (read from disk and put in the database buffercache) or “direct” (read from disk and put in the process’ PGA), or ‘offloaded mode’, where part of the execution is done by the cell server.

The code layer where the Oracle database process initiates the offloading is ‘kcfis’; an educated guess is Kernel Cache File Intelligent Storage. Does a “normal” alias non-Exadata database ever use the ‘kcfis’ layer? My first guess would be ‘no’, but we all know guessing takes you nowhere (right?). Let’s see if a “normal” database uses the ‘kcfis’ functions on a Linux x64 (OL 6.3) system with Oracle 11.2.0.3 64 bit using ASM.

The only way to profile kernel functions that I am aware of is using ‘gdb’ and breaking on functions in the Oracle executable:
(the process id shown below ought to be the process id of an oracle database process id you are going to execute in. Do not, I repeat: not do this with other processes, especially the ones that do important tasks!)

# gdb -p 42
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) rbreak ^kcfis.*
Breakpoint 1 at 0x2204094
<function, no debug info> kcfis_get_sched_delay;
Breakpoint 2 at 0x220410a
<function, no debug info> kcfis_capability_tab_get;
Breakpoint 3 at 0x2204150
<function, no debug info> kcfis_can_session_migrate;
Breakpoint 4 at 0x2204190
<function, no debug info> kcfis_fob_storage_attr;
Breakpoint 5 at 0x22041d0
<function, no debug info> kcfis_init_resource_limit;
Breakpoint 6 at 0x22041f2
<function, no debug info> kcfis_tablespace_is_on_sage;
...
(gdb) c
Continuing.

Okay, we got the debugger set, now let’s execute a simple query (doing a full scan) to see if kcfis is touched on a “normal” server or not!

TS@v11203 > select count(*) from t2;

I am on OS-X (Apple) using iTerm, and I see the tab of the gdb session turning red: gdb generated some output on the screen!

(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) 

So, we hit a kcfis function! Let me put upfront that I do not have sourcecode of any kind, so my statements about what a function does are actually guesses. Anyway: from the name of the function (kcfis_tablespace_is_on_sage) it looks like a function in the kcfis layer which determines if a tablespace is on an Exadata storage server (exadata’s internal name is/was ‘sage’). Nice. so we hit the layer for determination if the tablespace is on Exadata.

Let’s continue the profiling with gdb:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000220410a in kcfis_capability_tab_get ()
(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) c
Continuing.

We see another call to the function ‘kcfis_tablespace_is_on_sage’, and a call to ‘kcfis_capability_tab_get’. The last function probably tries to probe the table (but could be tablespace) to get the capabilities. This could be the function which checks the requirements for hybrid columnar compression, I am not sure about that.

At what point during the processing of the full segment scan does the kcfis_tablespace_is_on_sage occur? One way of investigating this, is profiling some functions we know a (direct path) full scan does, and see where the kcfis_tablespace_is_on_sage kicks in. When the buffer cache is flushed prior to executing a full scan, and the SQL is made unique, so it has to be parsed, the following sequence of events happens:

- Parse
- Execute
- A ‘SQL*Net message to client’ wait
- A ‘db file sequential read’ wait (for reading the segment header)
- Potentially a ‘asynch descriptor resize’ wait
- The full scan is done asynchronously (potentially revealing some ‘direct path read’ waits)

So if we profile on start and end of a wait (kslwtbctx and kslwtectx), a single block read (pread64), AIO (io_submit and io_getevents_0_4) and of course kcfis_tablespace_is_on_sage, we should be able to see that:

Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, pread64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 6, io_getevents_0_4 (ctx=0x7fb42f475000, min_nr=2, nr=128, events=0x7fffb4c66768, timeout=0x7fffb4c67770) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

So what do we see here?
Row 1 : Here the wait ending for ‘SQL*Net message from client’.
Row 2-3 : This is the ‘SQL*Net message to client’ wait.
Row 5-6-7 : This is begin wait, pread64 for reading the segment header and the end wait.
Row 8&9 : The session probes for Exadata.
Row 10-11 : Start and end of a wait, ‘asynch descriptor resize’ (verified by trace file).
Row 12- : Here the full scan takes off.

So…a query starts (I’ve made it unique by adding a dummy hint, so it’s parsed again) the SQL*Net roundtrip occurs, the segment header is read, then the process looks if it’s on Exadata, which it isn’t here, and starts an asynchronous full scan.

So let’s repeat this check on a database which is on Exadata! In order to do so, we need to be aware several calls are not done on exadata: pread64, io_submit and io_getevents_0_4, because we do not do local IO, but issue them over infiniband. So I breaked on kslwtbctx, kslwtectx, kcfis_tablespace_is_on_sage. This is how that looks like:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000905cf62 in kslwtectx ()
Breakpoint 3, 0x0000000002230366 in kcfis_tablespace_is_on_sage ()

Mmmh, this is different. The ‘SQL*Net message from client’ wait ends, and before any other wait occurs, the existence of exadata is checked. This is different from the non-exadata case. Let’s take a look at the backtrace of the break on kcfis_tablespace_is_on_sage:

(gdb) bt
#0  0x0000000002230366 in kcfis_tablespace_is_on_sage ()
#1  0x0000000001402eb0 in qesSageEnabled ()
#2  0x0000000009234d20 in kkdlgstd ()
#3  0x0000000001a6111d in kkmfcblo ()
#4  0x000000000922f76d in kkmpfcbk ()
#5  0x000000000942e538 in qcsprfro ()
#6  0x000000000942de29 in qcsprfro_tree ()
#7  0x000000000942de6e in qcsprfro_tree ()
#8  0x0000000002dd80c5 in qcspafq ()
#9  0x0000000002dd51d9 in qcspqbDescendents ()
#10 0x0000000002dd91e4 in qcspqb ()
#11 0x0000000001a6b2be in kkmdrv ()
#12 0x0000000002584c76 in opiSem ()
#13 0x000000000258ac8b in opiDeferredSem ()
#14 0x000000000257dc32 in opitca ()
#15 0x0000000001ec3d7d in kksFullTypeCheck ()
#16 0x00000000092a7256 in rpiswu2 ()
#17 0x0000000001eca977 in kksLoadChild ()
#18 0x0000000009298448 in kxsGetRuntimeLock ()
#19 0x000000000925aa34 in kksfbc ()
#20 0x000000000925556e in kkspsc0 ()
#21 0x0000000009254e6a in kksParseCursor ()
#22 0x000000000933cb25 in opiosq0 ()
#23 0x0000000001b82a46 in kpooprx ()
#24 0x0000000001b80d2c in kpoal8 ()
#25 0x00000000091fb8b8 in opiodr ()
#26 0x000000000939e696 in ttcpip ()
#27 0x000000000180f011 in opitsk ()
#28 0x0000000001813c0a in opiino ()
#29 0x00000000091fb8b8 in opiodr ()
#30 0x000000000180af4c in opidrv ()
#31 0x0000000001e0a77b in sou2o ()
#32 0x0000000000a0cc05 in opimai_real ()
#33 0x0000000001e106ec in ssthrdmain ()
#34 0x0000000000a0cb71 in main ()

What is interesting to see, is line 23, backtrace layer number 21: kksParseCursor. So actually during parsing the detection of storage servers happens already, not when it actually starts a full segment scan needs to make a decision to do a smartscan or not.

This is just a very small post on how to watch the progress of the “CopyBack” state of a freshly inserted disk in an Exadata “Computing” (database) node. A disk failed in the (LSI Hardware) RAID5 set, and the hotspare disk was automatically used. The failed disk was replaced, and we are now awaiting the intermediate “CopyBack” phase.

The current state of the disks is visible using the following command:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0Firmware state: Copyback
Device Firmware Level: 0D70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: 0B70

But what is the progress? Some googling came up with (almost) the answer (http://trac.camsentry.com/wordpress/tag/ldinfo/), I modified it a tiny bit to make it refresh:

while $(true); do /opt/MegaRAID/MegaCli/MegaCli64 adpeventlog getlatest 200 -f ~/adpeventlog.txt a0; awk '/^Time/{TIME=$0};/Seconds/{SECS=$5}/^Event Desc/{printf("%25.25s %5.5s %s\n",TIME,SECS,$0);TIME=" ";SECS=""}' ~/adpeventlog.txt|grep -v fan|tac; sleep 5; done

This wil refresh every 5 seconds and show the progress of the CopyBack, and the state changes after it once it gets to that.

With Exadata version 11.2.3.2.0 came the Unbreakable Linux Kernel for Exadata, which had been the stock EL5 redhat kernel prior to this version (2.6.18). With the unbreakable kernel came the opportunity to run the perf utility. This utility has the opportunity to see which functions are active inside an executable when there’s a symbol table. And the oracle database executable has a symbol table! One reason to do this, is to get a more granular overview of what the Oracle database is doing than the wait interface, especially to get a more detailed overview of what the database is doing in what is visible in the wait interface as ‘on cpu’.

Right after the Exadata upgrade, I ran a simple query (which probably doesn’t reflect any real customer case) to get an idea. Previously I have been running ‘select count(*) from bigtable’ on Exadata before, and saw most of it being CPU, and a minor part of it being ‘cell smart table scan’. Now with perf I have the opportunity to get more details on what is the time spend on CPU!

These are the top 5 functions from that session:

    31.50%     476423   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    30.20%     456774   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
     7.48%     113083   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     6.96%     105301   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.94%      44475   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

So the top function used during the processing of the SQL is a function called ‘sxorchk’. Sadly, Oracle doesn’t provide information about this. Anyway, this is executing a XOR function for the reason of checking information gotten from storage. Probably the name of this function is System XOR CHecK.

Alright, once we know this we can look into the parameters of the instance I am executing the SQL, which is taken from the default Exadata parameter template (!!):

SYS@t11203 AS SYSDBA> show parameter db_block

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_block_buffers		     integer	 0
db_block_checking		     string	 FALSE
db_block_checksum		     string	 TYPICAL
db_block_size			     integer	 8192

So, this should be related to db_block_checksum, db_block_checking is off/false. Well, since this is a test instance, let’s turn it off, and redo the scan:

    49.35%     480911   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
    11.01%     107299   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     6.56%      63885   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     3.97%      38712   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     3.49%      33970   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdst_fetch

The sxorchk function is gone now! This is also reflected in the responsetime: the time with db_block_checksum set to typical is: 00:01:02.44, and with db_block_checksum set to false is: 00:00:41.68 on my system. That is a difference of roughly 20 seconds, or, as we can see from the first perf-function list: 31.5% of the time. That is a significant amount of time!

When I discussed this with Tuomas Pystynen, he asked me a very valid question: if this is a smartscan, the database does not get blocks, it gets result-sets. So blocks cannot be checked on the database layer. Is this offloaded to the storage/cell server?

Well, if it is offloaded to the storage, it will not be reflected in a function on the database layer. In other words, something is XOR’ed by the database foreground process, which is set (and can be reset) with db_block_checksum! Logical conclusion on these facts would be blocks are involved in some way…

There is way to know what is actually happening: looking at the backtrace of sxorchk function! Let’s do that!

    31.50%   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.76%-- kcbhxoro
                |          kcbhvbo
                |          |          
                |          |--100.00%-- kcbhvb_corruptcheck
                |          |          kcbl_sage_block_check
                |          |          kcfis_oss_block_verify
                |          |          kcfis_validate_translation
                |          |          kcfis_process_completed_req
                |          |          kcfis_process_completed_buffer
                |          |          kcfis_process_reaped_io
                |          |          kcfis_read
                |          |          kcbl_predpush_get
                |          |          kcbldrget
                |          |          kcbgtcr
                |          |          ktrget3
                |          |          ktrget2
                |          |          kdst_fetch

What we see here is the function is called from the kcfis (Kernel Cache File Intelligent Storage is my guess) layer from a function called ‘kcfis_oss_block_verify’, in other words: a block, which is actually the resultset which is send from the cell server, is validated/checked. That is understandable, but the usage of the parameter ‘db_block_checksum’ for setting this is misleading, to put it in a nice way.

Next question which I asked myself is: how about a normal/non-Exadata database? Well, I can mimic a non-Exadata database by setting the parameter ‘cell_offload_processing’ to false to disable smartscans!

This is how the top-5 functions look like with db_block_checksum set to true without smartscan:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    18.53%     353741   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.05%     191847   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.35%     102161   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.73%      52103   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

And this is how the top-5 functions look like with db_block_checksum set to false: without smartscan:

    36.51%     706798   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.47%     202645   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.58%     107941   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     3.57%      69044   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     2.38%      46036   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] ktrgcm

If we get the backtrace of the sxorchk function without smartscan enabled:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.79%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch

We can see the sxorchk function is called from kcbldrget (the direct path load function), with more or less the same helper function to prepare for XOR function, and no other layers (like kcfis in the backtrace with smartscan enabled).

My conclusion is: db_block_checksum is a parameter which enables checking of the data it has gotten from “storage”, which apparently is done via a XOR function. This checking is done for blocks on a regular/non-Exadata system, which the parameter suggests, but on Exadata with a smartscan the checking still is done, regardless the fact that no blocks but result-sets are gotten from “storage”.

The checking takes a significant portion of time during processing of my (very simple!) query. Probably other functions can make the overall processing more CPU intensive, which means the relative portion of time spend on checking gets less.

With Exadata/smartscan the relative amount of time spend on sxorchk with my simple SQL is 32%, without smartscan the relative amount of time drops to 21%. This is still a significant amount of time (and the function the most time is spend in, in both cases!).

Final conclusion is you should think about the setting of this parameter if you are doing much physical IO, and set it according to the needs of the database.

One addition: this is done on a V2 half rack version of Exadata (Nehalem CPU), so anyone with X2-[28] or X3-[28] is welcome to profile a ‘select count(*) from bigtable’ and post it here to see if there’s any difference. The database version is 11.2.0.3 BP15.

Recently I’ve spoken at the RMOUG training days 2013 in Denver (the mile high city). It was a first time for me to speak for the RMOUG and being in Denver. Thanks to the “sequestration” (federal budget cuts) the lines piled at immigration at Minneapolis (Minneapolis and Saint Paul, the twin cities) airport, and because my plane left more than one hour to late and my layover time was one hour and fifteen minutes, I tried to rebook my flight from Minneapolis to Denver. But, this flight turned out to be delayed too. This meant I was able to get on this flight!

At Denver airport Tim Gorman volunteered to fetch me at the airport and bring me to my Hotel. Tim did fetch multiple people, which meant we gathered at a bar where Debra Lilley, Mogens Norgaard, Claes (the tank commander), Martin Widlake amongst others where already waiting.

The RMOUG training days where held in the Denver conference centre, which have a very iconic blue bear at the front:

foto

This is the same conference centre where Collaborate 2013 will be held!

I delivered 3 talks: About multiblock reads (which I will be presenting at collaborate too, but named “How Oracle secretly changed multiblock reads”), Exadata OLTP (a study on processing single block reads on Exadata configurations and non-Exadata using Kevin Closson’s SLOB (Silly Little Oracle Benchmark) and a live Oracle function call tracing hacking session. The idea for the last presentation was conceived just prior to the conference, when I learned some people didn’t make it to Denver, and there where session slots to be filled. All three of these presentations will be presented at theOracle Usergroup Norway VĂ¥rseminar 2013 too!

After the conference we went skiing for a few days in Breckenridge with a group of friends, organised by Kellyn Pot’vin and Tim Gorman. Thank you Kellyn and Tim for all the work at both RMOUG and with this famous ‘Faux Table’ event, I really appreciate all the effort you put in these events! I also want to mention much other people who put effort in making things happen at the Faux table, for cooking, washing, driving, etc.

A few weeks further down the line was the annual HotSOS conference in Dallas. I travelled with Toon Koppelaars. Once again we found a long line for immigration, but eventually we where through immigration quick enough to fetch a beer at a bar near the gate of the connecting flight where each and every seat was equipped with an iPad, which you had to use to order something. Remarkable. At the conference I delivered my ‘About multiblock reads’ presentation. I was very pleased to be invited to speak at a conference which is dedicated to Oracle and performance. Once again this was a terrific meet-up with a huge amount of friends. This is an impression from the game night: Karl Arao from Enkitec tuning a pile of wood blocks (Jenga), overlooked by Gwen Shapira from Pythian and Mark Farnham.foto-1
There was a big number of dutch people at this year’s Hotsos conference: Toon Koppelaars, with whom I travelled, Marco Gralike, Gerwin Hendriksen and Jacco Landlust.

i would like to thank my employer VX Company for their support to make this possible.

In order to look how Exadata smartscans are different, first let’s have a peek the Oracle full segment/multiblock read evolution as short as possible:

a) Traditional multiblock reads, visible via the event ‘db file scattered read’
The essence is: Multiple adjacent blocks are read from disk, and put in the buffercache. Because every read is sequentially processed, IO latency is a performance penalty for every physical read. This works roughly this way: get a set of adjacent blocks from the segment header, fetch these blocks from disk, process these blocks, then get the next set of adjacent blocks, fetch these blocks from disk, process these blocks, etc.

b) Direct path multiblock reads, visible via the event ‘direct path read’
The essence is: Multiple IOs are done asynchronously, one or more IOs are reaped and processed, after which the number of IOs is brought back to the number of IOs the process want to keep in flight. Blocks are read to the process’ PGA (which means the IO result is not shared with other processes). Because of the asynchronous way of issuing multiple requests, the process does not suffer from the IO latency penalty of every single IO. This works roughly this way: get a set of adjacent blocks from the segment header, issue an asynchronous IO request for these, get a next set of adjacent blocks from the segment header, issue another asynchronous IO request, process one or more of the IO requests which are ready, issue IO requests for the number requests reaped, process one or more of the IO requests ready, etc. During processing, Oracle measures CPU and IO times, and can decide to add one or more concurrent requests to the two IO’s which it tries to keep in flight.

Back to smartscans

It’s not very hard to understand that direct path multiblock reads can perform much better than traditional multiblock reads. Probably at this time you think: yes, I know, but what does this have to do with Exadata? This is all information about regular processing! Well, Exadata uses the regular Oracle database executable. This means that part of the codepath of smartscans is shared with the normal/non-Exadata Oracle database. Obviously, there is a part that is unique to Exadata.

This is best viewed with a backtrace of the call to submit an IO request. This a full backtrace of the submit of an IO request of a full table scan on Linux to a database on ASM:

io_submit
skgfqio
ksfd_skgfqio
ksfdgo
ksfdaio
kfk_ufs_async_io
kfk_submit_ufs_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

Of course the functions used internally in the executable are not documented. But it’s very useful to look at them to gain a better understanding of what is happening. First look at the function at line line 24, kcbgtcr (Kernel Cache Buffers GeT Consistent Read). This is the function to perform a logical IO. One line up on line number 23 is the function kcbldrget (Kernel Cache Buffers direct path LoaDeR GET). This function indicates that the execution did choose the direct path read code path. In fact, the kcbl prefixed functions are believed to belong to Oracle direct path read codepath. Then roughly the ksfd, kfio, kfk, ksfd, and lastly skgfqio is executed, which performs the actual submit of an IO using io_submit().

Now let’s look how the equivalent submit of an IO request looks like on Exadata with smartscan turned off:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_issue_read
ossdisk_read
oss_read
kfk_submit_one_oss_io
kfk_submit_oss_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

First locate the kcbgtcr function, which is on line 31 (forget about trying to find logic in the line numbers; backtraces are read from bottom to top, whilst the numbering logic is from top to bottom). One line up on number 30 is the function kcbldrget again. Okay, that looks the same. If we read the backtrace up, it’s easy to spot the same layers, in fact the same functions: ksfd, kfio up to the kfk layer.

In the kfk layer there is a slight difference, which is understandable: on line 7 of the non-Exadata backtrace we see the function kfk_submit_ufs_io, while on Exadata the same function is kfk_submit_oss_io on line 14. I think this deserves a little Explanation. Exadata is the marketing name of the database machine, which internally was called ‘Sage’ in Oracle, and this name still surfaces sometimes, like in Exadata naming, or in patch descriptions. Quite probably OSS means ‘Oracle Sage Software’. So, this means that the process is aware it needs to read something from an Exadata storage server, and chooses a function that is meant to set that up.

The next two functions (kfk_ufs_async_io on line 6 and kfk_submit_one_oss_io on line 13) probably do logically the same, but are different because the infrastructure is different.

The next function up in both backtraces is where it gets really interesting, because now the code has to do something entirely different: on the non-Exadata system the ksfd layer is entered again, in order to get to the function skgfqio, which submit’s the IO request using the io_submit call. On the Exadata system, we see a call which I have not encountered outside of Exadata: oss_read. With the knowledge gained above, we can tell this quite probably is an Exadata specific call, which is inside an entire layer: oss. From the function names we can guess it prepares the IO request, and then issues it. Once the oss layer is crossed, we enter another layer: skgxp. The skgxp (System Kernel Generic inter-process Communication (Xfer?) Protocol) layer is NOT unique to Exadata, it’s the communication layer which is used by an Oracle RAC database for inter-process communication. So it seems like Oracle re-used the knowledge gained with RAC inter-process communication for the communication with the (Exadata) storage server.

Can we now please get to smartscans?

Yes, we didn’t encounter a smartscan yet, I did show a backtrace of a regular direct path read on Exadata. The reason is to show the difference between a traditional system (meaning disks which are presented to the system as local disks) and an Exadata system, which has to use infiniband and has to fetch the information it needs from Exadata storage servers. This is how a smartscan read call backtrace looks like:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_cread
oss_cread
kcfis_read
kcbl_predpush_get
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010010kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

The first thing which did struck me is the number of functions did decrease. On the other hand, this does not say much (you can make functions as long or as short as you wish). Also, a smartscan is done using a number of steps, of which some are already been done, which are not visible from this backtrace. After the submit of a smartscan there is another number of steps; of course reaping the submitted scan requests, but also verification of the reaped request.

It’s now interesting to see how much different the codepath looks like. The kcbgtcr function is still present, at line 15. One line up there’s the kcbldrget function, which reveals the process chose the direct read path codepath during execution. If we go up one line we see a function in the kcbl layer, which is Exadata specific as far as I know: kcbl_predpush_get. Again: all has been setup for doing a smartscan prior to the point where the process enters the point of this backtrace: submitting a request to an Exadata storage server. This means the process has the information needed (what information to ask from what storage server) prior to arriving at this point.

One layer up is a call to the kcfis (Kernel Cache File Intelligent Storage is my guess) layer, then approximately the same calls in the oss layer, but there’s a difference: it’s oss_cread and ossdisk_cread instead of oss_read and ossdisk_read, which make the call an Exadata request, instead of a request for database blocks. After the oss layer, there’s the skgxp layer again, which are exactly the same calls for both the non-smartscan and smartscan.

Conclusion

The purpose of this blogpost is to show the differences between submitting an IO request on a traditional system, on an Exadata system with smartscans turned off, and with smartscans turned on.

There is a lot of ground to cover on this. It’s not doable to cover this all in one blogpost. That’s the reason I try to pick a specific part, and work from that. Any comments are welcome.

From the backtraces it’s quite good visible regular IO’s are processed and done the same way on non-Exadata and Exadata, on Exadata the request is submitted via the skgxp layer to a storage server, instead of an IO request done via io_submit. This makes it very probable that single block IO’s are done the same way too, which means no ‘magic’ performance enhancement is possible, because it’s processed the same way on Exadata as on non-Exadata, the only difference is the IO request is done differently.

If we look at the comparison between smartscan and non-smartscan requests, it becomes prevalent there is something different happening. But the basic processing is the same, with which I mean the process is doing exactly the same as non-Exadata processes. From the smartscan backtrace it becomes visible that a process has to travel through the direct path code layer (kcbl), in order to get a smartscan, because otherwise it would be impossible to issue kcbl_predpush_get, which is the call for issuing a smartscan.

Disclaimer: I am not an Oracle employee. This information has all been obtained by profiling execution (using perf/nm/gdb). I have never seen any Oracle database and Exadata source code. There is a chance some assumptions are wrong.

Recently I was asked to look at a virtual (linux) system which needed to be moved to a new datacenter. If you want to determine if you are on VM Ware, you can use either lspci or dmidecode. A little searching on the internet revealed it’s reasonably easy to determine the version of VMWare ESX using the BIOS Information:

case $( dmidecode | grep -A4 "BIOS Information" | grep Address | awk '{ print $2 }' ) in
"0xE8480" ) echo "ESX 2.5" ;;
"0xE7C70" ) echo "ESX 3.0" ;;
"0xE7910" ) echo "ESX 3.5" ;;
"0xE7910" ) echo "ESX 4"   ;;
"0xEA550" ) echo "ESX 4U1" ;;
"0xEA2E0" ) echo "ESX 4.1" ;;
"0xE72C0" ) echo "ESX 5"   ;;
"0xEA0C0" ) echo "ESX 5.1" ;;
* ) echo "Unknown version: "
dmidecode | grep -A4 "BIOS Information" 
;;
esac

Sources:

http://virtwo.blogspot.com/2010/10/which-esx-version-am-i-running-on.html

http://dag.wieers.com/blog/detecting-vmware-esx-from-the-guest-os

Follow

Get every new post delivered to your Inbox.

Join 1,047 other followers

%d bloggers like this: