Archive

Exadata

This is yet another blogpost on Oracle’s direct path read feature which was introduced for non-parallel query processes in Oracle version 11.

For full table scans, a direct path read is done (according to my tests and current knowledge) when:

- The segment is bigger than 5 * _small_table_threshold.
- Less than 50% of the blocks of the table is already in the buffercache.
- Less than 25% of the blocks in the buffercache are directy.

Also (thanks to Freek d’Hooge who pointed me to an article from Tanel Poder) you can change the optimizer statistics to change the segment size for the direct path decision. Please mind that whilst this uses the statistics the optimizer uses, this is NOT an optimizer decision, but a decision made in the “code path”, so during execution.

So let’s take a look at my lab environment (Oracle Linux 6.3, 64 bit, Oracle 11.2.0.3 and ASM)

Small table threshold:

NAME						   VALUE
-------------------------------------------------- -------
_small_table_threshold				   1011

Table information:

TS@v11203 > select blocks from user_segments where segment_name = 'T2';

    BLOCKS
----------
     21504

So if we take small table threshold times and multiply it by five, we get 5055. This means that the size of table T2 is more than enough so should be scanned via direct path:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  0

TS@v11203 > select count(*) from t2;

  COUNT(*)
----------
   1000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  1

Well, that’s that, this seems quite simple.

I’ve created a relatively big table and created a (normal) index on it in the same database. The index is created on a single column, called ‘id’. If I issue a count(id), the whole index needs to be scanned, and Oracle will choose a fast full index scan. A fast full index scan is a scan which just needs to read all the blocks, not necessarily in leaf order. This means it can use multiblock reads (which reads in the order of allocated adjacent blocks).

Let’s check just to be sure:

TS@v11203 > select count(id) from bigtable;

Execution Plan
----------------------------------------------------------
Plan hash value: 106863591

------------------------------------------------------------------------------------
| Id  | Operation	      | Name	   | Rows  | Bytes | Cost (%CPU)| Time	   |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	   |	 1 |	13 | 19662   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE       | 	   |	 1 |	13 |		|	   |
|   2 |   INDEX FAST FULL SCAN| I_BIGTABLE |	34M|   425M| 19662   (2)| 00:03:56 |
------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

If we look at the index size, the size of the index makes this segment a candidate for direct path reads:

TS@v11203 > select blocks from user_segments where segment_name = 'I_BIGTABLE';

    BLOCKS
----------
     72704

If we look at number of small table threshold times five (5055), this index is much bigger than that. Also, this is bigger than table T2. Let’s execute select count(id) from bigtable, and look at the statistic ‘index fast full scans (direct read)’:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

TS@v11203 > select count(id) from bigtable;

 COUNT(ID)
----------
  32000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

Huh? This statistic tells me there hasn’t been a direct path read! This means that this read has been done in the “traditional way”. This is a bit…counter intuitive. I’ve traced the session, and indeed it’s doing the traditional multiblock reads via the scattered read waits.

I did a fair bit of fiddling around with the parameters which are reported to be involved, and found out I can get the database to do direct path reads by changing the parameter “_very_large_object_threshold”. The information found on the internet reports this value is in megabytes. A quick stroll through a number of different database (all on 11.2.0.3) shows this parameter is quite probably statically set at “500″.

If I calculate the size in megabytes of the index I_BIGTABLE, the size is 568M. This is clearly higher than the value of “_very_large_object_threshold”. I can get the same index scanned via direct path reads by changing the value of “_very_large_object_threshold” to 100.

This interesting, because it looks like this parameter does the same for full scans on index segments as “_small_table_threshold” does for full scans on table segments: the size of the segment to be scanned needs to be bigger than five times.

There are also differences: small table threshold is set in blocks, (apparently) very large object threshold is set in megabytes. Also, small table threshold is set by default at 2% of the size of the buffercache (so it scales up with bigger caches), very large object threshold seems to be fixed at 500. If my finding is correct, then it means an index segment needs to be bigger than 500*5=2500M to be considered for direct path reads. It’s unknown to me if the 50% limit for blocks in the cache and the 25% limit for dirty blocks is subject to this too.

When an Oracle process starts executing a query and needs to do a full segment scan, it needs to make a decision if it’s going to use ‘blockmode’, which is the normal way of working on non-Exadata Oracle databases, where blocks are read from disk and processed by the Oracle foreground process, either “cached” (read from disk and put in the database buffercache) or “direct” (read from disk and put in the process’ PGA), or ‘offloaded mode’, where part of the execution is done by the cell server.

The code layer where the Oracle database process initiates the offloading is ‘kcfis’; an educated guess is Kernel Cache File Intelligent Storage. Does a “normal” alias non-Exadata database ever use the ‘kcfis’ layer? My first guess would be ‘no’, but we all know guessing takes you nowhere (right?). Let’s see if a “normal” database uses the ‘kcfis’ functions on a Linux x64 (OL 6.3) system with Oracle 11.2.0.3 64 bit using ASM.

The only way to profile kernel functions that I am aware of is using ‘gdb’ and breaking on functions in the Oracle executable:
(the process id shown below ought to be the process id of an oracle database process id you are going to execute in. Do not, I repeat: not do this with other processes, especially the ones that do important tasks!)

# gdb -p 42
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) rbreak ^kcfis.*
Breakpoint 1 at 0x2204094
<function, no debug info> kcfis_get_sched_delay;
Breakpoint 2 at 0x220410a
<function, no debug info> kcfis_capability_tab_get;
Breakpoint 3 at 0x2204150
<function, no debug info> kcfis_can_session_migrate;
Breakpoint 4 at 0x2204190
<function, no debug info> kcfis_fob_storage_attr;
Breakpoint 5 at 0x22041d0
<function, no debug info> kcfis_init_resource_limit;
Breakpoint 6 at 0x22041f2
<function, no debug info> kcfis_tablespace_is_on_sage;
...
(gdb) c
Continuing.

Okay, we got the debugger set, now let’s execute a simple query (doing a full scan) to see if kcfis is touched on a “normal” server or not!

TS@v11203 > select count(*) from t2;

I am on OS-X (Apple) using iTerm, and I see the tab of the gdb session turning red: gdb generated some output on the screen!

(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) 

So, we hit a kcfis function! Let me put upfront that I do not have sourcecode of any kind, so my statements about what a function does are actually guesses. Anyway: from the name of the function (kcfis_tablespace_is_on_sage) it looks like a function in the kcfis layer which determines if a tablespace is on an Exadata storage server (exadata’s internal name is/was ‘sage’). Nice. so we hit the layer for determination if the tablespace is on Exadata.

Let’s continue the profiling with gdb:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000220410a in kcfis_capability_tab_get ()
(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) c
Continuing.

We see another call to the function ‘kcfis_tablespace_is_on_sage’, and a call to ‘kcfis_capability_tab_get’. The last function probably tries to probe the table (but could be tablespace) to get the capabilities. This could be the function which checks the requirements for hybrid columnar compression, I am not sure about that.

At what point during the processing of the full segment scan does the kcfis_tablespace_is_on_sage occur? One way of investigating this, is profiling some functions we know a (direct path) full scan does, and see where the kcfis_tablespace_is_on_sage kicks in. When the buffer cache is flushed prior to executing a full scan, and the SQL is made unique, so it has to be parsed, the following sequence of events happens:

- Parse
- Execute
- A ‘SQL*Net message to client’ wait
- A ‘db file sequential read’ wait (for reading the segment header)
- Potentially a ‘asynch descriptor resize’ wait
- The full scan is done asynchronously (potentially revealing some ‘direct path read’ waits)

So if we profile on start and end of a wait (kslwtbctx and kslwtectx), a single block read (pread64), AIO (io_submit and io_getevents_0_4) and of course kcfis_tablespace_is_on_sage, we should be able to see that:

Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, pread64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 6, io_getevents_0_4 (ctx=0x7fb42f475000, min_nr=2, nr=128, events=0x7fffb4c66768, timeout=0x7fffb4c67770) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

So what do we see here?
Row 1 : Here the wait ending for ‘SQL*Net message from client’.
Row 2-3 : This is the ‘SQL*Net message to client’ wait.
Row 5-6-7 : This is begin wait, pread64 for reading the segment header and the end wait.
Row 8&9 : The session probes for Exadata.
Row 10-11 : Start and end of a wait, ‘asynch descriptor resize’ (verified by trace file).
Row 12- : Here the full scan takes off.

So…a query starts (I’ve made it unique by adding a dummy hint, so it’s parsed again) the SQL*Net roundtrip occurs, the segment header is read, then the process looks if it’s on Exadata, which it isn’t here, and starts an asynchronous full scan.

So let’s repeat this check on a database which is on Exadata! In order to do so, we need to be aware several calls are not done on exadata: pread64, io_submit and io_getevents_0_4, because we do not do local IO, but issue them over infiniband. So I breaked on kslwtbctx, kslwtectx, kcfis_tablespace_is_on_sage. This is how that looks like:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000905cf62 in kslwtectx ()
Breakpoint 3, 0x0000000002230366 in kcfis_tablespace_is_on_sage ()

Mmmh, this is different. The ‘SQL*Net message from client’ wait ends, and before any other wait occurs, the existence of exadata is checked. This is different from the non-exadata case. Let’s take a look at the backtrace of the break on kcfis_tablespace_is_on_sage:

(gdb) bt
#0  0x0000000002230366 in kcfis_tablespace_is_on_sage ()
#1  0x0000000001402eb0 in qesSageEnabled ()
#2  0x0000000009234d20 in kkdlgstd ()
#3  0x0000000001a6111d in kkmfcblo ()
#4  0x000000000922f76d in kkmpfcbk ()
#5  0x000000000942e538 in qcsprfro ()
#6  0x000000000942de29 in qcsprfro_tree ()
#7  0x000000000942de6e in qcsprfro_tree ()
#8  0x0000000002dd80c5 in qcspafq ()
#9  0x0000000002dd51d9 in qcspqbDescendents ()
#10 0x0000000002dd91e4 in qcspqb ()
#11 0x0000000001a6b2be in kkmdrv ()
#12 0x0000000002584c76 in opiSem ()
#13 0x000000000258ac8b in opiDeferredSem ()
#14 0x000000000257dc32 in opitca ()
#15 0x0000000001ec3d7d in kksFullTypeCheck ()
#16 0x00000000092a7256 in rpiswu2 ()
#17 0x0000000001eca977 in kksLoadChild ()
#18 0x0000000009298448 in kxsGetRuntimeLock ()
#19 0x000000000925aa34 in kksfbc ()
#20 0x000000000925556e in kkspsc0 ()
#21 0x0000000009254e6a in kksParseCursor ()
#22 0x000000000933cb25 in opiosq0 ()
#23 0x0000000001b82a46 in kpooprx ()
#24 0x0000000001b80d2c in kpoal8 ()
#25 0x00000000091fb8b8 in opiodr ()
#26 0x000000000939e696 in ttcpip ()
#27 0x000000000180f011 in opitsk ()
#28 0x0000000001813c0a in opiino ()
#29 0x00000000091fb8b8 in opiodr ()
#30 0x000000000180af4c in opidrv ()
#31 0x0000000001e0a77b in sou2o ()
#32 0x0000000000a0cc05 in opimai_real ()
#33 0x0000000001e106ec in ssthrdmain ()
#34 0x0000000000a0cb71 in main ()

What is interesting to see, is line 23, backtrace layer number 21: kksParseCursor. So actually during parsing the detection of storage servers happens already, not when it actually starts a full segment scan needs to make a decision to do a smartscan or not.

This is just a very small post on how to watch the progress of the “CopyBack” state of a freshly inserted disk in an Exadata “Computing” (database) node. A disk failed in the (LSI Hardware) RAID5 set, and the hotspare disk was automatically used. The failed disk was replaced, and we are now awaiting the intermediate “CopyBack” phase.

The current state of the disks is visible using the following command:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0Firmware state: Copyback
Device Firmware Level: 0D70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: 0B70

But what is the progress? Some googling came up with (almost) the answer (http://trac.camsentry.com/wordpress/tag/ldinfo/), I modified it a tiny bit to make it refresh:

while $(true); do /opt/MegaRAID/MegaCli/MegaCli64 adpeventlog getlatest 200 -f ~/adpeventlog.txt a0; awk '/^Time/{TIME=$0};/Seconds/{SECS=$5}/^Event Desc/{printf("%25.25s %5.5s %s\n",TIME,SECS,$0);TIME=" ";SECS=""}' ~/adpeventlog.txt|grep -v fan|tac; sleep 5; done

This wil refresh every 5 seconds and show the progress of the CopyBack, and the state changes after it once it gets to that.

With Exadata version 11.2.3.2.0 came the Unbreakable Linux Kernel for Exadata, which had been the stock EL5 redhat kernel prior to this version (2.6.18). With the unbreakable kernel came the opportunity to run the perf utility. This utility has the opportunity to see which functions are active inside an executable when there’s a symbol table. And the oracle database executable has a symbol table! One reason to do this, is to get a more granular overview of what the Oracle database is doing than the wait interface, especially to get a more detailed overview of what the database is doing in what is visible in the wait interface as ‘on cpu’.

Right after the Exadata upgrade, I ran a simple query (which probably doesn’t reflect any real customer case) to get an idea. Previously I have been running ‘select count(*) from bigtable’ on Exadata before, and saw most of it being CPU, and a minor part of it being ‘cell smart table scan’. Now with perf I have the opportunity to get more details on what is the time spend on CPU!

These are the top 5 functions from that session:

    31.50%     476423   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    30.20%     456774   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
     7.48%     113083   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     6.96%     105301   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.94%      44475   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

So the top function used during the processing of the SQL is a function called ‘sxorchk’. Sadly, Oracle doesn’t provide information about this. Anyway, this is executing a XOR function for the reason of checking information gotten from storage. Probably the name of this function is System XOR CHecK.

Alright, once we know this we can look into the parameters of the instance I am executing the SQL, which is taken from the default Exadata parameter template (!!):

SYS@t11203 AS SYSDBA> show parameter db_block

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_block_buffers		     integer	 0
db_block_checking		     string	 FALSE
db_block_checksum		     string	 TYPICAL
db_block_size			     integer	 8192

So, this should be related to db_block_checksum, db_block_checking is off/false. Well, since this is a test instance, let’s turn it off, and redo the scan:

    49.35%     480911   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010010kmP
    11.01%     107299   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     6.56%      63885   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     3.97%      38712   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     3.49%      33970   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdst_fetch

The sxorchk function is gone now! This is also reflected in the responsetime: the time with db_block_checksum set to typical is: 00:01:02.44, and with db_block_checksum set to false is: 00:00:41.68 on my system. That is a difference of roughly 20 seconds, or, as we can see from the first perf-function list: 31.5% of the time. That is a significant amount of time!

When I discussed this with Tuomas Pystynen, he asked me a very valid question: if this is a smartscan, the database does not get blocks, it gets result-sets. So blocks cannot be checked on the database layer. Is this offloaded to the storage/cell server?

Well, if it is offloaded to the storage, it will not be reflected in a function on the database layer. In other words, something is XOR’ed by the database foreground process, which is set (and can be reset) with db_block_checksum! Logical conclusion on these facts would be blocks are involved in some way…

There is way to know what is actually happening: looking at the backtrace of sxorchk function! Let’s do that!

    31.50%   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.76%-- kcbhxoro
                |          kcbhvbo
                |          |          
                |          |--100.00%-- kcbhvb_corruptcheck
                |          |          kcbl_sage_block_check
                |          |          kcfis_oss_block_verify
                |          |          kcfis_validate_translation
                |          |          kcfis_process_completed_req
                |          |          kcfis_process_completed_buffer
                |          |          kcfis_process_reaped_io
                |          |          kcfis_read
                |          |          kcbl_predpush_get
                |          |          kcbldrget
                |          |          kcbgtcr
                |          |          ktrget3
                |          |          ktrget2
                |          |          kdst_fetch

What we see here is the function is called from the kcfis (Kernel Cache File Intelligent Storage is my guess) layer from a function called ‘kcfis_oss_block_verify’, in other words: a block, which is actually the resultset which is send from the cell server, is validated/checked. That is understandable, but the usage of the parameter ‘db_block_checksum’ for setting this is misleading, to put it in a nice way.

Next question which I asked myself is: how about a normal/non-Exadata database? Well, I can mimic a non-Exadata database by setting the parameter ‘cell_offload_processing’ to false to disable smartscans!

This is how the top-5 functions look like with db_block_checksum set to true without smartscan:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
    18.53%     353741   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.05%     191847   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.35%     102161   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     2.73%      52103   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo

And this is how the top-5 functions look like with db_block_checksum set to false: without smartscan:

    36.51%     706798   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kdstf00000010000kmP
    10.47%     202645   oracle  [kernel]                                                     [k] __default_send_IPI_dest_field
     5.58%     107941   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] qeaeCn1Serial
     3.57%      69044   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] kcbhvbo
     2.38%      46036   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] ktrgcm

If we get the backtrace of the sxorchk function without smartscan enabled:

    20.83%     397620   oracle  /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/oracle         [.] sxorchk
                |          
                |--99.79%-- kcbhxoro
                |          kcbhvbo
                |          kcbzvb
                |          kcbldrget
                |          kcbgtcr
                |          ktrget3
                |          ktrget2
                |          kdst_fetch

We can see the sxorchk function is called from kcbldrget (the direct path load function), with more or less the same helper function to prepare for XOR function, and no other layers (like kcfis in the backtrace with smartscan enabled).

My conclusion is: db_block_checksum is a parameter which enables checking of the data it has gotten from “storage”, which apparently is done via a XOR function. This checking is done for blocks on a regular/non-Exadata system, which the parameter suggests, but on Exadata with a smartscan the checking still is done, regardless the fact that no blocks but result-sets are gotten from “storage”.

The checking takes a significant portion of time during processing of my (very simple!) query. Probably other functions can make the overall processing more CPU intensive, which means the relative portion of time spend on checking gets less.

With Exadata/smartscan the relative amount of time spend on sxorchk with my simple SQL is 32%, without smartscan the relative amount of time drops to 21%. This is still a significant amount of time (and the function the most time is spend in, in both cases!).

Final conclusion is you should think about the setting of this parameter if you are doing much physical IO, and set it according to the needs of the database.

One addition: this is done on a V2 half rack version of Exadata (Nehalem CPU), so anyone with X2-[28] or X3-[28] is welcome to profile a ‘select count(*) from bigtable’ and post it here to see if there’s any difference. The database version is 11.2.0.3 BP15.

In order to look how Exadata smartscans are different, first let’s have a peek the Oracle full segment/multiblock read evolution as short as possible:

a) Traditional multiblock reads, visible via the event ‘db file scattered read’
The essence is: Multiple adjacent blocks are read from disk, and put in the buffercache. Because every read is sequentially processed, IO latency is a performance penalty for every physical read. This works roughly this way: get a set of adjacent blocks from the segment header, fetch these blocks from disk, process these blocks, then get the next set of adjacent blocks, fetch these blocks from disk, process these blocks, etc.

b) Direct path multiblock reads, visible via the event ‘direct path read’
The essence is: Multiple IOs are done asynchronously, one or more IOs are reaped and processed, after which the number of IOs is brought back to the number of IOs the process want to keep in flight. Blocks are read to the process’ PGA (which means the IO result is not shared with other processes). Because of the asynchronous way of issuing multiple requests, the process does not suffer from the IO latency penalty of every single IO. This works roughly this way: get a set of adjacent blocks from the segment header, issue an asynchronous IO request for these, get a next set of adjacent blocks from the segment header, issue another asynchronous IO request, process one or more of the IO requests which are ready, issue IO requests for the number requests reaped, process one or more of the IO requests ready, etc. During processing, Oracle measures CPU and IO times, and can decide to add one or more concurrent requests to the two IO’s which it tries to keep in flight.

Back to smartscans

It’s not very hard to understand that direct path multiblock reads can perform much better than traditional multiblock reads. Probably at this time you think: yes, I know, but what does this have to do with Exadata? This is all information about regular processing! Well, Exadata uses the regular Oracle database executable. This means that part of the codepath of smartscans is shared with the normal/non-Exadata Oracle database. Obviously, there is a part that is unique to Exadata.

This is best viewed with a backtrace of the call to submit an IO request. This a full backtrace of the submit of an IO request of a full table scan on Linux to a database on ASM:

io_submit
skgfqio
ksfd_skgfqio
ksfdgo
ksfdaio
kfk_ufs_async_io
kfk_submit_ufs_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

Of course the functions used internally in the executable are not documented. But it’s very useful to look at them to gain a better understanding of what is happening. First look at the function at line line 24, kcbgtcr (Kernel Cache Buffers GeT Consistent Read). This is the function to perform a logical IO. One line up on line number 23 is the function kcbldrget (Kernel Cache Buffers direct path LoaDeR GET). This function indicates that the execution did choose the direct path read code path. In fact, the kcbl prefixed functions are believed to belong to Oracle direct path read codepath. Then roughly the ksfd, kfio, kfk, ksfd, and lastly skgfqio is executed, which performs the actual submit of an IO using io_submit().

Now let’s look how the equivalent submit of an IO request looks like on Exadata with smartscan turned off:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_issue_read
ossdisk_read
oss_read
kfk_submit_one_oss_io
kfk_submit_oss_io
kfk_submit_io
kfk_io1
kfkRequest
kfk_transitIO
kfioSubmitIO
kfioRequestPriv
kfioRequest
ksfd_kfioRequest
ksfd_osmgo
ksfdgo
ksfdaio
kcflbi
kcbldio
kcblrs
kcblgt
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010000kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

First locate the kcbgtcr function, which is on line 31 (forget about trying to find logic in the line numbers; backtraces are read from bottom to top, whilst the numbering logic is from top to bottom). One line up on number 30 is the function kcbldrget again. Okay, that looks the same. If we read the backtrace up, it’s easy to spot the same layers, in fact the same functions: ksfd, kfio up to the kfk layer.

In the kfk layer there is a slight difference, which is understandable: on line 7 of the non-Exadata backtrace we see the function kfk_submit_ufs_io, while on Exadata the same function is kfk_submit_oss_io on line 14. I think this deserves a little Explanation. Exadata is the marketing name of the database machine, which internally was called ‘Sage’ in Oracle, and this name still surfaces sometimes, like in Exadata naming, or in patch descriptions. Quite probably OSS means ‘Oracle Sage Software’. So, this means that the process is aware it needs to read something from an Exadata storage server, and chooses a function that is meant to set that up.

The next two functions (kfk_ufs_async_io on line 6 and kfk_submit_one_oss_io on line 13) probably do logically the same, but are different because the infrastructure is different.

The next function up in both backtraces is where it gets really interesting, because now the code has to do something entirely different: on the non-Exadata system the ksfd layer is entered again, in order to get to the function skgfqio, which submit’s the IO request using the io_submit call. On the Exadata system, we see a call which I have not encountered outside of Exadata: oss_read. With the knowledge gained above, we can tell this quite probably is an Exadata specific call, which is inside an entire layer: oss. From the function names we can guess it prepares the IO request, and then issues it. Once the oss layer is crossed, we enter another layer: skgxp. The skgxp (System Kernel Generic inter-process Communication (Xfer?) Protocol) layer is NOT unique to Exadata, it’s the communication layer which is used by an Oracle RAC database for inter-process communication. So it seems like Oracle re-used the knowledge gained with RAC inter-process communication for the communication with the (Exadata) storage server.

Can we now please get to smartscans?

Yes, we didn’t encounter a smartscan yet, I did show a backtrace of a regular direct path read on Exadata. The reason is to show the difference between a traditional system (meaning disks which are presented to the system as local disks) and an Exadata system, which has to use infiniband and has to fetch the information it needs from Exadata storage servers. This is how a smartscan read call backtrace looks like:

sskgxp_sndmsg
skgxpfragsnd
skgxp_send_next_fragment
skgxpxmit
skgxpivsnd
skgxpivrpc
skgxpvrpc
ossnet_issue_vrpc
ossnet_queue_vrpc
ossdisk_cread
oss_cread
kcfis_read
kcbl_predpush_get
kcbldrget
kcbgtcr
ktrget3
ktrget2
kdst_fetch
kdstf00000010010kmP
kdsttgr
qertbFetch
qergsFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain
main

The first thing which did struck me is the number of functions did decrease. On the other hand, this does not say much (you can make functions as long or as short as you wish). Also, a smartscan is done using a number of steps, of which some are already been done, which are not visible from this backtrace. After the submit of a smartscan there is another number of steps; of course reaping the submitted scan requests, but also verification of the reaped request.

It’s now interesting to see how much different the codepath looks like. The kcbgtcr function is still present, at line 15. One line up there’s the kcbldrget function, which reveals the process chose the direct read path codepath during execution. If we go up one line we see a function in the kcbl layer, which is Exadata specific as far as I know: kcbl_predpush_get. Again: all has been setup for doing a smartscan prior to the point where the process enters the point of this backtrace: submitting a request to an Exadata storage server. This means the process has the information needed (what information to ask from what storage server) prior to arriving at this point.

One layer up is a call to the kcfis (Kernel Cache File Intelligent Storage is my guess) layer, then approximately the same calls in the oss layer, but there’s a difference: it’s oss_cread and ossdisk_cread instead of oss_read and ossdisk_read, which make the call an Exadata request, instead of a request for database blocks. After the oss layer, there’s the skgxp layer again, which are exactly the same calls for both the non-smartscan and smartscan.

Conclusion

The purpose of this blogpost is to show the differences between submitting an IO request on a traditional system, on an Exadata system with smartscans turned off, and with smartscans turned on.

There is a lot of ground to cover on this. It’s not doable to cover this all in one blogpost. That’s the reason I try to pick a specific part, and work from that. Any comments are welcome.

From the backtraces it’s quite good visible regular IO’s are processed and done the same way on non-Exadata and Exadata, on Exadata the request is submitted via the skgxp layer to a storage server, instead of an IO request done via io_submit. This makes it very probable that single block IO’s are done the same way too, which means no ‘magic’ performance enhancement is possible, because it’s processed the same way on Exadata as on non-Exadata, the only difference is the IO request is done differently.

If we look at the comparison between smartscan and non-smartscan requests, it becomes prevalent there is something different happening. But the basic processing is the same, with which I mean the process is doing exactly the same as non-Exadata processes. From the smartscan backtrace it becomes visible that a process has to travel through the direct path code layer (kcbl), in order to get a smartscan, because otherwise it would be impossible to issue kcbl_predpush_get, which is the call for issuing a smartscan.

Disclaimer: I am not an Oracle employee. This information has all been obtained by profiling execution (using perf/nm/gdb). I have never seen any Oracle database and Exadata source code. There is a chance some assumptions are wrong.

Recently I was discussing some IO related waits with some friends. The wait I was discussing was ‘kfk: async disk IO’. This wait was always visible in Oracle version 11.2.0.1 and seems to be gone in version 11.2.0.2 and above. Here is the result of some investigation into that.

First: the wait is not gone with version 11.2.0.2 and above, which is very simple to prove (this is a database version 11.2.0.3):

SYS@v11203 AS SYSDBA> select name, parameter1, parameter2, parameter3, wait_class from v$event_name where name like 'kfk: async disk IO';

NAME		     PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
-------------------- ---------- ---------- ---------- ----------------------------------------------------------------
kfk: async disk IO   count	intr	   timeout    System I/O

What is interesting, is that the wait class is ‘System I/O’. I don’t know the official definition of the wait class ‘System I/O’, but it tells me that it is something a background process is waiting for, not my foreground process. But I could be wrong…

Let’s look at an excerpt of a tracefile of a asynchronous, full segment scan in Oracle version 11.2.0.1. This host is running in a VM on OL6.3 X64:

PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356620409310181 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=120,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310180
EXEC #3:c=0,e=63,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310297
WAIT #3: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1356620409310351
WAIT #3: nam='Disk file operations I/O' ela= 9 FileOperation=2 fileno=5 filetype=2 obj#=-1 tim=1356620409311203
WAIT #3: nam='Disk file operations I/O' ela= 238 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409312218
WAIT #3: nam='kfk: async disk IO' ela= 14 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409313703
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409314449
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315169
WAIT #3: nam='kfk: async disk IO' ela= 5 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315850
WAIT #3: nam='Disk file operations I/O' ela= 42 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409316451
WAIT #3: nam='kfk: async disk IO' ela= 33 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409317903
WAIT #3: nam='kfk: async disk IO' ela= 403 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409320529
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409321950
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409323627
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409325424
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409327121
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409329153
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409330861
WAIT #3: nam='kfk: async disk IO' ela= 8 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409332534
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409334179

What is quit clear to see, is the database is doing some stuff which is instrumented in the wait interface (‘Disk file operations I/O’), but the event ‘kfk: async disk IO’ is much occurring. It is NOT taking an enormous amount of time (ela < 10 (microseconds) most of the time).

Let's dig a little deeper. Most events are instrumentation for calls to the operating system to do something. Let's take the linux 'strace' utility, and list the waits together with the system calls. This is done with the 'strace -e write=all -p PID’, here is a fragment:

io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
io_getevents(139892671860736, 2, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 122880, 0}}, {0, 0}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 107) = 107
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam='kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 35  33 20 63 6f 75 6e 74 3d  ' ela= 5 3 count= |
 | 00030  31 20 69 6e 74 72 3d 30  20 74 69 6d 65 6f 75 74  1 intr=0  timeout |
 | 00040  3d 34 32 39 34 39 36 37  32 39 35 20 6f 62 6a 23  =4294967 295 obj# |
 | 00050  3d 37 33 34 32 36 20 74  69 6d 3d 31 33 35 36 36  =73426 t im=13566 |
 | 00060  32 31 31 31 30 32 35 38  30 31 34                 21110258 014      |
write(9, "\n", 1)                       = 1
 | 00000  0a                                                .                 |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b585420, 0x7f3b4b585420, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 106) = 106
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam='kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 36  20 63 6f 75 6e 74 3d 31  ' ela= 6  count=1 |
 | 00030  20 69 6e 74 72 3d 30 20  74 69 6d 65 6f 75 74 3d   intr=0  timeout= |
 | 00040  34 32 39 34 39 36 37 32  39 35 20 6f 62 6a 23 3d  42949672 95 obj#= |
 | 00050  37 33 34 32 36 20 74 69  6d 3d 31 33 35 36 36 32  73426 ti m=135662 |
 | 00060  31 31 31 30 32 35 39 37  35 39                    11102597 59       |
write(9, "\n", 1)                       = 1
 | 00000  0a                                                .                 |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
write(9, "WAIT #3: nam='kfk: async disk IO"..., 106) = 106
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam='kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 39  20 63 6f 75 6e 74 3d 31  ' ela= 9  count=1 |
 | 00030  20 69 6e 74 72 3d 30 20  74 69 6d 65 6f 75 74 3d   intr=0  timeout= |
 | 00040  34 32 39 34 39 36 37 32  39 35 20 6f 62 6a 23 3d  42949672 95 obj#= |
 | 00050  37 33 34 32 36 20 74 69  6d 3d 31 33 35 36 36 32  73426 ti m=135662 |
 | 00060  31 31 31 30 32 36 32 31  37 38                    11102621 78       |

What is interesting here, is that the event ‘kfk: async disk IO’ is reported after both the io_getevents() and the io_submit() call. (with the io_submit call an IO request or IO requests are dispatched to the operating system, with the io_getevents call the operating system (AIO) completion queue is examined to see if any requests are ready). So we can conclude that ‘System I/O’ is not appropriate (by my definition), this is something my foreground process apparently is waiting for. We can also conclude that this event is related to IO. But what does it mean?

In order to dig deeper, we need to know what the database is doing internally. The most obvious way is to read the source. But, it’s not possible to get the Oracle database source if your not working for Oracle (and probably explicitly have been granted access). There are a number of ways to see more. I wrote about ‘perf’, but also the gdb (gnu debugger) can be used.

In order to debug a running Oracle database process, you can attach to a process with the debugger (as root) with ‘gdb -p PID’. An equivalent of using trace and looking for io_submit() and io_getevents() calls is:

# gdb -p 14122
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
...
0x0000003f38a0e530 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) 

Now enter the following commands (if you go into gdb yourself, it becomes apparent which are the commands):

(gdb) set pagination off
(gdb) break io_submit
Breakpoint 1 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) break 'io_getevents@plt'
Breakpoint 2 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) continue
Continuing.

And run the full segment scan again:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()

Now we need to know the function call which Oracle uses for the registering waits, for which we used the tracefile generated by SQL trace earlier. The first line in the tracefile of our current session if we issue the full scan again will be the ending of the wait ‘SQL*Net message from client’. The call which Oracle uses for writing this line is write(). Let’s clean up the current breakpoints, and set a breakpoint for the write call (stop the debugger’s current action using CTRL-c):

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break write
Breakpoint 3 at 0x3f38a0e4c0: file ../sysdeps/unix/syscall-template.S, line 82. (3 locations)
(gdb) c
Continuing.

(‘c’ is an abbreviation of ‘continue’)
And run the scan again.
Because of the breakpoint set, gdb will break execution when the write() call is issued:

Breakpoint 3, write () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

And generate a backtrace (a list of all the functions which where called to reach the write() call), with the ‘backtrace’ command (‘bt’):

(gdb) bt
#0  write () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000086014dd in sdbgrfuwf_write_file ()
#2  0x0000000008601426 in sdbgrfwf_write_file ()
#3  0x00000000085caafb in dbgtfdFileWrite ()
#4  0x00000000085ca685 in dbgtfdFileAccessCbk ()
#5  0x00000000085c9d6d in dbgtfPutStr ()
#6  0x0000000001a8d93a in dbktWriteTimestamp ()
#7  0x0000000001a8d6b5 in __PGOSF61_dbktWriteNfy ()
#8  0x00000000085caa06 in dbgtfdFileWrite ()
#9  0x00000000085ca685 in dbgtfdFileAccessCbk ()
#10 0x00000000085c9d6d in dbgtfPutStr ()
#11 0x0000000001a8e291 in dbktPri ()
#12 0x00000000009f143e in ksdwrf ()
#13 0x00000000009f1d8f in ksdwrfn ()
#14 0x0000000005abf0e1 in kxstTraceWait ()
#15 0x000000000821e88d in kslwtectx ()
#16 0x00000000083c2a79 in __PGOSF24_opikndf2 ()
#17 0x000000000143b790 in opitsk ()
#18 0x00000000014406da in opiino ()
#19 0x00000000083c54fd in opiodr ()
#20 0x0000000001437b60 in opidrv ()
#21 0x00000000018aac97 in sou2o ()
#22 0x00000000009d3ef1 in opimai_real ()
#23 0x00000000018aff26 in ssthrdmain ()
#24 0x00000000009d3e5d in main ()

What is interesting to see here, is the function at number 15, kslwtectx(). We know this write reports the ending of waiting at ‘SQL*Net message from client’. This function is ‘Kernel Service Layer WaiT End ConTeXt’ (educated guess; I’ve had a little help from Tanel Poder to find the Oracle wait interface functions). The function which marks the start of a wait is called ‘kslwtbctx’, and the function which ends a wait is called ‘kslwtectx’. Let’s put a break and continue on io_submit, io_getevents, kslwtbctx and kslwtectx:

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break 'io_getevents@plt'
Breakpoint 5 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtbctx
Breakpoint 6 at 0x8217fda
(gdb) commands
Type commands for breakpoint(s) 6, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtectx
Breakpoint 7 at 0x821e3d8
(gdb) commands
Type commands for breakpoint(s) 7, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

Now run the full scan again; this is what is seen in the debugger:

#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()

This goes on, until the IO calls end, and the foreground does some other stuff at the end, ending with the beginning of waiting on ‘SQL*Net message from client’:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()

Okay, back to ‘kfk: async disk IO’. We see the ‘io_getevents()’ call which ALWAYS has a kslwtbctx() call before, and always a kslwtectx() call after the io_getevents() call itself. Please mind we are on 11.2.0.1. This strongly suggests ‘kfk: async disk IO’ is the instrumentation of the io_getevents() call. Or io_getevents() is part of this wait event.

Let’s make the io_getevents() call take longer, and see if that is reflected in the wait, to prove the hypothesis made above. This can be done using the ‘catch’ function of the debugger:

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) catch syscall io_getevents
Catchpoint 8 (syscall 'io_getevents' [208])
(gdb) commands
Type commands for breakpoint(s) 8, one per line.
End with a line saying just "end".
>silent
>shell sleep 0.01
>c
>end
(gdb) c
Continuing.

Please mind it’s ‘io_getevents’ here instead of ‘io_getevents@plt’, because the catch function looks the syscall up in an header file, instead of the symbol table in the executable.

Now run the full scan again, with SQL trace turned on at level 8 to show the waits. Previously we saw waits with an elapsed time mostly lesser than 10ms. here is a snippet from the trace:

PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356626115353145 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=203,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115353116
EXEC #3:c=1000,e=80,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115355176
WAIT #3: nam='SQL*Net message to client' ela= 30 driver id=1413697536 #bytes=1 p3=0 obj#=73426 tim=1356626115355832
WAIT #3: nam='kfk: async disk IO' ela= 26404 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115384177
WAIT #3: nam='kfk: async disk IO' ela= 25050 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115411677
WAIT #3: nam='kfk: async disk IO' ela= 24885 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115437990
WAIT #3: nam='kfk: async disk IO' ela= 25002 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115464244

*** 2012-12-27 17:35:15.492
WAIT #3: nam='kfk: async disk IO' ela= 25192 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115492891
WAIT #3: nam='kfk: async disk IO' ela= 25175 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115521419
WAIT #3: nam='kfk: async disk IO' ela= 24804 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115549124
WAIT #3: nam='kfk: async disk IO' ela= 24980 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115576955
WAIT #3: nam='kfk: async disk IO' ela= 24988 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115604879

A little explanation here: the debugger triggers the syscall catcher on entrance and leaving of the syscall. The times being somewhat more than 20ms, instead of a few microseconds gives a fair indication the wait is related. We can also look in the debugger:

(gdb) info break
Num     Type           Disp Enb Address            What
8       catchpoint     keep y                      syscall "io_getevents" 
	catchpoint already hit 174 times
        silent
        shell sleep 0.01
        c

Yes, it has been hit 174 times. So ‘kfk: async disk IO’ is the instrumentation of the ‘io_getevents()’ call, or more actions of which the io_getevents() is part of.

Okay, now that we know that, let’s switch to Oracle version 11.2.0.3…
This is what I find on doing EXACTLY the same (full scan) in the tracefile:

PARSING IN CURSOR #140382932318408 len=23 dep=0 uid=84 oct=3 lid=84 tim=1356626776003874 hv=1020534364 ad='7f491298' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #140382932318408:c=0,e=41,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003873
EXEC #140382932318408:c=0,e=31,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003962
WAIT #140382932318408: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356626776004009
WAIT #140382932318408: nam='direct path read' ela= 340 file number=5 first dba=28418 block cnt=126 obj#=75579 tim=1356626776045733
FETCH #140382932318408:c=164975,e=188719,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356626776192752
STAT #140382932318408 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=20944 pr=20941 pw=0 time=188708 us)'
STAT #140382932318408 id=2 cnt=1000000 pid=1 pos=1 obj=75579 op='TABLE ACCESS FULL T2 (cr=20944 pr=20941 pw=0 time=68726 us cost=5738 siz=0 card=1000000)'

Okay. First thing to notice is the ‘kfk: async disk IO’ events are gone. We see the PARSE line, the SQL Net message, and a single ‘direct path read’ wait, and then the FETCH line. This is absolutely different behavior than version 11.2.0.1!
What is happening? I think the only way to understand more about this, is running the debugger again with notification of io_submit, io_getevents, kslwtbctx and kslwtectx:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

That’s odd! There are no calls to the code location’s kslwtbctx and kslwtectx during IO processing! So, despite ‘kfk: async disk IO’ still being a wait event, it doesn’t seem to be instrumenting io_getevents() anymore. What if we make io_getevents() take longer (I’ve used the same number as the earlier catching of the syscall, 0.01 second):

PARSE #140382932318408:c=0,e=168,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973832834
EXEC #140382932318408:c=0,e=68,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973833385
WAIT #140382932318408: nam='SQL*Net message to client' ela= 671 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356627973835367
WAIT #140382932318408: nam='asynch descriptor resize' ela= 680 outstanding #aio=0 current aio limit=152 new aio limit=182 obj#=75579 tim=1356627973837361

*** 2012-12-27 18:06:16.440
FETCH #140382932318408:c=185972,e=2604507,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356627976440376

No, the time in the FETCH line increases (elapsed; e=2604507, cpu; c=185972; there’s 2418535 microseconds of time which is not spend on CPU), but there is only little time instrumented by wait events. A reasonable conclusion is the ‘kfk: disk async IO’ wait event is different in this version (11.2.0.3). I’ve also tried to slow down the io_submit() call the same way, with the same result: the time increases in the FETCH line, but there’s nothing the database writes a wait line for.

What if I slow down the disk? This can be done using cgroups, as described in this post. I’ve set my both my ASM disks to 1 IO per second. That should result in waits! This is the resulting trace output:

PARSE #140382932318408:c=0,e=20,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964787836
EXEC #140382932318408:c=0,e=64,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964788256
WAIT #140382932318408: nam='SQL*Net message to client' ela= 703 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356633964790122
WAIT #140382932318408: nam='asynch descriptor resize' ela= 699 outstanding #aio=0 current aio limit=238 new aio limit=264 obj#=75579 tim=1356633964792014
WAIT #140382932318408: nam='direct path read' ela= 497938 file number=5 first dba=23939 block cnt=13 obj#=75579 tim=1356633965295483

*** 2012-12-27 19:46:05.795
WAIT #140382932318408: nam='direct path read' ela= 495498 file number=5 first dba=23953 block cnt=15 obj#=75579 tim=1356633965795382

*** 2012-12-27 19:46:06.295
WAIT #140382932318408: nam='direct path read' ela= 495890 file number=5 first dba=23969 block cnt=15 obj#=75579 tim=1356633966295208
WAIT #140382932318408: nam='direct path read' ela= 495889 file number=5 first dba=23985 block cnt=15 obj#=75579 tim=1356633966795127

So, if I slow the IO down, I get ‘direct path read’ wait events. How does that look when I use the break/continue technique on this configuration?

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

The interesting part are the io_getevents() calls, which consistently are 4 calls before kslwtbctx is called, and another one or two calls are done before kslwtectx is called, and wait time is registered (see above). Interesting, but no ‘kfk: async disk IO’.

Now let’s look at the same output when Oracle 11.2.0.1 is throttled:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()

This looks remarkably the same as the 11.2.0.3 version!

Conclusion

It seems the database issues a number of io_getevents() calls after asynchronous IO calls are submitted with io_submit(), which do return immediately (also known as ‘non-blocking’). They just peak at the completion queue, if an IO is available in the completion queue, the process processes that IO. The number of non-blocking calls is 3 with version 11.2.0.1 and 4 with 11.2.0.3 (please mind the ‘slot’ mechanism which controls the number of simultaneous AIO requests is probably in play here too). If these calls do not return any IO request, the wait interface is called (kslwtbctx()), and a blocking io_getevents() call is issued (which waits for at least 1 request to be returned too), for which the wait time is visible as event ‘direct path read’.

It also seems the first non-blocking io_getevents() call is instrumented by the ‘kfk: async disk IO’ event with Oracle 11.2.0.1, for reasons unknown to me. This instrumentation is not present in version 11.2.0.3.

Recently we upgraded an Exadata to the currently latest version, 11.2.3.2.0. The Exadata software itself consists of an image for the storage servers (the storage servers are essentially re-imaged), and a set of updates for the database/computing nodes, including: firmware for ILOM (lights out adapter), BIOS, LSI RAID adapter, Infiniband adapter, linux kernel, drivers, mandatory packages, to name some.

One of the exceptional things this upgrade does, is remove the hot-spare out of the RAID set on the database/compute nodes. This is documented in MOS note: 1468877.1, as ‘known issue 5: hotspare removed for compute nodes’. For some sites, this actually can be a good thing, if they are really tight on disk space on the compute nodes of Exadata. I must say that we have not encountered this situation. What this means, is that the actual HDD configuration on the compute node is left to the customer, instead of having one mandatory configuration (having 3 disks in a RAID-5 configuration, and one hot-spare).

So if you decide to use the former hot-spare disk as an active part of the RAID configuration, you are effectively trading availability for diskspace. Please mind the RAID set itself already provides redundancy, even without the hot-spare!

On the other hand, I think in most configurations, it makes sense to convert the disk back to being hot-spare.

This is done in the following way:

a) Get an overview of the current disk configuration:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 0B70

This shows the disk in slot number 3 being left “unconfigured”, but in “good” state (of a disk has gone bad because of errors, it will be removed from the RAID set, and will show up as “unconfigured(bad)”!)
This is the state the upgrade to 11.2.3.2.0 leaves your system.

New let’s make the disk hot spare again!

b) Get the inclosure id:
/opt/MegaRAID/MegaCli/MegaCli64 -encinfo -a0 | grep ID
Device ID : 252

This means we know the enclosure id (252) and the slot number (3), which is the information needed for the MegaCli utility to revert the unconfigured disk to hot-spare again!

c) Revert the unconfigured disk back to hot-spare
/opt/MegaRAID/MegaCli/MegaCli64 -PdHsp -set -EnclAffinity -PhysDrv[252:3] -a0
Adapter: 0: Set Physical Drive at EnclId-252 SlotId-3 as Hot Spare Success.
Exit Code: 0x00

d) Check the disk configuration again:
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Hotspare, Spun Up
Device Firmware Level: 0B70

It appears that having the disk been removed from the RAID set by the update to 11.2.3.2.0 generates an ASR message. At least it did at our site, despite this being an undocumented bug (7161023,’ASR generating false errors in relation to disks’) which is marked resolved in 11.2.3.1.0 (?). Most sites I encounter have ASR setup, but not having all messages send additionally to local, onsite monitoring. I want to stress it’s very important to have the ASR messages sent to your own monitoring too!

Oracle Support does not list all the specifications from an ASR message it has gotten. Instead, a Service Request is made with enough information for Oracle itself (!!). In our case, the exact error message was NOT specified, only ‘compute server hard disk predictive failure’ and the node name.

Where do you look on an Exadata for that information? The first logical point is the ASR daemon. I didn’t spend too much time on it, but it seems that it’s more a proxy for messages than a database. I wasn’t able to find useful information about the systems which where using this daemon.

What are the sources for ASR with an Exadata? These are:

Computing node:
- “compmon daemon” / Linux level monitoring
- ILOM
Storage node:
- “cell daemon” / Linux level monitoring
- ILOM

For the computing node, it’s quite easy to see if there are any detected failed devices from the viewpoint of the ILOM:
(please mind ipmitool -I open only works on the local system)
# ipmitool -I open sunoem cli "show /SP/logs/event/list Severity==(Major,Critical,Down)"
Connected. Use ^D to exit.
-> show /SP/logs/event/list Severity==(Major,Critical,Down)

ID Date/Time Class Type Severity
----- ------------------------ -------- -------- --------

-> Session closed
Disconnected

This shows no messages with the severity Major, Critical or Down are in the eventlog in the ILOM. Please mind that the logons to the ILOM have severity “Minor”. These are in most system the vast majority of the messages, which are not of interest for this investigation. If you want to know if something has failed, there even a simpler command:
# ipmitool -I open sunoem cli "show faulty"

For the “compmon daemon”, grep the processlist for “compmon”:
# ps -ef | grep compmon
root 12812 1 0 Oct22 ? 00:00:11 /usr/bin/perl -w /opt/oracle.cellos/compmon/exadata_mon_hw_asr.pl -server

The most important part here is the directory: /opt/oracle/cellos/compmon
If you navigate to that directory, you will see a number of “state files”: asrs.state, traps.state and disks.state.
The disks.state lists the disk status as listed with a) with the firmware state.
The most important file for the ASR message investigation is the traps.state file. This file lists traps it has sent to ASR. In our case:
1 ; Mon Oct 22 14:39:10 2012 ; 86425886-b359-4587-8d46-f31ff2ecb135 ; Physicaldisk : Make Model: is at status predictive failure. Raised fault id: HALRT-02008 ; Physical disk should be replaced. Exadata Compute Server: Disk Serial Number:
Yes, this is pasted correctly, it misses Physicaldisk, Make Model and Disk Serial Number information. This has not been omitted for safety, it just is not listed.
So, the failure which was sent was HALRT-02008 in our case.

For completeness, the ILOM layer can be investigated identically to the description of the ILOM handling on the computing layer. The Linux layer messages can be investigated with: # cellcli -e list alerthistory
32 2012-10-17T02:00:27+02:00 info "HDD disk controller battery on disk contoller at adapter 0 is going into a learn cycle. This is a normal maintenance activity that occurs quarterly and runs for approximately 1 to 12 hours. The disk controller cache might go into WriteThrough caching mode during the learn cycle. Disk write throughput might be temporarily lower during this time. The message is informational only, no action is required."
33 2012-10-22T11:43:21+02:00 info "Factory defaults restored for Adapter 0"
34 2012-10-22T11:43:23+02:00 info "Factory defaults restored for Adapter 0"

If you are administering an Oracle Exadata database machine, which base operating system image (the operating system version with which it system came) is Linux and version 11.2.3.1.0 (current version is viewable with the command ‘imageinfo’, which needs root account privileges) or higher, and multiple users are accessing the system with password authentication, this blogpost might be an interesting read. Also, if you have witnessed temporary lockout of the oracle user, or other users: this blogpost describes the reason and a potential resolution.

I administer several Exadata database machines, which are not all delivered at the same time, so the base image version is different. Because I also administer the Linux operating system on the computing nodes in the Exadata database machines, I noticed the Linux settings slightly differ among the different Exadata database machine computing nodes. There is a positive side to this: apparently the team that maintains the base image does not only renew the packages on the image, but also get feedback about the settings, and change things to improve something. There is also a negative side to this: these changes are not documented anywhere (that I am aware of), so getting a new system always is a bit exciting, because things might have been changed. Or not…

I speak about ‘base image’ deliberately. After a system is delivered with a certain ‘base image’ (collection of kernel, executables, os-scripts and settings), the kernel, executables and os-scripts are renewed with an upgrade, but the settings remain the same. This blogpost is about a PAM (pluggable authentication modules) setting, which I encountered on base-image 11.2.3.1.0.

I witnessed the ‘oracle’ account being locked out temporarily on a system. The reason was a series of unsuccesful logon attempts. This could be something which complies with somebodies security standards (but who’s?). I think the ‘oracle’ account on an Oracle database system being locked out (albeit temporarily) is highly undesirable on most systems. As I’ve described earlier: this is a setting which the Exadata image engineering team decided to do with pluggable authentication modules.

Meet pam_tally2.so…
In the directory /etc/pam.d there are two files which configure the temporary lockout when a series of unsuccesful logon attempts have been made: ‘login’ and ‘sshd’. The actual line responsible for the lockout is:

auth       required     pam_tally2.so deny=5 onerr=fail lock_time=600

In most environments where no actual security compliancy rules are known, I disable this behaviour by commenting it with a hash (‘#’) sign as the first character on the line. Of course you could read the manpage of pam_tally2 (search on the internet for ‘man pam_tally2′) and configure it to your linking or to comply with your security rules.

Recently I patched an 11.2.0.2 grid infrastructure to an higher version. After the patching I started the grid infrastructure on that host, and ASM was unable to start. Looking in the alert.log file of the ASM instance it turned out that upon starting ASM, even before the contents of the pfile/spfile was displayed, the ASM crashed with the ORA-00600 error:

Sat Oct 13 14:35:07 2012
NOTE: No asm libraries found in the system
* instance_number obtained from CSS = 1, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 1 
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 30960 free: 30960)
DFLT Huge Pages allocation successful (allocated: 67107111)
***********************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
CELL communication is configured to use the following interface(s) for this instance
    1.1.1.1 
CELL interconnect IPC version: Oracle RDS/IP (generic)
IPC Vendor 1 Protocol 3
  Version 4.1
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/11.2.0/grid/dbs/arch
Autotune of undo retention is turned on. 
LICENSE_MAX_USERS = 0
SYS auditing is disabled
NOTE: Volume support  enabled
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_20405.trc  (incident=66599):
ORA-00600: internal error code, arguments: [kmgs_component_init_3], [60], [65], [17], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_66599/+ASM1_ora_20405_i66599.trc
Sweep [inc][66599]: completed
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

A quick google didn't result in any hits, nor searching the MOS knowledge base and bug database.

I decided to start the ASM instance manually, and got an interesting message on screen just prior to the instance crashing:

ORA-32004: obsolete or deprecated parameter(s) specified for ASM instance

Upon investigation of the contents of the spfile, it turned out the parameter 'remote_os_authent' was set for the ASM instance. Removing this parameter returned normal, expect behaviour, in other words: not crashing the ASM instance.

I tried to replay/redo this failure in a VM on my laptop (using Oracle Grid Infrastructure 11.2.0.3 in non-clustered mode), but was UNABLE to crash ASM by setting 'remote_os_authent' to both true and false. So it could either be a 'cluster feature' or a feature specific to version 11.2.0.2.

The reason for this blogpost is to have a little bit of documentation on the OERI [kmgs_component_init_3], so if anybody encounters this problem, there is at least some clue to be found, or even a resolution.

ps. kmgs_component_init is in the VOS layer (operating system depended layer)

This post really is about using LVM (Logical Volume Manager; an abstraction layer for disk devices) snapshots. A snapshot is a frozen image of a logical volume, which simply means “filesystem”. It’s not really “frozen”, LVM2 snapshots are read/write by default. But you can freeze a filesystem in time with a LVM snapshot.

The background of this really is Exadata (computing node) and upgrading, but has nothing unique to Exadata. So don’t let this bother you. But the idea of using LVM snapshots popped up when dealing with Exadata computing nodes and upgrades.

First of all: LVM is in development, which mean different Linux versions have different LVM options available to them. I am using the Exadata X2 Linux version: RHEL/OL 5u7 x86_64. I guess OL6 has more and more advanced features inside LVM, but with X2, OL5u7 is what I have to use. So the steps in this blogpost are done with this version. Any comments are welcome!

Second: if you want to experiment with this: most people allocate all space in the volume group to logical volumes upfront. A snapshot is a COW (Copy on Write) copy of a logical volume. This means a snapshot starts off with zero extra bytes (source and snapshot are equal), and grows as the source gets modified. This means you need to have free/available space in the volume group to facilitate the snapshot.

Then there is another caveat: the ‘/boot’ filesystem cannot be in LVM, so is a normal partition on most systems (also on Exadata). This means snapshots do not help if you want a backup of that filesystem. You need to use another trick.

Okay, here we go: you have a large modification upcoming and want to be able to restore your system to this moment in time.

1. Backup /boot filesystem

[root@localhost ~]# df /boot 
Filesystem           1K-blocks  Used Available Use%  
Mounted on /dev/sda1    248895 24037    212008  11% /boot
[root@localhost ~]# umount /boot
[root@localhost ~]# dd if=/dev/sda1 of=dev_sda1_backup
[root@localhost ~]# mount /boot

2. Create snapshot of logical volumes
In this example it’s only the root filesystem (which is a bit special, because this filesystem is set in grub.conf for the bootloader, and in /etc/fstab).

[root@localhost ~]# lvdisplay -v /dev/vg00/lv_root 
Using logical volume(s) on command line 
/dev/hdc: open failed: No medium found 
--- Logical volume --- 
LV Name /dev/vg00/lv_root 
VG Name vg00 
LV UUID wutwln-ffdB-QRlg-1LgL-XqKB-glvn-OpCowW 
LV Write Access read/write 
LV Status available # open 1 
LV Size 3.91 GB 
Current LE 125 
Segments 1 
Allocation inherit 
Read ahead sectors auto 
- currently set to 256 
Block device 253:0 

lvdisplay displays the properties of the logical volume to be snapshotted, I use this to spot the number of LE’s.

[root@localhost ~]# lvcreate -l 125 -s /dev/vg00/lv_root -n lv_root_snap 
/dev/hdc: open failed: No medium found 
Logical volume "lv_root_snap" created

The system now has been backed up in a way that we can revert to this situation.

Next would be to do the O/S upgrade, software upgrade or whatever. It goes horribly wrong, and you need to restore the system to the previous situation.

1. Rename current logical volume
In this case, I rename my lv_root logical volume to lv_root_old:

[root@localhost ~]# lvrename /dev/vg00/lv_root /dev/vg00/lv_root_old
/dev/hdc: open failed: No medium found 
Renamed "lv_root" to "lv_root_old" in volume group "vg00"

The logical volume (which we currently use) has been renamed.

2. Create new lv_root
This is the logical volume we are going to use to put the snapshot in.

[root@localhost ~]# lvcreate -l 125 -n lv_root /dev/vg00
Logical volume "lv_root" created

3. Populate the new lv_root with the snapshot contents

[root@localhost ~]# dd if=/dev/vg00/lv_root_snap of=/dev/vg00/lv_root 	
8192000+0 records in 
8192000+0 records out 
4194304000 bytes (4.2 GB) copied, 93.0858 seconds, 45.1 MB/s

4. Restore the /boot filesystem

[root@localhost ~]# umount /boot
[root@localhost ~]# dd if=dev_sda1_backup of=/dev/sda1
[root@localhost ~]# mount /boot

5. Reboot

[root@localhost ~]# shutdown -r now

This sequence of events enabled me to restore my system to a post-modification situation. Of course you should test this very thoroughly for your own situation, but this offers an elegant way, which has little external dependencies.

Post restore/cleaning up:
When the system is reverted to its old situation, we are left with a logical volume and a snapshot which probably are not of use anymore. These can be cleaned up the following way:

[root@localhost ~]# lvremove /dev/vg00/lv_root_old 
Do you really want to remove active logical volume lv_root_snap? [y/n]: y
Logical volume "lv_root_snap" successfully removed 
Do you really want to remove active logical volume lv_root_old? [y/n]: y 
Logical volume "lv_root_old" successfully removed
Follow

Get every new post delivered to your Inbox.

Join 1,047 other followers

%d bloggers like this: