Archive

Oracle EE

For some time now, I am using gdb to trace the inner working of the Oracle database. The reason for using gdb instead of systemtap or Oracle’s dtrace is the lack of user-level tracing with Linux. I am using this on Linux because most of my work is happening on Linux.

In order to see the same information with gdb on the system calls of Oracle as strace, there’s the Oracle debug info repository. This requires a bit of explanation. When strace is used on a process doing IO that Oracle executes asynchronous, the IO calls as seen with strace look something like this:

io_submit(140425370206208, 1, {{0x7fb7516c4bc0, 0, 0, 0, 257}}) = 1
io_getevents(140425370206208,1,128,{{0x7fb7516c45e8,0x7fb7516c45e8,106496,0}}, {600, 0}) = 1

This reveals exactly how Oracle used these calls. In case you wonder how to read these calls: Linux (as well as any other Unix like operating system) provides man pages (manual pages) for not only for the command line tools, but also on system calls, c library functions and device and special files, among others. So if you wonder what the io_submit line means, type ‘man io_submit’, or to be 100% sure you look in the manual pages of the system calls, type ‘man 2 io_submit’ to specify you want section 2: system calls.

When I use gdb, and break on io_submit and io_getevents, I get this information:

Breakpoint 1, 0x00007fa883926660 in io_submit () from /lib64/libaio.so.1
Breakpoint 1, 0x00007fa883926660 in io_submit () from /lib64/libaio.so.1
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()

I think everybody can spot that I got less information now. In fact, I now know the calls have happened, and that’s all, there is no additional information. In order to get part of the information back that was visible with strace, use the debuginfo package of libaio. The debug info package must match 100% the version of the package it is meant to provide debug symbols about, because it provides debug information about the executable or library based on physical code locations.

In order to get information on these specific calls (libaio calls), the libaio-debuginfo package can be installed. Once done, we get a great deal of information which resembles strace:

Breakpoint 1, io_submit (ctx=0x7ff8b626c000, nr=1, iocbs=0x7fffa5c31a80) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 2, io_getevents_0_4 (ctx=0x7ff8b626c000, min_nr=2, nr=128, events=0x7fffa5c37b68, timeout=0x7fffa5c38b70) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

This shows all the arguments which are used by the process which is traced with gdb. Please mind that gdb breaks on entering the call, so it doesn’t give a return code. And the return code of io_getevents() is what returns the number of IO’s which are ready, so that information is still not visible, but is visible with strace, which does provides the return code.

How about the Oracle user land calls? I use breaking on kslwtbctx() and kslwtectx() a lot, which indicate the starting (kslwtbctx()) and stopping (kslwtectx()) of a wait event. When doing so, this is how it looks like:

Breakpoint 1, 0x00007f40a05c3660 in io_submit () from /lib64/libaio.so.1
Breakpoint 1, 0x00007f40a05c3660 in io_submit () from /lib64/libaio.so.1
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 4, 0x0000000007cf47b6 in kslwtbctx ()
Breakpoint 2, 0x000000000082d7d8 in io_getevents@plt ()
Breakpoint 5, 0x0000000007cfb4f2 in kslwtectx ()

Here we see the libaio functions again, together with the Oracle wait event functions. When using these calls this way, we can safely say that there are some calls done outside of a wait, and one call is done inside of a wait. Because this measurement is done on a well known piece of Oracle code (well known to me: executing a full table scan via direct path), I just know the wait is ‘direct path read’. But what if you do not know? Wouldn’t it be nice to know which wait is called here?

The simplest way to get more information on Oracle function calls is to get the debug information for the Oracle database. However, since that makes references to the source code, that will probably never happen. So, does that mean this is all we can get? No.

In order to get more information out of a function call, we need to dive a little deeper into the internals of Linux x86_64. When a function is called, the arguments are passed on via processor registers. This is implementation specific, and differs between 32-bit and 64-bit. An overview of how that works is summarised in this table. The important line is: “The first six integer or pointer arguments are passed in registers RDI, RSI, RDX, RCX, R8, and R9, while XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for floating point arguments. For system calls, R10 is used instead of RCX.”

So. This means that if I look at the CPU registers when breaking on a function, there might be something usable. I say “something usable” deliberately, because the Oracle function calls are not publicly documented (I think/hope they are inside Oracle development). I’ve done some investigation, and it turns out that at the END of a wait event, there are a few functions which are called which have some information stored in a CPU register which is useful:
a) First the function kslwtectx() is called to mark the ending of a wait event.
b) Then a function called kslwtrk_enter_wait_int is called, which stores the time the took in the register R13.
c) Next a function called kskthewt is called, which stores the number of the wait event (V$EVENT_NAME.EVENT#) in RSI.

If we combine that information in a little gdb macro, it looks like this:

break kslwtbctx
  commands
    silent
    printf "kslwtbctx\n"
    c
  end
break kslwtectx
  commands
    silent
    printf "kslwtectx -- "
    c
  end
break kslwtrk_enter_wait_int
  commands
    silent
    set $time=$r13
    c
  end
break kskthewt
  commands
    silent
    printf "wait: %d, time: %d\n", $rsi, $time
    c
  end

Put this in a text file, and once attached to a process to trace with gdb, load it using ‘source ‘.
Here is how it looks like when you put it on a process (I’ve put it on the checkpoint process):

kslwtbctx
kslwtectx -- wait: 7, time: 2999054
kslwtbctx
kslwtectx -- wait: 81, time: 1979
kslwtbctx
kslwtectx -- wait: 81, time: 1050
kslwtbctx
kslwtectx -- wait: 81, time: 1216
kslwtbctx
kslwtectx -- wait: 81, time: 2031
kslwtbctx
kslwtectx -- wait: 83, time: 10443

If you want to learn more about this stuff, don’t forget I will be doing a hands-on session on using gdb as a pre-conference training during Collaborate 2014 in Las Vegas.

When you are administering an Exadata or more Exadata’s, you probably have multiple databases running on different database or “computing” nodes. In order to understand what kind of IO you are doing, you can look inside the statistics of your database, and look in the data dictionary what that instance or instances (in case of RAC) have been doing. When using Exadata there is a near 100% chance you are using either normal redundancy or high redundancy, of which most people know the impact of the “write amplification” of both normal and high redundancy of ASM (the write statistics in the Oracle data dictionary do not reflect the additional writes needed to satisfy normal (#IO times 2) or high (#IO times 3) redundancy). This means there might be difference in IOs between what you measure or think for your database is doing, and actually is done at the storage level.

But what if you want to know what is happening on the storage level, so on the level of the cell or actually “cellsrv”, which is the process which makes IO flow to your databases? One option is to run “iostat -x”, but that gives a list that is quite hard readable (too much disk devices); and: it doesn’t show you what the reason for the IO was: redo write? controlfile read? Archivelog? This would especially be great if you want to understand what is happening if your IO behaves different than you expect, and you’ve ruled out IORM.

Well, it is possible to get an IO overview (cumulative since startup)! Every storage server keeps a table of IO reasons. This table can be dumped into a trace file on the cell; to generate a dump with an overview of what kind of IOs are done; use “cellcli” locally on a cell, and enter the following command:

alter cell events="immediate cellsrv.cellsrv_dump('ioreasons',0)";

This doesn’t generate anything useful as output on the command line, except for the name of the thread-logfile where we can find the contents of the dump we requested:

Dump sequence #18 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_14.trc
Cell enkcel01 successfully altered

As an aid for searching your dump in thread-logfile: search (“/” when you use “less” for it), enter the following (using the above example, with sequence #18): “/sequence\ #18″, without ‘”‘.

This is an example from a cell in the Enkitec lab, which I used for this example:

Cache::dumpReasons           I/O Reason Table
2013-10-23 08:11:06.869047*: Dump sequence #18:
Cache::dumpReasons Reason                  Reads Writes
Cache::dumpReasons ------------------------------------
Cache::dumpReasons UNKNOWN                436784 162942
Cache::dumpReasons RedoLog Write               0  80329
Cache::dumpReasons RedoLog Read              873      0
Cache::dumpReasons ControlFile Read       399993      0
Cache::dumpReasons ControlFile Write           0 473234
Cache::dumpReasons ASM DiskHeader IO        4326      4
Cache::dumpReasons BufferCache Read        27184      0
Cache::dumpReasons DataHeader Read          2627      0
Cache::dumpReasons DataHeader Write            0   1280
Cache::dumpReasons Datafile SeqRead           45      0
Cache::dumpReasons Datafile SeqWrite           0    373
Cache::dumpReasons HighPriority Checkpoint Write      0   6146
Cache::dumpReasons DBWR Aged Write             0    560
Cache::dumpReasons ReuseBlock Write            0    150
Cache::dumpReasons Selftune Checkpoint Write      0 116800
Cache::dumpReasons RequestLit Write            0     25
Cache::dumpReasons Archivelog IO               0    255
Cache::dumpReasons TrackingFile IO          2586   2698
Cache::dumpReasons ASM Relocate IO             0    200
Cache::dumpReasons ASM Replacement IO          0     91
Cache::dumpReasons ASM CacheCleanup IO         0   4514
Cache::dumpReasons ASM UserFile Relocate       0   2461
Cache::dumpReasons ASM Redo IO                 0  10610
Cache::dumpReasons ASM Cache IO             1953      0
Cache::dumpReasons ASM PST IO                  0     44
Cache::dumpReasons ASM Heartbeat IO           26 162984
Cache::dumpReasons ASM BlockFormat IO          0   3704
Cache::dumpReasons ASM StaleFile IO            0    675
Cache::dumpReasons OSD Header IO               0    315
Cache::dumpReasons Smart scan              11840      0

Please mind the numbers here are IOs, it doesn’t say anything about the size of the IOs. Also please mind these are numbers of a single cell, you probably have 3, 7 or 14 cells.

In my opinion this IO summary can be of much value during IO performance investigations, but also during proofs of concept.

If the cell has been running for a while, these number may grow very big. In order to make an easy baseline, the IO reason numbers can be reset, so you can start off your test or proof-of-concept run and measure what actually has happened on the cell layer! In order to reset the IO reason table, enter the following command in the cellcli:

alter cell events = "immediate cellsrv.cellsrv_resetstats(ioreasons)"; 

This will reset the IO reasons table in the cell.

PS1: Thanks to Nikolay Kovachev for pointing out the ‘ioreasons’ resetstats parameter. Indeed ‘all’ is way too blunt.
PS2: The IO numbers seem to be the number IO requests the cell has gotten from it’s clients (ASM and database) for data, not for metadata. During a smartscan metadata flows in between the database and the cell server before data is actually served.

Exadata gets its performance by letting the storage (the exadata storage server) participate in query processing, which means part of the processing is done as close as possible to where the data is stored. The participation of the storage server in query processing means that a storage grid can massively parallel (depending on the amount of storage servers participating) process a smart scan request.

However, this also means additional CPU is used on the storage layer. Because there is no real limit on how many queries can use smartscans (and/or hybrid columnar compression, in other words: processing) on the available storage servers, this means a storage server can get overloaded, which could hurt performance. To overcome this problem, Oracle introduced the ‘passthrough’ functionality in the storage server. In the exadata book, it is explained that this functionality came with storage server version 11.2.2.3.0 and Oracle database version 11.2.0.3 and Exadata bundle patch 7. It also explains that the ‘passthrough’ functionality means that the storage server deliberately starts sending non-storage processed data during the smartscan. So when this happens, you still do a smartscan (!), but your foreground process or parallel query slave gets much more data, and needs to process more. The database-side statistic to know this is happening is “cell physical IO bytes sent directly to DB node to balance CPU usage” which is at the database level in v$sysstat and on the session level in v$sesstat.

But how does this work on the storage server?

On the storage server, the passthrough mode properties are governed by a few “underbar” or “undocumented” parameters in the storage server. In order to get the (current) values of the Exadata storage server, the following command can be used on the “cellcli”:

alter cell events="immediate cellsrv.cellsrv_dump('cellparams',0)";

The cell will echo the thread-logfile in which the output of this dump is put:

Dump sequence #1 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_87.trc
Cell enkcel01 successfully altered

Now load this tracefile (readonly) in your favourite text manipulation tool (I use ‘less’ because less is more).

The “underbar” parameters which are of interest are the following parameters:

_cell_mpp_cpu_freq = 2
_cell_mpp_threshold = 90
_cell_mpp_max_pushback = 50

The “MPP” function is responsible for the passthrough functionality. I can’t find anywhere what “MPP” means, my guess is “Measurement (of) Performance (for) Passthrough”. These parameters govern how it works.

_cell_mpp_cpu_freq seems to be the frequency at which the MPP code measures the host CPU, “2″ means per “200ms”.
_cell_mpp_threshold seems to be the CPU usage threshold after which the passthrough functionality kicks in.
_cell_mpp_max_pushback seems to be the maximum percentage of blocks (unsure what the exact granularity is) which are sent to the database in passthrough mode.

In order to get a good understanding of what MPP does, there is a MPP specific dump which could be very beneficial to diagnose MPP related matters. This dump is available on the storage server, which means in the cellcli:

alter cell events="immediate cellsrv.cellsrv_dump('mpp_stats',0)";

The cell will once again echo the thread-logfile in which the output of this dump is put:

Dump sequence #8 has been written to /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
Cell enkcel01 successfully altered

Now peek in the tracefile!

Trace file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/enkcel01/trace/svtrc_15737_22.trc
ORACLE_HOME = /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.32-400.11.1.el5uek
Version:        #1 SMP Thu Nov 22 03:29:09 PST 2012
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.2.1_LINUX.X64_130109

*** 2013-10-21 10:54:05.994
UserThread: LWPID: 16114 userId: 22 kernelId: 22 pthreadID: 0x4d1e5940
2013-10-21 14:36:04.910675*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 0 out of 4232 total blocks.  Percent = 0.000000%
2013-10-21 14:36:04.910675*: Dump sequence #3:
[MPP] Current cell cpu utilization: 7
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 8 [Pushback Rate] 0
[MPP] Mon Oct 21 14:36:04 2013 [Cell CPU History] 7 [Pushback Rate] 0
...
[MPP] Mon Oct 21 14:05:57 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0
[MPP] Mon Oct 21 14:05:56 2013 [Cell CPU History] 1 [Pushback Rate] 0

So, what do we see here? We see a cell tracefile which is in the well-known Oracle trace format. This means it starts off with a header which is specific to the cell server.

Then we see a timestamp with three asterisks in front of it. The time in the timestamp is (10:54:05.994), which is is roughly 3 hours and 30 minutes earlier than the timestamp with the next messages, which is 14:36:04.910. The line with the three asterisks is the creation timestamp of the tracefile, which is when the thread which we are using was created. The next line is also created just after that time, which lists the LWPID, userId, etc.

The line with [MPP] is created because of the mpp_stats dump. The timestamp has a single asterisk after it, which means the time is an approximation. The line tells important information: during the approximate 30 minutes in this dump, 4232 blocks where processed by this cell, and 0 blocks where “executed” in “passthru” mode.

Next, the measurements which where taken every 200ms of the past are printed to get an exact overview of measured CPU business and the rate at which “pushback” alias “passthru” was applied.

To see what this means, let’s generate CPU business on the cell, and see if we can get the storage server to invoke “passthru”. There is a simple trick to let a fake process take 100% on it’s CPU thread with common linux shell tools: ‘yes > /dev/null &’. The storage server which I use has 16 CPU threads, so I start 16 of these processes, to effectively have a process that can monopolise every CPU thread.

Next, I started a (sufficiently large; 7GB) scan on a table via sqlplus, and then dumped ‘mpp_stats’ using the method described in this blog.

2013-10-21 16:30:27.977624*: [MPP] Number of blocks executed in passthru mode because of high CPU utilization: 2728 out of 13287 total blocks.  Percent = 20.531347%
2013-10-21 16:30:27.977624*: Dump sequence #10:
[MPP] Current cell cpu utilization: 8
[MPP] Mon Oct 21 16:30:27 2013 [Cell CPU History] 8 [Pushback Rate] 0
...
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 96 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:13 2013 [Cell CPU History] 95 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 95 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:12 2013 [Cell CPU History] 96 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 95 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 98 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:11 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 99 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 100 [Pushback Rate] 50
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 98 [Pushback Rate] 45
[MPP] Mon Oct 21 16:30:10 2013 [Cell CPU History] 97 [Pushback Rate] 40
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 98 [Pushback Rate] 35
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 30
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 25
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 100 [Pushback Rate] 20
[MPP] Mon Oct 21 16:30:09 2013 [Cell CPU History] 99 [Pushback Rate] 15
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 100 [Pushback Rate] 10
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 5
[MPP] Mon Oct 21 16:30:08 2013 [Cell CPU History] 99 [Pushback Rate] 0
...

This shows it all! The header shows that during the last 30 minutes, this storage server sended 2728 blocks out of the total of 13287 blocks via passthrough mode. Further, in the lines which contain the historical measurements, the “pushback rate” can be seen climbing up to 50%, because the CPU usage was above 90%.

Please mind the techniques I’ve described here are done on one storage server, while a normal Exadata setup has 3 (8th/quarter rack), 7 (half rack) or 14 (full rack) storage servers.

This post is about database writer (dbwr, mostly seen as dbw0 nowadays) IO.
The testenvironment in which I made the measurements in this post: Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM. The test environment is a (VMWare Fusion) VM, with 2 CPU’s.

It might be a good idea to read my previous blog about logwriter IO.

The number of database writers is depended on the number of CPU’s visible to the instance (when not explicitly set with the DB_WRITER_PROCESSES parameter), and seems mostly to be CEIL(CPU_COUNT/8). There might be other things which could influence the number (NUMA comes to mind). In my case, I’ve got 2 CPU’s visible, which means I got one database writer (dbw0).

Let’s start the investigation! I enable a 10046 level 8 trace on the database writer. When you look at an idle Oracle database, you probably see the database writer process behaving the same as described in the first part of the log writer IO blog, which is sleeping at the wait event ‘rdbms ipc message’. This is the database writer waiting for work:

*** 2013-08-30 12:42:00.291
WAIT #0: nam='rdbms ipc message' ela= 3001013 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859320291675

*** 2013-08-30 12:42:03.292
WAIT #0: nam='rdbms ipc message' ela= 3000870 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859323292672

*** 2013-08-30 12:42:06.293
WAIT #0: nam='rdbms ipc message' ela= 3000806 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859326293581

There’s a description about this wait in the previously mentioned log writer blog. The short version is: ‘rdbms ipc message’ means: waiting to be posted for work.

It helps to make a little table and insert one or a few rows into it, and then commit. This dirties blocks in the buffercache. After that, issue ‘alter database flush buffer_cache’ to make Oracle write all the dirty blocks currently in the buffer cache. This is what is shown in the database writer trace file:

WAIT #0: nam='rdbms ipc message' ela= 328100 timeout=300 p2=0 p3=0 obj#=-1 tim=1377868410713860
WAIT #0: nam='db file async I/O submit' ela= 1 requests=14 interrupt=0 timeout=0 obj#=-1 tim=1377868410715372
WAIT #0: nam='db file parallel write' ela= 26 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1377868410715427

The first line is an ‘rdbms ipc message’ with an elapsed time (ela=) shorter than 3 seconds, because we triggered the database writer with the ‘alter database flush buffer_cache’. The database writer is notified to do something, and stops waiting.

Then we see two waits: ‘db file async I/O submit’ and ‘db file parallel write’. ‘db file parallel write’ is a wait I suspected, Oracle must write these dirty blocks, I/O costs time, and it’s logical to have a wait event for that. I did see the wait event ‘db file async I/O submit’ in some AWR reports, but haven’t encountered it at the top end of the waits, which means I haven’t looked/needed to look at it yet…

Well, let’s dive in one layer deeper, and combine strace with a 10046 level 8 tracefile using ‘strace -e write -e all -p PID’

semtimedop(851971, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
write(18, "\n*** 2013-08-30 15:32:44.063\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 35 3a 33 32 3a 34 34  2e 30 36 33 0a           15:32:44 .063.    |
lseek(18, 0, SEEK_CUR)                  = 682279
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 32  38 31 32 31 31 20 74 69   ela= 12 81211 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 36 39 35  36 34 30 36 33 36 30 31  13778695 64063601 |
write(19, "J?2-u4~UW1\n", 11)           = 11
 | 00000  4a 3f 32 2d 75 34 7e 55  57 31 0a                 J?2-u4~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
io_submit(139931766530048, 24, {{0x7f44671a7a50, 0, 1, 0, 256}, {0x7f44671a7ca8, 0, 1, 0, 257}, {0x7f44671a7f00, 0, 1, 0, 256}, {0x7f44671a8158, 0, 1, 0, 256}, {0x7f44671a83b0, 0, 1, 0, 256}, {0x7f44671a8608, 0, 1, 0, 257}, {0x7f44671a8860, 0, 1, 0, 257}, {0x7f44671a8ab8, 0, 1, 0, 257}, {0x7f44671a8d10, 0, 1, 0, 257}, {0x7f44671a8f68, 0, 1, 0, 257}, {0x7f44671a91c0, 0, 1, 0, 257}, {0x7f44671a9418, 0, 1, 0, 257}, {0x7f44671a9670, 0, 1, 0, 257}, {0x7f446719e1f8, 0, 1, 0, 256}, {0x7f44671a77f8, 0, 1, 0, 257}, {0x7f44671a75a0, 0, 1, 0, 257}, {0x7f44671a7348, 0, 1, 0, 256}, {0x7f44671a70f0, 0, 1, 0, 256}, {0x7f44671a6e98, 0, 1, 0, 256}, {0x7f44671a6c40, 0, 1, 0, 256}, {0x7f44671a69e8, 0, 1, 0, 256}, {0x7f44671a6790, 0, 1, 0, 256}, {0x7f44671a6538, 0, 1, 0, 256}, {0x7f44671a62e0, 0, 1, 0, 256}}) = 24
lseek(18, 0, SEEK_CUR)                  = 682376
write(18, "WAIT #0: nam='db file async I/O "..., 109) = 109
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 61 73  79 6e 63 20 49 2f 4f 20   file as ync I/O  |
 | 00020  73 75 62 6d 69 74 27 20  65 6c 61 3d 20 31 20 72  submit'  ela= 1 r |
 | 00030  65 71 75 65 73 74 73 3d  32 34 20 69 6e 74 65 72  equests= 24 inter |
 | 00040  72 75 70 74 3d 30 20 74  69 6d 65 6f 75 74 3d 30  rupt=0 t imeout=0 |
 | 00050  20 6f 62 6a 23 3d 2d 31  20 74 69 6d 3d 31 33 37   obj#=-1  tim=137 |
 | 00060  37 38 36 39 35 36 34 30  36 36 31 36 37           78695640 66167    |
write(19, "J?JV~1j1\n", 9)              = 9
 | 00000  4a 3f 4a 56 7e 31 6a 31  0a                       J?JV~1j1 .        |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
io_getevents(139931766530048, 2, 128, {{0x7f44671a7a50, 0x7f44671a7a50, 8192, 0}, {0x7f44671a7ca8, 0x7f44671a7ca8, 8192, 0}, {0x7f44671a7f00, 0x7f44671a7f00, 8192, 0}, {0x7f44671a8158, 0x7f44671a8158, 8192, 0}, {0x7f44671a83b0, 0x7f44671a83b0, 8192, 0}, {0x7f44671a8608, 0x7f44671a8608, 8192, 0}, {0x7f44671a8860, 0x7f44671a8860, 8192, 0}, {0x7f44671a8ab8, 0x7f44671a8ab8, 8192, 0}, {0x7f44671a8d10, 0x7f44671a8d10, 8192, 0}, {0x7f44671a8f68, 0x7f44671a8f68, 8192, 0}, {0x7f44671a91c0, 0x7f44671a91c0, 8192, 0}, {0x7f44671a9418, 0x7f44671a9418, 8192, 0}, {0x7f44671a9670, 0x7f44671a9670, 8192, 0}, {0x7f446719e1f8, 0x7f446719e1f8, 8192, 0}, {0x7f44671a77f8, 0x7f44671a77f8, 8192, 0}, {0x7f44671a75a0, 0x7f44671a75a0, 8192, 0}, {0x7f44671a7348, 0x7f44671a7348, 16384, 0}, {0x7f44671a70f0, 0x7f44671a70f0, 8192, 0}, {0x7f44671a6e98, 0x7f44671a6e98, 8192, 0}, {0x7f44671a6c40, 0x7f44671a6c40, 16384, 0}, {0x7f44671a69e8, 0x7f44671a69e8, 8192, 0}, {0x7f44671a6790, 0x7f44671a6790, 8192, 0}, {0x7f44671a6538, 0x7f44671a6538, 8192, 0}, {0x7f44671a62e0, 0x7f44671a62e0, 8192, 0}}, {600, 0}) = 24
write(18, "WAIT #0: nam='db file parallel w"..., 117) = 117
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 35 33 20 72  rite' el a= 153 r |
 | 00030  65 71 75 65 73 74 73 3d  32 20 69 6e 74 65 72 72  equests= 2 interr |
 | 00040  75 70 74 3d 30 20 74 69  6d 65 6f 75 74 3d 32 31  upt=0 ti meout=21 |
 | 00050  34 37 34 38 33 36 34 37  20 6f 62 6a 23 3d 2d 31  47483647  obj#=-1 |
 | 00060  20 74 69 6d 3d 31 33 37  37 38 36 39 35 36 34 30   tim=137 78695640 |
 | 00070  36 36 35 35 30                                    66550             |

I started this snippet from strace with the semtimedop() system call. If you look at the extended sql trace (10046 level 8) output, you see that the databasewriter stopped sleeping on the semaphore after 1.28 seconds. Then some writes are done (to the .trc and .trm files), and a times() system call to fetch usermode and kernelmode times, after which an important system call is shown: io_submit(). io_submit() submits 24 IOs here.
After the io_submit() system call, the process writes a wait event to the trace: ‘db file async I/O submit’, with an elapsed time (ela=) of 1 microsecond.

This looks like Oracle instrumented the submitting of IOs for the database writer! This is different from the logwriter and normal database foreground processes, where the io_submit() system call is not instrumented!

This wait event also shows the number of IOs (requests=24) which are submitted by the io_submit() call.

Then we see the call to reap the writes, io_getevents(). The io_getevents() system call needed at least 2 IOs ready to stop waiting (!). But instead, it was able to reap all 24 submitted IOs (you could count all the iocb’s in the io_getevents() call, or look at the returncode of the call, which is 24). The wait event of the reap is ‘db file parallel write’, as you can see from the write() system call. This is where it gets truly interesting, because when you look at the information in the wait line, you see requests=2, which probably has a relationship with the min_nr parameter in the io_getevents() call. We also see the io_getevents() call has a timeout of 600 seconds. This seems to be a general timeout for asynchronous IO calls done by Oracle on linux. What stands out for me is that the wait line does not contain any information about how many requests it did reap; two I/O’s ready was enough for the system call to succeed.

Now let’s switch to synchronous IO. I probably have mentioned this a couple of times, but if you don’t do asynchronous IO, you are doing it seriously wrong. Many of the newer IO functions in Oracle have been purpose-built for asychronous IO; direct path reads is a famous example, for which asynchronous IO gives you the opportunity to overcome the IO latency penalty of every distinct IO.

This is how writing by the database writer looks like with the combination extended SQL trace and strace when disk_asynch_io is set to false, when the databasewriter is triggered by ‘alter system flush buffer_cache’:

semtimedop(983043, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
write(18, "\n*** 2013-08-30 16:34:19.162\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 36 3a 33 34 3a 31 39  2e 31 36 32 0a           16:34:19 .162.    |
lseek(18, 0, SEEK_CUR)                  = 9915
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 36  35 34 35 38 36 20 74 69   ela= 16 54586 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 37 33 32  35 39 31 36 32 37 36 32  13778732 59162762 |
write(19, "J?e7K6~UW1\n", 11)           = 11
 | 00000  4a 3f 65 37 4b 36 7e 55  57 31 0a                 J?e7K6~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
pwrite(256, "\6\242\0\0\207 \f\0\214\255O\0\0\0\n\6\236\316\0\0\1\0\0\0%1\1\0X\231O\0"..., 8192, 6221258752) = 8192
pwrite(257, "\2\242\0\0\326\22\300\0\214\255O\0\0\0\t\4\10\207\0\0\1\0\16\0\371\4\0\0005\2\32\32"..., 8192, 3414867968) = 8192
pwrite(256, "\2\242\0\0*\2\300\0\205\255O\0\0\0\1\4\236W\0\0\10\0\33\0=\6\0\0\305\2\v\v"..., 8192, 1359298560) = 8192
pwrite(256, "\2\242\0\0d\1\300\0\207\255O\0\0\0\1\4\321e\0\0\4\0\33\0#\5\0\0\35\3\20\20"..., 8192, 1358725120) = 8192
pwrite(257, "&\242\0\0\360\0\300\0\206\255O\0\0\0\1\4[\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1368260608) = 8192
pwrite(257, "&\242\0\0\260\0\300\0\210\255O\0\0\0\1\4\320\351\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367736320) = 8192
pwrite(257, "&\242\0\0\200\0\300\0\214\255O\0\0\0\2\4.)\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367343104) = 8192
pwrite(257, "\6\242\0\0\6\t@\0\210\255O\0\0\0\1\6Z\37\0\0\1\0\0\0N&\1\0\21\254O\0"..., 8192, 749780992) = 8192
pwrite(256, "\6\242\0\0\231\2@\0\206\255O\0\0\0\1\6q\207\0\0\1\0\0\0J\0\0\0\253\254O\0"..., 8192, 734208000) = 8192
lseek(18, 0, SEEK_CUR)                  = 10012
write(18, "WAIT #0: nam='db file parallel w"..., 107) = 107
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 32 34 20 72 65  rite' el a= 24 re |
 | 00030  71 75 65 73 74 73 3d 39  20 69 6e 74 65 72 72 75  quests=9  interru |
 | 00040  70 74 3d 30 20 74 69 6d  65 6f 75 74 3d 30 20 6f  pt=0 tim eout=0 o |
 | 00050  62 6a 23 3d 2d 31 20 74  69 6d 3d 31 33 37 37 38  bj#=-1 t im=13778 |
 | 00060  37 33 32 35 39 31 38 33  32 32 37                 73259183 227      |
write(19, "J?ny4~1h1\n", 10)            = 10
 | 00000  4a 3f 6e 79 34 7e 31 68  31 0a                    J?ny4~1h 1.       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(18, "WAIT #0: nam='db file parallel w"..., 106) = 106
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 20 72 65 71  rite' el a= 1 req |
 | 00030  75 65 73 74 73 3d 39 20  69 6e 74 65 72 72 75 70  uests=9  interrup |
 | 00040  74 3d 30 20 74 69 6d 65  6f 75 74 3d 30 20 6f 62  t=0 time out=0 ob |
 | 00050  6a 23 3d 2d 31 20 74 69  6d 3d 31 33 37 37 38 37  j#=-1 ti m=137787 |
 | 00060  33 32 35 39 31 38 33 34  32 30                    32591834 20       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semctl(983043, 16, SETVAL, 0x1)         = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semtimedop(983043, {{14, -1, 0}}, 1, {1, 320000000}) = -1 EAGAIN (Resource temporarily unavailable)

Mmmh...I see a number of very suspicious things. I started this snippet with the sleep on the semaphore, after which the process writes the wait event information about this. The next couple of system calls (9) are pwrite() system calls, which is the database writer writing dirty blocks to the datafiles. As you can see, this is happening in a sequential way (one by one). After the writes, the process writes the wait event information. We got no 'db file async I/O submit wait, because there is no asynchronous IO, it's directly the wait 'db file parallel write'. The number of requests in the wait is 9, which is the same as the number of write() system calls.

When I look further (still in the first wait line!) I see ela=24. This is very suspicious...can I do 9 write calls to the O/S in 24 microseconds? But wait, it gets more weird!

I see yet another wait event 'db file parallel write' after the first one without any pwrite()'s happening in between. That is strange. My guess is this wait represents the same 9 pwrite() requests, and it was timed at 1 microsecond (ela=1). I've verified this waitevent in this situation (two times a wait event for the same event), this happens at least with every 'alter system flush buffer_cache'. Also an elapsed time of 1 microsecond is very suspicious.

Let's dig one layer deeper and look at the start and end of the wait event timing. In order to do this, I use gdb and the following commands:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) command
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) command
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 4 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

After gdb continues, I issue a ‘alter system flush buffer_cache’, and look at the gdb output:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

I’ve included the two semaphore waits. It easy to spot the pwrite64() calls are not timed. That is the reason the elapsed time in the ‘db file parallel write’ wait is so short…they both do not time any actual IO calls! Unless someone can prove me wrong, it looks like the synchronous IO implementation of database writer of wait events is seriously broken.

But, as I’ve said: you really should have a platform which has the ability to do asynchronous IO.

Let’s switch to asynchronous IO again, and look at it at the same layer. Here are the gdb commands I’ve used:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660: file io_submit.c, line 23.
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 5 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

You should see the semtimedop() system calls with accompanying kslwt[be]ctx stepping over the screen every 3 seconds. Insert a few rows into a dummy table and commit, and issue an ‘alter system flush buffer_cache’, and look what happens:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

I included a semaphore wait at the beginning and at the end of this snippet.
The next remarkable thing is the io_submit() system call does not seem to be timed by a wait event, because it is not encosed in a kslwtbctx() and kslwtectx() combination. That’s a pity, because the wait event ‘db file async I/O submit’ strongly hints at being the timing of submitting IO to the operating system.
After this, we see kslwtbctx() io_getevents_0_4() kslwtectx(). This is Oracle timing the reap of the writes submitted earlier.

To be absolutely sure there isn’t some kind of magic happening with io_submit(), let’s make this call last longer.
In order to do so, you can change the commands with the break on io_submit():

(gdb) commands 4
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end
(gdb) 

The inclusion of ‘shell sleep 1′ means that when the debugger breaks execution of the database writer when encountering io_submit(), it will execute ‘sleep 1′ (wait/sleep 1 second), after which the ‘c’ (continue) command is executed by the debugger, which continues execution of the database writer process. The result is that for the database process, the io_submit() system call now will take at least 1 second.

However, when done so, a 10046/8 trace (still) looks like this on my test system:

*** 2013-09-06 11:12:16.337
WAIT #0: nam='db file async I/O submit' ela= 470 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458736337487
WAIT #0: nam='db file parallel write' ela= 1386 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458736340152

*** 2013-09-06 11:12:17.344
WAIT #0: nam='db file async I/O submit' ela= 545 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458737344837
WAIT #0: nam='db file parallel write' ela= 1554 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458737346981

What we see is that after making the io_submit() call take at least 1 second, the event ‘db file async I/O submit’ still is timed at approximately 0.5 milliseconds. So at this point it’s safe to say ‘db file async I/O submit’ is NOT the timing of submitting IO requests to the operating system.

Now take a look at the event ‘db file parallel write’ again in the 10046/8 output. We’ve already established that “requests” with the event ‘db file parallel write’ does not reflect the number of IO requests it did reap, but instead is the minimum number of requests that this io_getevents() call should reap from the operating system completion queue before the system call is ready. The actual number of IO requests it did reap is not shown. Another oddity is timeout. The number is recognisable as 2^31-1 by any person working in IT. If the measure of timeout is in hundredths of a second (according to the Oracle reference guide version 11.2), it means the write times out in 2147483 seconds. That should mean a write times out after almost 25 days. That seems a bit lengthy to me.

The nice thing is we can actually see the timeout of the operating system call, io_getevents(). This is extremely easy with strace (the timeout is the last argument showed enclosed in accolades). Another way to do so is asking the debugger to print the struct:

(gdb) commands 5
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) c
Continuing.

This is the result:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$1 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

So, actually the timeout of io_getevents() is set to 600 seconds ($1 = {tv_sec = 600, tv_nsec = 0}), and not 2147483 seconds, as the documentation suggests. Of course this conclusion is about the system call the database writer executes, there is a possibility that the Oracle code retries the call, but I strongly doubt that.

Up to this point the database writer only needed to flush a few blocks to disk. Because of the low number of blocks to write, and because the system on which I perform my tests is a VM for which the IOs are likely to be cached, it means the latency of these IOs is extremely low. What if we limit the number of IOs the database writer can do? This blogpost describes how that can be done on linux.

At this point there are a few million rows in my test table (I tried generating more dirty blocks for writing using ‘insert into t select 1+(level-1) from dual connect by level < 100000;' for example). I did setup IO throttling using cgroups, and did put the database writer in the task-list to be throttled. After that, I started deleting all rows from my dummy table t using 'delete from t'.

Please mind the database writer still has event 10046/8 active, so I can follow what the database writer is doing at the Oracle level. Also I modified the breaks in gdb a bit so I can detect the number of IOs io_getevents() was able to reap: I use the 'finish' command, which waits until the current function ends, and prints out the return code. io_getevents() returns the number of IOs it was able to reap. The downside of using 'finish' is the execution is stopped, and must be manually continued. Nevertheless is this the result:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=153, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=38, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$12 = {tv_sec = 600, tv_nsec = 0}
c
0x0000000002cfa74e in skgfrliopo ()
Value returned is $13 = 38
(gdb) c
Continuing.

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$14 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $15 = 0
(gdb) c
Continuing.

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$16 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $17 = 18
(gdb) 

At the top we see kslwtectx(), and then io_submit() submitting 153 IO requests. After that, we see kslwtbctx()-kslwtectx() for the event ‘db file async I/O submit’.
Then we see the start of a wait (kslwtbctx()) for ‘db file parallel write’, and io_getevents(). This reap request must reap at least 38 IO requests, with a timeout set to 600 seconds. Because IOPS for both read and write are aggressively throttled at 1 IOPS, this request only reaps the minimal number of IO requests: 38.

I was able to see this because of the ‘finish’ command in gdb. It returns me to the calling function (skgfrliopo()), which is an Oracle function.

After continuing the process, we see something unexpected (to me at least): the wait is ended (kslwtectx()), and Oracle tries to reap the left over IOs using two non-blocking calls (the so-called ‘snacking’). Once both calls are done, either new IOs are submitted (which starts over the process I am describing here), or Oracle registers another ‘db file parallel write’ event, and issues a blocking io_getevents() call for a number IOs (which is lower than the total number of outstainding requests) to reap. If the number of active IO requests for this process grows, the number of io_getevents() calls in ‘db file parallel write’ can increase to two io_getevents() calls.

This is how it looks like when the number of active requests gets higher:

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=685, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=114, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 0, tv_nsec = 0}

Summary
The database writer related events are kind of strange, and what is visible is absolutely not obvious.

The naming ‘db file parallel write’ is only true for asynchronous IO, with synchronous IO, the writes are done sequentially.

The database writer has a unique event named ‘db file async I/O submit’, which, despite its naming does NOT time submitting asychronous IOs to the operating system. This event is now shown with sychronous IO, obviously.

The ‘db file parallel write’ event does not time the write calls with synchronous IO, and seems to be printed two times for a set of write IOs. I consider this wait implementation severely broken.

The event ‘db file async I/O submit’ shows the accurate number of IO requests which are submitted to the operating system using io_submit().
The event ‘db file parallel write’ shows the minimal number of IO requests the process needs to reap using io_getevents(), which can be a lower number than submitted with io_submit(). The timeout which is shown in this event does not seem to relate to anything. The actual (operating system) timeout is 600 seconds, not 25 days.

The event ‘db file parallel write’ waits for a lower number of IO requests then are submitted, the database writer tries to reap any leftover IOs using up to two non-blocking io_getevents(), after which it’s possible new IO requests are introduced.

At this moment my conclusion is ‘db file parallel write’ is an “indication of database writer business”, but absolutely nothing more than that, certainly not a measurement of IO latency.

This post is about log writer (lgwr) IO.
It’s good to point out the environment on which I do my testing:
Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM.

In order to look at what the logwriter is doing, a 10046 trace of the lgwr at level 8 gives an overview.
A way of doing so is using oradebug. Be very careful about using oradebug on production environments, it can/may cause the instance to crash.

This is how I did it:

SYS@v11203 AS SYSDBA> oradebug setospid 2491
Oracle pid: 11, Unix process pid: 2491, image: oracle@ol63-oracle.local (LGWR)
SYS@v11203 AS SYSDBA> oradebug unlimit
Statement processed.
SYS@v11203 AS SYSDBA> oradebug event 10046 trace name context forever, level 8
Statement processed.

Of course 2491 is the Linux process id of the log writer, as is visible with “image”.

When you look at the tracefile, the logwriter trace of a non-used database will look like this:

*** 2013-08-29 21:10:45.593
WAIT #0: nam='rdbms ipc message' ela= 3000575 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803445592696

*** 2013-08-29 21:10:48.595
WAIT #0: nam='rdbms ipc message' ela= 3000566 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803448594695

*** 2013-08-29 21:10:51.597
WAIT #0: nam='rdbms ipc message' ela= 3000812 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803451596813

*** 2013-08-29 21:10:54.599
WAIT #0: nam='rdbms ipc message' ela= 3000733 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803454598806

*** 2013-08-29 21:10:57.601
WAIT #0: nam='rdbms ipc message' ela= 3000460 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803457600717

*** 2013-08-29 21:11:00.603
WAIT #0: nam='rdbms ipc message' ela= 3000526 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803460602699

*** 2013-08-29 21:11:03.605
WAIT #0: nam='rdbms ipc message' ela= 3000640 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803463604715

*** 2013-08-29 21:11:04.218
WAIT #0: nam='rdbms ipc message' ela= 612031 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803464218054

What we see here is the logwriter spending 3 seconds (ela= 3000812) on the event ‘rdbms ipc message’, simply because the timeout (timeout column obviously) was set so (timeout=300; timeout is set in centiseconds, Oracle old “fine” measurement of time). To understand what this means from the perspective of the operating system (Linux), there’s a need “trick”: use strace with verbose output of writing. That way the extended SQL trace will “document” the system calls. This is how this looks like:

[root@ol63-oracle ~]# strace -e write=all -e all -p 2491
Process 2491 attached - interrupt to quit
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
write(32, "\n*** 2013-08-29 21:35:50.740\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 30  2e 37 34 30 0a           21:35:50 .740.    |
lseek(32, 0, SEEK_CUR)                  = 118213
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 32 30  38 37 34 35 32 20 74 69   ela= 20 87452 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 30 37 34 30 33 37 33  13778049 50740373 |
write(33, "J?Jqz7~UW1\n", 11)           = 11
 | 00000  4a 3f 4a 71 7a 37 7e 55  57 31 0a                 J?Jqz7~U W1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
semtimedop(327683, {{15, -1, 0}}, 1, {0, 910000000}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
write(32, "\n*** 2013-08-29 21:35:51.651\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 31  2e 36 35 31 0a           21:35:51 .651.    |
lseek(32, 0, SEEK_CUR)                  = 118339
write(32, "WAIT #0: nam='rdbms ipc message'"..., 94) = 94
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 39 31  30 36 37 37 20 74 69 6d   ela= 91 0677 tim |
 | 00030  65 6f 75 74 3d 39 31 20  70 32 3d 30 20 70 33 3d  eout=91  p2=0 p3= |
 | 00040  30 20 6f 62 6a 23 3d 2d  31 20 74 69 6d 3d 31 33  0 obj#=- 1 tim=13 |
 | 00050  37 37 38 30 34 39 35 31  36 35 31 37 32 34        77804951 651724   |
write(33, "J?HRU3~UU1\n", 11)           = 11
 | 00000  4a 3f 48 52 55 33 7e 55  55 31 0a                 J?HRU3~U U1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
write(32, "\n*** 2013-08-29 21:35:54.652\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 34  2e 36 35 32 0a           21:35:54 .652.    |
lseek(32, 0, SEEK_CUR)                  = 118463
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 33 30  30 30 33 33 32 20 74 69   ela= 30 00332 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 34 36 35 32 38 32 36  13778049 54652826 |
write(33, "J?MZSB~UW1\n", 11)           = 11
 | 00000  4a 3f 4d 5a 53 42 7e 55  57 31 0a                 J?MZSB~U W1.      |
lseek(32, 0, SEEK_CUR)                  = 118559
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(33, "3?uG~01\n", 8)               = 8
 | 00000  33 3f 75 47 7e 30 31 0a                           3?uG~01.          |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037

First thing to notice (which is not relevant to this text about logwriter IOs, but nice to know): any time there’s a timestamp in the Oracle trace file (like *** 2013-08-29 21:38:58.721), Oracle did issue a times() syscall prior to that. If you want to know what that does, just issue “man 2 times” on the linux prompt. As a summary: this syscall fetches the CPU time spend in user mode and system/kernel mode, for the current process and for its children.

Back to where this post is about: we see rdbms ipc message waits. This is the instrumentation of the time sleeping (actually) on a semaphore, with timeout mostly set to 3 seconds:

semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

Use the manpage of semtimedop to understand all, but the last struct ({3, 0}) is a struct timespec. The first number is seconds, the second number is nanoseconds. But in other words: if you see the logwriter spending its time on ‘rdbms ipc message’ it’s waiting for work. The semaphore construction is made so the process can asked (by another process) to do something, and can do that immediately.

I guess most people reading this blog, and coming this far, are aware what the function of the logwriter is (writing the log buffer to disk), and the logwriter does that if it finds something to write, or if some other process asks the logwriter to do so. The process which asks the logwriter to write waits for a response from the logwriter on the event ‘log file sync’.

If you use the extended SQL trace with strace “trick” you see this:

io_submit(140612782833664, 1, {{0x7fe2f5369450, 0, 1, 0, 257}}) = 1
io_getevents(140612782833664, 1, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 1024, 0}}, {600, 0}) = 1
lseek(26, 0, SEEK_CUR)                  = 5190
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 30 37 20  write' e la= 107  |
 | 00030  66 69 6c 65 73 3d 31 20  62 6c 6f 63 6b 73 3d 32  files=1  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 31 20 6f 62 6a 23   request s=1 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 37 30  =-1 tim= 13778070 |
 | 00060  32 34 39 35 31 33 31 38                           24951318          |

What we see is an asynchronously submitted IO request (one IO request; io_submit), and it’s reaped after that (io_getevents), after which the logwriter writes a wait event to the tracefile indicating this has happened. The logwriter log file parallel write events statistics tell use it wrote to 1 file (equalling one IO, as can be seen from the IO system calls), with a length of 2 blocks. As we know, the log block size is and has been 512 bytes on Linux, which can be seen in the io_getevents call (1024 equals 2*512). The elapsed time (ela) is 107 microseconds.

The log blocksize is and has been 512 bytes on Linux, but can also be 4kB with more recent versions of Oracle, for newer SSD devices.

It would be logical to think in this example the IO time is 107 microseconds. But hold that thought, I’ll come to that. Let’s add a logfile member to every logfile, and switch to a new logfile to make writing to both logfile members happening. Let’s look at a logwriter writing something now:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2
lseek(26, 0, SEEK_CUR)                  = 101469
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 31 37 20  write' e la= 117  |
 | 00030  66 69 6c 65 73 3d 32 20  62 6c 6f 63 6b 73 3d 32  files=2  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 32 20 6f 62 6a 23   request s=2 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 38 36  =-1 tim= 13778086 |
 | 00060  33 32 30 36 37 31 32 35                           32067125          |

We see here that Oracle submits two IOs at the same time, using one call, and reaps both of them (read the manpage of io_getevents, but the minimal number of IOs is set to 2). So indeed log file parallel write writes in parallel. The number of blocks is a bit puzzling, in total there are two (log-) blocks written (blocks=2), the logwriter IO to a single online redologfile had blocks set to 2, which meant it wrote 1024 bytes (2 512 blocks), here blocks=2 seems to mean the logwriter wrote 2 blocks in total.

How would Oracle write when asynchronous IO is not enabled? Well, let’s set disk_asynch_io to false, and redo enabling the extended sql trace, and strace it:

pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 1415701504) = 512
pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 6560015360) = 512
lseek(25, 0, SEEK_CUR)                  = 4600
write(25, "WAIT #0: nam='log file parallel "..., 103) = 103
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 35 20 66  write' e la= 15 f |
 | 00030  69 6c 65 73 3d 32 20 62  6c 6f 63 6b 73 3d 32 20  iles=2 b locks=2  |
 | 00040  72 65 71 75 65 73 74 73  3d 32 20 6f 62 6a 23 3d  requests =2 obj#= |
 | 00050  2d 31 20 74 69 6d 3d 31  33 37 37 38 30 39 33 39  -1 tim=1 37780939 |
 | 00060  37 38 38 31 38 31 39                              7881819           |

Ah! So when there’s no asynchronous IO, Oracle issues two pwrite() calls sequentially. Not really parallel!

Now think about what the wait means in both situations. In the first situation, with asynchronous IO (which should be your setting, if you have a modern system) the timing is about two IOs, and it seems to indicate the time both IOs took. Since they have been issued in parallel, if you have enough storage bandwidth left, this seems to indicate the time of the slowest IO. In the second situation, the IOs are issued serially, so the wait time seems to indicate the total latency time of the two IOs. This means you have to divide the wait time by the number of requests to get the average IO time of the IOs. Again: if the IOs are done synchronously instead of asynchronously.

But is this true?

Let’s dig a little deeper using gdb! First synchronous IO. Attach to the logwriter using gdb (gdb -p PID), and issue the following debugger commands:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 3 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) 

Now insert something into a table, and commit.

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Hey! Do you see what I see? First I see logwriter nicely sleeping like a little baby. With that I mean: I see kslwtbctx() to indicate the starting of the timing of a wait event, then semtimedop() is issued for the sake of sleeping, and after semtimedop() has timed out, kslwtectx() to end the event. But then we start writing (which is the function of the log writer; the pwrite64 calls), but there seems to be a kslwtbctx() missing??? Actually, kslwtbctx() is there, but AFTER the pwrite64() calls!

Am I missing something? Is Oracle doing something incredibly smart? Well, let’s startup an extended SQL trace and throttle the IO. By throttling the write IO for the logwriter I can (severely) increase the IO time, so the IO time should increase drastically. (see the throttling blogpost on this blog; instead for the logwriter you must limit write IOPS instead read IOPS). I’ve done that, and this is what the extended SQL trace shows:

*** 2013-08-29 23:11:54.509
WAIT #0: nam='log file parallel write' ela= 512 files=2 blocks=4 requests=2 obj#=-1 tim=1377810714509122

*** 2013-08-29 23:11:54.509
Warning: log write elapsed time 2004ms, size 1KB

*** 2013-08-29 23:11:56.509
WAIT #0: nam='log file parallel write' ela= 541 files=2 blocks=2 requests=2 obj#=-1 tim=1377810716509653

*** 2013-08-29 23:11:56.509
Warning: log write elapsed time 2000ms, size 0KB

This is what I call caught with the fingers in the cooky jar! The wait event shows an (extremely) low timing for the IOs (the ela=), and after that we get a warning the log writes were terribly slow! Oracle doesn’t time it’s logwrite events correct if we do synchronous IO in this situation!!!

Now let’s have a look at asynchronous IO. I do the same: attach to lgwr with gdb, and now break and continue on the asynchronous IO system calls: io_submit and io_getevents (please mind this requires breaking on io_getevents_0_4 with linux):

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 3 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

You should now see a stream of kslwtbctx and kslwtectx coming, which is the timing of the semtimedop() system calls. If you insert something in a table you’ll see:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001750) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40078f8, timeout=0x7fffc4008900) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Here’s something worthy to see! The first line shows the ending of a wait (kslwtectx), then the logwriter submitting an IO request (io_submit). Notice there isn’t any wait event timing started at this point, because kslwtbctx is called AFTER the IOs are submitted. This means that the wait event is the timing of the reap of the submitted IOs, instead of measuring full IO latency time. I know this is and should be a tiny fragment of time, but nevertheless the wait only shows the waiting for the reaping of the IO, not the timing of the entire IO.

Now look at the io_submit/io_getevents strace output again shown earlier in the post:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2

Look at the struct timespec in the io_getevents call: {0, 0}. This is a non-blocking reap call just to peek at the O/S completion queue. In other words, if the submitted IOs take a long time/are not ready, this call will just be unsuccessful and execution continues. This looks like the asynchronous read implementation of direct path reads (see this blogpost about non-blocking io_getevents calls).

Let’s throttle the IO of the lgwr, and print out the timespec struct at the break on io_getevents:

(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>print *timeout
>c
>end

Now redo the insertion of data into a table, and commit to put the logwriter to work. Please mind IO is throttled!

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001640) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077e8, timeout=0x7fffc40087f0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$6 = {tv_sec = 0, tv_nsec = 0}

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077d8, timeout=0x7fffc40087e0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$7 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Ah! We see the io_submit call issuing two IOs (nr=2), then the starting of the timing of the log file parallel write event (kslwtbctx), and then two io_getevents calls, the non blocking first, which almost certainly does not succeed (and is NOT shown with strace!) after which a blocking io_getevents call is issued (because seconds in the timespec struct is set to 600).

I haven’t got an idea why the non-blocking io_getevents call is issued after which a blocking io_getevents call is issued.

Summary
This is the description taken from Oracle 11.2 reference guide of log file parallel write:

‘Wait Time: Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.’

This is actually a very good description, surprisingly accurate for asynchronous IO.
As we’ve seen, this is not true for synchronous IO, there Oracle just issues the I/Os it needs to do sequentially, which are not timed at all, but a wait time is displayed, which is wrong.

It turns out that for asynchronous log writes, the log file parallel write event times the time waiting for the writes to be flushed to disk. This means waiting for the slowest write is what is made visible with the event timing, all the fast(er) writes can have any time below the timing of the slowest write. This means the wait is timing from the perspective of the process, not IO latency timing. It gives a fair hint in the IO time of the slowest IO, so for the way most people using it (measure the impact of logwriter IO time spend), it’s fairly accurate, but for actual IO latency measurement it’s weak. Because, if you want to go beyond measuring the time logwriter spend on IO, you should be very well aware of the implementation. For example, you cannot divide the log file parallel write time by the number of IOs; the resulting time does not reflect anything.

Correction: Chris Buckel pointed out I had ‘db file parallel write’ instead of ‘log file parallel write’ in the summary. This is corrected.
Correction: Kevin Closson pointed out that the log block size can be 4kB too with newer versions of Oracle.

This is part 3 of a number of blogposts about huge Oracle database IO’s.
If you landed on this blogpost and did not read part 1 or part 2, please read part 1 here and part 2 here.

In part 1 I showed how database IOs of a full table scan could be bigger than 1MB by increasing the db_file_multiblock_read_count parameter to a number beyond 1MB expressed in Oracle blocks. These bigger IOs only happen with direct path reads, not with buffered multiblock reads (which are recognisable via the ‘db file scattered read’ wait event). In part 2 I showed that with Oracle Linux 6u4 X64, Oracle 11.2.0.3 64 bits and ASM the maximal size of an IO could go up to 1024 blocks if db_file_multiblock_read_count was set high enough, and that once an IO could be 1024, Oracle aligns the IO to block id being a multiplication of 1024.

With all the gained knowledge, the next thing I ask myself is: how did Oracle implement this? What operating system calls are done? How can we do IOs of 8MB whilst the maximal IO size of Linux is supposed to be 1MB? The obvious answer is take a peak!

In order to see the exact working, it is once again necessary to reduce the number of IOPS, to see both the io_getevents calls which probe the completion queue (which are not timed/captured by a wait event) and the blocking io_getevents call which is timed with the ‘direct path read’ wait event. To throttle IO (reduce IOPS), see this article.

Let’s focus on how direct path IO works first. Direct path IO (on Linux with ASM is what I researched) works roughly in 3 stages:

Before a direct path IO takes place, Oracle does some stuff which necessary to make direct path IO’s happen. Some of the things are issuing a object checkpoint (visible with the ‘enq: KO – fast object checkpoint’ waitevent) to make sure the latest version of the blocks involved are on disk, and a rebuild of an operating system asynchronous IO context (visible via the io_destroy() and io_setup() system calls; this is not captured by an Oracle waitevent. The Oracle waitevent ‘asynch descriptor resize’ which could show up during this phase actually is about the Oracle asynchronous IO tracking mechanism, which Oracle often refers to as ‘slots’).

a) submit/process phase:
After the above mentioned initial operations, Oracle submits IO’s via the io_submit() system call. Oracle issues two IO requests for two distinct ranges of blocks when it starts a direct path read scan, which is the minimal number of IO requests which are issued in parallel. This number could go up (up to 32 concurrent requests). These are two distinct io_submit() calls.

When Oracle enters the submit/process phase again, it does something interesting: it first processes a single (multiblock) IO request, then issues an IO request via io_submit(), then processes another IO request (if available), then issues an IO request (io_submit()), depending on the number of number IOs Oracle needs to submit and needs to process.

It’s important to note that Oracle did not put any waits in this phase. This is all instrumented as cpu time.

b) reap all phase:
Now Oracle has issued a number of IOs, Oracle does something truly interesting: it issues a number of IO requests (visible as io_getevents() calls) for ALL the IOs this process submitted with timeout set to 0 (zero). This does two things: it makes the call non blocking and it will be invisible for strace (the Linux system call trace utility) if they do not succeed (!!).

This phase is not covered by a waitevent. This means that if IOs are fast, and this phase succeeds in reaping all IOs, there is no wait, and are instrumented as cpu time.

Also, if all IOs are reaped in this phase, it can go back to the submit/process phase.

c) wait for IO phase:
If Oracle comes up to this phase, the process has done everything, and can do nothing but wait for IO to arrive. So that is exactly what the process is going to do: it issues a io_getevents() call with timeout set to 600 seconds, and waits for any SINGLE IO request to arrive. Here Oracle “truly” waits, and this is covered by a wait event: ‘direct path read’. Once a single IO request is reaped, this step is ready, and goes back to the submit/process phase.

Okay, now that we are up to date with how Oracle processes asynchronous direct path IO, let’s dig into how this works with huge IOs!

For investigating this issue, the linux strace utility actually works best, because it interprets the pointers in the system call, and shows the content of the struct, instead of just showing the pointer, which gdb does. As always, you need to careful choose the tools you use for the job. By using strace, the only hint in which phase the io_getevents() call was done is by looking at the timeout value(!).

This is a snippet of the strace output when Oracle was doing 1020 blocks IO requests:

io_submit(140019063115776, 8, {{0x7f58b89fbfb8, 0, 0, 0, 256}, {0x7f58b89fc210, 0, 0, 0, 257}, {0x7f58b89fc468, 0, 0, 0, 256}, {0x7f58b89fc6c0, 0, 0, 0, 257}, {0x7f58b89fc918, 0, 0, 0, 256}, {0x7f58b89fcb70, 0, 0, 0, 257}, {0x7f58b89fcdc8, 0, 0, 0, 256}, {0x7f58b89fd020, 0, 0, 0, 257}}) = 8
io_getevents(140019063115776, 8, 128, {{0x7f58b89f6db0, 0x7f58b89f6db0, 1048576, 0}, {0x7f58b89f7008, 0x7f58b89f7008, 1048576, 0}, {0x7f58b89f66a8, 0x7f58b89f66a8, 1048576, 0}, {0x7f58b89f6450, 0x7f58b89f6450, 1048576, 0}, {0x7f58b89f61f8, 0x7f58b89f61f8, 1048576, 0}, {0x7f58b89f7710, 0x7f58b89f7710, 1048576, 0}, {0x7f58b89f7968, 0x7f58b89f7968, 1048576, 0}, {0x7f58b89f7bc0, 0x7f58b89f7bc0, 1015808, 0}}, {600, 0}) = 8

This is a combination of an io_submit() call and an io_getevents(). There is a lot to say about this.
First, Oracle did not magically increase the maximum IO size of the operating system. As we can see, Oracle submits multiple IO requests in a single io_submit() call. If we peek at the size field in the io_getevents() calls, we see the size mostly being 1048576. In other words: the individual IO requests are 1MB, as we expect, Oracle submitted multiple requests in order to submit IO requests for the single (Oracle level) request of 1020 blocks.

If we add the IO sizes of the individual requests we get: (1048576*7)+1015808=8355840 bytes. 8355840/8192 (=blocksize) = 1020. So the size of the IO requests is the same as the number of Oracle blocks.

Next, if we look at the timeout “field” of io_getevents(), it reads {600, 0}. This means this was a blocking call, with timeout set to 600. We can deduct from that field that this was an io_getevents() call from the wait for IO phase.

I urge you to take an even closer look. If you look at the pointer to the iocb (the large hexadecimal number), you will see that the IOs submitted are different from the IOs reaped with io_getevents(). This is Oracle taking advantage of asynchronous IO.

I’ve throttled IO to 1 IOPS, so I know all the requests will go up to the wait for IO phase.

Let’s take a look at a strace output if IO is not throttled to 1 IOPS:

io_getevents(140136915922944, 28, 128, {{0x7f7429499710, 0x7f7429499710, 1048576, 0}, {0x7f7429499e18, 0x7f7429499e18, 1048576, 0}, {0x7f7429498900, 0x7f7429498900, 1048576, 0}, {0x7f742949c5f0, 0x7f742949c5f0, 1048576, 0}, {0x7f742949ba38, 0x7f742949ba38, 1048576, 0}, {0x7f7429498db0, 0x7f7429498db0, 1048576, 0}, {0x7f742949ac28, 0x7f742949ac28, 1048576, 0}, {0x7f742949a2c8, 0x7f742949a2c8, 1048576, 0}, {0x7f742949b7e0, 0x7f742949b7e0, 1048576, 0}, {0x7f742949a070, 0x7f742949a070, 1048576, 0}, {0x7f7429499260, 0x7f7429499260, 1048576, 0}, {0x7f74294981f8, 0x7f74294981f8, 1048576, 0}, {0x7f742949c848, 0x7f742949c848, 1048576, 0}, {0x7f742949a778, 0x7f742949a778, 1048576, 0}}, {0, 0}) = 14

The timeout struct is set to {0, 0}. So this is an io_getevents() call from the ‘get all IOs phase’. As we can see, all IOs in this case is 28 (that is a little less than 28MB!). Another interesting thing is this actual call returned 14 IO requests, but still succeeded, despite the minimal number of requests ready was set to 28. My guess is because because another io_getevents() call or calls reaped the other IOs, and this request completed the 28 requests, and made this call successful.

Because this is in the ‘get all IOs phase’, there is no wait event. So this huge amount of IOs being ready is not reflected in the wait events, nor can be assembled from it for this specific request. Well, you could calculate the number of IOs needed to read this segment by looking at DBA_EXTENTS, and calculate the total time taken for reading the segment, and calculate an average but that’s all.

Because the ‘direct path read’ wait event only shows up when we need to wait, which will be for a single request, a wait will always show maximally for 1024 blocks because that is the maximum (which I’ve found up to now). If all the IOs are ready before the ‘wait for IO phase’, it means there are more IO requests ready than a single Oracle IO.

This is a small note describing how Oracle implemented the situation which is covered by the db file parallel read wait event. This events happens if Oracle knows it must read multiple blocks which are not adjacent (thus from different random files and locations), and cannot continue processing with the result of a single block. In other words: if it cannot process something after reading a single block (otherwise Oracle will read a single block visible by the wait ‘db file sequential read’).

This is how it shows up if you enable sql trace:

WAIT #139658359011504: nam='db file parallel read' ela= 69997140 files=1 blocks=70 requests=70 obj#=76227 tim=1373200106669612

What this shows, is Oracle issuing a request for 70 blocks. This has an interesting implication for monitoring and looking at the time spend on the event ‘db file parallel read’: if you don’t know the number of blocks for which an IO request is issued, it’s impossible to say something about the time. So just monitoring or looking a cumulative time spend in ‘db file parallel read’ doesn’t say anything about IO latency, it only tells something about where the Oracle process did spend its time on.

How did Oracle implement this? This is obviously port specific (which means the implementation will be different on different operating systems). My test environment is Oracle Linux 6u3 X64, Oracle 11.2.0.3 64 bit using ASM.

This is how the requests are asynchronously submitted to the operating system:

Breakpoint 2, io_submit (ctx=0x7f04c0c8d000, nr=70, iocbs=0x7fff86d965f0) at io_submit.c:23

So all the IO requests are submitted in one go!

After the IO requests are submitted (which is not covered by a wait, which makes sense, because the io_submit call is/is supposed to be non blocking.

Next Oracle waits for ALL the IOs to finish, covered by the ‘db file parallel read’ wait event:

Breakpoint 13, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, io_getevents_0_4 (ctx=0x7f04c0c8d000, min_nr=70, nr=128, events=0x7fff86d9c798, timeout=0x7fff86d9d7a0) at io_getevents.c:46
Breakpoint 14, 0x0000000008fa1334 in kslwtectx ()

We see kslwtbctx which indicates the start of a waitevent, then a io_getevents call:

‘ctx’ is the IO context. This is how Linux keeps track of groups of asynchronous IO requests.
‘min_nr’ is the minimal number of requests that must be ready for this call to succeed, this call will wait until ‘timeout’ is reached. io_getevents will just peek if ‘timeout’ is set to zero.
‘nr’ is the maximal number of requests that io_getevents will “fetch”.
‘events’ is a struct (table like structure) that holds the information about the iocb’s (IO control blocks) of the requests.
‘timeout’ is a struct that sets the timeout of this call. For Oracle IO I see timeout being 600 (seconds) most of the time.

The last line show kslwtectx indicating that the wait has ended.

This is part 2 of a number of blogposts about huge Oracle database IO’s.
If you landed on this blogpost and did not read part 1, please read part 1 here.

In part 1 I showed how database IOs of a full table scan could be bigger than 1MB by increasing the db_file_multiblock_read_count parameter to a number beyond 1MB expressed in Oracle blocks. These bigger IOs only happen with direct path reads, not with buffered multiblock reads.

But how much bigger can these IOs be? In part 1 I showed Oracle IOs of 1020 blocks. Is that the limit? To investigate this, I created a much bigger table (table T2 in part 1 had a maximum extent size of 1024 blocks, which meant that the 1020 is the biggest IO possible from this table).

For the sake of this investigation I created a much bigger table to get larger extents:

 EXTENT_ID     BLOCKS
---------- ----------
	 0	    8
...
       263	 8192
       264	 8192
       265	 8192

266 rows selected.

So with my new, big table I got extents up to 8192 blocks. Let’s do the same test on this table as I did in the previous post, which means setting db_file_multiblock_read_count to 2000 with SQL trace at level 8 with throttled IOPS to get waits!

The IOs with the “small” extents of 8 blocks and 128 blocks are done exactly the same as the previous post. The IOs done against the extents with a size of 1024 blocks are the same too, simply because reading the 1020 blocks in one IO is as efficient as it could be! But let’s take a look at the point where the extents become 8192 blocks:

WAIT #140281084141152: nam='direct path read' ela= 7996378 file number=5 first dba=174468 block cnt=1020 obj#=76227 tim=1373104660882677
WAIT #140281084141152: nam='direct path read' ela= 7995820 file number=5 first dba=175489 block cnt=1023 obj#=76227 tim=1373104668882345
WAIT #140281084141152: nam='direct path read' ela= 7996472 file number=5 first dba=176520 block cnt=632 obj#=76227 tim=1373104676882677
WAIT #140281084141152: nam='direct path read' ela= 7998049 file number=5 first dba=177152 block cnt=1024 obj#=76227 tim=1373104684883512
WAIT #140281084141152: nam='direct path read' ela= 7995472 file number=5 first dba=178176 block cnt=1024 obj#=76227 tim=1373104692882932
WAIT #140281084141152: nam='direct path read' ela= 7993677 file number=5 first dba=179200 block cnt=1024 obj#=76227 tim=1373104700880106
WAIT #140281084141152: nam='direct path read' ela= 7996969 file number=5 first dba=180224 block cnt=1024 obj#=76227 tim=1373104708880891
WAIT #140281084141152: nam='direct path read' ela= 5998630 file number=5 first dba=181248 block cnt=1024 obj#=76227 tim=1373104714882889
WAIT #140281084141152: nam='direct path read' ela= 9996459 file number=5 first dba=182272 block cnt=1024 obj#=76227 tim=1373104724882545

I’ve included one IO of 1020 blocks, after which Oracle issues an IO request of 1023 blocks, then 632 blocks, and then it issues a number of requests for 1024 blocks.

It seems that the maximal amount of blocks of multiblock IO requests done by Oracle “unbuffered” or “direct path” is 1024 blocks, which is 8MB! Please mind I have tested this ONLY with Oracle version 11.2.0.3 64 bits on Linux X64 on ASM, with a database and segment blocksize of 8kB.

There’s another peculiarity I want to introduce. Let’s take a look at the dba and block cnt only, together with block_id and blocks (number of blocks in the extent, taken from dba_extents):

                                  BLOCK_ID	  BLOCKS         EXTENT_ID
dba=174468 block cnt=1020           174464          1024               197
dba=175489 block cnt=1023           175488          1024               198
dba=176520 block cnt=632            176512          8192               199
dba=177152 block cnt=1024
dba=178176 block cnt=1024
dba=179200 block cnt=1024
dba=180224 block cnt=1024
dba=181248 block cnt=1024
dba=182272 block cnt=1024
dba=183296 block cnt=1024 

There are two anomalies here: the first one is at dba 175489. It seems that dba 175489/extent id 198 has only one BMB, so Oracle can read 1023 blocks of the extent, instead of the 1020 blocks in the other extents with the size of 1024 blocks.

The second anomaly is much more interesting: extent id 199 starts at block id 176512. This is an extent consisting of 8192 blocks. The first IO starts at block id 176520. This means there are 176520-176512=8 blocks not read, which are probably BMB’s. But this is in line with my expectation. What is odd, is the first IO issued for datablocks in this extent which is 8192 blocks in length is “only” 632 blocks!

After the oddly sized IO request at the beginning of the extent (632), the other IO requests are the full 1024 blocks, which probably is the maximum size of direct path Oracle IO requests.

I wondered about the oddly sized request at the beginning. I looked at the dba’s of the requests, and noticed the dba’s of the 1024 blocks sized requests looked vaguely familiar. Then I spotted the familiarity and a possible reason for the “small” request at the beginning of the extent:

SYS@v11203 AS SYSDBA> select mod(176520,1024) from dual;

MOD(176520,1024)
----------------
	     392

SYS@v11203 AS SYSDBA> select mod(177152,1024) from dual;

MOD(177152,1024)
----------------
	       0

SYS@v11203 AS SYSDBA> select mod(178176,1024) from dual;

MOD(178176,1024)
----------------
	       0

ONLY when the direct path read requests can potentially read “the full” 1024 blocks in one IO, Oracle first issues an IO to align the later requests to a block id being a multiplication of 1024, after which it will issue IO requests of 1024 blocks.

I don’t know why the alignment takes place. I can imagine this alignment can line up with ASM, thus avoiding a single Oracle IO request touching more than one ASM allocation unit. But I am open to other people’s ideas and suggestions about potential reasons!

Stay tuned for further updates about extra huge database IOs!

It’s been a while since I presented the first incarnation of my ‘about multiblock reads’ presentation. When I did this at the UKOUG TEBS conference in Birmingham in 2011, Christian Antognini chaired my presentation. After my presentation Christian showed me it’s possible to set the parameter ‘db_file_multiblock_read_count’ higher than 1MB/db_block_size (which is 128 if your blocksize is 8kB), and you could benefit from it if your hardware is sufficient. In fact, Christian showed me AWR reports (could also be statspack reports, not sure) which showed the benefit.

My understanding of the parameter db_file_multiblock_read_count at the time was:

The maximum value is the operating system’s maximum I/O size expressed as Oracle blocks ((max I/O size)/DB_BLOCK_SIZE). If you set this parameter to a value greater than the maximum, Oracle uses the maximum.

Which is taken from the Reference guide, in which this text is there for both 10.2 and 11.2, and even for 12.1.

This is interesting, because the statement from the Oracle reference guide conflicts with Christians statement and measurements!

My test environment: Oracle 11.2.0.3 on Linux OL6u3 X64 with ASM, on my Macbook with SSD.

I’ve got a table created in a bigfile ASSM tablespace, system allocation type. This means objects get extents allocated which are increased progressively in size in steps:

 EXTENT_ID     BLOCKS
---------- ----------
	 0	    8
...
	15	    8
	16	  128
...
	78	  128
	79	 1024
...
	91	 1024

First let’s take a look at a full scan done “buffered”, alias blocks read to the buffercache, which is visible via the “db file scattered read” wait event. This is the “traditional” multiblock read.

I’ve flushed the buffer cache, set db_file_multiblock_read_count to 128, set event 10949 at level 1 and enabled SQL trace at level 8.
Event 10949 is needed to get a buffered multiblock read of an table which the Oracle database deems ‘big’ (to avoid non-buffered alias direct path reads).
Another thing is both ‘db file sequential read’ and ‘db file scattered read’ will always show up (even if the action (the physical read) is very fast), so there’s no need the change things (slow down I/O) to be sure we get these waits.

The statement executed is ‘select count(*) from t2′.

The first section of extent size 8, which shows the typical alteration of 8 and 7 blocks being read (after the initial smaller sizes) which is because of the space bitmap block every other extent:

WAIT #140150651756760: nam='db file sequential read' ela= 272 file#=5 block#=23938 blocks=1 obj#=75579 tim=1372970940124856
WAIT #140150651756760: nam='db file scattered read' ela= 2162 file#=5 block#=23939 blocks=5 obj#=75579 tim=1372970940127442
WAIT #140150651756760: nam='db file scattered read' ela= 725 file#=5 block#=23944 blocks=8 obj#=75579 tim=1372970940128478
WAIT #140150651756760: nam='db file scattered read' ela= 144 file#=5 block#=23953 blocks=7 obj#=75579 tim=1372970940128908
WAIT #140150651756760: nam='db file scattered read' ela= 253 file#=5 block#=23960 blocks=8 obj#=75579 tim=1372970940129636

Then the extents of 128 blocks, which show 126 blocks being read because of 2 space bitmap blocks at the beginning of every extent of 128 blocks:

WAIT #140150651756760: nam='db file scattered read' ela= 3741 file#=5 block#=24066 blocks=126 obj#=75579 tim=1372970940139996
WAIT #140150651756760: nam='db file scattered read' ela= 955 file#=5 block#=24194 blocks=126 obj#=75579 tim=1372970940142894
WAIT #140150651756760: nam='db file scattered read' ela= 2916 file#=5 block#=24322 blocks=126 obj#=75579 tim=1372970940148287
WAIT #140150651756760: nam='db file scattered read' ela= 2111 file#=5 block#=24450 blocks=126 obj#=75579 tim=1372970940152527
WAIT #140150651756760: nam='db file scattered read' ela= 2952 file#=5 block#=24578 blocks=126 obj#=75579 tim=1372970940157492

Then the extents of 1024 blocks, which start off with 4 space bitmap blocks at the beginning of every extents:

WAIT #140150651756760: nam='db file scattered read' ela= 3079 file#=5 block#=36228 blocks=128 obj#=75579 tim=1372970941036955
WAIT #140150651756760: nam='db file scattered read' ela= 2737 file#=5 block#=36356 blocks=128 obj#=75579 tim=1372970941048199
WAIT #140150651756760: nam='db file scattered read' ela= 2531 file#=5 block#=36484 blocks=128 obj#=75579 tim=1372970941059473
WAIT #140150651756760: nam='db file scattered read' ela= 2912 file#=5 block#=36612 blocks=128 obj#=75579 tim=1372970941071015
WAIT #140150651756760: nam='db file scattered read' ela= 2768 file#=5 block#=36740 blocks=128 obj#=75579 tim=1372970941082467
WAIT #140150651756760: nam='db file scattered read' ela= 287 file#=5 block#=36868 blocks=128 obj#=75579 tim=1372970941091378
WAIT #140150651756760: nam='db file scattered read' ela= 2831 file#=5 block#=36996 blocks=128 obj#=75579 tim=1372970941102969
WAIT #140150651756760: nam='db file scattered read' ela= 536 file#=5 block#=37124 blocks=124 obj#=75579 tim=1372970941111811

We see 7 lines of 128 blocks read (7*128=896) and one line of 124 blocks (896+124=1020) which is 4 blocks short of the extent size of 1024, which are the space bitmap blocks.

Okay, nothing new here.

If I raise the parameter db_file_multiblock_read_count to 2000, and reissue the same ‘select count(*) from t2′, I get exactly the same waits. In other words: nothing different.

So the conclusion is: when using buffered reads, the statement about 1MB being the limit of the size of a single multiblock read indeed is true.

Now let’s move on to direct path reads. In short, when a table is deemed ‘big’ by the database, meaning the number of blocks occupied by the table is more than five times _small_table_threshold (which is 2% of the buffercache by default) then the database foregrounds will not use the buffercache to read the table blocks to, but read them to the process’ PGA to avoid discarding usefull blocks in the cache, and avoid the overhead of concurrency control (latching). There are more criteria, but that’s outside the scope of this blogpost.

First let’s trace a session doing direct path reads. This is easier than the above situation where I forced buffered reads, because I can do the same as above, but skip setting event 10949. On the other hand it’s harder, because in order to see all the IOs using tracing (which means ‘get a wait for all IOs’), I need reduce the number of IOs (I use 1 IOPS), because otherwise some IOs will not get a wait, so not all IOs will be visible in the tracefile (look at this article for learning how to reduce IOPS with Linux, go see me present my About multiblock reads presentation to learn why some waits for direct path read IOs could not show up)

So, here are snippets from the tracefile of a ‘select count(*) from t2′, with db_file_multiblock_read_count set to 128 and SQL trace at level 8:

First we see the reads in the small extents (8 blocks) from BMB to BMB, not stopping at the extent border:

WAIT #140402935204600: nam='direct path read' ela= 996117 file number=5 first dba=23939 block cnt=13 obj#=75579 tim=1373055770533891
WAIT #140402935204600: nam='direct path read' ela= 997014 file number=5 first dba=23953 block cnt=15 obj#=75579 tim=1373055771531222
WAIT #140402935204600: nam='direct path read' ela= 1000275 file number=5 first dba=23969 block cnt=15 obj#=75579 tim=1373055772531716
WAIT #140402935204600: nam='direct path read' ela= 999111 file number=5 first dba=23985 block cnt=15 obj#=75579 tim=1373055773531249
WAIT #140402935204600: nam='direct path read' ela= 999628 file number=5 first dba=24001 block cnt=15 obj#=75579 tim=1373055774531093

Then we see the reads of 126 blocks, of the extents of 128 blocks:

WAIT #140402935204600: nam='direct path read' ela= 1004881 file number=5 first dba=24066 block cnt=126 obj#=75579 tim=1373055778541646
WAIT #140402935204600: nam='direct path read' ela= 989042 file number=5 first dba=24194 block cnt=126 obj#=75579 tim=1373055779531980
WAIT #140402935204600: nam='direct path read' ela= 1002835 file number=5 first dba=24322 block cnt=126 obj#=75579 tim=1373055780541509
WAIT #140402935204600: nam='direct path read' ela= 989545 file number=5 first dba=24450 block cnt=126 obj#=75579 tim=1373055781531621
WAIT #140402935204600: nam='direct path read' ela= 1007730 file number=5 first dba=24578 block cnt=126 obj#=75579 tim=1373055782539986

And then the reads of the extents of the size of 1024 blocks:

WAIT #140402935204600: nam='direct path read' ela= 2000086 file number=5 first dba=36228 block cnt=124 obj#=75579 tim=1373055875542615
WAIT #140402935204600: nam='direct path read' ela= 1995550 file number=5 first dba=36480 block cnt=128 obj#=75579 tim=1373055877539193
WAIT #140402935204600: nam='direct path read' ela= 1999327 file number=5 first dba=36736 block cnt=128 obj#=75579 tim=1373055879539584
WAIT #140402935204600: nam='direct path read' ela= 1998883 file number=5 first dba=36992 block cnt=128 obj#=75579 tim=1373055881539584
WAIT #140402935204600: nam='direct path read' ela= 1998753 file number=5 first dba=37252 block cnt=124 obj#=75579 tim=1373055883539416

If you take a close look, you’ll see IO’s are still missing (just add cnt to dba and see if it fits the next dba, with the exception of BMB’s, which are not read too, so will leave a gap). Nevertheless: this is as we expect.

Now let’s do the same, but set db_file_multiblock_read_count to 2000! In order to reduce length: the IO’s of the 8 blocks and 128 blocks extents are identical. Let’s look at the 1024 blocks extent IOs:

WAIT #140550163651400: nam='direct path read' ela= 7862301 file number=5 first dba=32132 block cnt=1020 obj#=75579 tim=1373057289713781
WAIT #140550163651400: nam='direct path read' ela= 7996018 file number=5 first dba=33156 block cnt=1020 obj#=75579 tim=1373057297713687
WAIT #140550163651400: nam='direct path read' ela= 7996272 file number=5 first dba=34180 block cnt=1020 obj#=75579 tim=1373057305713868
WAIT #140550163651400: nam='direct path read' ela= 7995719 file number=5 first dba=35204 block cnt=1020 obj#=75579 tim=1373057313713547
WAIT #140550163651400: nam='direct path read' ela= 7996310 file number=5 first dba=36228 block cnt=1020 obj#=75579 tim=1373057321713593

Wow! These are HUGE IOs! The entire 1024 block extent is read in one (oracle) IO excluding the 4 BMB’s.

So this means Christian was right, and the Oracle documentation is wrong about the maximum size of db_file_multiblock_read_count for direct path reads.

There’s much more to be said about huge IOs, but I’ll leave that for another blogpost.

This is yet another blogpost on Oracle’s direct path read feature which was introduced for non-parallel query processes in Oracle version 11.

For full table scans, a direct path read is done (according to my tests and current knowledge) when:

- The segment is bigger than 5 * _small_table_threshold.
- Less than 50% of the blocks of the table is already in the buffercache.
- Less than 25% of the blocks in the buffercache are dirty.

Also (thanks to Freek d’Hooge who pointed me to an article from Tanel Poder) you can change the optimizer statistics to change the segment size for the direct path decision. Please mind that whilst this uses the statistics the optimizer uses, this is NOT an optimizer decision, but a decision made in the “code path”, so during execution.

So let’s take a look at my lab environment (Oracle Linux 6.3, 64 bit, Oracle 11.2.0.3 and ASM)

Small table threshold:

NAME						   VALUE
-------------------------------------------------- -------
_small_table_threshold				   1011

Table information:

TS@v11203 > select blocks from user_segments where segment_name = 'T2';

    BLOCKS
----------
     21504

So if we take small table threshold times and multiply it by five, we get 5055. This means that the size of table T2 is more than enough so should be scanned via direct path:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  0

TS@v11203 > select count(*) from t2;

  COUNT(*)
----------
   1000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'table scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
table scans (direct read)						  1

Well, that’s that, this seems quite simple.

I’ve created a relatively big table and created a (normal) index on it in the same database. The index is created on a single column, called ‘id’. If I issue a count(id), the whole index needs to be scanned, and Oracle will choose a fast full index scan. A fast full index scan is a scan which just needs to read all the blocks, not necessarily in leaf order. This means it can use multiblock reads (which reads in the order of allocated adjacent blocks).

Let’s check just to be sure:

TS@v11203 > select count(id) from bigtable;

Execution Plan
----------------------------------------------------------
Plan hash value: 106863591

------------------------------------------------------------------------------------
| Id  | Operation	      | Name	   | Rows  | Bytes | Cost (%CPU)| Time	   |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	   |	 1 |	13 | 19662   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE       | 	   |	 1 |	13 |		|	   |
|   2 |   INDEX FAST FULL SCAN| I_BIGTABLE |	34M|   425M| 19662   (2)| 00:03:56 |
------------------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

If we look at the index size, the size of the index makes this segment a candidate for direct path reads:

TS@v11203 > select blocks from user_segments where segment_name = 'I_BIGTABLE';

    BLOCKS
----------
     72704

If we look at number of small table threshold times five (5055), this index is much bigger than that. Also, this is bigger than table T2. Let’s execute select count(id) from bigtable, and look at the statistic ‘index fast full scans (direct read)’:

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

TS@v11203 > select count(id) from bigtable;

 COUNT(ID)
----------
  32000000

TS@v11203 > select s.name, m.value from v$statname s, v$mystat m where m.statistic#=s.statistic# and s.name = 'index fast full scans (direct read)';

NAME								      VALUE
---------------------------------------------------------------- ----------
index fast full scans (direct read)					  0

Huh? This statistic tells me there hasn’t been a direct path read! This means that this read has been done in the “traditional way”. This is a bit…counter intuitive. I’ve traced the session, and indeed it’s doing the traditional multiblock reads via the scattered read waits.

I did a fair bit of fiddling around with the parameters which are reported to be involved, and found out I can get the database to do direct path reads by changing the parameter “_very_large_object_threshold”. The information found on the internet reports this value is in megabytes. A quick stroll through a number of different database (all on 11.2.0.3) shows this parameter is quite probably statically set at “500″.

If I calculate the size in megabytes of the index I_BIGTABLE, the size is 568M. This is clearly higher than the value of “_very_large_object_threshold”. I can get the same index scanned via direct path reads by changing the value of “_very_large_object_threshold” to 100.

This interesting, because it looks like this parameter does the same for full scans on index segments as “_small_table_threshold” does for full scans on table segments: the size of the segment to be scanned needs to be bigger than five times.

There are also differences: small table threshold is set in blocks, (apparently) very large object threshold is set in megabytes. Also, small table threshold is set by default at 2% of the size of the buffercache (so it scales up with bigger caches), very large object threshold seems to be fixed at 500. If my finding is correct, then it means an index segment needs to be bigger than 500*5=2500M to be considered for direct path reads. It’s unknown to me if the 50% limit for blocks in the cache and the 25% limit for dirty blocks is subject to this too.

Follow

Get every new post delivered to your Inbox.

Join 1,883 other followers

%d bloggers like this: