Oracle IO on linux: log writer IO and wait events

This post is about log writer (lgwr) IO.
It’s good to point out the environment on which I do my testing:
Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM.

In order to look at what the logwriter is doing, a 10046 trace of the lgwr at level 8 gives an overview.
A way of doing so is using oradebug. Be very careful about using oradebug on production environments, it can/may cause the instance to crash.

This is how I did it:

SYS@v11203 AS SYSDBA> oradebug setospid 2491
Oracle pid: 11, Unix process pid: 2491, image: oracle@ol63-oracle.local (LGWR)
SYS@v11203 AS SYSDBA> oradebug unlimit
Statement processed.
SYS@v11203 AS SYSDBA> oradebug event 10046 trace name context forever, level 8
Statement processed.

Of course 2491 is the Linux process id of the log writer, as is visible with “image”.

When you look at the tracefile, the logwriter trace of a non-used database will look like this:

*** 2013-08-29 21:10:45.593
WAIT #0: nam='rdbms ipc message' ela= 3000575 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803445592696

*** 2013-08-29 21:10:48.595
WAIT #0: nam='rdbms ipc message' ela= 3000566 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803448594695

*** 2013-08-29 21:10:51.597
WAIT #0: nam='rdbms ipc message' ela= 3000812 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803451596813

*** 2013-08-29 21:10:54.599
WAIT #0: nam='rdbms ipc message' ela= 3000733 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803454598806

*** 2013-08-29 21:10:57.601
WAIT #0: nam='rdbms ipc message' ela= 3000460 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803457600717

*** 2013-08-29 21:11:00.603
WAIT #0: nam='rdbms ipc message' ela= 3000526 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803460602699

*** 2013-08-29 21:11:03.605
WAIT #0: nam='rdbms ipc message' ela= 3000640 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803463604715

*** 2013-08-29 21:11:04.218
WAIT #0: nam='rdbms ipc message' ela= 612031 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803464218054

What we see here is the logwriter spending 3 seconds (ela= 3000812) on the event ‘rdbms ipc message’, simply because the timeout (timeout column obviously) was set so (timeout=300; timeout is set in centiseconds, Oracle old “fine” measurement of time). To understand what this means from the perspective of the operating system (Linux), there’s a need “trick”: use strace with verbose output of writing. That way the extended SQL trace will “document” the system calls. This is how this looks like:

[root@ol63-oracle ~]# strace -e write=all -e all -p 2491
Process 2491 attached - interrupt to quit
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
write(32, "\n*** 2013-08-29 21:35:50.740\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 30  2e 37 34 30 0a           21:35:50 .740.    |
lseek(32, 0, SEEK_CUR)                  = 118213
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 32 30  38 37 34 35 32 20 74 69   ela= 20 87452 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 30 37 34 30 33 37 33  13778049 50740373 |
write(33, "J?Jqz7~UW1\n", 11)           = 11
 | 00000  4a 3f 4a 71 7a 37 7e 55  57 31 0a                 J?Jqz7~U W1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
semtimedop(327683, {{15, -1, 0}}, 1, {0, 910000000}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
write(32, "\n*** 2013-08-29 21:35:51.651\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 31  2e 36 35 31 0a           21:35:51 .651.    |
lseek(32, 0, SEEK_CUR)                  = 118339
write(32, "WAIT #0: nam='rdbms ipc message'"..., 94) = 94
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 39 31  30 36 37 37 20 74 69 6d   ela= 91 0677 tim |
 | 00030  65 6f 75 74 3d 39 31 20  70 32 3d 30 20 70 33 3d  eout=91  p2=0 p3= |
 | 00040  30 20 6f 62 6a 23 3d 2d  31 20 74 69 6d 3d 31 33  0 obj#=- 1 tim=13 |
 | 00050  37 37 38 30 34 39 35 31  36 35 31 37 32 34        77804951 651724   |
write(33, "J?HRU3~UU1\n", 11)           = 11
 | 00000  4a 3f 48 52 55 33 7e 55  55 31 0a                 J?HRU3~U U1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
write(32, "\n*** 2013-08-29 21:35:54.652\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 34  2e 36 35 32 0a           21:35:54 .652.    |
lseek(32, 0, SEEK_CUR)                  = 118463
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 33 30  30 30 33 33 32 20 74 69   ela= 30 00332 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 34 36 35 32 38 32 36  13778049 54652826 |
write(33, "J?MZSB~UW1\n", 11)           = 11
 | 00000  4a 3f 4d 5a 53 42 7e 55  57 31 0a                 J?MZSB~U W1.      |
lseek(32, 0, SEEK_CUR)                  = 118559
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(33, "3?uG~01\n", 8)               = 8
 | 00000  33 3f 75 47 7e 30 31 0a                           3?uG~01.          |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037

First thing to notice (which is not relevant to this text about logwriter IOs, but nice to know): any time there’s a timestamp in the Oracle trace file (like *** 2013-08-29 21:38:58.721), Oracle did issue a times() syscall prior to that. If you want to know what that does, just issue “man 2 times” on the linux prompt. As a summary: this syscall fetches the CPU time spend in user mode and system/kernel mode, for the current process and for its children.

Back to where this post is about: we see rdbms ipc message waits. This is the instrumentation of the time sleeping (actually) on a semaphore, with timeout mostly set to 3 seconds:

semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

Use the manpage of semtimedop to understand all, but the last struct ({3, 0}) is a struct timespec. The first number is seconds, the second number is nanoseconds. But in other words: if you see the logwriter spending its time on ‘rdbms ipc message’ it’s waiting for work. The semaphore construction is made so the process can asked (by another process) to do something, and can do that immediately.

I guess most people reading this blog, and coming this far, are aware what the function of the logwriter is (writing the log buffer to disk), and the logwriter does that if it finds something to write, or if some other process asks the logwriter to do so. The process which asks the logwriter to write waits for a response from the logwriter on the event ‘log file sync’.

If you use the extended SQL trace with strace “trick” you see this:

io_submit(140612782833664, 1, {{0x7fe2f5369450, 0, 1, 0, 257}}) = 1
io_getevents(140612782833664, 1, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 1024, 0}}, {600, 0}) = 1
lseek(26, 0, SEEK_CUR)                  = 5190
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 30 37 20  write' e la= 107  |
 | 00030  66 69 6c 65 73 3d 31 20  62 6c 6f 63 6b 73 3d 32  files=1  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 31 20 6f 62 6a 23   request s=1 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 37 30  =-1 tim= 13778070 |
 | 00060  32 34 39 35 31 33 31 38                           24951318          |

What we see is an asynchronously submitted IO request (one IO request; io_submit), and it’s reaped after that (io_getevents), after which the logwriter writes a wait event to the tracefile indicating this has happened. The logwriter log file parallel write events statistics tell use it wrote to 1 file (equalling one IO, as can be seen from the IO system calls), with a length of 2 blocks. As we know, the log block size is and has been 512 bytes on Linux, which can be seen in the io_getevents call (1024 equals 2*512). The elapsed time (ela) is 107 microseconds.

The log blocksize is and has been 512 bytes on Linux, but can also be 4kB with more recent versions of Oracle, for newer SSD devices.

It would be logical to think in this example the IO time is 107 microseconds. But hold that thought, I’ll come to that. Let’s add a logfile member to every logfile, and switch to a new logfile to make writing to both logfile members happening. Let’s look at a logwriter writing something now:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2
lseek(26, 0, SEEK_CUR)                  = 101469
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 31 37 20  write' e la= 117  |
 | 00030  66 69 6c 65 73 3d 32 20  62 6c 6f 63 6b 73 3d 32  files=2  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 32 20 6f 62 6a 23   request s=2 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 38 36  =-1 tim= 13778086 |
 | 00060  33 32 30 36 37 31 32 35                           32067125          |

We see here that Oracle submits two IOs at the same time, using one call, and reaps both of them (read the manpage of io_getevents, but the minimal number of IOs is set to 2). So indeed log file parallel write writes in parallel. The number of blocks is a bit puzzling, in total there are two (log-) blocks written (blocks=2), the logwriter IO to a single online redologfile had blocks set to 2, which meant it wrote 1024 bytes (2 512 blocks), here blocks=2 seems to mean the logwriter wrote 2 blocks in total.

How would Oracle write when asynchronous IO is not enabled? Well, let’s set disk_asynch_io to false, and redo enabling the extended sql trace, and strace it:

pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 1415701504) = 512
pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 6560015360) = 512
lseek(25, 0, SEEK_CUR)                  = 4600
write(25, "WAIT #0: nam='log file parallel "..., 103) = 103
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 35 20 66  write' e la= 15 f |
 | 00030  69 6c 65 73 3d 32 20 62  6c 6f 63 6b 73 3d 32 20  iles=2 b locks=2  |
 | 00040  72 65 71 75 65 73 74 73  3d 32 20 6f 62 6a 23 3d  requests =2 obj#= |
 | 00050  2d 31 20 74 69 6d 3d 31  33 37 37 38 30 39 33 39  -1 tim=1 37780939 |
 | 00060  37 38 38 31 38 31 39                              7881819           |

Ah! So when there’s no asynchronous IO, Oracle issues two pwrite() calls sequentially. Not really parallel!

Now think about what the wait means in both situations. In the first situation, with asynchronous IO (which should be your setting, if you have a modern system) the timing is about two IOs, and it seems to indicate the time both IOs took. Since they have been issued in parallel, if you have enough storage bandwidth left, this seems to indicate the time of the slowest IO. In the second situation, the IOs are issued serially, so the wait time seems to indicate the total latency time of the two IOs. This means you have to divide the wait time by the number of requests to get the average IO time of the IOs. Again: if the IOs are done synchronously instead of asynchronously.

But is this true?

Let’s dig a little deeper using gdb! First synchronous IO. Attach to the logwriter using gdb (gdb -p PID), and issue the following debugger commands:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 3 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) 

Now insert something into a table, and commit.

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Hey! Do you see what I see? First I see logwriter nicely sleeping like a little baby. With that I mean: I see kslwtbctx() to indicate the starting of the timing of a wait event, then semtimedop() is issued for the sake of sleeping, and after semtimedop() has timed out, kslwtectx() to end the event. But then we start writing (which is the function of the log writer; the pwrite64 calls), but there seems to be a kslwtbctx() missing??? Actually, kslwtbctx() is there, but AFTER the pwrite64() calls!

Am I missing something? Is Oracle doing something incredibly smart? Well, let’s startup an extended SQL trace and throttle the IO. By throttling the write IO for the logwriter I can (severely) increase the IO time, so the IO time should increase drastically. (see the throttling blogpost on this blog; instead for the logwriter you must limit write IOPS instead read IOPS). I’ve done that, and this is what the extended SQL trace shows:

*** 2013-08-29 23:11:54.509
WAIT #0: nam='log file parallel write' ela= 512 files=2 blocks=4 requests=2 obj#=-1 tim=1377810714509122

*** 2013-08-29 23:11:54.509
Warning: log write elapsed time 2004ms, size 1KB

*** 2013-08-29 23:11:56.509
WAIT #0: nam='log file parallel write' ela= 541 files=2 blocks=2 requests=2 obj#=-1 tim=1377810716509653

*** 2013-08-29 23:11:56.509
Warning: log write elapsed time 2000ms, size 0KB

This is what I call caught with the fingers in the cooky jar! The wait event shows an (extremely) low timing for the IOs (the ela=), and after that we get a warning the log writes were terribly slow! Oracle doesn’t time it’s logwrite events correct if we do synchronous IO in this situation!!!

Now let’s have a look at asynchronous IO. I do the same: attach to lgwr with gdb, and now break and continue on the asynchronous IO system calls: io_submit and io_getevents (please mind this requires breaking on io_getevents_0_4 with linux):

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 3 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

You should now see a stream of kslwtbctx and kslwtectx coming, which is the timing of the semtimedop() system calls. If you insert something in a table you’ll see:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001750) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40078f8, timeout=0x7fffc4008900) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Here’s something worthy to see! The first line shows the ending of a wait (kslwtectx), then the logwriter submitting an IO request (io_submit). Notice there isn’t any wait event timing started at this point, because kslwtbctx is called AFTER the IOs are submitted. This means that the wait event is the timing of the reap of the submitted IOs, instead of measuring full IO latency time. I know this is and should be a tiny fragment of time, but nevertheless the wait only shows the waiting for the reaping of the IO, not the timing of the entire IO.

Now look at the io_submit/io_getevents strace output again shown earlier in the post:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2

Look at the struct timespec in the io_getevents call: {0, 0}. This is a non-blocking reap call just to peek at the O/S completion queue. In other words, if the submitted IOs take a long time/are not ready, this call will just be unsuccessful and execution continues. This looks like the asynchronous read implementation of direct path reads (see this blogpost about non-blocking io_getevents calls).

Let’s throttle the IO of the lgwr, and print out the timespec struct at the break on io_getevents:

(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>print *timeout
>c
>end

Now redo the insertion of data into a table, and commit to put the logwriter to work. Please mind IO is throttled!

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001640) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077e8, timeout=0x7fffc40087f0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$6 = {tv_sec = 0, tv_nsec = 0}

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077d8, timeout=0x7fffc40087e0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$7 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Ah! We see the io_submit call issuing two IOs (nr=2), then the starting of the timing of the log file parallel write event (kslwtbctx), and then two io_getevents calls, the non blocking first, which almost certainly does not succeed (and is NOT shown with strace!) after which a blocking io_getevents call is issued (because seconds in the timespec struct is set to 600).

I haven’t got an idea why the non-blocking io_getevents call is issued after which a blocking io_getevents call is issued.

Summary
This is the description taken from Oracle 11.2 reference guide of log file parallel write:

‘Wait Time: Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.’

This is actually a very good description, surprisingly accurate for asynchronous IO.
As we’ve seen, this is not true for synchronous IO, there Oracle just issues the I/Os it needs to do sequentially, which are not timed at all, but a wait time is displayed, which is wrong.

It turns out that for asynchronous log writes, the log file parallel write event times the time waiting for the writes to be flushed to disk. This means waiting for the slowest write is what is made visible with the event timing, all the fast(er) writes can have any time below the timing of the slowest write. This means the wait is timing from the perspective of the process, not IO latency timing. It gives a fair hint in the IO time of the slowest IO, so for the way most people using it (measure the impact of logwriter IO time spend), it’s fairly accurate, but for actual IO latency measurement it’s weak. Because, if you want to go beyond measuring the time logwriter spend on IO, you should be very well aware of the implementation. For example, you cannot divide the log file parallel write time by the number of IOs; the resulting time does not reflect anything.

Correction: Chris Buckel pointed out I had ‘db file parallel write’ instead of ‘log file parallel write’ in the summary. This is corrected.
Correction: Kevin Closson pointed out that the log block size can be 4kB too with newer versions of Oracle.

11 comments
  1. rsiz said:

    Very thorough Frits. Since asynch I/O probably should be set, the relative lack of instrumentation for synch probably does not hurt production systems, but it may distort measurements in a confusing way. With asynch, this clarifies why log writes can appear slower than OS level I/O latency tests if the location of logfile members can cause contention anywhere in the I/O infrastructure pathway to the members (a similar problem to when folks leave all the copies of the controlfile in the same default location.) To better measure the expected latency from the OS level therefore, parallel writes to the planned locations on the physical media for the log file members would be needed to see if there is a difference between individual writes to the target locations versus simultaneous writes to all members. On top of that of course might be the intermittent read load of arch, which was the basis in the days of having many relatively small devices of the idea of ping ponging the online redo log member locations (especially when such devices had little or no write cache in front of them). How this might show up in actual increased latency through ASM to underlying media supporting permanence would be complex to measure except through the actual Oracle software. I believe this could potentially explain a difference in the measurement of latency by the “storage team” versus the “dba team” of up to the number of members per online redo log if something in the underlying infrastructure (oh, such as a seek not masked by write cache) were to occur. Thanks for the clarity of your detail!

  2. Nice post Frits.

    You said that redo blocks are 512 full stop. That is true…unless one is using flash for redo in which case they should set the blocksize parameter to 4K when created logs.

    Personally I see no problem with Oracle’s timing method in the async case. A log file parallel write event is a batched write and they are timing how long it took to complete the batch of writes. It’s true the, for instance, the case of 4 writes 3 may be sum millisecond and one of them be, say, 10ms but since it is a batch event who really cares about the few that are fast then it’s the slowest individual transfer that matters? The sync I/O case is just stupid but using synchronous I/O for redo is, well, uh, I won’t say it.

  3. Agree on the 4kB as redo block size, will correct that in the text.

    I try to make the readers aware of the implementation of the wait, so they understand the event. Especially that this event does not tell the entire latency story.

    • >I try to make the readers aware of the implementation of the wait, so they understand the event. Especially that this event does not tell the entire latency story.

      I think you’ve done a really good thing…but I’m on the philosophical side now. Think about it this way (and see if you agree): lgwr waits for log file parallel sync, say, 10ms. Suppose further that the write batch had 8 writes to different files and offsets. Does it matter at all if 7 of the 8 finished in even 42u if the slowest was 10ms? I guess what I’m trying to say is I don’t care about the fast ones at all really. It’s the slow one that I care about…if, on the other hand, Oracle was silly enough to report an average of the 8 in that same scenario than that would either be stupid or just called a feature by Oracle 🙂

      • For absolute performance testing, indeed it’s the slowest IO in the batch that makes the performance, and is obviously the one to know.

        But now think of a situation where performance is an issue, and it’s diagnosed to be the log writer which causes the problem. If you are not aware of the implementation, it’s quite easy to end up with a totally wrong number which is deemd “average IO latency”. Depending on what you do this can be way too low (if you divide ‘log file parallel write’ time by the number of log IOs), or way too high (if you divide ‘log file parallel write’ time by the number of waits), and -probably- are going to have a strange conversation with the company’s SAN/NAS/Storage administrator, because the database figures are radically different from the figures of the storage.

        Having an average and not knowing it would be on the edge of destructive.

  4. Frits,

    Regarding your synchronous I/O scenario, you are right, that’s not at all parallel so the moniker log file parallel write is not fitting. I think maybe you should disable async I/O as you did but configure log writer _lgwr_io_slaves as an experiment. I recall from days of old that having parallel synchronous writers worked fairly well. Would get you back to functionality more suitable to the event name I think.

    • It’s enough for me to know what is happening, even if that means that I get to understand it’s broken or not correct.

      I actually logged a bug for the instrumentation of direct path reads with synchronous IO. The SR in which the bug is logged is inactive for months now. It’s remarkable how bad Oracle Support is at handling a report of internal errors. I can agree that’s not where support is for, but it’s impossible to get it to development, and get in touch with someone who actually understands me.

      • >get in touch with someone who actually understands me.

        …someone who cares more like it.

        …really, you should try your same technique with synchronous _lgwr_io_slaves. See what happens.

  5. Timothy Woodcock said:

    Fritz,

    Do you have any updates on this? I started going down this rabbit hole due to target log write size wait events kicking back in EM12c oracle 12.1.0.2.0 on linux.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.