Oracle IO on linux: database writer IO and wait events

This post is about database writer (dbwr, mostly seen as dbw0 nowadays) IO.
The testenvironment in which I made the measurements in this post: Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM. The test environment is a (VMWare Fusion) VM, with 2 CPU’s.

It might be a good idea to read my previous blog about logwriter IO.

The number of database writers is depended on the number of CPU’s visible to the instance (when not explicitly set with the DB_WRITER_PROCESSES parameter), and seems mostly to be CEIL(CPU_COUNT/8). There might be other things which could influence the number (NUMA comes to mind). In my case, I’ve got 2 CPU’s visible, which means I got one database writer (dbw0).

Let’s start the investigation! I enable a 10046 level 8 trace on the database writer. When you look at an idle Oracle database, you probably see the database writer process behaving the same as described in the first part of the log writer IO blog, which is sleeping at the wait event ‘rdbms ipc message’. This is the database writer waiting for work:

*** 2013-08-30 12:42:00.291
WAIT #0: nam='rdbms ipc message' ela= 3001013 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859320291675

*** 2013-08-30 12:42:03.292
WAIT #0: nam='rdbms ipc message' ela= 3000870 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859323292672

*** 2013-08-30 12:42:06.293
WAIT #0: nam='rdbms ipc message' ela= 3000806 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859326293581

There’s a description about this wait in the previously mentioned log writer blog. The short version is: ‘rdbms ipc message’ means: waiting to be posted for work.

It helps to make a little table and insert one or a few rows into it, and then commit. This dirties blocks in the buffercache. After that, issue ‘alter database flush buffer_cache’ to make Oracle write all the dirty blocks currently in the buffer cache. This is what is shown in the database writer trace file:

WAIT #0: nam='rdbms ipc message' ela= 328100 timeout=300 p2=0 p3=0 obj#=-1 tim=1377868410713860
WAIT #0: nam='db file async I/O submit' ela= 1 requests=14 interrupt=0 timeout=0 obj#=-1 tim=1377868410715372
WAIT #0: nam='db file parallel write' ela= 26 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1377868410715427

The first line is an ‘rdbms ipc message’ with an elapsed time (ela=) shorter than 3 seconds, because we triggered the database writer with the ‘alter database flush buffer_cache’. The database writer is notified to do something, and stops waiting.

Then we see two waits: ‘db file async I/O submit’ and ‘db file parallel write’. ‘db file parallel write’ is a wait I suspected, Oracle must write these dirty blocks, I/O costs time, and it’s logical to have a wait event for that. I did see the wait event ‘db file async I/O submit’ in some AWR reports, but haven’t encountered it at the top end of the waits, which means I haven’t looked/needed to look at it yet…

Well, let’s dive in one layer deeper, and combine strace with a 10046 level 8 tracefile using ‘strace -e write -e all -p PID’

semtimedop(851971, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
write(18, "\n*** 2013-08-30 15:32:44.063\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 35 3a 33 32 3a 34 34  2e 30 36 33 0a           15:32:44 .063.    |
lseek(18, 0, SEEK_CUR)                  = 682279
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 32  38 31 32 31 31 20 74 69   ela= 12 81211 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 36 39 35  36 34 30 36 33 36 30 31  13778695 64063601 |
write(19, "J?2-u4~UW1\n", 11)           = 11
 | 00000  4a 3f 32 2d 75 34 7e 55  57 31 0a                 J?2-u4~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
io_submit(139931766530048, 24, {{0x7f44671a7a50, 0, 1, 0, 256}, {0x7f44671a7ca8, 0, 1, 0, 257}, {0x7f44671a7f00, 0, 1, 0, 256}, {0x7f44671a8158, 0, 1, 0, 256}, {0x7f44671a83b0, 0, 1, 0, 256}, {0x7f44671a8608, 0, 1, 0, 257}, {0x7f44671a8860, 0, 1, 0, 257}, {0x7f44671a8ab8, 0, 1, 0, 257}, {0x7f44671a8d10, 0, 1, 0, 257}, {0x7f44671a8f68, 0, 1, 0, 257}, {0x7f44671a91c0, 0, 1, 0, 257}, {0x7f44671a9418, 0, 1, 0, 257}, {0x7f44671a9670, 0, 1, 0, 257}, {0x7f446719e1f8, 0, 1, 0, 256}, {0x7f44671a77f8, 0, 1, 0, 257}, {0x7f44671a75a0, 0, 1, 0, 257}, {0x7f44671a7348, 0, 1, 0, 256}, {0x7f44671a70f0, 0, 1, 0, 256}, {0x7f44671a6e98, 0, 1, 0, 256}, {0x7f44671a6c40, 0, 1, 0, 256}, {0x7f44671a69e8, 0, 1, 0, 256}, {0x7f44671a6790, 0, 1, 0, 256}, {0x7f44671a6538, 0, 1, 0, 256}, {0x7f44671a62e0, 0, 1, 0, 256}}) = 24
lseek(18, 0, SEEK_CUR)                  = 682376
write(18, "WAIT #0: nam='db file async I/O "..., 109) = 109
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 61 73  79 6e 63 20 49 2f 4f 20   file as ync I/O  |
 | 00020  73 75 62 6d 69 74 27 20  65 6c 61 3d 20 31 20 72  submit'  ela= 1 r |
 | 00030  65 71 75 65 73 74 73 3d  32 34 20 69 6e 74 65 72  equests= 24 inter |
 | 00040  72 75 70 74 3d 30 20 74  69 6d 65 6f 75 74 3d 30  rupt=0 t imeout=0 |
 | 00050  20 6f 62 6a 23 3d 2d 31  20 74 69 6d 3d 31 33 37   obj#=-1  tim=137 |
 | 00060  37 38 36 39 35 36 34 30  36 36 31 36 37           78695640 66167    |
write(19, "J?JV~1j1\n", 9)              = 9
 | 00000  4a 3f 4a 56 7e 31 6a 31  0a                       J?JV~1j1 .        |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
io_getevents(139931766530048, 2, 128, {{0x7f44671a7a50, 0x7f44671a7a50, 8192, 0}, {0x7f44671a7ca8, 0x7f44671a7ca8, 8192, 0}, {0x7f44671a7f00, 0x7f44671a7f00, 8192, 0}, {0x7f44671a8158, 0x7f44671a8158, 8192, 0}, {0x7f44671a83b0, 0x7f44671a83b0, 8192, 0}, {0x7f44671a8608, 0x7f44671a8608, 8192, 0}, {0x7f44671a8860, 0x7f44671a8860, 8192, 0}, {0x7f44671a8ab8, 0x7f44671a8ab8, 8192, 0}, {0x7f44671a8d10, 0x7f44671a8d10, 8192, 0}, {0x7f44671a8f68, 0x7f44671a8f68, 8192, 0}, {0x7f44671a91c0, 0x7f44671a91c0, 8192, 0}, {0x7f44671a9418, 0x7f44671a9418, 8192, 0}, {0x7f44671a9670, 0x7f44671a9670, 8192, 0}, {0x7f446719e1f8, 0x7f446719e1f8, 8192, 0}, {0x7f44671a77f8, 0x7f44671a77f8, 8192, 0}, {0x7f44671a75a0, 0x7f44671a75a0, 8192, 0}, {0x7f44671a7348, 0x7f44671a7348, 16384, 0}, {0x7f44671a70f0, 0x7f44671a70f0, 8192, 0}, {0x7f44671a6e98, 0x7f44671a6e98, 8192, 0}, {0x7f44671a6c40, 0x7f44671a6c40, 16384, 0}, {0x7f44671a69e8, 0x7f44671a69e8, 8192, 0}, {0x7f44671a6790, 0x7f44671a6790, 8192, 0}, {0x7f44671a6538, 0x7f44671a6538, 8192, 0}, {0x7f44671a62e0, 0x7f44671a62e0, 8192, 0}}, {600, 0}) = 24
write(18, "WAIT #0: nam='db file parallel w"..., 117) = 117
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 35 33 20 72  rite' el a= 153 r |
 | 00030  65 71 75 65 73 74 73 3d  32 20 69 6e 74 65 72 72  equests= 2 interr |
 | 00040  75 70 74 3d 30 20 74 69  6d 65 6f 75 74 3d 32 31  upt=0 ti meout=21 |
 | 00050  34 37 34 38 33 36 34 37  20 6f 62 6a 23 3d 2d 31  47483647  obj#=-1 |
 | 00060  20 74 69 6d 3d 31 33 37  37 38 36 39 35 36 34 30   tim=137 78695640 |
 | 00070  36 36 35 35 30                                    66550             |

I started this snippet from strace with the semtimedop() system call. If you look at the extended sql trace (10046 level 8) output, you see that the databasewriter stopped sleeping on the semaphore after 1.28 seconds. Then some writes are done (to the .trc and .trm files), and a times() system call to fetch usermode and kernelmode times, after which an important system call is shown: io_submit(). io_submit() submits 24 IOs here.
After the io_submit() system call, the process writes a wait event to the trace: ‘db file async I/O submit’, with an elapsed time (ela=) of 1 microsecond.

This looks like Oracle instrumented the submitting of IOs for the database writer! This is different from the logwriter and normal database foreground processes, where the io_submit() system call is not instrumented!

This wait event also shows the number of IOs (requests=24) which are submitted by the io_submit() call.

Then we see the call to reap the writes, io_getevents(). The io_getevents() system call needed at least 2 IOs ready to stop waiting (!). But instead, it was able to reap all 24 submitted IOs (you could count all the iocb’s in the io_getevents() call, or look at the returncode of the call, which is 24). The wait event of the reap is ‘db file parallel write’, as you can see from the write() system call. This is where it gets truly interesting, because when you look at the information in the wait line, you see requests=2, which probably has a relationship with the min_nr parameter in the io_getevents() call. We also see the io_getevents() call has a timeout of 600 seconds. This seems to be a general timeout for asynchronous IO calls done by Oracle on linux. What stands out for me is that the wait line does not contain any information about how many requests it did reap; two I/O’s ready was enough for the system call to succeed.

Now let’s switch to synchronous IO. I probably have mentioned this a couple of times, but if you don’t do asynchronous IO, you are doing it seriously wrong. Many of the newer IO functions in Oracle have been purpose-built for asychronous IO; direct path reads is a famous example, for which asynchronous IO gives you the opportunity to overcome the IO latency penalty of every distinct IO.

This is how writing by the database writer looks like with the combination extended SQL trace and strace when disk_asynch_io is set to false, when the databasewriter is triggered by ‘alter system flush buffer_cache':

semtimedop(983043, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
write(18, "\n*** 2013-08-30 16:34:19.162\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 36 3a 33 34 3a 31 39  2e 31 36 32 0a           16:34:19 .162.    |
lseek(18, 0, SEEK_CUR)                  = 9915
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 36  35 34 35 38 36 20 74 69   ela= 16 54586 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 37 33 32  35 39 31 36 32 37 36 32  13778732 59162762 |
write(19, "J?e7K6~UW1\n", 11)           = 11
 | 00000  4a 3f 65 37 4b 36 7e 55  57 31 0a                 J?e7K6~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
pwrite(256, "\6\242\0\0\207 \f\0\214\255O\0\0\0\n\6\236\316\0\0\1\0\0\0%1\1\0X\231O\0"..., 8192, 6221258752) = 8192
pwrite(257, "\2\242\0\0\326\22\300\0\214\255O\0\0\0\t\4\10\207\0\0\1\0\16\0\371\4\0\0005\2\32\32"..., 8192, 3414867968) = 8192
pwrite(256, "\2\242\0\0*\2\300\0\205\255O\0\0\0\1\4\236W\0\0\10\0\33\0=\6\0\0\305\2\v\v"..., 8192, 1359298560) = 8192
pwrite(256, "\2\242\0\0d\1\300\0\207\255O\0\0\0\1\4\321e\0\0\4\0\33\0#\5\0\0\35\3\20\20"..., 8192, 1358725120) = 8192
pwrite(257, "&\242\0\0\360\0\300\0\206\255O\0\0\0\1\4[\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1368260608) = 8192
pwrite(257, "&\242\0\0\260\0\300\0\210\255O\0\0\0\1\4\320\351\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367736320) = 8192
pwrite(257, "&\242\0\0\200\0\300\0\214\255O\0\0\0\2\4.)\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367343104) = 8192
pwrite(257, "\6\242\0\0\6\t@\0\210\255O\0\0\0\1\6Z\37\0\0\1\0\0\0N&\1\0\21\254O\0"..., 8192, 749780992) = 8192
pwrite(256, "\6\242\0\0\231\2@\0\206\255O\0\0\0\1\6q\207\0\0\1\0\0\0J\0\0\0\253\254O\0"..., 8192, 734208000) = 8192
lseek(18, 0, SEEK_CUR)                  = 10012
write(18, "WAIT #0: nam='db file parallel w"..., 107) = 107
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 32 34 20 72 65  rite' el a= 24 re |
 | 00030  71 75 65 73 74 73 3d 39  20 69 6e 74 65 72 72 75  quests=9  interru |
 | 00040  70 74 3d 30 20 74 69 6d  65 6f 75 74 3d 30 20 6f  pt=0 tim eout=0 o |
 | 00050  62 6a 23 3d 2d 31 20 74  69 6d 3d 31 33 37 37 38  bj#=-1 t im=13778 |
 | 00060  37 33 32 35 39 31 38 33  32 32 37                 73259183 227      |
write(19, "J?ny4~1h1\n", 10)            = 10
 | 00000  4a 3f 6e 79 34 7e 31 68  31 0a                    J?ny4~1h 1.       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(18, "WAIT #0: nam='db file parallel w"..., 106) = 106
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 20 72 65 71  rite' el a= 1 req |
 | 00030  75 65 73 74 73 3d 39 20  69 6e 74 65 72 72 75 70  uests=9  interrup |
 | 00040  74 3d 30 20 74 69 6d 65  6f 75 74 3d 30 20 6f 62  t=0 time out=0 ob |
 | 00050  6a 23 3d 2d 31 20 74 69  6d 3d 31 33 37 37 38 37  j#=-1 ti m=137787 |
 | 00060  33 32 35 39 31 38 33 34  32 30                    32591834 20       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semctl(983043, 16, SETVAL, 0x1)         = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semtimedop(983043, {{14, -1, 0}}, 1, {1, 320000000}) = -1 EAGAIN (Resource temporarily unavailable)

Mmmh…I see a number of very suspicious things. I started this snippet with the sleep on the semaphore, after which the process writes the wait event information about this. The next couple of system calls (9) are pwrite() system calls, which is the database writer writing dirty blocks to the datafiles. As you can see, this is happening in a sequential way (one by one). After the writes, the process writes the wait event information. We got no ‘db file async I/O submit wait, because there is no asynchronous IO, it’s directly the wait ‘db file parallel write’. The number of requests in the wait is 9, which is the same as the number of write() system calls.

When I look further (still in the first wait line!) I see ela=24. This is very suspicious…can I do 9 write calls to the O/S in 24 microseconds? But wait, it gets more weird!

I see yet another wait event ‘db file parallel write’ after the first one without any pwrite()’s happening in between. That is strange. My guess is this wait represents the same 9 pwrite() requests, and it was timed at 1 microsecond (ela=1). I’ve verified this waitevent in this situation (two times a wait event for the same event), this happens at least with every ‘alter system flush buffer_cache’. Also an elapsed time of 1 microsecond is very suspicious.

Let’s dig one layer deeper and look at the start and end of the wait event timing. In order to do this, I use gdb and the following commands:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) command
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) command
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 4 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

After gdb continues, I issue a ‘alter system flush buffer_cache’, and look at the gdb output:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

I’ve included the two semaphore waits. It easy to spot the pwrite64() calls are not timed. That is the reason the elapsed time in the ‘db file parallel write’ wait is so short…they both do not time any actual IO calls! Unless someone can prove me wrong, it looks like the synchronous IO implementation of database writer of wait events is seriously broken.

But, as I’ve said: you really should have a platform which has the ability to do asynchronous IO.

Let’s switch to asynchronous IO again, and look at it at the same layer. Here are the gdb commands I’ve used:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660: file io_submit.c, line 23.
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 5 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

You should see the semtimedop() system calls with accompanying kslwt[be]ctx stepping over the screen every 3 seconds. Insert a few rows into a dummy table and commit, and issue an ‘alter system flush buffer_cache’, and look what happens:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

I included a semaphore wait at the beginning and at the end of this snippet.
The next remarkable thing is the io_submit() system call does not seem to be timed by a wait event, because it is not encosed in a kslwtbctx() and kslwtectx() combination. That’s a pity, because the wait event ‘db file async I/O submit’ strongly hints at being the timing of submitting IO to the operating system.
After this, we see kslwtbctx() io_getevents_0_4() kslwtectx(). This is Oracle timing the reap of the writes submitted earlier.

To be absolutely sure there isn’t some kind of magic happening with io_submit(), let’s make this call last longer.
In order to do so, you can change the commands with the break on io_submit():

(gdb) commands 4
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end
(gdb) 

The inclusion of ‘shell sleep 1′ means that when the debugger breaks execution of the database writer when encountering io_submit(), it will execute ‘sleep 1′ (wait/sleep 1 second), after which the ‘c’ (continue) command is executed by the debugger, which continues execution of the database writer process. The result is that for the database process, the io_submit() system call now will take at least 1 second.

However, when done so, a 10046/8 trace (still) looks like this on my test system:

*** 2013-09-06 11:12:16.337
WAIT #0: nam='db file async I/O submit' ela= 470 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458736337487
WAIT #0: nam='db file parallel write' ela= 1386 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458736340152

*** 2013-09-06 11:12:17.344
WAIT #0: nam='db file async I/O submit' ela= 545 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458737344837
WAIT #0: nam='db file parallel write' ela= 1554 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458737346981

What we see is that after making the io_submit() call take at least 1 second, the event ‘db file async I/O submit’ still is timed at approximately 0.5 milliseconds. So at this point it’s safe to say ‘db file async I/O submit’ is NOT the timing of submitting IO requests to the operating system.

Now take a look at the event ‘db file parallel write’ again in the 10046/8 output. We’ve already established that “requests” with the event ‘db file parallel write’ does not reflect the number of IO requests it did reap, but instead is the minimum number of requests that this io_getevents() call should reap from the operating system completion queue before the system call is ready. The actual number of IO requests it did reap is not shown. Another oddity is timeout. The number is recognisable as 2^31-1 by any person working in IT. If the measure of timeout is in hundredths of a second (according to the Oracle reference guide version 11.2), it means the write times out in 2147483 seconds. That should mean a write times out after almost 25 days. That seems a bit lengthy to me.

The nice thing is we can actually see the timeout of the operating system call, io_getevents(). This is extremely easy with strace (the timeout is the last argument showed enclosed in accolades). Another way to do so is asking the debugger to print the struct:

(gdb) commands 5
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) c
Continuing.

This is the result:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$1 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

So, actually the timeout of io_getevents() is set to 600 seconds ($1 = {tv_sec = 600, tv_nsec = 0}), and not 2147483 seconds, as the documentation suggests. Of course this conclusion is about the system call the database writer executes, there is a possibility that the Oracle code retries the call, but I strongly doubt that.

Up to this point the database writer only needed to flush a few blocks to disk. Because of the low number of blocks to write, and because the system on which I perform my tests is a VM for which the IOs are likely to be cached, it means the latency of these IOs is extremely low. What if we limit the number of IOs the database writer can do? This blogpost describes how that can be done on linux.

At this point there are a few million rows in my test table (I tried generating more dirty blocks for writing using ‘insert into t select 1+(level-1) from dual connect by level < 100000;' for example). I did setup IO throttling using cgroups, and did put the database writer in the task-list to be throttled. After that, I started deleting all rows from my dummy table t using 'delete from t'.

Please mind the database writer still has event 10046/8 active, so I can follow what the database writer is doing at the Oracle level. Also I modified the breaks in gdb a bit so I can detect the number of IOs io_getevents() was able to reap: I use the 'finish' command, which waits until the current function ends, and prints out the return code. io_getevents() returns the number of IOs it was able to reap. The downside of using 'finish' is the execution is stopped, and must be manually continued. Nevertheless is this the result:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=153, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=38, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$12 = {tv_sec = 600, tv_nsec = 0}
c
0x0000000002cfa74e in skgfrliopo ()
Value returned is $13 = 38
(gdb) c
Continuing.

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$14 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $15 = 0
(gdb) c
Continuing.

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$16 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $17 = 18
(gdb) 

At the top we see kslwtectx(), and then io_submit() submitting 153 IO requests. After that, we see kslwtbctx()-kslwtectx() for the event ‘db file async I/O submit’.
Then we see the start of a wait (kslwtbctx()) for ‘db file parallel write’, and io_getevents(). This reap request must reap at least 38 IO requests, with a timeout set to 600 seconds. Because IOPS for both read and write are aggressively throttled at 1 IOPS, this request only reaps the minimal number of IO requests: 38.

I was able to see this because of the ‘finish’ command in gdb. It returns me to the calling function (skgfrliopo()), which is an Oracle function.

After continuing the process, we see something unexpected (to me at least): the wait is ended (kslwtectx()), and Oracle tries to reap the left over IOs using two non-blocking calls (the so-called ‘snacking’). Once both calls are done, either new IOs are submitted (which starts over the process I am describing here), or Oracle registers another ‘db file parallel write’ event, and issues a blocking io_getevents() call for a number IOs (which is lower than the total number of outstainding requests) to reap. If the number of active IO requests for this process grows, the number of io_getevents() calls in ‘db file parallel write’ can increase to two io_getevents() calls.

This is how it looks like when the number of active requests gets higher:

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=685, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=114, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 0, tv_nsec = 0}

Summary
The database writer related events are kind of strange, and what is visible is absolutely not obvious.

The naming ‘db file parallel write’ is only true for asynchronous IO, with synchronous IO, the writes are done sequentially.

The database writer has a unique event named ‘db file async I/O submit’, which, despite its naming does NOT time submitting asychronous IOs to the operating system. This event is now shown with sychronous IO, obviously.

The ‘db file parallel write’ event does not time the write calls with synchronous IO, and seems to be printed two times for a set of write IOs. I consider this wait implementation severely broken.

The event ‘db file async I/O submit’ shows the accurate number of IO requests which are submitted to the operating system using io_submit().
The event ‘db file parallel write’ shows the minimal number of IO requests the process needs to reap using io_getevents(), which can be a lower number than submitted with io_submit(). The timeout which is shown in this event does not seem to relate to anything. The actual (operating system) timeout is 600 seconds, not 25 days.

The event ‘db file parallel write’ waits for a lower number of IO requests then are submitted, the database writer tries to reap any leftover IOs using up to two non-blocking io_getevents(), after which it’s possible new IO requests are introduced.

At this moment my conclusion is ‘db file parallel write’ is an “indication of database writer business”, but absolutely nothing more than that, certainly not a measurement of IO latency.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,138 other followers

%d bloggers like this: