Archive

Linux

This is a quick post about how you can get the debuginfo packages on your Oracle Linux system in the easiest way thinkable: via yum.

I guess most people reading this are familiar with Oracle Linux, and do know how to install it, and how to use the public yum server to install and update packages on Linux right from Oracle’s free internet repository. If you do not know, follow the link and learn.

As a sidestep from the purpose of this blog articel: during the ACE Director briefing prior to Oracle Openworld Wim Coekaerts announced that the public-yum repository is now hosted on Akamai instead of somewhere “unofficial” in the Oracle infra. This is really, really noticeable when using it now, previously I could not get beyond a speed of approximately 500K/s, now I can get speeds of 10M/s.

Anyways, in some cases you want to look deep inside your Linux system (for example using gdb), and want gdb to understand the functions it encounters as much as possible. In order for gdb to understand functions, the debuginfo package of the executable you are running must be installed. Please mind some things simply do not have a corresponding debuginfo package, or even an rpm package; for example the Oracle database.

Finding debuginfo packages has been a painful process, at least for me this has been this way. For Oracle Linux, the debuginfo packages are at http://oss.oracle.com/ol6/debuginfo. As far as I know, this is not widely known (but this could be my ignorance).

But it gets even better! You can use this location as a yum location, and install the debuginfo packages via yum, if you add this piece of text in /etc/yum.repos.d as the file ‘debuginfo.repo’:

[ol6_debuginfo]
name=Oracle Linux 6 debuginfo
baseurl=http://oss.oracle.com/ol6/debuginfo
gpgkey=https://oss.oracle.com/ol6/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

A word of caution: I’ve found the yum metadata to be outdated with the available packages, which means that I’ve had to install a package “directly” (via rpm) from the site instead of via yum, because the metadata didn’t contain this new package. So if you update your kernel (for example), and can’t find the corresponding debuginfo package, take a look at the package list yourself, it might be there, but not in the metadata at that time.

This post is about database writer (dbwr, mostly seen as dbw0 nowadays) IO.
The testenvironment in which I made the measurements in this post: Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM. The test environment is a (VMWare Fusion) VM, with 2 CPU’s.

It might be a good idea to read my previous blog about logwriter IO.

The number of database writers is depended on the number of CPU’s visible to the instance (when not explicitly set with the DB_WRITER_PROCESSES parameter), and seems mostly to be CEIL(CPU_COUNT/8). There might be other things which could influence the number (NUMA comes to mind). In my case, I’ve got 2 CPU’s visible, which means I got one database writer (dbw0).

Let’s start the investigation! I enable a 10046 level 8 trace on the database writer. When you look at an idle Oracle database, you probably see the database writer process behaving the same as described in the first part of the log writer IO blog, which is sleeping at the wait event ‘rdbms ipc message’. This is the database writer waiting for work:

*** 2013-08-30 12:42:00.291
WAIT #0: nam='rdbms ipc message' ela= 3001013 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859320291675

*** 2013-08-30 12:42:03.292
WAIT #0: nam='rdbms ipc message' ela= 3000870 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859323292672

*** 2013-08-30 12:42:06.293
WAIT #0: nam='rdbms ipc message' ela= 3000806 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859326293581

There’s a description about this wait in the previously mentioned log writer blog. The short version is: ‘rdbms ipc message’ means: waiting to be posted for work.

It helps to make a little table and insert one or a few rows into it, and then commit. This dirties blocks in the buffercache. After that, issue ‘alter database flush buffer_cache’ to make Oracle write all the dirty blocks currently in the buffer cache. This is what is shown in the database writer trace file:

WAIT #0: nam='rdbms ipc message' ela= 328100 timeout=300 p2=0 p3=0 obj#=-1 tim=1377868410713860
WAIT #0: nam='db file async I/O submit' ela= 1 requests=14 interrupt=0 timeout=0 obj#=-1 tim=1377868410715372
WAIT #0: nam='db file parallel write' ela= 26 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1377868410715427

The first line is an ‘rdbms ipc message’ with an elapsed time (ela=) shorter than 3 seconds, because we triggered the database writer with the ‘alter database flush buffer_cache’. The database writer is notified to do something, and stops waiting.

Then we see two waits: ‘db file async I/O submit’ and ‘db file parallel write’. ‘db file parallel write’ is a wait I suspected, Oracle must write these dirty blocks, I/O costs time, and it’s logical to have a wait event for that. I did see the wait event ‘db file async I/O submit’ in some AWR reports, but haven’t encountered it at the top end of the waits, which means I haven’t looked/needed to look at it yet…

Well, let’s dive in one layer deeper, and combine strace with a 10046 level 8 tracefile using ‘strace -e write -e all -p PID’

semtimedop(851971, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
write(18, "\n*** 2013-08-30 15:32:44.063\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 35 3a 33 32 3a 34 34  2e 30 36 33 0a           15:32:44 .063.    |
lseek(18, 0, SEEK_CUR)                  = 682279
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 32  38 31 32 31 31 20 74 69   ela= 12 81211 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 36 39 35  36 34 30 36 33 36 30 31  13778695 64063601 |
write(19, "J?2-u4~UW1\n", 11)           = 11
 | 00000  4a 3f 32 2d 75 34 7e 55  57 31 0a                 J?2-u4~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
io_submit(139931766530048, 24, {{0x7f44671a7a50, 0, 1, 0, 256}, {0x7f44671a7ca8, 0, 1, 0, 257}, {0x7f44671a7f00, 0, 1, 0, 256}, {0x7f44671a8158, 0, 1, 0, 256}, {0x7f44671a83b0, 0, 1, 0, 256}, {0x7f44671a8608, 0, 1, 0, 257}, {0x7f44671a8860, 0, 1, 0, 257}, {0x7f44671a8ab8, 0, 1, 0, 257}, {0x7f44671a8d10, 0, 1, 0, 257}, {0x7f44671a8f68, 0, 1, 0, 257}, {0x7f44671a91c0, 0, 1, 0, 257}, {0x7f44671a9418, 0, 1, 0, 257}, {0x7f44671a9670, 0, 1, 0, 257}, {0x7f446719e1f8, 0, 1, 0, 256}, {0x7f44671a77f8, 0, 1, 0, 257}, {0x7f44671a75a0, 0, 1, 0, 257}, {0x7f44671a7348, 0, 1, 0, 256}, {0x7f44671a70f0, 0, 1, 0, 256}, {0x7f44671a6e98, 0, 1, 0, 256}, {0x7f44671a6c40, 0, 1, 0, 256}, {0x7f44671a69e8, 0, 1, 0, 256}, {0x7f44671a6790, 0, 1, 0, 256}, {0x7f44671a6538, 0, 1, 0, 256}, {0x7f44671a62e0, 0, 1, 0, 256}}) = 24
lseek(18, 0, SEEK_CUR)                  = 682376
write(18, "WAIT #0: nam='db file async I/O "..., 109) = 109
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 61 73  79 6e 63 20 49 2f 4f 20   file as ync I/O  |
 | 00020  73 75 62 6d 69 74 27 20  65 6c 61 3d 20 31 20 72  submit'  ela= 1 r |
 | 00030  65 71 75 65 73 74 73 3d  32 34 20 69 6e 74 65 72  equests= 24 inter |
 | 00040  72 75 70 74 3d 30 20 74  69 6d 65 6f 75 74 3d 30  rupt=0 t imeout=0 |
 | 00050  20 6f 62 6a 23 3d 2d 31  20 74 69 6d 3d 31 33 37   obj#=-1  tim=137 |
 | 00060  37 38 36 39 35 36 34 30  36 36 31 36 37           78695640 66167    |
write(19, "J?JV~1j1\n", 9)              = 9
 | 00000  4a 3f 4a 56 7e 31 6a 31  0a                       J?JV~1j1 .        |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
io_getevents(139931766530048, 2, 128, {{0x7f44671a7a50, 0x7f44671a7a50, 8192, 0}, {0x7f44671a7ca8, 0x7f44671a7ca8, 8192, 0}, {0x7f44671a7f00, 0x7f44671a7f00, 8192, 0}, {0x7f44671a8158, 0x7f44671a8158, 8192, 0}, {0x7f44671a83b0, 0x7f44671a83b0, 8192, 0}, {0x7f44671a8608, 0x7f44671a8608, 8192, 0}, {0x7f44671a8860, 0x7f44671a8860, 8192, 0}, {0x7f44671a8ab8, 0x7f44671a8ab8, 8192, 0}, {0x7f44671a8d10, 0x7f44671a8d10, 8192, 0}, {0x7f44671a8f68, 0x7f44671a8f68, 8192, 0}, {0x7f44671a91c0, 0x7f44671a91c0, 8192, 0}, {0x7f44671a9418, 0x7f44671a9418, 8192, 0}, {0x7f44671a9670, 0x7f44671a9670, 8192, 0}, {0x7f446719e1f8, 0x7f446719e1f8, 8192, 0}, {0x7f44671a77f8, 0x7f44671a77f8, 8192, 0}, {0x7f44671a75a0, 0x7f44671a75a0, 8192, 0}, {0x7f44671a7348, 0x7f44671a7348, 16384, 0}, {0x7f44671a70f0, 0x7f44671a70f0, 8192, 0}, {0x7f44671a6e98, 0x7f44671a6e98, 8192, 0}, {0x7f44671a6c40, 0x7f44671a6c40, 16384, 0}, {0x7f44671a69e8, 0x7f44671a69e8, 8192, 0}, {0x7f44671a6790, 0x7f44671a6790, 8192, 0}, {0x7f44671a6538, 0x7f44671a6538, 8192, 0}, {0x7f44671a62e0, 0x7f44671a62e0, 8192, 0}}, {600, 0}) = 24
write(18, "WAIT #0: nam='db file parallel w"..., 117) = 117
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 35 33 20 72  rite' el a= 153 r |
 | 00030  65 71 75 65 73 74 73 3d  32 20 69 6e 74 65 72 72  equests= 2 interr |
 | 00040  75 70 74 3d 30 20 74 69  6d 65 6f 75 74 3d 32 31  upt=0 ti meout=21 |
 | 00050  34 37 34 38 33 36 34 37  20 6f 62 6a 23 3d 2d 31  47483647  obj#=-1 |
 | 00060  20 74 69 6d 3d 31 33 37  37 38 36 39 35 36 34 30   tim=137 78695640 |
 | 00070  36 36 35 35 30                                    66550             |

I started this snippet from strace with the semtimedop() system call. If you look at the extended sql trace (10046 level 8) output, you see that the databasewriter stopped sleeping on the semaphore after 1.28 seconds. Then some writes are done (to the .trc and .trm files), and a times() system call to fetch usermode and kernelmode times, after which an important system call is shown: io_submit(). io_submit() submits 24 IOs here.
After the io_submit() system call, the process writes a wait event to the trace: ‘db file async I/O submit’, with an elapsed time (ela=) of 1 microsecond.

This looks like Oracle instrumented the submitting of IOs for the database writer! This is different from the logwriter and normal database foreground processes, where the io_submit() system call is not instrumented!

This wait event also shows the number of IOs (requests=24) which are submitted by the io_submit() call.

Then we see the call to reap the writes, io_getevents(). The io_getevents() system call needed at least 2 IOs ready to stop waiting (!). But instead, it was able to reap all 24 submitted IOs (you could count all the iocb’s in the io_getevents() call, or look at the returncode of the call, which is 24). The wait event of the reap is ‘db file parallel write’, as you can see from the write() system call. This is where it gets truly interesting, because when you look at the information in the wait line, you see requests=2, which probably has a relationship with the min_nr parameter in the io_getevents() call. We also see the io_getevents() call has a timeout of 600 seconds. This seems to be a general timeout for asynchronous IO calls done by Oracle on linux. What stands out for me is that the wait line does not contain any information about how many requests it did reap; two I/O’s ready was enough for the system call to succeed.

Now let’s switch to synchronous IO. I probably have mentioned this a couple of times, but if you don’t do asynchronous IO, you are doing it seriously wrong. Many of the newer IO functions in Oracle have been purpose-built for asychronous IO; direct path reads is a famous example, for which asynchronous IO gives you the opportunity to overcome the IO latency penalty of every distinct IO.

This is how writing by the database writer looks like with the combination extended SQL trace and strace when disk_asynch_io is set to false, when the databasewriter is triggered by ‘alter system flush buffer_cache’:

semtimedop(983043, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
write(18, "\n*** 2013-08-30 16:34:19.162\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 36 3a 33 34 3a 31 39  2e 31 36 32 0a           16:34:19 .162.    |
lseek(18, 0, SEEK_CUR)                  = 9915
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 36  35 34 35 38 36 20 74 69   ela= 16 54586 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 37 33 32  35 39 31 36 32 37 36 32  13778732 59162762 |
write(19, "J?e7K6~UW1\n", 11)           = 11
 | 00000  4a 3f 65 37 4b 36 7e 55  57 31 0a                 J?e7K6~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
pwrite(256, "\6\242\0\0\207 \f\0\214\255O\0\0\0\n\6\236\316\0\0\1\0\0\0%1\1\0X\231O\0"..., 8192, 6221258752) = 8192
pwrite(257, "\2\242\0\0\326\22\300\0\214\255O\0\0\0\t\4\10\207\0\0\1\0\16\0\371\4\0\0005\2\32\32"..., 8192, 3414867968) = 8192
pwrite(256, "\2\242\0\0*\2\300\0\205\255O\0\0\0\1\4\236W\0\0\10\0\33\0=\6\0\0\305\2\v\v"..., 8192, 1359298560) = 8192
pwrite(256, "\2\242\0\0d\1\300\0\207\255O\0\0\0\1\4\321e\0\0\4\0\33\0#\5\0\0\35\3\20\20"..., 8192, 1358725120) = 8192
pwrite(257, "&\242\0\0\360\0\300\0\206\255O\0\0\0\1\4[\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1368260608) = 8192
pwrite(257, "&\242\0\0\260\0\300\0\210\255O\0\0\0\1\4\320\351\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367736320) = 8192
pwrite(257, "&\242\0\0\200\0\300\0\214\255O\0\0\0\2\4.)\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367343104) = 8192
pwrite(257, "\6\242\0\0\6\t@\0\210\255O\0\0\0\1\6Z\37\0\0\1\0\0\0N&\1\0\21\254O\0"..., 8192, 749780992) = 8192
pwrite(256, "\6\242\0\0\231\2@\0\206\255O\0\0\0\1\6q\207\0\0\1\0\0\0J\0\0\0\253\254O\0"..., 8192, 734208000) = 8192
lseek(18, 0, SEEK_CUR)                  = 10012
write(18, "WAIT #0: nam='db file parallel w"..., 107) = 107
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 32 34 20 72 65  rite' el a= 24 re |
 | 00030  71 75 65 73 74 73 3d 39  20 69 6e 74 65 72 72 75  quests=9  interru |
 | 00040  70 74 3d 30 20 74 69 6d  65 6f 75 74 3d 30 20 6f  pt=0 tim eout=0 o |
 | 00050  62 6a 23 3d 2d 31 20 74  69 6d 3d 31 33 37 37 38  bj#=-1 t im=13778 |
 | 00060  37 33 32 35 39 31 38 33  32 32 37                 73259183 227      |
write(19, "J?ny4~1h1\n", 10)            = 10
 | 00000  4a 3f 6e 79 34 7e 31 68  31 0a                    J?ny4~1h 1.       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(18, "WAIT #0: nam='db file parallel w"..., 106) = 106
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 20 72 65 71  rite' el a= 1 req |
 | 00030  75 65 73 74 73 3d 39 20  69 6e 74 65 72 72 75 70  uests=9  interrup |
 | 00040  74 3d 30 20 74 69 6d 65  6f 75 74 3d 30 20 6f 62  t=0 time out=0 ob |
 | 00050  6a 23 3d 2d 31 20 74 69  6d 3d 31 33 37 37 38 37  j#=-1 ti m=137787 |
 | 00060  33 32 35 39 31 38 33 34  32 30                    32591834 20       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semctl(983043, 16, SETVAL, 0x1)         = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semtimedop(983043, {{14, -1, 0}}, 1, {1, 320000000}) = -1 EAGAIN (Resource temporarily unavailable)

Mmmh...I see a number of very suspicious things. I started this snippet with the sleep on the semaphore, after which the process writes the wait event information about this. The next couple of system calls (9) are pwrite() system calls, which is the database writer writing dirty blocks to the datafiles. As you can see, this is happening in a sequential way (one by one). After the writes, the process writes the wait event information. We got no 'db file async I/O submit wait, because there is no asynchronous IO, it's directly the wait 'db file parallel write'. The number of requests in the wait is 9, which is the same as the number of write() system calls.

When I look further (still in the first wait line!) I see ela=24. This is very suspicious...can I do 9 write calls to the O/S in 24 microseconds? But wait, it gets more weird!

I see yet another wait event 'db file parallel write' after the first one without any pwrite()'s happening in between. That is strange. My guess is this wait represents the same 9 pwrite() requests, and it was timed at 1 microsecond (ela=1). I've verified this waitevent in this situation (two times a wait event for the same event), this happens at least with every 'alter system flush buffer_cache'. Also an elapsed time of 1 microsecond is very suspicious.

Let's dig one layer deeper and look at the start and end of the wait event timing. In order to do this, I use gdb and the following commands:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) command
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) command
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 4 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

After gdb continues, I issue a ‘alter system flush buffer_cache’, and look at the gdb output:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

I’ve included the two semaphore waits. It easy to spot the pwrite64() calls are not timed. That is the reason the elapsed time in the ‘db file parallel write’ wait is so short…they both do not time any actual IO calls! Unless someone can prove me wrong, it looks like the synchronous IO implementation of database writer of wait events is seriously broken.

But, as I’ve said: you really should have a platform which has the ability to do asynchronous IO.

Let’s switch to asynchronous IO again, and look at it at the same layer. Here are the gdb commands I’ve used:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660: file io_submit.c, line 23.
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 5 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

You should see the semtimedop() system calls with accompanying kslwt[be]ctx stepping over the screen every 3 seconds. Insert a few rows into a dummy table and commit, and issue an ‘alter system flush buffer_cache’, and look what happens:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

I included a semaphore wait at the beginning and at the end of this snippet.
The next remarkable thing is the io_submit() system call does not seem to be timed by a wait event, because it is not encosed in a kslwtbctx() and kslwtectx() combination. That’s a pity, because the wait event ‘db file async I/O submit’ strongly hints at being the timing of submitting IO to the operating system.
After this, we see kslwtbctx() io_getevents_0_4() kslwtectx(). This is Oracle timing the reap of the writes submitted earlier.

To be absolutely sure there isn’t some kind of magic happening with io_submit(), let’s make this call last longer.
In order to do so, you can change the commands with the break on io_submit():

(gdb) commands 4
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end
(gdb) 

The inclusion of ‘shell sleep 1′ means that when the debugger breaks execution of the database writer when encountering io_submit(), it will execute ‘sleep 1′ (wait/sleep 1 second), after which the ‘c’ (continue) command is executed by the debugger, which continues execution of the database writer process. The result is that for the database process, the io_submit() system call now will take at least 1 second.

However, when done so, a 10046/8 trace (still) looks like this on my test system:

*** 2013-09-06 11:12:16.337
WAIT #0: nam='db file async I/O submit' ela= 470 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458736337487
WAIT #0: nam='db file parallel write' ela= 1386 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458736340152

*** 2013-09-06 11:12:17.344
WAIT #0: nam='db file async I/O submit' ela= 545 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458737344837
WAIT #0: nam='db file parallel write' ela= 1554 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458737346981

What we see is that after making the io_submit() call take at least 1 second, the event ‘db file async I/O submit’ still is timed at approximately 0.5 milliseconds. So at this point it’s safe to say ‘db file async I/O submit’ is NOT the timing of submitting IO requests to the operating system.

Now take a look at the event ‘db file parallel write’ again in the 10046/8 output. We’ve already established that “requests” with the event ‘db file parallel write’ does not reflect the number of IO requests it did reap, but instead is the minimum number of requests that this io_getevents() call should reap from the operating system completion queue before the system call is ready. The actual number of IO requests it did reap is not shown. Another oddity is timeout. The number is recognisable as 2^31-1 by any person working in IT. If the measure of timeout is in hundredths of a second (according to the Oracle reference guide version 11.2), it means the write times out in 2147483 seconds. That should mean a write times out after almost 25 days. That seems a bit lengthy to me.

The nice thing is we can actually see the timeout of the operating system call, io_getevents(). This is extremely easy with strace (the timeout is the last argument showed enclosed in accolades). Another way to do so is asking the debugger to print the struct:

(gdb) commands 5
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) c
Continuing.

This is the result:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$1 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

So, actually the timeout of io_getevents() is set to 600 seconds ($1 = {tv_sec = 600, tv_nsec = 0}), and not 2147483 seconds, as the documentation suggests. Of course this conclusion is about the system call the database writer executes, there is a possibility that the Oracle code retries the call, but I strongly doubt that.

Up to this point the database writer only needed to flush a few blocks to disk. Because of the low number of blocks to write, and because the system on which I perform my tests is a VM for which the IOs are likely to be cached, it means the latency of these IOs is extremely low. What if we limit the number of IOs the database writer can do? This blogpost describes how that can be done on linux.

At this point there are a few million rows in my test table (I tried generating more dirty blocks for writing using ‘insert into t select 1+(level-1) from dual connect by level < 100000;' for example). I did setup IO throttling using cgroups, and did put the database writer in the task-list to be throttled. After that, I started deleting all rows from my dummy table t using 'delete from t'.

Please mind the database writer still has event 10046/8 active, so I can follow what the database writer is doing at the Oracle level. Also I modified the breaks in gdb a bit so I can detect the number of IOs io_getevents() was able to reap: I use the 'finish' command, which waits until the current function ends, and prints out the return code. io_getevents() returns the number of IOs it was able to reap. The downside of using 'finish' is the execution is stopped, and must be manually continued. Nevertheless is this the result:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=153, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=38, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$12 = {tv_sec = 600, tv_nsec = 0}
c
0x0000000002cfa74e in skgfrliopo ()
Value returned is $13 = 38
(gdb) c
Continuing.

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$14 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $15 = 0
(gdb) c
Continuing.

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$16 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $17 = 18
(gdb) 

At the top we see kslwtectx(), and then io_submit() submitting 153 IO requests. After that, we see kslwtbctx()-kslwtectx() for the event ‘db file async I/O submit’.
Then we see the start of a wait (kslwtbctx()) for ‘db file parallel write’, and io_getevents(). This reap request must reap at least 38 IO requests, with a timeout set to 600 seconds. Because IOPS for both read and write are aggressively throttled at 1 IOPS, this request only reaps the minimal number of IO requests: 38.

I was able to see this because of the ‘finish’ command in gdb. It returns me to the calling function (skgfrliopo()), which is an Oracle function.

After continuing the process, we see something unexpected (to me at least): the wait is ended (kslwtectx()), and Oracle tries to reap the left over IOs using two non-blocking calls (the so-called ‘snacking’). Once both calls are done, either new IOs are submitted (which starts over the process I am describing here), or Oracle registers another ‘db file parallel write’ event, and issues a blocking io_getevents() call for a number IOs (which is lower than the total number of outstainding requests) to reap. If the number of active IO requests for this process grows, the number of io_getevents() calls in ‘db file parallel write’ can increase to two io_getevents() calls.

This is how it looks like when the number of active requests gets higher:

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=685, iocbs=0x7fff872031a0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=114, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}

Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 0, tv_nsec = 0}

Summary
The database writer related events are kind of strange, and what is visible is absolutely not obvious.

The naming ‘db file parallel write’ is only true for asynchronous IO, with synchronous IO, the writes are done sequentially.

The database writer has a unique event named ‘db file async I/O submit’, which, despite its naming does NOT time submitting asychronous IOs to the operating system. This event is now shown with sychronous IO, obviously.

The ‘db file parallel write’ event does not time the write calls with synchronous IO, and seems to be printed two times for a set of write IOs. I consider this wait implementation severely broken.

The event ‘db file async I/O submit’ shows the accurate number of IO requests which are submitted to the operating system using io_submit().
The event ‘db file parallel write’ shows the minimal number of IO requests the process needs to reap using io_getevents(), which can be a lower number than submitted with io_submit(). The timeout which is shown in this event does not seem to relate to anything. The actual (operating system) timeout is 600 seconds, not 25 days.

The event ‘db file parallel write’ waits for a lower number of IO requests then are submitted, the database writer tries to reap any leftover IOs using up to two non-blocking io_getevents(), after which it’s possible new IO requests are introduced.

At this moment my conclusion is ‘db file parallel write’ is an “indication of database writer business”, but absolutely nothing more than that, certainly not a measurement of IO latency.

This post is about log writer (lgwr) IO.
It’s good to point out the environment on which I do my testing:
Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM.

In order to look at what the logwriter is doing, a 10046 trace of the lgwr at level 8 gives an overview.
A way of doing so is using oradebug. Be very careful about using oradebug on production environments, it can/may cause the instance to crash.

This is how I did it:

SYS@v11203 AS SYSDBA> oradebug setospid 2491
Oracle pid: 11, Unix process pid: 2491, image: oracle@ol63-oracle.local (LGWR)
SYS@v11203 AS SYSDBA> oradebug unlimit
Statement processed.
SYS@v11203 AS SYSDBA> oradebug event 10046 trace name context forever, level 8
Statement processed.

Of course 2491 is the Linux process id of the log writer, as is visible with “image”.

When you look at the tracefile, the logwriter trace of a non-used database will look like this:

*** 2013-08-29 21:10:45.593
WAIT #0: nam='rdbms ipc message' ela= 3000575 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803445592696

*** 2013-08-29 21:10:48.595
WAIT #0: nam='rdbms ipc message' ela= 3000566 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803448594695

*** 2013-08-29 21:10:51.597
WAIT #0: nam='rdbms ipc message' ela= 3000812 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803451596813

*** 2013-08-29 21:10:54.599
WAIT #0: nam='rdbms ipc message' ela= 3000733 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803454598806

*** 2013-08-29 21:10:57.601
WAIT #0: nam='rdbms ipc message' ela= 3000460 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803457600717

*** 2013-08-29 21:11:00.603
WAIT #0: nam='rdbms ipc message' ela= 3000526 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803460602699

*** 2013-08-29 21:11:03.605
WAIT #0: nam='rdbms ipc message' ela= 3000640 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803463604715

*** 2013-08-29 21:11:04.218
WAIT #0: nam='rdbms ipc message' ela= 612031 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803464218054

What we see here is the logwriter spending 3 seconds (ela= 3000812) on the event ‘rdbms ipc message’, simply because the timeout (timeout column obviously) was set so (timeout=300; timeout is set in centiseconds, Oracle old “fine” measurement of time). To understand what this means from the perspective of the operating system (Linux), there’s a need “trick”: use strace with verbose output of writing. That way the extended SQL trace will “document” the system calls. This is how this looks like:

[root@ol63-oracle ~]# strace -e write=all -e all -p 2491
Process 2491 attached - interrupt to quit
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
write(32, "\n*** 2013-08-29 21:35:50.740\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 30  2e 37 34 30 0a           21:35:50 .740.    |
lseek(32, 0, SEEK_CUR)                  = 118213
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 32 30  38 37 34 35 32 20 74 69   ela= 20 87452 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 30 37 34 30 33 37 33  13778049 50740373 |
write(33, "J?Jqz7~UW1\n", 11)           = 11
 | 00000  4a 3f 4a 71 7a 37 7e 55  57 31 0a                 J?Jqz7~U W1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
semtimedop(327683, {{15, -1, 0}}, 1, {0, 910000000}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
write(32, "\n*** 2013-08-29 21:35:51.651\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 31  2e 36 35 31 0a           21:35:51 .651.    |
lseek(32, 0, SEEK_CUR)                  = 118339
write(32, "WAIT #0: nam='rdbms ipc message'"..., 94) = 94
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 39 31  30 36 37 37 20 74 69 6d   ela= 91 0677 tim |
 | 00030  65 6f 75 74 3d 39 31 20  70 32 3d 30 20 70 33 3d  eout=91  p2=0 p3= |
 | 00040  30 20 6f 62 6a 23 3d 2d  31 20 74 69 6d 3d 31 33  0 obj#=- 1 tim=13 |
 | 00050  37 37 38 30 34 39 35 31  36 35 31 37 32 34        77804951 651724   |
write(33, "J?HRU3~UU1\n", 11)           = 11
 | 00000  4a 3f 48 52 55 33 7e 55  55 31 0a                 J?HRU3~U U1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
write(32, "\n*** 2013-08-29 21:35:54.652\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 34  2e 36 35 32 0a           21:35:54 .652.    |
lseek(32, 0, SEEK_CUR)                  = 118463
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 33 30  30 30 33 33 32 20 74 69   ela= 30 00332 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 34 36 35 32 38 32 36  13778049 54652826 |
write(33, "J?MZSB~UW1\n", 11)           = 11
 | 00000  4a 3f 4d 5a 53 42 7e 55  57 31 0a                 J?MZSB~U W1.      |
lseek(32, 0, SEEK_CUR)                  = 118559
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(33, "3?uG~01\n", 8)               = 8
 | 00000  33 3f 75 47 7e 30 31 0a                           3?uG~01.          |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037

First thing to notice (which is not relevant to this text about logwriter IOs, but nice to know): any time there’s a timestamp in the Oracle trace file (like *** 2013-08-29 21:38:58.721), Oracle did issue a times() syscall prior to that. If you want to know what that does, just issue “man 2 times” on the linux prompt. As a summary: this syscall fetches the CPU time spend in user mode and system/kernel mode, for the current process and for its children.

Back to where this post is about: we see rdbms ipc message waits. This is the instrumentation of the time sleeping (actually) on a semaphore, with timeout mostly set to 3 seconds:

semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

Use the manpage of semtimedop to understand all, but the last struct ({3, 0}) is a struct timespec. The first number is seconds, the second number is nanoseconds. But in other words: if you see the logwriter spending its time on ‘rdbms ipc message’ it’s waiting for work. The semaphore construction is made so the process can asked (by another process) to do something, and can do that immediately.

I guess most people reading this blog, and coming this far, are aware what the function of the logwriter is (writing the log buffer to disk), and the logwriter does that if it finds something to write, or if some other process asks the logwriter to do so. The process which asks the logwriter to write waits for a response from the logwriter on the event ‘log file sync’.

If you use the extended SQL trace with strace “trick” you see this:

io_submit(140612782833664, 1, {{0x7fe2f5369450, 0, 1, 0, 257}}) = 1
io_getevents(140612782833664, 1, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 1024, 0}}, {600, 0}) = 1
lseek(26, 0, SEEK_CUR)                  = 5190
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 30 37 20  write' e la= 107  |
 | 00030  66 69 6c 65 73 3d 31 20  62 6c 6f 63 6b 73 3d 32  files=1  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 31 20 6f 62 6a 23   request s=1 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 37 30  =-1 tim= 13778070 |
 | 00060  32 34 39 35 31 33 31 38                           24951318          |

What we see is an asynchronously submitted IO request (one IO request; io_submit), and it’s reaped after that (io_getevents), after which the logwriter writes a wait event to the tracefile indicating this has happened. The logwriter log file parallel write events statistics tell use it wrote to 1 file (equalling one IO, as can be seen from the IO system calls), with a length of 2 blocks. As we know, the log block size is and has been 512 bytes on Linux, which can be seen in the io_getevents call (1024 equals 2*512). The elapsed time (ela) is 107 microseconds.

The log blocksize is and has been 512 bytes on Linux, but can also be 4kB with more recent versions of Oracle, for newer SSD devices.

It would be logical to think in this example the IO time is 107 microseconds. But hold that thought, I’ll come to that. Let’s add a logfile member to every logfile, and switch to a new logfile to make writing to both logfile members happening. Let’s look at a logwriter writing something now:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2
lseek(26, 0, SEEK_CUR)                  = 101469
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 31 37 20  write' e la= 117  |
 | 00030  66 69 6c 65 73 3d 32 20  62 6c 6f 63 6b 73 3d 32  files=2  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 32 20 6f 62 6a 23   request s=2 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 38 36  =-1 tim= 13778086 |
 | 00060  33 32 30 36 37 31 32 35                           32067125          |

We see here that Oracle submits two IOs at the same time, using one call, and reaps both of them (read the manpage of io_getevents, but the minimal number of IOs is set to 2). So indeed log file parallel write writes in parallel. The number of blocks is a bit puzzling, in total there are two (log-) blocks written (blocks=2), the logwriter IO to a single online redologfile had blocks set to 2, which meant it wrote 1024 bytes (2 512 blocks), here blocks=2 seems to mean the logwriter wrote 2 blocks in total.

How would Oracle write when asynchronous IO is not enabled? Well, let’s set disk_asynch_io to false, and redo enabling the extended sql trace, and strace it:

pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 1415701504) = 512
pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 6560015360) = 512
lseek(25, 0, SEEK_CUR)                  = 4600
write(25, "WAIT #0: nam='log file parallel "..., 103) = 103
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 35 20 66  write' e la= 15 f |
 | 00030  69 6c 65 73 3d 32 20 62  6c 6f 63 6b 73 3d 32 20  iles=2 b locks=2  |
 | 00040  72 65 71 75 65 73 74 73  3d 32 20 6f 62 6a 23 3d  requests =2 obj#= |
 | 00050  2d 31 20 74 69 6d 3d 31  33 37 37 38 30 39 33 39  -1 tim=1 37780939 |
 | 00060  37 38 38 31 38 31 39                              7881819           |

Ah! So when there’s no asynchronous IO, Oracle issues two pwrite() calls sequentially. Not really parallel!

Now think about what the wait means in both situations. In the first situation, with asynchronous IO (which should be your setting, if you have a modern system) the timing is about two IOs, and it seems to indicate the time both IOs took. Since they have been issued in parallel, if you have enough storage bandwidth left, this seems to indicate the time of the slowest IO. In the second situation, the IOs are issued serially, so the wait time seems to indicate the total latency time of the two IOs. This means you have to divide the wait time by the number of requests to get the average IO time of the IOs. Again: if the IOs are done synchronously instead of asynchronously.

But is this true?

Let’s dig a little deeper using gdb! First synchronous IO. Attach to the logwriter using gdb (gdb -p PID), and issue the following debugger commands:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 3 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) 

Now insert something into a table, and commit.

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Hey! Do you see what I see? First I see logwriter nicely sleeping like a little baby. With that I mean: I see kslwtbctx() to indicate the starting of the timing of a wait event, then semtimedop() is issued for the sake of sleeping, and after semtimedop() has timed out, kslwtectx() to end the event. But then we start writing (which is the function of the log writer; the pwrite64 calls), but there seems to be a kslwtbctx() missing??? Actually, kslwtbctx() is there, but AFTER the pwrite64() calls!

Am I missing something? Is Oracle doing something incredibly smart? Well, let’s startup an extended SQL trace and throttle the IO. By throttling the write IO for the logwriter I can (severely) increase the IO time, so the IO time should increase drastically. (see the throttling blogpost on this blog; instead for the logwriter you must limit write IOPS instead read IOPS). I’ve done that, and this is what the extended SQL trace shows:

*** 2013-08-29 23:11:54.509
WAIT #0: nam='log file parallel write' ela= 512 files=2 blocks=4 requests=2 obj#=-1 tim=1377810714509122

*** 2013-08-29 23:11:54.509
Warning: log write elapsed time 2004ms, size 1KB

*** 2013-08-29 23:11:56.509
WAIT #0: nam='log file parallel write' ela= 541 files=2 blocks=2 requests=2 obj#=-1 tim=1377810716509653

*** 2013-08-29 23:11:56.509
Warning: log write elapsed time 2000ms, size 0KB

This is what I call caught with the fingers in the cooky jar! The wait event shows an (extremely) low timing for the IOs (the ela=), and after that we get a warning the log writes were terribly slow! Oracle doesn’t time it’s logwrite events correct if we do synchronous IO in this situation!!!

Now let’s have a look at asynchronous IO. I do the same: attach to lgwr with gdb, and now break and continue on the asynchronous IO system calls: io_submit and io_getevents (please mind this requires breaking on io_getevents_0_4 with linux):

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 3 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

You should now see a stream of kslwtbctx and kslwtectx coming, which is the timing of the semtimedop() system calls. If you insert something in a table you’ll see:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001750) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40078f8, timeout=0x7fffc4008900) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Here’s something worthy to see! The first line shows the ending of a wait (kslwtectx), then the logwriter submitting an IO request (io_submit). Notice there isn’t any wait event timing started at this point, because kslwtbctx is called AFTER the IOs are submitted. This means that the wait event is the timing of the reap of the submitted IOs, instead of measuring full IO latency time. I know this is and should be a tiny fragment of time, but nevertheless the wait only shows the waiting for the reaping of the IO, not the timing of the entire IO.

Now look at the io_submit/io_getevents strace output again shown earlier in the post:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2

Look at the struct timespec in the io_getevents call: {0, 0}. This is a non-blocking reap call just to peek at the O/S completion queue. In other words, if the submitted IOs take a long time/are not ready, this call will just be unsuccessful and execution continues. This looks like the asynchronous read implementation of direct path reads (see this blogpost about non-blocking io_getevents calls).

Let’s throttle the IO of the lgwr, and print out the timespec struct at the break on io_getevents:

(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>print *timeout
>c
>end

Now redo the insertion of data into a table, and commit to put the logwriter to work. Please mind IO is throttled!

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001640) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077e8, timeout=0x7fffc40087f0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$6 = {tv_sec = 0, tv_nsec = 0}

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077d8, timeout=0x7fffc40087e0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$7 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Ah! We see the io_submit call issuing two IOs (nr=2), then the starting of the timing of the log file parallel write event (kslwtbctx), and then two io_getevents calls, the non blocking first, which almost certainly does not succeed (and is NOT shown with strace!) after which a blocking io_getevents call is issued (because seconds in the timespec struct is set to 600).

I haven’t got an idea why the non-blocking io_getevents call is issued after which a blocking io_getevents call is issued.

Summary
This is the description taken from Oracle 11.2 reference guide of log file parallel write:

‘Wait Time: Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.’

This is actually a very good description, surprisingly accurate for asynchronous IO.
As we’ve seen, this is not true for synchronous IO, there Oracle just issues the I/Os it needs to do sequentially, which are not timed at all, but a wait time is displayed, which is wrong.

It turns out that for asynchronous log writes, the log file parallel write event times the time waiting for the writes to be flushed to disk. This means waiting for the slowest write is what is made visible with the event timing, all the fast(er) writes can have any time below the timing of the slowest write. This means the wait is timing from the perspective of the process, not IO latency timing. It gives a fair hint in the IO time of the slowest IO, so for the way most people using it (measure the impact of logwriter IO time spend), it’s fairly accurate, but for actual IO latency measurement it’s weak. Because, if you want to go beyond measuring the time logwriter spend on IO, you should be very well aware of the implementation. For example, you cannot divide the log file parallel write time by the number of IOs; the resulting time does not reflect anything.

Correction: Chris Buckel pointed out I had ‘db file parallel write’ instead of ‘log file parallel write’ in the summary. This is corrected.
Correction: Kevin Closson pointed out that the log block size can be 4kB too with newer versions of Oracle.

This is part 3 of a number of blogposts about huge Oracle database IO’s.
If you landed on this blogpost and did not read part 1 or part 2, please read part 1 here and part 2 here.

In part 1 I showed how database IOs of a full table scan could be bigger than 1MB by increasing the db_file_multiblock_read_count parameter to a number beyond 1MB expressed in Oracle blocks. These bigger IOs only happen with direct path reads, not with buffered multiblock reads (which are recognisable via the ‘db file scattered read’ wait event). In part 2 I showed that with Oracle Linux 6u4 X64, Oracle 11.2.0.3 64 bits and ASM the maximal size of an IO could go up to 1024 blocks if db_file_multiblock_read_count was set high enough, and that once an IO could be 1024, Oracle aligns the IO to block id being a multiplication of 1024.

With all the gained knowledge, the next thing I ask myself is: how did Oracle implement this? What operating system calls are done? How can we do IOs of 8MB whilst the maximal IO size of Linux is supposed to be 1MB? The obvious answer is take a peak!

In order to see the exact working, it is once again necessary to reduce the number of IOPS, to see both the io_getevents calls which probe the completion queue (which are not timed/captured by a wait event) and the blocking io_getevents call which is timed with the ‘direct path read’ wait event. To throttle IO (reduce IOPS), see this article.

Let’s focus on how direct path IO works first. Direct path IO (on Linux with ASM is what I researched) works roughly in 3 stages:

Before a direct path IO takes place, Oracle does some stuff which necessary to make direct path IO’s happen. Some of the things are issuing a object checkpoint (visible with the ‘enq: KO – fast object checkpoint’ waitevent) to make sure the latest version of the blocks involved are on disk, and a rebuild of an operating system asynchronous IO context (visible via the io_destroy() and io_setup() system calls; this is not captured by an Oracle waitevent. The Oracle waitevent ‘asynch descriptor resize’ which could show up during this phase actually is about the Oracle asynchronous IO tracking mechanism, which Oracle often refers to as ‘slots’).

a) submit/process phase:
After the above mentioned initial operations, Oracle submits IO’s via the io_submit() system call. Oracle issues two IO requests for two distinct ranges of blocks when it starts a direct path read scan, which is the minimal number of IO requests which are issued in parallel. This number could go up (up to 32 concurrent requests). These are two distinct io_submit() calls.

When Oracle enters the submit/process phase again, it does something interesting: it first processes a single (multiblock) IO request, then issues an IO request via io_submit(), then processes another IO request (if available), then issues an IO request (io_submit()), depending on the number of number IOs Oracle needs to submit and needs to process.

It’s important to note that Oracle did not put any waits in this phase. This is all instrumented as cpu time.

b) reap all phase:
Now Oracle has issued a number of IOs, Oracle does something truly interesting: it issues a number of IO requests (visible as io_getevents() calls) for ALL the IOs this process submitted with timeout set to 0 (zero). This does two things: it makes the call non blocking and it will be invisible for strace (the Linux system call trace utility) if they do not succeed (!!).

This phase is not covered by a waitevent. This means that if IOs are fast, and this phase succeeds in reaping all IOs, there is no wait, and are instrumented as cpu time.

Also, if all IOs are reaped in this phase, it can go back to the submit/process phase.

c) wait for IO phase:
If Oracle comes up to this phase, the process has done everything, and can do nothing but wait for IO to arrive. So that is exactly what the process is going to do: it issues a io_getevents() call with timeout set to 600 seconds, and waits for any SINGLE IO request to arrive. Here Oracle “truly” waits, and this is covered by a wait event: ‘direct path read’. Once a single IO request is reaped, this step is ready, and goes back to the submit/process phase.

Okay, now that we are up to date with how Oracle processes asynchronous direct path IO, let’s dig into how this works with huge IOs!

For investigating this issue, the linux strace utility actually works best, because it interprets the pointers in the system call, and shows the content of the struct, instead of just showing the pointer, which gdb does. As always, you need to careful choose the tools you use for the job. By using strace, the only hint in which phase the io_getevents() call was done is by looking at the timeout value(!).

This is a snippet of the strace output when Oracle was doing 1020 blocks IO requests:

io_submit(140019063115776, 8, {{0x7f58b89fbfb8, 0, 0, 0, 256}, {0x7f58b89fc210, 0, 0, 0, 257}, {0x7f58b89fc468, 0, 0, 0, 256}, {0x7f58b89fc6c0, 0, 0, 0, 257}, {0x7f58b89fc918, 0, 0, 0, 256}, {0x7f58b89fcb70, 0, 0, 0, 257}, {0x7f58b89fcdc8, 0, 0, 0, 256}, {0x7f58b89fd020, 0, 0, 0, 257}}) = 8
io_getevents(140019063115776, 8, 128, {{0x7f58b89f6db0, 0x7f58b89f6db0, 1048576, 0}, {0x7f58b89f7008, 0x7f58b89f7008, 1048576, 0}, {0x7f58b89f66a8, 0x7f58b89f66a8, 1048576, 0}, {0x7f58b89f6450, 0x7f58b89f6450, 1048576, 0}, {0x7f58b89f61f8, 0x7f58b89f61f8, 1048576, 0}, {0x7f58b89f7710, 0x7f58b89f7710, 1048576, 0}, {0x7f58b89f7968, 0x7f58b89f7968, 1048576, 0}, {0x7f58b89f7bc0, 0x7f58b89f7bc0, 1015808, 0}}, {600, 0}) = 8

This is a combination of an io_submit() call and an io_getevents(). There is a lot to say about this.
First, Oracle did not magically increase the maximum IO size of the operating system. As we can see, Oracle submits multiple IO requests in a single io_submit() call. If we peek at the size field in the io_getevents() calls, we see the size mostly being 1048576. In other words: the individual IO requests are 1MB, as we expect, Oracle submitted multiple requests in order to submit IO requests for the single (Oracle level) request of 1020 blocks.

If we add the IO sizes of the individual requests we get: (1048576*7)+1015808=8355840 bytes. 8355840/8192 (=blocksize) = 1020. So the size of the IO requests is the same as the number of Oracle blocks.

Next, if we look at the timeout “field” of io_getevents(), it reads {600, 0}. This means this was a blocking call, with timeout set to 600. We can deduct from that field that this was an io_getevents() call from the wait for IO phase.

I urge you to take an even closer look. If you look at the pointer to the iocb (the large hexadecimal number), you will see that the IOs submitted are different from the IOs reaped with io_getevents(). This is Oracle taking advantage of asynchronous IO.

I’ve throttled IO to 1 IOPS, so I know all the requests will go up to the wait for IO phase.

Let’s take a look at a strace output if IO is not throttled to 1 IOPS:

io_getevents(140136915922944, 28, 128, {{0x7f7429499710, 0x7f7429499710, 1048576, 0}, {0x7f7429499e18, 0x7f7429499e18, 1048576, 0}, {0x7f7429498900, 0x7f7429498900, 1048576, 0}, {0x7f742949c5f0, 0x7f742949c5f0, 1048576, 0}, {0x7f742949ba38, 0x7f742949ba38, 1048576, 0}, {0x7f7429498db0, 0x7f7429498db0, 1048576, 0}, {0x7f742949ac28, 0x7f742949ac28, 1048576, 0}, {0x7f742949a2c8, 0x7f742949a2c8, 1048576, 0}, {0x7f742949b7e0, 0x7f742949b7e0, 1048576, 0}, {0x7f742949a070, 0x7f742949a070, 1048576, 0}, {0x7f7429499260, 0x7f7429499260, 1048576, 0}, {0x7f74294981f8, 0x7f74294981f8, 1048576, 0}, {0x7f742949c848, 0x7f742949c848, 1048576, 0}, {0x7f742949a778, 0x7f742949a778, 1048576, 0}}, {0, 0}) = 14

The timeout struct is set to {0, 0}. So this is an io_getevents() call from the ‘get all IOs phase’. As we can see, all IOs in this case is 28 (that is a little less than 28MB!). Another interesting thing is this actual call returned 14 IO requests, but still succeeded, despite the minimal number of requests ready was set to 28. My guess is because because another io_getevents() call or calls reaped the other IOs, and this request completed the 28 requests, and made this call successful.

Because this is in the ‘get all IOs phase’, there is no wait event. So this huge amount of IOs being ready is not reflected in the wait events, nor can be assembled from it for this specific request. Well, you could calculate the number of IOs needed to read this segment by looking at DBA_EXTENTS, and calculate the total time taken for reading the segment, and calculate an average but that’s all.

Because the ‘direct path read’ wait event only shows up when we need to wait, which will be for a single request, a wait will always show maximally for 1024 blocks because that is the maximum (which I’ve found up to now). If all the IOs are ready before the ‘wait for IO phase’, it means there are more IO requests ready than a single Oracle IO.

This is a small note describing how Oracle implemented the situation which is covered by the db file parallel read wait event. This events happens if Oracle knows it must read multiple blocks which are not adjacent (thus from different random files and locations), and cannot continue processing with the result of a single block. In other words: if it cannot process something after reading a single block (otherwise Oracle will read a single block visible by the wait ‘db file sequential read’).

This is how it shows up if you enable sql trace:

WAIT #139658359011504: nam='db file parallel read' ela= 69997140 files=1 blocks=70 requests=70 obj#=76227 tim=1373200106669612

What this shows, is Oracle issuing a request for 70 blocks. This has an interesting implication for monitoring and looking at the time spend on the event ‘db file parallel read’: if you don’t know the number of blocks for which an IO request is issued, it’s impossible to say something about the time. So just monitoring or looking a cumulative time spend in ‘db file parallel read’ doesn’t say anything about IO latency, it only tells something about where the Oracle process did spend its time on.

How did Oracle implement this? This is obviously port specific (which means the implementation will be different on different operating systems). My test environment is Oracle Linux 6u3 X64, Oracle 11.2.0.3 64 bit using ASM.

This is how the requests are asynchronously submitted to the operating system:

Breakpoint 2, io_submit (ctx=0x7f04c0c8d000, nr=70, iocbs=0x7fff86d965f0) at io_submit.c:23

So all the IO requests are submitted in one go!

After the IO requests are submitted (which is not covered by a wait, which makes sense, because the io_submit call is/is supposed to be non blocking.

Next Oracle waits for ALL the IOs to finish, covered by the ‘db file parallel read’ wait event:

Breakpoint 13, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, io_getevents_0_4 (ctx=0x7f04c0c8d000, min_nr=70, nr=128, events=0x7fff86d9c798, timeout=0x7fff86d9d7a0) at io_getevents.c:46
Breakpoint 14, 0x0000000008fa1334 in kslwtectx ()

We see kslwtbctx which indicates the start of a waitevent, then a io_getevents call:

‘ctx’ is the IO context. This is how Linux keeps track of groups of asynchronous IO requests.
‘min_nr’ is the minimal number of requests that must be ready for this call to succeed, this call will wait until ‘timeout’ is reached. io_getevents will just peek if ‘timeout’ is set to zero.
‘nr’ is the maximal number of requests that io_getevents will “fetch”.
‘events’ is a struct (table like structure) that holds the information about the iocb’s (IO control blocks) of the requests.
‘timeout’ is a struct that sets the timeout of this call. For Oracle IO I see timeout being 600 (seconds) most of the time.

The last line show kslwtectx indicating that the wait has ended.

This is part 2 of a number of blogposts about huge Oracle database IO’s.
If you landed on this blogpost and did not read part 1, please read part 1 here.

In part 1 I showed how database IOs of a full table scan could be bigger than 1MB by increasing the db_file_multiblock_read_count parameter to a number beyond 1MB expressed in Oracle blocks. These bigger IOs only happen with direct path reads, not with buffered multiblock reads.

But how much bigger can these IOs be? In part 1 I showed Oracle IOs of 1020 blocks. Is that the limit? To investigate this, I created a much bigger table (table T2 in part 1 had a maximum extent size of 1024 blocks, which meant that the 1020 is the biggest IO possible from this table).

For the sake of this investigation I created a much bigger table to get larger extents:

 EXTENT_ID     BLOCKS
---------- ----------
	 0	    8
...
       263	 8192
       264	 8192
       265	 8192

266 rows selected.

So with my new, big table I got extents up to 8192 blocks. Let’s do the same test on this table as I did in the previous post, which means setting db_file_multiblock_read_count to 2000 with SQL trace at level 8 with throttled IOPS to get waits!

The IOs with the “small” extents of 8 blocks and 128 blocks are done exactly the same as the previous post. The IOs done against the extents with a size of 1024 blocks are the same too, simply because reading the 1020 blocks in one IO is as efficient as it could be! But let’s take a look at the point where the extents become 8192 blocks:

WAIT #140281084141152: nam='direct path read' ela= 7996378 file number=5 first dba=174468 block cnt=1020 obj#=76227 tim=1373104660882677
WAIT #140281084141152: nam='direct path read' ela= 7995820 file number=5 first dba=175489 block cnt=1023 obj#=76227 tim=1373104668882345
WAIT #140281084141152: nam='direct path read' ela= 7996472 file number=5 first dba=176520 block cnt=632 obj#=76227 tim=1373104676882677
WAIT #140281084141152: nam='direct path read' ela= 7998049 file number=5 first dba=177152 block cnt=1024 obj#=76227 tim=1373104684883512
WAIT #140281084141152: nam='direct path read' ela= 7995472 file number=5 first dba=178176 block cnt=1024 obj#=76227 tim=1373104692882932
WAIT #140281084141152: nam='direct path read' ela= 7993677 file number=5 first dba=179200 block cnt=1024 obj#=76227 tim=1373104700880106
WAIT #140281084141152: nam='direct path read' ela= 7996969 file number=5 first dba=180224 block cnt=1024 obj#=76227 tim=1373104708880891
WAIT #140281084141152: nam='direct path read' ela= 5998630 file number=5 first dba=181248 block cnt=1024 obj#=76227 tim=1373104714882889
WAIT #140281084141152: nam='direct path read' ela= 9996459 file number=5 first dba=182272 block cnt=1024 obj#=76227 tim=1373104724882545

I’ve included one IO of 1020 blocks, after which Oracle issues an IO request of 1023 blocks, then 632 blocks, and then it issues a number of requests for 1024 blocks.

It seems that the maximal amount of blocks of multiblock IO requests done by Oracle “unbuffered” or “direct path” is 1024 blocks, which is 8MB! Please mind I have tested this ONLY with Oracle version 11.2.0.3 64 bits on Linux X64 on ASM, with a database and segment blocksize of 8kB.

There’s another peculiarity I want to introduce. Let’s take a look at the dba and block cnt only, together with block_id and blocks (number of blocks in the extent, taken from dba_extents):

                                  BLOCK_ID	  BLOCKS         EXTENT_ID
dba=174468 block cnt=1020           174464          1024               197
dba=175489 block cnt=1023           175488          1024               198
dba=176520 block cnt=632            176512          8192               199
dba=177152 block cnt=1024
dba=178176 block cnt=1024
dba=179200 block cnt=1024
dba=180224 block cnt=1024
dba=181248 block cnt=1024
dba=182272 block cnt=1024
dba=183296 block cnt=1024 

There are two anomalies here: the first one is at dba 175489. It seems that dba 175489/extent id 198 has only one BMB, so Oracle can read 1023 blocks of the extent, instead of the 1020 blocks in the other extents with the size of 1024 blocks.

The second anomaly is much more interesting: extent id 199 starts at block id 176512. This is an extent consisting of 8192 blocks. The first IO starts at block id 176520. This means there are 176520-176512=8 blocks not read, which are probably BMB’s. But this is in line with my expectation. What is odd, is the first IO issued for datablocks in this extent which is 8192 blocks in length is “only” 632 blocks!

After the oddly sized IO request at the beginning of the extent (632), the other IO requests are the full 1024 blocks, which probably is the maximum size of direct path Oracle IO requests.

I wondered about the oddly sized request at the beginning. I looked at the dba’s of the requests, and noticed the dba’s of the 1024 blocks sized requests looked vaguely familiar. Then I spotted the familiarity and a possible reason for the “small” request at the beginning of the extent:

SYS@v11203 AS SYSDBA> select mod(176520,1024) from dual;

MOD(176520,1024)
----------------
	     392

SYS@v11203 AS SYSDBA> select mod(177152,1024) from dual;

MOD(177152,1024)
----------------
	       0

SYS@v11203 AS SYSDBA> select mod(178176,1024) from dual;

MOD(178176,1024)
----------------
	       0

ONLY when the direct path read requests can potentially read “the full” 1024 blocks in one IO, Oracle first issues an IO to align the later requests to a block id being a multiplication of 1024, after which it will issue IO requests of 1024 blocks.

I don’t know why the alignment takes place. I can imagine this alignment can line up with ASM, thus avoiding a single Oracle IO request touching more than one ASM allocation unit. But I am open to other people’s ideas and suggestions about potential reasons!

Stay tuned for further updates about extra huge database IOs!

It’s been a while since I presented the first incarnation of my ‘about multiblock reads’ presentation. When I did this at the UKOUG TEBS conference in Birmingham in 2011, Christian Antognini chaired my presentation. After my presentation Christian showed me it’s possible to set the parameter ‘db_file_multiblock_read_count’ higher than 1MB/db_block_size (which is 128 if your blocksize is 8kB), and you could benefit from it if your hardware is sufficient. In fact, Christian showed me AWR reports (could also be statspack reports, not sure) which showed the benefit.

My understanding of the parameter db_file_multiblock_read_count at the time was:

The maximum value is the operating system’s maximum I/O size expressed as Oracle blocks ((max I/O size)/DB_BLOCK_SIZE). If you set this parameter to a value greater than the maximum, Oracle uses the maximum.

Which is taken from the Reference guide, in which this text is there for both 10.2 and 11.2, and even for 12.1.

This is interesting, because the statement from the Oracle reference guide conflicts with Christians statement and measurements!

My test environment: Oracle 11.2.0.3 on Linux OL6u3 X64 with ASM, on my Macbook with SSD.

I’ve got a table created in a bigfile ASSM tablespace, system allocation type. This means objects get extents allocated which are increased progressively in size in steps:

 EXTENT_ID     BLOCKS
---------- ----------
	 0	    8
...
	15	    8
	16	  128
...
	78	  128
	79	 1024
...
	91	 1024

First let’s take a look at a full scan done “buffered”, alias blocks read to the buffercache, which is visible via the “db file scattered read” wait event. This is the “traditional” multiblock read.

I’ve flushed the buffer cache, set db_file_multiblock_read_count to 128, set event 10949 at level 1 and enabled SQL trace at level 8.
Event 10949 is needed to get a buffered multiblock read of an table which the Oracle database deems ‘big’ (to avoid non-buffered alias direct path reads).
Another thing is both ‘db file sequential read’ and ‘db file scattered read’ will always show up (even if the action (the physical read) is very fast), so there’s no need the change things (slow down I/O) to be sure we get these waits.

The statement executed is ‘select count(*) from t2′.

The first section of extent size 8, which shows the typical alteration of 8 and 7 blocks being read (after the initial smaller sizes) which is because of the space bitmap block every other extent:

WAIT #140150651756760: nam='db file sequential read' ela= 272 file#=5 block#=23938 blocks=1 obj#=75579 tim=1372970940124856
WAIT #140150651756760: nam='db file scattered read' ela= 2162 file#=5 block#=23939 blocks=5 obj#=75579 tim=1372970940127442
WAIT #140150651756760: nam='db file scattered read' ela= 725 file#=5 block#=23944 blocks=8 obj#=75579 tim=1372970940128478
WAIT #140150651756760: nam='db file scattered read' ela= 144 file#=5 block#=23953 blocks=7 obj#=75579 tim=1372970940128908
WAIT #140150651756760: nam='db file scattered read' ela= 253 file#=5 block#=23960 blocks=8 obj#=75579 tim=1372970940129636

Then the extents of 128 blocks, which show 126 blocks being read because of 2 space bitmap blocks at the beginning of every extent of 128 blocks:

WAIT #140150651756760: nam='db file scattered read' ela= 3741 file#=5 block#=24066 blocks=126 obj#=75579 tim=1372970940139996
WAIT #140150651756760: nam='db file scattered read' ela= 955 file#=5 block#=24194 blocks=126 obj#=75579 tim=1372970940142894
WAIT #140150651756760: nam='db file scattered read' ela= 2916 file#=5 block#=24322 blocks=126 obj#=75579 tim=1372970940148287
WAIT #140150651756760: nam='db file scattered read' ela= 2111 file#=5 block#=24450 blocks=126 obj#=75579 tim=1372970940152527
WAIT #140150651756760: nam='db file scattered read' ela= 2952 file#=5 block#=24578 blocks=126 obj#=75579 tim=1372970940157492

Then the extents of 1024 blocks, which start off with 4 space bitmap blocks at the beginning of every extents:

WAIT #140150651756760: nam='db file scattered read' ela= 3079 file#=5 block#=36228 blocks=128 obj#=75579 tim=1372970941036955
WAIT #140150651756760: nam='db file scattered read' ela= 2737 file#=5 block#=36356 blocks=128 obj#=75579 tim=1372970941048199
WAIT #140150651756760: nam='db file scattered read' ela= 2531 file#=5 block#=36484 blocks=128 obj#=75579 tim=1372970941059473
WAIT #140150651756760: nam='db file scattered read' ela= 2912 file#=5 block#=36612 blocks=128 obj#=75579 tim=1372970941071015
WAIT #140150651756760: nam='db file scattered read' ela= 2768 file#=5 block#=36740 blocks=128 obj#=75579 tim=1372970941082467
WAIT #140150651756760: nam='db file scattered read' ela= 287 file#=5 block#=36868 blocks=128 obj#=75579 tim=1372970941091378
WAIT #140150651756760: nam='db file scattered read' ela= 2831 file#=5 block#=36996 blocks=128 obj#=75579 tim=1372970941102969
WAIT #140150651756760: nam='db file scattered read' ela= 536 file#=5 block#=37124 blocks=124 obj#=75579 tim=1372970941111811

We see 7 lines of 128 blocks read (7*128=896) and one line of 124 blocks (896+124=1020) which is 4 blocks short of the extent size of 1024, which are the space bitmap blocks.

Okay, nothing new here.

If I raise the parameter db_file_multiblock_read_count to 2000, and reissue the same ‘select count(*) from t2′, I get exactly the same waits. In other words: nothing different.

So the conclusion is: when using buffered reads, the statement about 1MB being the limit of the size of a single multiblock read indeed is true.

Now let’s move on to direct path reads. In short, when a table is deemed ‘big’ by the database, meaning the number of blocks occupied by the table is more than five times _small_table_threshold (which is 2% of the buffercache by default) then the database foregrounds will not use the buffercache to read the table blocks to, but read them to the process’ PGA to avoid discarding usefull blocks in the cache, and avoid the overhead of concurrency control (latching). There are more criteria, but that’s outside the scope of this blogpost.

First let’s trace a session doing direct path reads. This is easier than the above situation where I forced buffered reads, because I can do the same as above, but skip setting event 10949. On the other hand it’s harder, because in order to see all the IOs using tracing (which means ‘get a wait for all IOs’), I need reduce the number of IOs (I use 1 IOPS), because otherwise some IOs will not get a wait, so not all IOs will be visible in the tracefile (look at this article for learning how to reduce IOPS with Linux, go see me present my About multiblock reads presentation to learn why some waits for direct path read IOs could not show up)

So, here are snippets from the tracefile of a ‘select count(*) from t2′, with db_file_multiblock_read_count set to 128 and SQL trace at level 8:

First we see the reads in the small extents (8 blocks) from BMB to BMB, not stopping at the extent border:

WAIT #140402935204600: nam='direct path read' ela= 996117 file number=5 first dba=23939 block cnt=13 obj#=75579 tim=1373055770533891
WAIT #140402935204600: nam='direct path read' ela= 997014 file number=5 first dba=23953 block cnt=15 obj#=75579 tim=1373055771531222
WAIT #140402935204600: nam='direct path read' ela= 1000275 file number=5 first dba=23969 block cnt=15 obj#=75579 tim=1373055772531716
WAIT #140402935204600: nam='direct path read' ela= 999111 file number=5 first dba=23985 block cnt=15 obj#=75579 tim=1373055773531249
WAIT #140402935204600: nam='direct path read' ela= 999628 file number=5 first dba=24001 block cnt=15 obj#=75579 tim=1373055774531093

Then we see the reads of 126 blocks, of the extents of 128 blocks:

WAIT #140402935204600: nam='direct path read' ela= 1004881 file number=5 first dba=24066 block cnt=126 obj#=75579 tim=1373055778541646
WAIT #140402935204600: nam='direct path read' ela= 989042 file number=5 first dba=24194 block cnt=126 obj#=75579 tim=1373055779531980
WAIT #140402935204600: nam='direct path read' ela= 1002835 file number=5 first dba=24322 block cnt=126 obj#=75579 tim=1373055780541509
WAIT #140402935204600: nam='direct path read' ela= 989545 file number=5 first dba=24450 block cnt=126 obj#=75579 tim=1373055781531621
WAIT #140402935204600: nam='direct path read' ela= 1007730 file number=5 first dba=24578 block cnt=126 obj#=75579 tim=1373055782539986

And then the reads of the extents of the size of 1024 blocks:

WAIT #140402935204600: nam='direct path read' ela= 2000086 file number=5 first dba=36228 block cnt=124 obj#=75579 tim=1373055875542615
WAIT #140402935204600: nam='direct path read' ela= 1995550 file number=5 first dba=36480 block cnt=128 obj#=75579 tim=1373055877539193
WAIT #140402935204600: nam='direct path read' ela= 1999327 file number=5 first dba=36736 block cnt=128 obj#=75579 tim=1373055879539584
WAIT #140402935204600: nam='direct path read' ela= 1998883 file number=5 first dba=36992 block cnt=128 obj#=75579 tim=1373055881539584
WAIT #140402935204600: nam='direct path read' ela= 1998753 file number=5 first dba=37252 block cnt=124 obj#=75579 tim=1373055883539416

If you take a close look, you’ll see IO’s are still missing (just add cnt to dba and see if it fits the next dba, with the exception of BMB’s, which are not read too, so will leave a gap). Nevertheless: this is as we expect.

Now let’s do the same, but set db_file_multiblock_read_count to 2000! In order to reduce length: the IO’s of the 8 blocks and 128 blocks extents are identical. Let’s look at the 1024 blocks extent IOs:

WAIT #140550163651400: nam='direct path read' ela= 7862301 file number=5 first dba=32132 block cnt=1020 obj#=75579 tim=1373057289713781
WAIT #140550163651400: nam='direct path read' ela= 7996018 file number=5 first dba=33156 block cnt=1020 obj#=75579 tim=1373057297713687
WAIT #140550163651400: nam='direct path read' ela= 7996272 file number=5 first dba=34180 block cnt=1020 obj#=75579 tim=1373057305713868
WAIT #140550163651400: nam='direct path read' ela= 7995719 file number=5 first dba=35204 block cnt=1020 obj#=75579 tim=1373057313713547
WAIT #140550163651400: nam='direct path read' ela= 7996310 file number=5 first dba=36228 block cnt=1020 obj#=75579 tim=1373057321713593

Wow! These are HUGE IOs! The entire 1024 block extent is read in one (oracle) IO excluding the 4 BMB’s.

So this means Christian was right, and the Oracle documentation is wrong about the maximum size of db_file_multiblock_read_count for direct path reads.

There’s much more to be said about huge IOs, but I’ll leave that for another blogpost.

This post is just a reminder to myself how to install the vmware (fusion) stuff in a fresh installed OL6u4 installation. To be honest, it isn’t entirely “fresh”: I’ve ‘yum update’ed the installation, and I’ve installed the 11gR2 preinstall package (yum install oracle-rdbms-server-11gR2-preinstall) package already.

1. Reboot
After the yum update and yum install of the oracle database preinstall package everything is updated, but not yet activated (new kernel). This means you need to reboot your VM first.

2. Install the kernel headers
Now the whole system is installed and setup we can *almost* install the vmware tools. However: the vmware tools need the kernel headers. If you look for the kernel headers, you *think* you see them:

# rpm -qa | grep headers
glibc-headers-2.12-1.107.el6.x86_64
kernel-uek-headers-2.6.32-400.29.1.el6uek.x86_64

But this is not the case. In fact, the system uses a different kernel version (2.6.39 in my case, which is the UEK2 kernel). But more importantly: the kernel headers are not in the kernel-uek-headers package (!!). In order to install the kernel headers which the vmware tools driver installation needs, you need to install the kernel-uek-devel package:

# yum install kernel-uek-devel

3. Make the vmware tools available to the VM, and extract the installer package
Virtual Machine>Install VMWare Tools.
Inside the linux VM:
# mount /dev/dvd /mnt -t iso9660
# cd /tmp
# tar xzvf /mnt/VMwareTools-9.2.2-893683.tar.gz

4. Install the vmware tools
# cd vmware-tools-distrib
# ./vmware-install.pl -default
(the vmware tools installation starts, and it will choose all the default values)

For the installation of the grid and database software I’ve got the files setup/extracted outside of my VM, and use ‘shared folders’ feature of vmware (fusion) to mount these from my VM:

1. Enable shared folders in the vmware fusion hypervisor, and share a directory
Virtual Machine>Settings>Sharing
enable shared folders (off by default)
Press ‘+’, this will make a window available to select a folder. Select the folder in which the installation media (database and grid directory for Oracle 11.2.0.3 installation) is present, and click ‘add’.
(optional: set permissions in the shared folders window to ‘readonly’)
Take note of the name field, the name of the shared folder is what you need in the next step.
At this point the folder you selected is shared with the VM, and the sharing is enabled.

2. Make the shared folders visible in the Oracle Linux VM
The last step is to create a mountpoint for the shared folder (vmware creates it’s own directory ‘/mnt/hgfs’ but I like to create my own):

# mkdir /mnt/oracle11203

And make the mount permanent in fstab:

# vi /etc/fstab
.host:/oracle11203        /mnt/oracle11203      vmhgfs  default        0 0

All is setup now! Use the newly created static mount:

# mount -a

And make sure it’s there:

# mount 
...
.host:/ on /mnt/hgfs type vmhgfs (rw,ttl=1)
.host:/oracle11203 on /mnt/oracle11203 type vmhgfs (rw,ttl=1)

When an Oracle process starts executing a query and needs to do a full segment scan, it needs to make a decision if it’s going to use ‘blockmode’, which is the normal way of working on non-Exadata Oracle databases, where blocks are read from disk and processed by the Oracle foreground process, either “cached” (read from disk and put in the database buffercache) or “direct” (read from disk and put in the process’ PGA), or ‘offloaded mode’, where part of the execution is done by the cell server.

The code layer where the Oracle database process initiates the offloading is ‘kcfis’; an educated guess is Kernel Cache File Intelligent Storage. Does a “normal” alias non-Exadata database ever use the ‘kcfis’ layer? My first guess would be ‘no’, but we all know guessing takes you nowhere (right?). Let’s see if a “normal” database uses the ‘kcfis’ functions on a Linux x64 (OL 6.3) system with Oracle 11.2.0.3 64 bit using ASM.

The only way to profile kernel functions that I am aware of is using ‘gdb’ and breaking on functions in the Oracle executable:
(the process id shown below ought to be the process id of an oracle database process id you are going to execute in. Do not, I repeat: not do this with other processes, especially the ones that do important tasks!)

# gdb -p 42
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
...
(gdb) rbreak ^kcfis.*
Breakpoint 1 at 0x2204094
<function, no debug info> kcfis_get_sched_delay;
Breakpoint 2 at 0x220410a
<function, no debug info> kcfis_capability_tab_get;
Breakpoint 3 at 0x2204150
<function, no debug info> kcfis_can_session_migrate;
Breakpoint 4 at 0x2204190
<function, no debug info> kcfis_fob_storage_attr;
Breakpoint 5 at 0x22041d0
<function, no debug info> kcfis_init_resource_limit;
Breakpoint 6 at 0x22041f2
<function, no debug info> kcfis_tablespace_is_on_sage;
...
(gdb) c
Continuing.

Okay, we got the debugger set, now let’s execute a simple query (doing a full scan) to see if kcfis is touched on a “normal” server or not!

TS@v11203 > select count(*) from t2;

I am on OS-X (Apple) using iTerm, and I see the tab of the gdb session turning red: gdb generated some output on the screen!

(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) 

So, we hit a kcfis function! Let me put upfront that I do not have sourcecode of any kind, so my statements about what a function does are actually guesses. Anyway: from the name of the function (kcfis_tablespace_is_on_sage) it looks like a function in the kcfis layer which determines if a tablespace is on an Exadata storage server (exadata’s internal name is/was ‘sage’). Nice. so we hit the layer for determination if the tablespace is on Exadata.

Let’s continue the profiling with gdb:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000220410a in kcfis_capability_tab_get ()
(gdb) c
Continuing.

Breakpoint 6, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
(gdb) c
Continuing.

We see another call to the function ‘kcfis_tablespace_is_on_sage’, and a call to ‘kcfis_capability_tab_get’. The last function probably tries to probe the table (but could be tablespace) to get the capabilities. This could be the function which checks the requirements for hybrid columnar compression, I am not sure about that.

At what point during the processing of the full segment scan does the kcfis_tablespace_is_on_sage occur? One way of investigating this, is profiling some functions we know a (direct path) full scan does, and see where the kcfis_tablespace_is_on_sage kicks in. When the buffer cache is flushed prior to executing a full scan, and the SQL is made unique, so it has to be parsed, the following sequence of events happens:

- Parse
- Execute
- A ‘SQL*Net message to client’ wait
- A ‘db file sequential read’ wait (for reading the segment header)
- Potentially a ‘asynch descriptor resize’ wait
- The full scan is done asynchronously (potentially revealing some ‘direct path read’ waits)

So if we profile on start and end of a wait (kslwtbctx and kslwtectx), a single block read (pread64), AIO (io_submit and io_getevents_0_4) and of course kcfis_tablespace_is_on_sage, we should be able to see that:

Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 1, pread64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 4, 0x00000000022041f2 in kcfis_tablespace_is_on_sage ()
Breakpoint 2, 0x0000000008f9a652 in kslwtbctx ()
Breakpoint 3, 0x0000000008fa1334 in kslwtectx ()
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 5, io_submit (ctx=0x7fb42f475000, nr=1, iocbs=0x7fffb4c5e100) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
Breakpoint 6, io_getevents_0_4 (ctx=0x7fb42f475000, min_nr=2, nr=128, events=0x7fffb4c66768, timeout=0x7fffb4c67770) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

So what do we see here?
Row 1 : Here the wait ending for ‘SQL*Net message from client’.
Row 2-3 : This is the ‘SQL*Net message to client’ wait.
Row 5-6-7 : This is begin wait, pread64 for reading the segment header and the end wait.
Row 8&9 : The session probes for Exadata.
Row 10-11 : Start and end of a wait, ‘asynch descriptor resize’ (verified by trace file).
Row 12- : Here the full scan takes off.

So…a query starts (I’ve made it unique by adding a dummy hint, so it’s parsed again) the SQL*Net roundtrip occurs, the segment header is read, then the process looks if it’s on Exadata, which it isn’t here, and starts an asynchronous full scan.

So let’s repeat this check on a database which is on Exadata! In order to do so, we need to be aware several calls are not done on exadata: pread64, io_submit and io_getevents_0_4, because we do not do local IO, but issue them over infiniband. So I breaked on kslwtbctx, kslwtectx, kcfis_tablespace_is_on_sage. This is how that looks like:

(gdb) c
Continuing.
Breakpoint 2, 0x000000000905cf62 in kslwtectx ()
Breakpoint 3, 0x0000000002230366 in kcfis_tablespace_is_on_sage ()

Mmmh, this is different. The ‘SQL*Net message from client’ wait ends, and before any other wait occurs, the existence of exadata is checked. This is different from the non-exadata case. Let’s take a look at the backtrace of the break on kcfis_tablespace_is_on_sage:

(gdb) bt
#0  0x0000000002230366 in kcfis_tablespace_is_on_sage ()
#1  0x0000000001402eb0 in qesSageEnabled ()
#2  0x0000000009234d20 in kkdlgstd ()
#3  0x0000000001a6111d in kkmfcblo ()
#4  0x000000000922f76d in kkmpfcbk ()
#5  0x000000000942e538 in qcsprfro ()
#6  0x000000000942de29 in qcsprfro_tree ()
#7  0x000000000942de6e in qcsprfro_tree ()
#8  0x0000000002dd80c5 in qcspafq ()
#9  0x0000000002dd51d9 in qcspqbDescendents ()
#10 0x0000000002dd91e4 in qcspqb ()
#11 0x0000000001a6b2be in kkmdrv ()
#12 0x0000000002584c76 in opiSem ()
#13 0x000000000258ac8b in opiDeferredSem ()
#14 0x000000000257dc32 in opitca ()
#15 0x0000000001ec3d7d in kksFullTypeCheck ()
#16 0x00000000092a7256 in rpiswu2 ()
#17 0x0000000001eca977 in kksLoadChild ()
#18 0x0000000009298448 in kxsGetRuntimeLock ()
#19 0x000000000925aa34 in kksfbc ()
#20 0x000000000925556e in kkspsc0 ()
#21 0x0000000009254e6a in kksParseCursor ()
#22 0x000000000933cb25 in opiosq0 ()
#23 0x0000000001b82a46 in kpooprx ()
#24 0x0000000001b80d2c in kpoal8 ()
#25 0x00000000091fb8b8 in opiodr ()
#26 0x000000000939e696 in ttcpip ()
#27 0x000000000180f011 in opitsk ()
#28 0x0000000001813c0a in opiino ()
#29 0x00000000091fb8b8 in opiodr ()
#30 0x000000000180af4c in opidrv ()
#31 0x0000000001e0a77b in sou2o ()
#32 0x0000000000a0cc05 in opimai_real ()
#33 0x0000000001e106ec in ssthrdmain ()
#34 0x0000000000a0cb71 in main ()

What is interesting to see, is line 23, backtrace layer number 21: kksParseCursor. So actually during parsing the detection of storage servers happens already, not when it actually starts a full segment scan needs to make a decision to do a smartscan or not.

This is just a very small post on how to watch the progress of the “CopyBack” state of a freshly inserted disk in an Exadata “Computing” (database) node. A disk failed in the (LSI Hardware) RAID5 set, and the hotspare disk was automatically used. The failed disk was replaced, and we are now awaiting the intermediate “CopyBack” phase.

The current state of the disks is visible using the following command:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"
Slot Number: 0Firmware state: Copyback
Device Firmware Level: 0D70
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: 0B70

But what is the progress? Some googling came up with (almost) the answer (http://trac.camsentry.com/wordpress/tag/ldinfo/), I modified it a tiny bit to make it refresh:

while $(true); do /opt/MegaRAID/MegaCli/MegaCli64 adpeventlog getlatest 200 -f ~/adpeventlog.txt a0; awk '/^Time/{TIME=$0};/Seconds/{SECS=$5}/^Event Desc/{printf("%25.25s %5.5s %s\n",TIME,SECS,$0);TIME=" ";SECS=""}' ~/adpeventlog.txt|grep -v fan|tac; sleep 5; done

This wil refresh every 5 seconds and show the progress of the CopyBack, and the state changes after it once it gets to that.

Follow

Get every new post delivered to your Inbox.

Join 1,883 other followers

%d bloggers like this: