Oracle 11.2 and the direct path read event

In my previous post I touched the topic of a “new” codepath (codepath means “way of utilising the operating system”) for of full segment scans. Full segment scan means scanning a segment (I use “segment” because it could be a partition of a table or index) which is visible commonly visible in an explain plan (but not restricted to) as the rowsources “TABLE ACCESS FULL”, “FAST FULL INDEX SCAN” and “BITMAP FULL SCAN”.

Look at my presentation About multiblock reads to see how and when direct path reads kick in, and what the difference between the two is. Most notably, Oracle has released very little information about asynchronous direct path reads.

This post is about the implementation of direct path reads in the Oracle executable in 11.2.0.3 and the OS implementation on Linux.
For completeness: this is Oracle Linux 6.3 X64 with kernel 2.6.39-300.17.3.el6uek.x86_64. Database 11.2.0.3 64 bit, no PSU/CPU.
The database uses two ASM disks and the clusterware in single machine mode.
The system is running in a VM on my Macbook OSX 10.8.2 with SSD, VMWare Professional 5.0.2.

The database action which is traced with gdb is a simple “select count(*) from t2”, where t2 is a table without any indexes or constraints, and big enough to make the database engine choose a direct path read. Oracle can choose a number of new, largely undocumented mechanisms for optimising specifically direct path reads. The mechanisms observed in this blogpost are the true asychronous reads, which means the foreground process can issue several IO read requests, and reap them if they are finished, and increasing the number of simultaneous IO requests.

1. Observing the IO requests: getting the picture straight

First we restore the setup from my previous post, which means a normal database process (using sqlplus), and a root session with gdb attached to this process using “gdb -p ”

Now set breakpoints for the operating system IO calls (with the intention to see io_submit() and io_getevents()) to show the IO calls for the oracle foreground process, and let the oracle process continue processing by using “c” in the debugger:

(gdb) rbreak ^io_.*
Breakpoint 1 at 0xa08c20
 io_prep_pwritev;
Note: breakpoint 1 also set at pc 0xa08c20.
...
Breakpoint 45 at 0x7f2e71b1dc0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 1-45, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Now issue the full segment scan in the Oracle session, and look at the output:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()

.
First thing that comes to mind is: why the TWO calls for the io_submit (io_submit@plt () and io_submit () from /lib64/libaio.so.1) and only 1 call for io_getevents (io_getevents@plt ())?

This is where ‘plt’ needs an introduction. ‘plt’ means procedure linkage table. This is a construction for functions used by an executable that uses dynamically linked shared libraries, which is exactly what the ‘oracle’ executable uses. If you look at the address of the @plt call, you see it is in the oracle executable (see /proc/maps), and io_submit () from /lib/libaio.so.1 is, as the line indicates from the dynamically linked library libaio.so.1 (which also can be seen from /proc//maps).

Back to io_getevents. We see only the io_getevents@plt call, which means that either the executable fakes the system call (?), or that we are missing something. This lead me to investigate the libaio library itself. This can be done using ‘nm -D’ (nm is an executable):

# nm -D /lib64/libaio.so.1
0000000000000000 A LIBAIO_0.1
0000000000000000 A LIBAIO_0.4
0000003f38200710 T io_cancel
0000003f38200670 T io_cancel
0000003f38200690 T io_destroy
0000003f382006a0 T io_getevents
0000003f38200620 T io_getevents
0000003f38200570 T io_queue_init
0000003f38200590 T io_queue_release
0000003f382005b0 T io_queue_run
0000003f382005a0 T io_queue_wait
0000003f382006e0 T io_queue_wait
0000003f38200680 T io_setup
0000003f38200660 T io_submit

.
Aha! there are TWO io_getevents here! Now THAT is interesting!

Let’s see which io_getevents call or calls are used by the oracle executable:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) rbreak ^io_get.*
Breakpoint 46 at 0x3f382006a0
 io_getevents;
...
Breakpoint 53 at 0xa09030
 io_getevents@plt;

.
And list the breakpoints:

(gdb) info break
Num     Type           Disp Enb Address            What
46      breakpoint     keep y   0x0000003f382006a0 io_getevents
47      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
48      breakpoint     keep y   0x0000003f382006a0 io_getevents
49      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
50      breakpoint     keep y   0x0000003f382006a0 io_getevents
51      breakpoint     keep y   0x0000003f382006a0 io_getevents
52      breakpoint     keep y   0x0000003f382006a0 io_getevents
53      breakpoint     keep y   0x0000000000a09030 io_getevents@plt

.
So, all the io_getevents breakpoints are for address 0x0000003f382006a0! If we look back at the libaio contents, we see addresses for the io_getevents functions have addresses 0x0000003f382006a0 and 0x0000003f38200620.

Let’s make gdb break and continue at the second io_getevents address too!

(gdb) break *0x0000003f38200620
Breakpoint 54 at 0x3f38200620
(gdb) commands
Type commands for breakpoint(s) 54, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Now run the full segment scan in the sqlplus session again, and see what gdb shows:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1

.
Okay! We got the system call back!

So, we now got a breakpoint set in gdb which is able show us the asynchronous operating system calls Oracle uses in a scan, which are io_submit and io_getevents. Oracle uses more calls for managing asynchronous IO, but for clarity I focus on io_submit and io_getevents.

2. Add wait events to direct path read profile

At this point I can add two breakpoints for the wait instrumentation (kslwtbctx (enter wait) and kslwtectx (end wait)) using:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 55 at 0x8f9a652
 kslwtbctx;
Breakpoint 56 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 55-56, one per line.
End with a line saying just "end".
>silent
>f
>c
>end

.
If I rerun the full scan again, I can see there are no waits for the IO related calls, just io_submit, then io_getevents, which I can tell because I don’t see kslwtbctx and kslwtectx during the IO related calls. I see them AFTER the IO is done, which is normal, expected behavior. Please mind the cause of this is this is a VM where the disks are on an SSD, but most probably the disks (which is a file with VMWare Fusion) are in the cache of my MacBook, which makes IO very fast.

As introduced in the previous post, the last function call seen is entering of a wait using ‘kslwtbctx’, which is the event for ‘SQL*Net message from client’, which is the foreground waiting for user input to process:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

.
Now see what happens if I slow the IO’s down using cgroups.

I slowed down both my (ASM-)disks to 1 IOPS, and ran the scan in sqlplus again. This is a snippet of the gdb output:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008fa1334 in kslwtectx ()

.
Actually, above is the typical pattern I see: a submitted IO via io_submit, four io_getevents calls, registering a wait event, then another io_getevents call, and ending of the wait event, given IO isn’t too fast. (in real life it all depends on the speed of the IO beneath the operating system, and mostly is a mixture of the last and the previous gdb output)

The pattern can have multiple forms, for example, when the scan starts, after the initialisation of the asynchronous IO dependencies (like an IO context on the OS and the IO slots in Oracle), the first calls are two io_submit calls (in order to get the minimal number of asynchronous IO’s (2) in flight).

Apparently, the four io_getevents calls after io_submit are non-blocking. This is my assumption, simply because I see these calls scrolling over my screen until the wait event is registered, and the scrolling stops at another io_getevents call. But can we be sure?

3. Digging deeper with debuginfo

In order to look deeper into the io_getevents calls, the arguments of the io_getevents function specifically, the debuginfo package of libaio needs to be installed (the debuginfo packages are available at http://oss.oracle.com/ol6/debuginfo). After that, the gdb session needs to be restarted in order to pickup the debug information.

This is how the gdb output looks like with the libaio debuginfo installed:

#0 0x0000000002cfb352 in io_prep_pread ()
#0 0x0000000000a09bb0 in io_submit@plt ()
#0 io_submit (ctx=0x7ff6ceb2c000, nr=1, iocbs=0x7fff2a4e09e0) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e9048, timeout=0x7fff2a4ea050) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ec128, timeout=0x7fff2a4ed130) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e8e48, timeout=0x7fff2a4e9e50) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ebf28, timeout=0x7fff2a4ecf30) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=1, nr=128, events=0x7fff2a4e8e38, timeout=0x7fff2a4e9e40) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008fa1334 in kslwtectx ()

.
Now we see a LOT of more interesting information!

First of all, all the IO calls involved are done in the same “aio_context”: ctx=0x7ff6ceb2c000. This means that any asynchronous IO that is done during this query reports it’s readiness here. This means IO’s can be reaped out of order(!).

Next is min_nr. The non-instrumented io_getevents calls have min_nr set to 2, the io_getevents call which is instrumented has min_nr set to 1.

nr is the maximum number of IO’s that can be reaped by this call in this aio_context. As far as I know, the number of slots cannot be higher than 32, which means that if I am right, there will not be more than 32 requests.

The last io_getevents argument is timeout, which is really interesting to understand and verify the behavior I described. This value is a pointer to a struct which holds the timeout specification. In order to actually know the timeout value, we need to print the contents of the struct.

This is where gdb, once the debuginfo is available, can help. Let’s modify the breakpoints the following way:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) break *0x0000003f38200620
Breakpoint 38 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 38, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) rbreak ^kslwt[be]ctx
Breakpoint 39 at 0x8f9a652
 kslwtbctx;
Breakpoint 40 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 39-40, one per line.
End with a line saying just "end".
>silent
>f
>end
(gdb) rbreak ^io_.*
Breakpoint 41 at 0x3f38200570: file io_queue_init.c, line 28.
int io_queue_init(int, io_context_t *);
...
Breakpoint 74 at 0x7f549fd44c0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 41-74, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

.
Notice the “print *timeout” with breakpoint 38 (*0x0000003f38200620) alias the “correct” io_getevents call.
Using “print *timeout” we can let gdb follow the pointer, and print the contents of the struct. Let’s see if my assumption about the non-instrumented io_getevents is correct:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  io_submit (ctx=0x7f54a1956000, nr=1, iocbs=0x7fff78f059b0) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0dfd8, timeout=0x7fff78f0efe0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$21 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f110b8, timeout=0x7fff78f120c0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0ddd8, timeout=0x7fff78f0ede0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f10eb8, timeout=0x7fff78f11ec0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=1, nr=128, events=0x7fff78f0ddc8, timeout=0x7fff78f0edd0) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 600, tv_nsec = 0}

.
Yes! What I saw on my screen in gdb is now proven with the printing of the timeout struct!! The non-instrumented io_getevents calls just peak in the completion queue (ctx) to see if min_nr (3 in this case) IO requests are ready for four times, then register a wait event, and issue another io_getevents, look in the same completion queue (ctx), but only for 1 ready IO request, and now wait for 600 seconds.

4. Conclusion

The reason for this post is to show the essence of how the direct path reads work. There is much more to be said on this subject, especially with the adaptive or “auto tune” mechanism, which scales up the number of asynchronous IO’s in flight.

7 comments
  1. jatin said:

    excelllent!!!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.