Extra huge database IOs, part 3

This is part 3 of a number of blogposts about huge Oracle database IO’s.
If you landed on this blogpost and did not read part 1 or part 2, please read part 1 here and part 2 here.

In part 1 I showed how database IOs of a full table scan could be bigger than 1MB by increasing the db_file_multiblock_read_count parameter to a number beyond 1MB expressed in Oracle blocks. These bigger IOs only happen with direct path reads, not with buffered multiblock reads (which are recognisable via the ‘db file scattered read’ wait event). In part 2 I showed that with Oracle Linux 6u4 X64, Oracle 11.2.0.3 64 bits and ASM the maximal size of an IO could go up to 1024 blocks if db_file_multiblock_read_count was set high enough, and that once an IO could be 1024, Oracle aligns the IO to block id being a multiplication of 1024.

With all the gained knowledge, the next thing I ask myself is: how did Oracle implement this? What operating system calls are done? How can we do IOs of 8MB whilst the maximal IO size of Linux is supposed to be 1MB? The obvious answer is take a peak!

In order to see the exact working, it is once again necessary to reduce the number of IOPS, to see both the io_getevents calls which probe the completion queue (which are not timed/captured by a wait event) and the blocking io_getevents call which is timed with the ‘direct path read’ wait event. To throttle IO (reduce IOPS), see this article.

Let’s focus on how direct path IO works first. Direct path IO (on Linux with ASM is what I researched) works roughly in 3 stages:

Before a direct path IO takes place, Oracle does some stuff which necessary to make direct path IO’s happen. Some of the things are issuing a object checkpoint (visible with the ‘enq: KO – fast object checkpoint’ waitevent) to make sure the latest version of the blocks involved are on disk, and a rebuild of an operating system asynchronous IO context (visible via the io_destroy() and io_setup() system calls; this is not captured by an Oracle waitevent. The Oracle waitevent ‘asynch descriptor resize’ which could show up during this phase actually is about the Oracle asynchronous IO tracking mechanism, which Oracle often refers to as ‘slots’).

a) submit/process phase:
After the above mentioned initial operations, Oracle submits IO’s via the io_submit() system call. Oracle issues two IO requests for two distinct ranges of blocks when it starts a direct path read scan, which is the minimal number of IO requests which are issued in parallel. This number could go up (up to 32 concurrent requests). These are two distinct io_submit() calls.

When Oracle enters the submit/process phase again, it does something interesting: it first processes a single (multiblock) IO request, then issues an IO request via io_submit(), then processes another IO request (if available), then issues an IO request (io_submit()), depending on the number of number IOs Oracle needs to submit and needs to process.

It’s important to note that Oracle did not put any waits in this phase. This is all instrumented as cpu time.

b) reap all phase:
Now Oracle has issued a number of IOs, Oracle does something truly interesting: it issues a number of IO requests (visible as io_getevents() calls) for ALL the IOs this process submitted with timeout set to 0 (zero). This does two things: it makes the call non blocking and it will be invisible for strace (the Linux system call trace utility) if they do not succeed (!!).

This phase is not covered by a waitevent. This means that if IOs are fast, and this phase succeeds in reaping all IOs, there is no wait, and are instrumented as cpu time.

Also, if all IOs are reaped in this phase, it can go back to the submit/process phase.

c) wait for IO phase:
If Oracle comes up to this phase, the process has done everything, and can do nothing but wait for IO to arrive. So that is exactly what the process is going to do: it issues a io_getevents() call with timeout set to 600 seconds, and waits for any SINGLE IO request to arrive. Here Oracle “truly” waits, and this is covered by a wait event: ‘direct path read’. Once a single IO request is reaped, this step is ready, and goes back to the submit/process phase.

Okay, now that we are up to date with how Oracle processes asynchronous direct path IO, let’s dig into how this works with huge IOs!

For investigating this issue, the linux strace utility actually works best, because it interprets the pointers in the system call, and shows the content of the struct, instead of just showing the pointer, which gdb does. As always, you need to careful choose the tools you use for the job. By using strace, the only hint in which phase the io_getevents() call was done is by looking at the timeout value(!).

This is a snippet of the strace output when Oracle was doing 1020 blocks IO requests:

io_submit(140019063115776, 8, {{0x7f58b89fbfb8, 0, 0, 0, 256}, {0x7f58b89fc210, 0, 0, 0, 257}, {0x7f58b89fc468, 0, 0, 0, 256}, {0x7f58b89fc6c0, 0, 0, 0, 257}, {0x7f58b89fc918, 0, 0, 0, 256}, {0x7f58b89fcb70, 0, 0, 0, 257}, {0x7f58b89fcdc8, 0, 0, 0, 256}, {0x7f58b89fd020, 0, 0, 0, 257}}) = 8
io_getevents(140019063115776, 8, 128, {{0x7f58b89f6db0, 0x7f58b89f6db0, 1048576, 0}, {0x7f58b89f7008, 0x7f58b89f7008, 1048576, 0}, {0x7f58b89f66a8, 0x7f58b89f66a8, 1048576, 0}, {0x7f58b89f6450, 0x7f58b89f6450, 1048576, 0}, {0x7f58b89f61f8, 0x7f58b89f61f8, 1048576, 0}, {0x7f58b89f7710, 0x7f58b89f7710, 1048576, 0}, {0x7f58b89f7968, 0x7f58b89f7968, 1048576, 0}, {0x7f58b89f7bc0, 0x7f58b89f7bc0, 1015808, 0}}, {600, 0}) = 8

This is a combination of an io_submit() call and an io_getevents(). There is a lot to say about this.
First, Oracle did not magically increase the maximum IO size of the operating system. As we can see, Oracle submits multiple IO requests in a single io_submit() call. If we peek at the size field in the io_getevents() calls, we see the size mostly being 1048576. In other words: the individual IO requests are 1MB, as we expect, Oracle submitted multiple requests in order to submit IO requests for the single (Oracle level) request of 1020 blocks.

If we add the IO sizes of the individual requests we get: (1048576*7)+1015808=8355840 bytes. 8355840/8192 (=blocksize) = 1020. So the size of the IO requests is the same as the number of Oracle blocks.

Next, if we look at the timeout “field” of io_getevents(), it reads {600, 0}. This means this was a blocking call, with timeout set to 600. We can deduct from that field that this was an io_getevents() call from the wait for IO phase.

I urge you to take an even closer look. If you look at the pointer to the iocb (the large hexadecimal number), you will see that the IOs submitted are different from the IOs reaped with io_getevents(). This is Oracle taking advantage of asynchronous IO.

I’ve throttled IO to 1 IOPS, so I know all the requests will go up to the wait for IO phase.

Let’s take a look at a strace output if IO is not throttled to 1 IOPS:

io_getevents(140136915922944, 28, 128, {{0x7f7429499710, 0x7f7429499710, 1048576, 0}, {0x7f7429499e18, 0x7f7429499e18, 1048576, 0}, {0x7f7429498900, 0x7f7429498900, 1048576, 0}, {0x7f742949c5f0, 0x7f742949c5f0, 1048576, 0}, {0x7f742949ba38, 0x7f742949ba38, 1048576, 0}, {0x7f7429498db0, 0x7f7429498db0, 1048576, 0}, {0x7f742949ac28, 0x7f742949ac28, 1048576, 0}, {0x7f742949a2c8, 0x7f742949a2c8, 1048576, 0}, {0x7f742949b7e0, 0x7f742949b7e0, 1048576, 0}, {0x7f742949a070, 0x7f742949a070, 1048576, 0}, {0x7f7429499260, 0x7f7429499260, 1048576, 0}, {0x7f74294981f8, 0x7f74294981f8, 1048576, 0}, {0x7f742949c848, 0x7f742949c848, 1048576, 0}, {0x7f742949a778, 0x7f742949a778, 1048576, 0}}, {0, 0}) = 14

The timeout struct is set to {0, 0}. So this is an io_getevents() call from the ‘get all IOs phase’. As we can see, all IOs in this case is 28 (that is a little less than 28MB!). Another interesting thing is this actual call returned 14 IO requests, but still succeeded, despite the minimal number of requests ready was set to 28. My guess is because because another io_getevents() call or calls reaped the other IOs, and this request completed the 28 requests, and made this call successful.

Because this is in the ‘get all IOs phase’, there is no wait event. So this huge amount of IOs being ready is not reflected in the wait events, nor can be assembled from it for this specific request. Well, you could calculate the number of IOs needed to read this segment by looking at DBA_EXTENTS, and calculate the total time taken for reading the segment, and calculate an average but that’s all.

Because the ‘direct path read’ wait event only shows up when we need to wait, which will be for a single request, a wait will always show maximally for 1024 blocks because that is the maximum (which I’ve found up to now). If all the IOs are ready before the ‘wait for IO phase’, it means there are more IO requests ready than a single Oracle IO.

1 comment

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.