Linux strace doesn’t lie after all.

strace is a linux utility to profile system calls. Using strace you can see the system calls that a process executes, in order to investigate the inner working or performance. In my presentation about multiblock reads I put the text ‘strace lies’. This is NOT correct. My current understanding is that strace does show every system call made by an executable. So…why did I make that statement? (editorial note: this article dives into the inner working of Linux AIO)

During the hotsos symposium in Dallas I was chatting with Tanel Poder, and he asked me to look a little bit more into the linux io_getevents() call and strace, because there might be an optimisation which keeps the call from truly issuing a system call, which means strace could be right. We started thinking about it a bit, and came to the conclusion it should be possible for the linux AIO code to cut the corner and peek at the IOs before executing the io_getevents system call (as a spoiler: because the IO context is in userspace).

So, what to do to investigate this? Well, let’s just look at how it works. The Oracle executable executes io_getevents_0_4() in order to do the system call io_getevents(). The function io_getevents_0_4() comes from libaio (the linux asynchronous IO library). After a small search, it appears libaio has a git source code repository, so we can peek into the source code directly from our browser!

If you browse to the source tree, you see the file io_getevents.c. If you click on it, you see the contents of this file, which has the function io_getevents_0_4() in it. This is a very simple function (actual function source code):

int io_getevents_0_4(io_context_t ctx, long min_nr, long nr, struct io_event * events, struct timespec * timeout)
{
	struct aio_ring *ring;
	ring = (struct aio_ring*)ctx;
	if (ring==NULL || ring->magic != AIO_RING_MAGIC)
		goto do_syscall;
	if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0) {
		if (ring->head == ring->tail)
			return 0;
	}
	
do_syscall:	
	return __io_getevents_0_4(ctx, min_nr, nr, events, timeout);
}

If you look at line 7, you see ‘if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0)’. In other words: if timeout (struct) is set to any value (not NULL), and if the timeout->tv_sec (the seconds portion of the timeout struct) is set to 0 and if the timeout->nsec (the nanoseconds portion of the timeout struct) is set to 0, we enter this function. Once in the function, we look at the struct ring, which is defined as a struct aio_ring from the pointer ctx which is passed to the function io_getevents_0_4(); the first argument. If ring->head is the same as ring-> tail, in other words: if the ring (buffer) is empty, we cut the corner and return 0, without executing the system call. In any other case, the function __io_getevents_0_4() is executed, which executes the system call.

A way to check if this truly is happening, is using the gdb ‘catch syscall’ functionality. In my investigation, I executed ‘break io_getevents_0_4’, which breaks on the userland portion of the io_getevents() function, ‘catch syscall io_getevents’, which breaks when the system call truly is executed, and ‘break io_submit’ to understand which getevents are executed in what number. I setup a testcase with a sqlplus session with the server process throttled to 1 IO per second (see my article on using cgroups to throttle IO), attached to the server process with gdb, and executed the following commands:

break io_submit
commands
silent
printf "io_submit\n"
c
end
break io_getevents_0_4
commands
silent
printf "io_getevents_0_4-libaio\n"
c 
end
catch syscall io_getevents
commands
silent
printf "io_getevents-syscall\n"
c
end

Next I executed a SQL which did a direct path full table scan. This is the result:

io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall
io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall

If you recall what is in the about multiblock reads presentation: after the io_submit ‘phase’, Oracle executes up to 4 io_getevents() calls non-blocking to look for IO. In this case you see the calls being done in user land, but not making it to the system call, because of the shortcut in the io_getevents_0_4() code. After 4 times, Oracle executes io_getevents() with timeout set to 600 seconds, which makes the call truly execute a system call. Please mind that ‘catch syscall’ triggers twice (as can be seen from the two ‘io_getevents-syscall’ in the above example), but is in reality only 1 system call. This proves the working of the code of the io_getevents_0_4() function we looked into, and the reason why I thought the strace utility lied.

2 comments
  1. gerwinh said:

    Hi Frits, thanks for the investigation. Did some investigation myself regarding strace and the fact that certain system calls looked to be absent keeped me puzzled how reliable strace in the end actually is. The investigation you did made in my opinion clear that strace is reliable and we can fully use it in system call related investigations.

    Regards,

    Gerwin

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.