Investigating kernel dives using ftrace.

This blogpost is about using the linux ftrace kernel facility. If you are familiar with ftrace and specifically the function_graph tracer, you might already be aware of this functionality. This is Linux specific, and this facility is at least available in kernel 2.6.39 (Oracle’s UEK2 kernel).

What is a ‘kernel dive’? Whenever a process is running, it should mostly be in ‘user mode’, executing the program it is supposed to run. However, during running there could be situations (a lot of situations, depending on what the program is doing!) that the program needs something “from” the system. Such a thing could be allocating memory, or using a device that is shared like a block device, or a network device. These things are controlled by the kernel, and require a process to issue a system call. A user process executes a system call to request actions to such things. Starting from the system call, the execution ‘dives’ in the kernel, and executes in kernel or system mode. However, kernel dives are not limited to system calls; for example a page fault (paging in backing memory for allocated virtual memory) switches to system mode, as well as handling an interrupt.

The Linux kernel has a facility that is called ‘ftrace’. The name ftrace originally was named because of function tracing, but it has evolved into a tracing framework. It’s important to point out that ftrace currently ONLY works in KERNELSPACE. This means you miss the userspace code.

Ftrace uses both explicit tracepoints (defined in the linux kernel source), as well as dynamic tracepoints, for which the gcc -pg (profiling data) flag is used to capture function entry. For function exit a ‘trampoline’ is used. A trampoline here is an extra function executed (mcount) at function entry that stores the return address, and replaces the return address with that of the trampoline, so an exit can be detected.

the linux debugfs filesystem must be mounted for ftrace to work. You can check if the debugfs filesystem is mounted using:

[root@bigmachine ~]# mount -t debugfs

It does not return any rows if debugfs is not mounted. You can mount debugfs the following way:

mount -t debugfs none /sys/kernel/debug

Let’s do some basic steps first, just tracing an Oracle session!
First go to the tracing directory:

[root@bigmachine ~]# cd /sys/kernel/debug/tracing/
[root@bigmachine tracing]#

Obtain the PID of an Oracle foreground process, and enable tracing for this PID:

[root@bigmachine tracing]# echo 6431 > set_ftrace_pid

We also need to choose what we want to trace. Something very beneficial for understanding what is going on in the kernel is the tracer ‘function_graph’. Here’s how you enable that trace:

[root@bigmachine tracing]# echo function_graph > current_tracer

The trace output is in ‘trace’. Assuming that the PID to trace is idle, this is how the trace output looks like:

[root@bigmachine tracing]# cat trace
# tracer: function_graph
# |     |   |                     |   |   |   |

Now execute something silly (something which does little!).

TS@fv12102 > select * from dual;


Now let’s first show how much information is gathered:

[root@bigmachine tracing]# cat trace | wc -l

That’s correct: approximately 1900 rows of trace data are created during only ‘select * from dual’. Now think about this: most of the things ‘select * from dual’ does are done in userspace. What does ‘select * from dual’ actually do in kernel space? Here’s a little grep to see what the Oracle process did:

[root@bigmachine tracing]# grep \|\ \ [a-zA-Z_0-9]*\(\)\ \{ trace
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
...much more...

Probably you are aware Oracle executes a lot of times() and getrusage() calls. As you can see, the system calls are also audited by the Linux system. Now just open the trace file with your favourite file viewer (I use ‘less’, you can use the vi commands to search for pieces of text), and peek in the file:

   0)               |  sys_getrusage() {
   0)               |    getrusage() {
   0)               |      k_getrusage() {
   0)               |        task_cputime_adjusted() {
   0)               |          cputime_adjust() {
   0)   0.045 us    |            nsecs_to_jiffies();
   0)   0.503 us    |          }
   0)   0.876 us    |        }
   0)   0.041 us    |        jiffies_to_timeval();
   0)   0.042 us    |        jiffies_to_timeval();
   0)               |        get_task_mm() {
   0)   0.047 us    |          _raw_spin_lock();
   0)   0.380 us    |        }
   0)               |        mmput() {
   0)   0.041 us    |          _cond_resched();
   0)   0.351 us    |        }
   0)   3.836 us    |      }
   0)   0.043 us    |      _cond_resched();
   0)   4.596 us    |    }
   0)   5.004 us    |  }

I skipped the beginning of the trace, which is actually the ending of the kernel code of the waiting on a next command of the Oracle process. If you scroll down to the end, you will see how the waiting on a next command (instrumented by the wait event ‘SQL*Net message from client’) is actually implemented on the kernel side using a read function on a pipe. You find the above shown systemcall, getrusage, just after the ending of the read function in the beginning of the trace (or search for ‘sys_getrusage’).

What is interesting is that this trace is showing the different functions in the kernel and which function is calling what function, made visible by accolades and indention, made to look like a c program. In other words: this allows you to see in what specific function of the kernel the time is spend, and how the total time of a kernel function is build up!

Now that basic usage is known, let’s step up to something interesting; the getrusage timing in interesting, but just an example. One of such really interesting things is IO.

First clear the trace file:

[root@bigmachine tracing]# echo > trace

Verify that the current tracer still is function_graph:

[root@bigmachine tracing]# cat current_tracer

Verify the process id to trace (it should list the process id of the oracle foreground process you want to trace):

[root@bigmachine tracing]# cat set_ftrace_pid

Now an additional next step to only trace the pread systemcall using the ‘set_graph_function’ facility:

[root@bigmachine tracing]# echo sys_pread64 > set_graph_function

Now make the Oracle session that is traced do a pread call. I used ‘select * from t1 where rownum=1’. Because I flushed the buffer cache prior to doing this (to make sure physical IO is needed), I did get 2 physical IOs, one for the segment header and one for the data block.

To make sure nothing else will get into the trace buffer (the file really is a buffer in memory), do the following to stop further tracing:

[root@bigmachine tracing]# echo 0 > tracing_on

This is how my output looks like:
I added line numbering to it (you can do that yourself too with the nl linux utility), so there are a few things I can point you to.

The first pread call ends at line# 568. That is a lot of information. It also shows how much stuff is done during an IO.
– The filesystem type matters! At line 7 you see vfs_read (the linux filesystem abstraction layer), but at line 19 you see filesystem specific code!
– At line 27 you see a XFS specific function indicating DIO (direct IO) is used.
– At line 33 you see an interaction with the block layer using the blk_start_plug() function. Linux uses a method to group IOs that is logically equal to filling up and later draining a bathtub using a plug.
– At line 328 you see the IO request has been built by the filesystem code, and the the request queue is unplugged: blk_flush_plug_list. You see some IO scheduler functions (starting with ‘elv’) and you see the deadline scheduler is used (deadline_add_request). Not very far after that, we enter the scsi layer (indicated by functions starting with ‘scsi_’).
– At line 431 you see the request being submitted to the device using the mpt_put_msg_frame function. We are now in the low level driver layer. We see there is time involved (82ms, you can see this is a virtual machine). You see all kinds of loops ending here. At this point the IO request has been sent to the IO device. However, there is more interesting stuff coming!
– At line 429 we see the kernel is preparing for waiting on the IO request to return. This is done in using the dio_await_completion function.
– At line 436 we are totally done submitting the IO request, and entering the Linux process scheduler. In the scheduler code we see housekeeping (update_blocked_averages, line 456), and rebalancing the process, which means trying to find the best cpu thread to execute on (load_balance, line 462), before the process finally goes to sleep.
– At line 487 the process is woken by an interrupt (not visible in the trace) from it’s uninterruptible sleep (state ‘D’). The IO then really finishes up; at function dio_bio_complete, line 500 the data of the IO request is put in a BIO structure which can be passed on to userspace, some further housekeeping is done (dio_complete, line 539), the access time is updated (touch_atime, line 552) and some xfs housekeeping, unlocking the inode (xfs_iunlock, line 559).

Let me show you something which shows how valuable this tracing is for finding (linux operating system) issues. This is another trace on the very same system with slightly different kernel settings:

Can you spot the difference? If not, take a look at this diff:, at line number 115. The function called the second time is gup_huge_pmd. That’s a function for handling user pages, and this function is using huge pages! The change I made was setting vm.nr_hugepages from zero to a number higher than the database instance needed. So not only the Oracle database can benefit from huge pages, but also the Linux kernel IO subsystem!

If this wetted your appetite, read on! Here’s a trace, once again of pread: Can you see what is the difference here? If not, I created a diff of this trace and the previous one: A nice example of the difference is at line 42. The first trace file contains all kinds of functions for doing extent management for the XFS filesystem, until line 89. This is all replaced with a single function blkdev_get_block.

The similarity is it’s all the system call pread. The difference between the second and the third is that with the third pread trace I used Oracle’s ASM facility, alias database IO to a block device directly without a filesystem. This trace shows pread is not executing all the filesystem functions, because there is no filesystem. However, please do realise it’s all about spend time, not how many rows there are in a trace.

Now before jumping to conclusions about the greatness of ASM because it can skip a lot of the code path, please do realise that ASM adds code path inside the Oracle database, because some kind of disk space management must be done. Also there is an entire ASM instance for managing the diskspace (which is NOT part of database sessions’ code path). These are facts, not opinion. There are use cases for both filesystem usage and for ASM, although I have a personal preference for ASM.

Ftrace does not get the attention it deserves. It’s a great tool for investigating time spend in the kernel, and it’s available by mounting the debugfs filesystem. This blogpost describes the function_graph tracer, there are other tracers too, it’s absolutely not limited to the description in this blogpost.

To show how ftrace with the function_graph tracer works I took the pread system call as an example. It turns out the pread system call comes in many shapes. Unless you intimately know a system, there can be different layers in the kernel in play when executing pread. However, on a normal system the main time component of the pread system call should be off CPU in uninterruptible state. And that is exactly what the function_graph tracer can tell.

Again, as a reminder, ftrace only works for kernel level (“system”) execution of a process.

1 comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: