Tag Archives: systemcall

During investigating how Oracle works with regards to waiting, I discovered an oddity. I was researching for my redo blog posts, and found that in a certain case, Oracle uses the ‘nanosleep’ system call. As the name of this system call suggests, this call allows you to let a process sleep with nanosecond precision.

The oddity that I found, was the following:

$ strace -Tp 42
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

I executed strace with the ‘-T’ argument (show time spend in system calls), and ‘-p’ to attach it to the process I want to measure. The nanosleep function was called with struct time spec set to {0, 1000}, which means 0 seconds and 1000 nanoseconds. However, the actual time spend was between 0.000191 and 0.000253 seconds, which is 1910000 and 253000 in nanoseconds. That’s approximately 200 times longer than it’s set to wait!

Since then I have spend a fair amount of time to understand why this is happening. So far I found a number of facts that I think explain this, but if anyone else has something to throw in here, please leave a comment.

The first thing I did was create a C program that just runs nanosleep in a loop, so I can run the nanosleep systemcall in isolation. Here’s the nanosleep PoC code (measure_time.c):

#include <time>
int main(int argc, char **argv)
  int       loop_counter, loop_total;
  struct    timespec sleep;



  while ( loop_counter &lt; loop_total ) {

Then, when researching I found that there is a systemcall that shows the actual clock resolution time! I created another very small C program to run just that (measure_resulution.c):

#include <time>
int main(int argc, char **argv)
  int       result;
  struct    timespec resolution;
  clockid_t clk_id;

  clock_getres(clk_id, &amp;resolution);

  printf("Resolution: %ld s, %ld ns\n", resolution.tv_sec, resolution.tv_nsec);

This c program can be compiled using ‘gcc measure_resolution.c -o measure_resolution’. This is what it showed:

$ ./measure_resolution
Resolution: 0 s, 1 ns

So, my system has a 1 ns resolution, despite my nanosleep systemcalls taking way longer. A little later I found out this information can be obtained directly in /proc/timer_list:

$ grep resolution /proc/timer_list | uniq
  .resolution: 1 nsecs

The first thing I found while researching, is that when I change the priority of the process (not the 1000ns), I can get lower sleeping times:

# strace -T chrt --rr 99 ./measure_time
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

The lowest time here is 0.036ms (36000ns). In order to look further, I found that newer kernel versions have an addition to perf that can measure wait time, scheduler delay (‘run queue’) and actual run time; perf sched timehist. So at this point it seemed that was a way to understand more about the timing of nanosleep. In order to look at that I created a virtual machine with fedora 27 (kernel version 4.14.13), and compiled my measure_time.c program on it.

The next thing to do is run the measure_time executable with perf sched record:

# perf sched record chrt --rr 99 ./measure_time
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.183 MB (54 samples) ]

Then run perf sched timehist to look at the data and find the pid of the measure_time executable:

# perf sched timehist
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.760817 [0000]                                0.000      0.000      0.000
   37745.760823 [0000]  rcu_sched[8]                        0.000      0.000      0.005
   37745.762367 [0002]  perf[2777]                          0.000      0.000      0.000
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.762431 [0002]  migration/2[21]                     0.000      0.001      0.011
   37745.762479 [0003]                                0.000      0.000      0.000
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764108 [0003]                                1.583      0.000      0.045
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005

So it’s pid 2778, now add that to perf sched timehist:

# perf sched timehist -p 2778
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005
   37745.764153 [0003]  measure_time[2778]                  0.034      0.002      0.004
   37745.764195 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764236 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764291 [0003]  measure_time[2778]                  0.050      0.002      0.004
   37745.764347 [0003]  measure_time[2778]                  0.051      0.002      0.004
   37745.764405 [0003]  measure_time[2778]                  0.052      0.002      0.004
   37745.764478 [0003]  measure_time[2778]                  0.067      0.006      0.005
   37745.764506 [0003]  measure_time[2778]                  0.022      0.002      0.005
   37745.764581 [0003]  measure_time[2778]                  0.022      0.002      0.052

The way to look at this: first perf is run. Because we just started running, there is no wait time, the task gets the state running, then it needs to be scheduled to run. That time is in ‘sch delay’, obviously scheduling delay. Then it gets truly is running and runs for some time, which time is visible in the ‘run time’ column. Next, the measure_time executable is run.

The first time measure_time runs, it needs to do some initialisation, like the dynamic loader loading libc, allocate memory (mmap calls), etcetera. This is visible in the run time of 1.583 ms. However, then it runs into the nanosleep call. The nanosleep call puts the process into an interruptible sleep. The sleep is visible in the wait time column. Here we see it’s waiting/sleeping for between 0.022 and 0.067 ms. After waiting, the process gets the status running, after which the scheduler must schedule the task and make it running. That time is in scheduling time again. Then it runs, but all it needs to do is end the nanosleep call, add one to the loop_counter variable, jump back and enter the nanosleep call again, as we can see, that only takes 0.004 ms.

Okay, so far it was quite a geeking out session, but not really an answer has been found to why a 1000ns sleep turns out to be way higher. I sought out help via social media, and gotten some sane advise from Kevin Closson. Essentially, the advise was to run it on physical hardware instead of my test virtual machine, and make the load as low as possible in order for the scheduler to be able to schedule my process as quickly as it can. Yes, very obvious actually.

So, the next target is a physical machine in our Enkitec lab. This is on a X3 generation machine (Sandy Bridge EP). I compiled measure_time.c, and ran ‘strace -T chrt –rr 99 ./measure_time’ again:

nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

That’s way better! But still the nanosleep timing is approximately 13000ns, instead of 1000ns that the call is supposed to take.

At this point I had no clue, and obviously needed to rethink it again; the time the call is supposed to sleep and thus take and the wallclock time I measured are still quite far apart in my measurements. On one hand you could argue that the time it is set to wait is so small that this could be simply caused by the code in the nanosleep function itself, on the other hand it’s not even close.

Then I gotten a eureka moment: strace is using ptrace facilities to attach to the process, and ptrace is known to quite severely slow processes down. Hat tip to Martin Bach who contacted me with this at the very same moment I realised this. So then the question becomes: can I measure nanosleep in another way?

Then I thought about the linux kernel ftrace facilities. In fact, the physical machines at my disposal in the Enkitec lab all run an Exadata images, and the Exadata images do not contain bleeding edge kernels like fedora 27 has, so I couldn’t repeat the perf sched timehist tracing. I wrote about ftrace in the past.

So, let’s redo the timing investigation again using ftrace:
1. Make sure the debug filesystem is mounted:

# mount /sys/kernel/debug || mount -t debugfs none /sys/kernel/debug

2. Enter the trace directory in /sys/kernel/debug:

# cd /sys/kernel/debug/tracing

3. Create another session on the same machine and obtain the process id:

# echo $$

4. In the debug/tracing session, setup nanosleep system call tracing:
Enter the process id as the pid to trace:

echo 62733 > set_ftrace_pid

Set the type of trace. The default ‘tracer’ is nop (no tracing). Set it to ‘function_graph’:

echo function_graph > current_tracer

If there is an option ‘(no)function-fork’ in trace_options, set it to ‘function-fork’ to make ftrace trace child processes too (function-fork is an addition in recent kernels):

# egrep ^nofunction-fork trace_options && echo function-fork > trace_options

Set the function to trace. For the nanosleep system call, this is SyS_nanosleep:

# echo SyS_nanosleep > set_ftrace_filter

Now enable tracing:

# echo 1 > tracing_on

5. In the session to trace, execute chrt –rr 99 ./measure_time.
6. In the debug/tracing session, stop the trace and look at the results:

# echo 0 > tracing_on
# cat trace
# tracer: function_graph
# |     |   |                     |   |   |   |
 20)   8.958 us    |  SyS_nanosleep();
 20)   5.077 us    |  SyS_nanosleep();
 20)   4.260 us    |  SyS_nanosleep();
 20)   9.384 us    |  SyS_nanosleep();
 20)   5.495 us    |  SyS_nanosleep();
 20)   5.526 us    |  SyS_nanosleep();
 20)   5.039 us    |  SyS_nanosleep();
 20)   4.936 us    |  SyS_nanosleep();
 20)   4.898 us    |  SyS_nanosleep();
 20)   4.889 us    |  SyS_nanosleep();

This shows the waiting time still fluctuating, but going down to as low as 4.260 us, alias 4260 nanoseconds. This still is roughly four times the sleeping time set (1000ns), but it gotten way lower than earlier! Probably, the tracing increased the latency a bit, so my guess would be the true sleeping time when no trace is applied is around 2000ns. Please mind this is with changed (increased) process priorities (chrt –rr 99); when measure_time is ran without any priorities set, it looks like this:

# cat trace
# tracer: function_graph
# |     |   |                     |   |   |   |
 21) + 60.583 us   |  SyS_nanosleep();
 21) + 56.504 us   |  SyS_nanosleep();
 21) + 55.940 us   |  SyS_nanosleep();
 21) + 56.118 us   |  SyS_nanosleep();
 21) + 56.076 us   |  SyS_nanosleep();
 21) + 56.078 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 56.100 us   |  SyS_nanosleep();
 21) + 56.868 us   |  SyS_nanosleep();

But: there is Oracle grid infrastructure running, and multiple databases are running.

What would be interesting to see is how nanosleep is actually executed. Ftrace provides a way to do exactly that! The important thing to keep in mind is that ftrace is kernel only. In order to see how SyS_nanosleep is executed, do the following:

# > set_ftrace_filter
# echo SyS_nanosleep > set_graph_function
# > trace

Then execute ‘chrt –rr 99 ./measure_time’ again, and look in trace again:
(please mind I picked the second occurrence of SyS_nanosleep, of course the trace shows them all)

 14)               |  SyS_nanosleep() {
 14)               |    hrtimer_nanosleep() {
 14)   0.045 us    |      hrtimer_init();
 14)               |      do_nanosleep() {
 14)               |        hrtimer_start_range_ns() {
 14)               |          __hrtimer_start_range_ns() {
 14)               |            lock_hrtimer_base() {
 14)   0.039 us    |              _raw_spin_lock_irqsave();
 14)   0.527 us    |            } /* lock_hrtimer_base */
 14)   0.048 us    |            ktime_get();
 14)   0.043 us    |            get_nohz_timer_target();
 14)   0.079 us    |            enqueue_hrtimer();
 14)               |            tick_program_event() {
 14)               |              clockevents_program_event() {
 14)   0.048 us    |                ktime_get();
 14)   0.446 us    |              } /* clockevents_program_event */
 14)   0.876 us    |            } /* tick_program_event */
 14)   0.040 us    |            _raw_spin_unlock();
 14)   0.040 us    |            __raise_softirq_irqoff();
 14)   4.318 us    |          } /* __hrtimer_start_range_ns */
 14)   4.792 us    |        } /* hrtimer_start_range_ns */
 14)               |        schedule() {
 14)               |          __schedule() {
 14)   0.041 us    |            rcu_note_context_switch();
 14)   0.039 us    |            _raw_spin_lock_irq();
 14)               |            deactivate_task() {
 14)               |              dequeue_task() {
 14)   0.056 us    |                update_rq_clock();
 14)               |                dequeue_task_rt() {
 14)               |                  update_curr_rt() {
 14)   0.057 us    |                    cpuacct_charge();
 14)   0.053 us    |                    sched_avg_update();
 14)   0.046 us    |                    _raw_spin_lock();
 14)               |                    sched_rt_runtime_exceeded() {
 14)   0.041 us    |                      balance_runtime();
 14)   0.402 us    |                    } /* sched_rt_runtime_exceeded */
 14)   0.046 us    |                    _raw_spin_unlock();
 14)   2.701 us    |                  } /* update_curr_rt */
 14)               |                  dequeue_rt_entity() {
 14)               |                    dequeue_rt_stack() {
 14)               |                      __dequeue_rt_entity() {
 14)   0.062 us    |                        cpupri_set();
 14)   0.057 us    |                        update_rt_migration();
 14)   0.861 us    |                      } /* __dequeue_rt_entity */
 14)   1.258 us    |                    } /* dequeue_rt_stack */
 14)   0.047 us    |                    enqueue_top_rt_rq();
 14)   1.908 us    |                  } /* dequeue_rt_entity */
 14)   0.048 us    |                  dequeue_pushable_task();
 14)   5.564 us    |                } /* dequeue_task_rt */
 14)   6.445 us    |              } /* dequeue_task */
 14)   6.789 us    |            } /* deactivate_task */
 14)   0.044 us    |            pick_next_task_stop();
 14)   0.041 us    |            pick_next_task_dl();
 14)               |            pick_next_task_rt() {
 14)   0.042 us    |              pull_rt_task();
 14)   0.050 us    |              update_curr_rt();
 14)   0.823 us    |            } /* pick_next_task_rt */
 14)               |            pick_next_task_fair() {
 14)               |              put_prev_task_rt() {
 14)   0.042 us    |                update_curr_rt();
 14)   0.377 us    |              } /* put_prev_task_rt */
 14)               |              pick_next_entity() {
 14)   0.039 us    |                clear_buddies();
 14)   0.340 us    |              } /* pick_next_entity */
 14)               |              set_next_entity() {
 14)   0.065 us    |                __dequeue_entity();
 14)   0.426 us    |              } /* set_next_entity */
 14)   2.159 us    |            } /* pick_next_task_fair */
 14)   0.052 us    |            finish_task_switch();
 14) + 15.656 us   |          } /* __schedule */
 14) + 16.003 us   |        } /* schedule */
 14)   0.039 us    |        _cond_resched();
 14)               |        hrtimer_cancel() {
 14)               |          hrtimer_try_to_cancel() {
 14)               |            lock_hrtimer_base() {
 14)   0.040 us    |              _raw_spin_lock_irqsave();
 14)   0.365 us    |            } /* lock_hrtimer_base */
 14)   0.046 us    |            _raw_spin_unlock_irqrestore();
 14)   1.006 us    |          } /* hrtimer_try_to_cancel */
 14)   1.403 us    |        } /* hrtimer_cancel */
 14) + 23.559 us   |      } /* do_nanosleep */
 14) + 24.444 us   |    } /* hrtimer_nanosleep */
 14) + 24.842 us   |  } /* SyS_nanosleep */

This gives a full overview of the functions that SyS_nanosleep is executing in the kernel, including timing.
Essentially, everything happens in ‘do_nanosleep’. Inside do_nanosleep, the following functions are directly called:
– hrtimer_start_range_ns -> 4.792 us
– schedule -> 16.003 us
– _cond_resched -> 0.039 us
– hrtimer_cancel -> 1.403 us
This is more than the 4260 ns we saw earlier, it’s clear the tracing influences the latency of the execution.

In order to get a breakdown on where the time goes for the 4260 ns, remove SyS_nanosleep from set_graph_function, empty the trace and add the four functions mentioned above to set_ftrace_filter:

# > set_graph_function
# > trace
# echo 'hrtimer_start_range_ns schedule _cond_resched hrtimer_cancel' > set_ftrace_filter

Now execute ‘./chrt –rr 99 ./measure_time’ again, and look in trace; this is a snippet that shows a single nanosleep invocation:

  2)   0.938 us    |  hrtimer_start_range_ns();
  2)   3.406 us    |  schedule();
  2)   0.077 us    |  _cond_resched();
  2)   0.123 us    |  hrtimer_cancel();

What this shows is that apparently the time it takes for the schedule function to finish is taking the majority of the time. There will be influence of ftrace being active, but taking that into account on this system the time to get rescheduled after waiting for a short time is around 3 to 4 us (microseconds).

There is a certain point at which timing is so fine grained, that anything you do, even running a (kernel) function itself distorts time. It looks like this is 1us (microsecond) granularity for the systems I’ve looked at.

Obviously, when you run on a virtualised platform, especially like I did with a ‘desktop’ virtualisation product like Virtualbox, but it applies to any virtualisation, scheduling times increase because the virtualisation needs to abstract running on a CPU. With modern virtualisation running user-mode this overhead is minimal or non-existent. However, kernel mode access needs to be abstracted in order to keep virtual machines from invading each others resources. This can become significant if resources are oversubscribed (we gotten a glimpse of that in this article with the first nanosleep measurements at roughly 220 us).

The nanosleep times gotten lower when the process priority was set as high as possible. Then when moving onto physical hardware, the latency times gotten even lower. The priority of the process strongly influences the scheduling time, which is kind of logical, because that’s the entire purpose of the priority.

The linux strace utility influences process execution latency quite significantly when looking at a microsecond granularity. Ftrace provides a way to trace in a very flexible way with way lesser overhead than strace, although it still does impose an overhead on the process it is tracing.

Using ftrace, the minimal time that I could measure for a process to get scheduled again after sleeping was between 3 to 4 microseconds. This latency is influenced by the tracing, so it probably comes down to between 2 to 3 microseconds without it. Also, this is strongly influenced by the priority, I needed to set it to the highest priority in order to get the low latency. Another big influencer will be the hardware that I am running, and there are a fair amount of processes running, some at high priority (cluster ware ocssd, Oracle database lms processes, linux daemons).

This blogpost is about using the linux ftrace kernel facility. If you are familiar with ftrace and specifically the function_graph tracer, you might already be aware of this functionality. This is Linux specific, and this facility is at least available in kernel 2.6.39 (Oracle’s UEK2 kernel).

What is a ‘kernel dive’? Whenever a process is running, it should mostly be in ‘user mode’, executing the program it is supposed to run. However, during running there could be situations (a lot of situations, depending on what the program is doing!) that the program needs something “from” the system. Such a thing could be allocating memory, or using a device that is shared like a block device, or a network device. These things are controlled by the kernel, and require a process to issue a system call. A user process executes a system call to request actions to such things. Starting from the system call, the execution ‘dives’ in the kernel, and executes in kernel or system mode. However, kernel dives are not limited to system calls; for example a page fault (paging in backing memory for allocated virtual memory) switches to system mode, as well as handling an interrupt.

The Linux kernel has a facility that is called ‘ftrace’. The name ftrace originally was named because of function tracing, but it has evolved into a tracing framework. It’s important to point out that ftrace currently ONLY works in KERNELSPACE. This means you miss the userspace code.

Ftrace uses both explicit tracepoints (defined in the linux kernel source), as well as dynamic tracepoints, for which the gcc -pg (profiling data) flag is used to capture function entry. For function exit a ‘trampoline’ is used. A trampoline here is an extra function executed (mcount) at function entry that stores the return address, and replaces the return address with that of the trampoline, so an exit can be detected.

the linux debugfs filesystem must be mounted for ftrace to work. You can check if the debugfs filesystem is mounted using:

[root@bigmachine ~]# mount -t debugfs

It does not return any rows if debugfs is not mounted. You can mount debugfs the following way:

mount -t debugfs none /sys/kernel/debug

Let’s do some basic steps first, just tracing an Oracle session!
First go to the tracing directory:

[root@bigmachine ~]# cd /sys/kernel/debug/tracing/
[root@bigmachine tracing]#

Obtain the PID of an Oracle foreground process, and enable tracing for this PID:

[root@bigmachine tracing]# echo 6431 > set_ftrace_pid

We also need to choose what we want to trace. Something very beneficial for understanding what is going on in the kernel is the tracer ‘function_graph’. Here’s how you enable that trace:

[root@bigmachine tracing]# echo function_graph > current_tracer

The trace output is in ‘trace’. Assuming that the PID to trace is idle, this is how the trace output looks like:

[root@bigmachine tracing]# cat trace
# tracer: function_graph
# |     |   |                     |   |   |   |

Now execute something silly (something which does little!).

TS@fv12102 > select * from dual;


Now let’s first show how much information is gathered:

[root@bigmachine tracing]# cat trace | wc -l

That’s correct: approximately 1900 rows of trace data are created during only ‘select * from dual’. Now think about this: most of the things ‘select * from dual’ does are done in userspace. What does ‘select * from dual’ actually do in kernel space? Here’s a little grep to see what the Oracle process did:

[root@bigmachine tracing]# grep \|\ \ [a-zA-Z_0-9]*\(\)\ \{ trace
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
...much more...

Probably you are aware Oracle executes a lot of times() and getrusage() calls. As you can see, the system calls are also audited by the Linux system. Now just open the trace file with your favourite file viewer (I use ‘less’, you can use the vi commands to search for pieces of text), and peek in the file:

   0)               |  sys_getrusage() {
   0)               |    getrusage() {
   0)               |      k_getrusage() {
   0)               |        task_cputime_adjusted() {
   0)               |          cputime_adjust() {
   0)   0.045 us    |            nsecs_to_jiffies();
   0)   0.503 us    |          }
   0)   0.876 us    |        }
   0)   0.041 us    |        jiffies_to_timeval();
   0)   0.042 us    |        jiffies_to_timeval();
   0)               |        get_task_mm() {
   0)   0.047 us    |          _raw_spin_lock();
   0)   0.380 us    |        }
   0)               |        mmput() {
   0)   0.041 us    |          _cond_resched();
   0)   0.351 us    |        }
   0)   3.836 us    |      }
   0)   0.043 us    |      _cond_resched();
   0)   4.596 us    |    }
   0)   5.004 us    |  }

I skipped the beginning of the trace, which is actually the ending of the kernel code of the waiting on a next command of the Oracle process. If you scroll down to the end, you will see how the waiting on a next command (instrumented by the wait event ‘SQL*Net message from client’) is actually implemented on the kernel side using a read function on a pipe. You find the above shown systemcall, getrusage, just after the ending of the read function in the beginning of the trace (or search for ‘sys_getrusage’).

What is interesting is that this trace is showing the different functions in the kernel and which function is calling what function, made visible by accolades and indention, made to look like a c program. In other words: this allows you to see in what specific function of the kernel the time is spend, and how the total time of a kernel function is build up!

Now that basic usage is known, let’s step up to something interesting; the getrusage timing in interesting, but just an example. One of such really interesting things is IO.

First clear the trace file:

[root@bigmachine tracing]# echo > trace

Verify that the current tracer still is function_graph:

[root@bigmachine tracing]# cat current_tracer

Verify the process id to trace (it should list the process id of the oracle foreground process you want to trace):

[root@bigmachine tracing]# cat set_ftrace_pid

Now an additional next step to only trace the pread systemcall using the ‘set_graph_function’ facility:

[root@bigmachine tracing]# echo sys_pread64 > set_graph_function

Now make the Oracle session that is traced do a pread call. I used ‘select * from t1 where rownum=1’. Because I flushed the buffer cache prior to doing this (to make sure physical IO is needed), I did get 2 physical IOs, one for the segment header and one for the data block.

To make sure nothing else will get into the trace buffer (the file really is a buffer in memory), do the following to stop further tracing:

[root@bigmachine tracing]# echo 0 > tracing_on

This is how my output looks like:
I added line numbering to it (you can do that yourself too with the nl linux utility), so there are a few things I can point you to.

The first pread call ends at line# 568. That is a lot of information. It also shows how much stuff is done during an IO.
– The filesystem type matters! At line 7 you see vfs_read (the linux filesystem abstraction layer), but at line 19 you see filesystem specific code!
– At line 27 you see a XFS specific function indicating DIO (direct IO) is used.
– At line 33 you see an interaction with the block layer using the blk_start_plug() function. Linux uses a method to group IOs that is logically equal to filling up and later draining a bathtub using a plug.
– At line 328 you see the IO request has been built by the filesystem code, and the the request queue is unplugged: blk_flush_plug_list. You see some IO scheduler functions (starting with ‘elv’) and you see the deadline scheduler is used (deadline_add_request). Not very far after that, we enter the scsi layer (indicated by functions starting with ‘scsi_’).
– At line 431 you see the request being submitted to the device using the mpt_put_msg_frame function. We are now in the low level driver layer. We see there is time involved (82ms, you can see this is a virtual machine). You see all kinds of loops ending here. At this point the IO request has been sent to the IO device. However, there is more interesting stuff coming!
– At line 429 we see the kernel is preparing for waiting on the IO request to return. This is done in using the dio_await_completion function.
– At line 436 we are totally done submitting the IO request, and entering the Linux process scheduler. In the scheduler code we see housekeeping (update_blocked_averages, line 456), and rebalancing the process, which means trying to find the best cpu thread to execute on (load_balance, line 462), before the process finally goes to sleep.
– At line 487 the process is woken by an interrupt (not visible in the trace) from it’s uninterruptible sleep (state ‘D’). The IO then really finishes up; at function dio_bio_complete, line 500 the data of the IO request is put in a BIO structure which can be passed on to userspace, some further housekeeping is done (dio_complete, line 539), the access time is updated (touch_atime, line 552) and some xfs housekeeping, unlocking the inode (xfs_iunlock, line 559).

Let me show you something which shows how valuable this tracing is for finding (linux operating system) issues. This is another trace on the very same system with slightly different kernel settings:

Can you spot the difference? If not, take a look at this diff:, at line number 115. The function called the second time is gup_huge_pmd. That’s a function for handling user pages, and this function is using huge pages! The change I made was setting vm.nr_hugepages from zero to a number higher than the database instance needed. So not only the Oracle database can benefit from huge pages, but also the Linux kernel IO subsystem!

If this wetted your appetite, read on! Here’s a trace, once again of pread: Can you see what is the difference here? If not, I created a diff of this trace and the previous one: A nice example of the difference is at line 42. The first trace file contains all kinds of functions for doing extent management for the XFS filesystem, until line 89. This is all replaced with a single function blkdev_get_block.

The similarity is it’s all the system call pread. The difference between the second and the third is that with the third pread trace I used Oracle’s ASM facility, alias database IO to a block device directly without a filesystem. This trace shows pread is not executing all the filesystem functions, because there is no filesystem. However, please do realise it’s all about spend time, not how many rows there are in a trace.

Now before jumping to conclusions about the greatness of ASM because it can skip a lot of the code path, please do realise that ASM adds code path inside the Oracle database, because some kind of disk space management must be done. Also there is an entire ASM instance for managing the diskspace (which is NOT part of database sessions’ code path). These are facts, not opinion. There are use cases for both filesystem usage and for ASM, although I have a personal preference for ASM.

Ftrace does not get the attention it deserves. It’s a great tool for investigating time spend in the kernel, and it’s available by mounting the debugfs filesystem. This blogpost describes the function_graph tracer, there are other tracers too, it’s absolutely not limited to the description in this blogpost.

To show how ftrace with the function_graph tracer works I took the pread system call as an example. It turns out the pread system call comes in many shapes. Unless you intimately know a system, there can be different layers in the kernel in play when executing pread. However, on a normal system the main time component of the pread system call should be off CPU in uninterruptible state. And that is exactly what the function_graph tracer can tell.

Again, as a reminder, ftrace only works for kernel level (“system”) execution of a process.

%d bloggers like this: