The Oracle wait interface granularity of measurement

The intention of this blogpost is to show the Oracle wait time granularity and the Oracle database time measurement granularity. One of the reasons for doing this, is the Oracle database switched from using the function gettimeofday() up to version 11.2 to clock_gettime() to measure time.

This switch is understandable, as gettimeofday() is a best guess of the kernel of the wall clock time, while clock_gettime(CLOCK_MONOTONIC,…) is an monotonic increasing timer, which means it is more precise and does not have the option to drift backward, which gettimeofday() can do in certain circumstances, like time adjustments via NTP.

The first thing I wanted to proof, is the switch of the gettimeofday() call to the clock_gettime() call. This turned out not to be as simple as I thought.

Because a lot of applications and executables need timing information, which is traditionally done via the gettimeofday() call, the Linux kernel maintainers understood the importance of making this call as performant as possible. Calling a (hardware) clock means you request information from a resource on your computer system. Requesting something like that requires a program to switch to kernel mode first. In order to lower the resources and time needed to get time information, the Linux kernel includes a “trick” to get that information, which is known as a virtual system call or vsyscall. Essentially this means this information is provided in userspace, so there are lesser resources needed, and there is no need to switch to kernel mode. James Morle has an excellent article describing this, this line is a link to that. By staying in userspace, the gettimeofday() and clock_gettime() calls are “userland” calls, and do not show up when using “strace” to see system calls of a process executing.

However I said it wasn’t as easy as I thought. I was looking into this, and thought making the vsyscalls visible by echoing “0” in /proc/sys/kernel/vsyscall64. However, I am working with kernel version 3.8.13 for doing this part of the research….which does not have /proc/sys/kernel/vsyscall64, which means I can’t turn off the vsyscall optimisation and make both gettimeofday() and clock_gettime() visible as systemcall.

Searching for kernel.vsyscall64 on the internet I found out that with early versions Linux kernel version 3 vsyscall64 has been removed. This means I can’t use a simple switch to flip to make the calls visible. So, instead of going straight to the thing I wanted to research, I got stuck in doing the necessary preparing and essential preliminary investigation for it.

Can I do it in another way? Yes, this can be done using gdb, the GNU debugger. Start up a foreground (Oracle database) session, and fetch the process ID of that session and attach to it with gdb:

gdb -p PID
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...etc...
(gdb)

Now simply break on gettimeofday and clock_gettime, and make gdb notify you it encountered the call. First 11.2.0.4:

(gdb) break gettimeofday
Breakpoint 1 at 0x332fc9c7c0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>printf "gettimeofday\n"
>continue
>end
(gdb) break clock_gettime
Breakpoint 2 at 0x3330803e10
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>printf "clock_gettime\n"
>continue
>end
(gdb) 

You can save this gdb macro to time.gdb by executing “save breakpoints time.gdb”. Now execute “c” (continue) and enter, to make the process you attached to running again. Execute something very simple, like:

SQL> select * from dual;

This results in Oracle 11.2.0.4:

(gdb) c
Continuing.
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
...etc...

That is expected, we already knew the Oracle database is executing the gettimeofday function a lot. Now let’s do exactly the same, but with Oracle version 12.1.0.2. If you saved the breakpoints and macro’s, you can attach to an Oracle 12.1.0.2 foreground process with gdb and execute ‘source time.gdb’ to set the breakpoints and macro’s. When the ‘select * from dual’ is executed in this version of the database, it looks like this:

(gdb) c
Continuing.
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
...etc...

It is clearly (mostly) executing the clock_gettime() function.

The clock_gettime() function can use a variety of time sources. If you read the manpage of clock_gettime you will see that the first argument is the clock source. You can see the clock sources in the kernel source file Linux/include/uapi/linux/time.h, which shows:

/*
 * The IDs of the various system clocks (for POSIX.1b interval timers):
 */
 #define CLOCK_REALTIME                  0
 #define CLOCK_MONOTONIC                 1
 #define CLOCK_PROCESS_CPUTIME_ID        2
 #define CLOCK_THREAD_CPUTIME_ID         3
 #define CLOCK_MONOTONIC_RAW             4
 ...

The first argument of clock_gettime is the type of clock, so if I remove the macro with clock_gettime, execution stops when clock_gettime is called:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
        silent
        printf "clock_gettime\n"
        c
(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>end
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
(gdb) c
Continuing.

Now execute something in the sqlplus session. What will happen in the gdb session is:

Breakpoint 2, 0x0000003330803e10 in clock_gettime () from /lib64/librt.so.1
(gdb)

Now look up the first argument of the call:

(gdb) print $rdi
$1 = 1

So Oracle is using CLOCK_MONOTONIC. Not the point of this article, but this means Oracle database time measurement is granular on the microsecond layer.

Now let’s look how much time the Oracle wait interface takes itself. The Oracle wait interface is using the functions kslwtbctx() (kernel service layer wait begin context) and kslwtectx() (kernel service layer wait end context) to start and stop measuring a wait event. Please mind that instead of looking at the time the wait interface provides, this means looking at the time that is taken executing in the kslwtbctx() and kslwtectx() functions. This can be done using systemtap:

global kslwtbctx_time, kslwtectx_time, kslwtbctx_count=0, kslwtbctx_tot=0, kslwtectx_count=0, kslwtectx_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	if ( pid() == target() ) {
		kslwtbctx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx").return {
	if ( pid() == target() ) {
		printf("kslwtbctx: %12d\n", local_clock_ns()-kslwtbctx_time)
		kslwtbctx_tot+=local_clock_ns()-kslwtbctx_time
		kslwtbctx_count++
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	if ( pid() == target() ) {
		kslwtectx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx").return {
	if ( pid() == target() ) {
		printf("kslwtectx: %12d\n", local_clock_ns()-kslwtectx_time)
		kslwtectx_tot+=local_clock_ns()-kslwtectx_time
		kslwtectx_count++
	}
}
probe end {
	printf("\nkslwtbctx: avg %12d\nkslwtectx: avg %12d\n",kslwtbctx_tot/kslwtbctx_count,kslwtectx_tot/kslwtectx_count)
}

This systemtap script has a ‘begin probe’, which executes once the systemtap script starts running. I simply print ‘Begin.’ with a newline. The idea is that it prompts me once the systemtap script is actually running.

Then there is a (userspace) process based probe for the oracle process. There are two probes for both the kslwtbctx and the kslwtectx function in the oracle process. The first one (.function(“kslwtbctx”)) fires when the function is entered, the second one (.function(“kslwtbctx”).return) fires when the function has ended.

The ‘if ( pid() == target() )’ function filters all the invocations and returns of the probed functions for the PID set by “-x PID” parameter. Otherwise any invocation of the probed function by any process would be picked up.

The entering probe simply records the time in nanoseconds in a variable. The returning probe subtracts the previous recorded time from the current time, which means the time between entering and returning is shown. Also, the returning probe adds the time the function took to another variable, and counts the number of times the return probe has fired.

The end probe shows the total time spend in each of the two functions, divided by the number of times the return probe was fired. This way the average time spend in the two functions is calculated. As you will see, the time spend in the function varies.

When this is executed against an Oracle foreground session, this is how it looks like:

# stap -x 2914 wait_interface.stap
Begin.
kslwtectx:         9806
kslwtbctx:         3182
kslwtectx:         1605
kslwtbctx:         1311
kslwtectx:         4200
kslwtbctx:         1126
kslwtectx:         1014
kslwtbctx:          840
kslwtectx:         4402
kslwtbctx:         2636
kslwtectx:         2023
kslwtbctx:         1586
^C
kslwtbctx: avg         2165
kslwtectx: avg         4305

The time measured is in nanoseconds. The average wait interface overhead is roughly 6 microseconds including systemtap overhead on my system.

The obvious thought you might have, is: “why is this important?”. Well, this is actually important, because the 6us dictates what the wait interface should measure, and what it should not measure. What I mean to say, is that anything that is called inside the Oracle database for which the time spend is in the same ballpark as the wait interface overhead or lower, should not be measured by the wait interface, because the measurement would influence the overall performance in a negative way.

A good example of this are latch gets. The Oracle database does not instrument individual latch gets via the wait interface, but rather exposes waiting for a latch via the wait interface when a process has spun for it (a latch is a spinlock), and decides to go to sleep (on a semaphore) waiting to get woken once the latch becomes available.

However, using systemtap we can actually measure latch gets! Here is a simple script to measure the latch gets for non-shared latches via the kslgetl() function:

global kslgetl_time, kslgetl_count=0, kslgetl_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl") {
	if ( pid() == target() )
		kslgetl_time=local_clock_ns()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl").return {
	if ( pid() == target() ) {
		printf("kslgetl: %12d\n", local_clock_ns()-kslgetl_time)
		kslgetl_tot+=local_clock_ns()-kslgetl_time
		kslgetl_count++
	}
}
probe end {
	printf("\nkslgetl: avg %12d\n",kslgetl_tot/kslgetl_count)
}

This systemtap script follows the exact same structure and logic of the previous systemtap script.

This is how this looks like on my system when executed against a database foreground session executing ‘select * from dual’:

# stap -x 2914 kslgetl.stap
Begin.
kslgetl:         3363
kslgetl:          786
kslgetl:          744
kslgetl:          782
kslgetl:          721
kslgetl:          707
kslgetl:         1037
kslgetl:          728
kslgetl:          711
kslgetl:          736
kslgetl:          719
kslgetl:          714
kslgetl:         1671
kslgetl:          929
kslgetl:          968
kslgetl:          919
kslgetl:          883
kslgetl:          869
kslgetl:         3030
kslgetl:          750
^C
kslgetl: avg         1362

As you can see, the measured average time spend inside the kslgetl function is 1.3us on my system, which includes systemtap overhead. Clearly the time for taking a latch is less than the wait interface takes, which means not instrumenting the kslgetl() function in the wait interface is a sensible thing.

This means that with the current state of the wait interface, it does not make sense to measure a lot more very fine grained events, even though the timer can time on microsecond granularity. Please mind I am not saying that it does not make sense to detail the response time, I think with modern computer systems with a lot of memory the Oracle database can run more and more without needing to wait for things like disk IOs. This means modern database servers can spend a lot of time just running on CPU, making it hard to understand in what routines the time is spend. Tuning on CPU execution requires an insight into where time is spend. The only option to understand how CPU time in Oracle is composited, is by using external (operating system based) tools like perf and flame graphs to detail CPU time. It would be great if an option would exist in the database to detail on CPU time.

Advertisements
8 comments
  1. Hi Frits,
    thank you very much for this research.

    > This switch is understandable, as gettimeofday() is a best guess of the kernel of the wall clock time, while clock_gettime(CLOCK_MONOTONIC,…) is an monotonic increasing timer, which means it is more precise and does not have the option to drift backward, which gettimeofday() can do in certain circumstances, like time adjustments via NTP.

    Never looked at this so deep, but actually it really makes sense due to gettimeofday() uses the clock_gettime(CLOCK_REALTIME) back end nowadays afaik. However are you sure that clock_gettime(CLOCK_MONOTONIC) is not adjusted by NTP due to adjtimex() calls? Unfortunately i have no internet access from my lab to trace ntp.

    Thanks.

    Regards
    Stefan

    • Hi Stefan, thanks for reading.
      CLOCK_REALTIME is a representation of the wall clock time. So if an administrator or a daemon changes it, it jumps accordingly. Also think about DST.
      CLOCK_MONOTONIC is guaranteed to increase only. Some reading up on the inter web shows it can be influenced by NTP, however by being MONOTONIC it avoids “jumping all over the place”.

      Another things that I constantly run into when searching for CLOCK_MONOTONIC is problems with virtualisation. It seems Xen and Xen based virtualisation implementations (OVM) account for “stolen time” (CPU ticks a VM should have gotten, but didn’t due to CPU pressure; which is externalised with %steal with sar). I simply don’t know about virtual box and VMWare. If these use a time source on the CPU, it would be totally independent of stolen ticks.

  2. Hi Frits,
    it’s me again 😉

    > The average wait interface overhead is roughly 6 microseconds including systemtap overhead on my system.
    > …
    > As you can see, the measured average time spend inside the kslgetl function is 1.3us on my system, which includes systemtap overhead. Clearly the time for taking a latch is less than the wait interface takes, which means not instrumenting the kslgetl() function in the wait interface is a sensible thing. This means that with the current state of the wait interface, it does not make sense to measure a lot more very fine grained events, even though the timer can time on microsecond granularity.

    If i understand you correctly you mean that the wait interface is not able to capture/handle wait events that last shorter than 6 microseconds (minus SystemTap overhead) correctly. Right?

    However Oracle introduced a microsecond histogram with 12c (http://docs.oracle.com/database/121/REFRN/GUID-7B359986-CC9E-45DE-811E-381D5783519F.htm#REFRN30743) where they break down to < 1 us, < 2 us, < 4 us, < 8 us, etc. So it seems like they can use a "light-weight" wait interface code or something like that to handle and time these microsecond ranges.

    Thanks.

    Regards
    Stefan

    • Hi Stefan, thanks for reading!
      Let me clarify my point: the point that I make is the wait interface can measure anything, and the resolution is depended on the clock source. However, if the activity to be measured is supposed to take lesser time than the time it takes to measure it, measuring it will influence performance of the activity severely, especially if the activity is done a lot of times.

      I took a latch get, which I measured at 1.3us, as an example. The measurement (alias the time spend in kslwtbctx() and kslwtectx()) takes 6us on my system. If a latch get would be measured by the wait interface (which is not done of course), it would take 6/1.3*100=462%, alias roughly four and a half times more time. That would not make sense.

      Now for microsecond histograms in 12c. The Oracle database has the wait interface setup at fixed locations, where the database is suspected to spend time waiting by the database developers. However, sometimes the database does not spend a lot of time in these locations. A great example is ‘SQL*Net message to client’ during a normal execution. There isn’t any waiting in a normal case, and typically you would see very low times (I just measured 1us on my test system). Because the wait ‘SQL*Net message to client’ is typically only triggered a couple of times during execution, it doesn’t impact performance severely.

      So, if there would be a wait that is mostly in the 1/2us area, and is executed with great repetition in a session, the time spend in it would be mostly the wait interface instead of the activity it is supposed to measure.

      This also means that if IO latency get lesser up to the point of 1-5us, Oracle should rethink the waits surrounding it.

      • Hi Frits,
        glad we are talking about the same 🙂

        > So, if there would be a wait that is mostly in the 1/2us area, and is executed with great repetition in a session, the time spend in it would be mostly the wait interface instead of the activity it is supposed to measure

        … but isn’t this happening with V$EVENT_HISTOGRAM_MICRO? It is tracked all the time (when TIMED_STATISTICS=TRUE) and it got this granularity. So for example you would see your latch example in time ranges < 1 us, < 2 us and < 4 us with different wait counts. So i am not quite sure if Oracle has not some more "light-weight" measurement code.

        Just ran a quick test case and it counts wait events in "1 microsecond" range in V$EVENT_HISTOGRAM_MICRO. So maybe they just accept the time in wait interface (even it is much more than the measured activity itself) or they got some more "light-weight" measurement code.

        Regards
        Stefan

      • Hi Stefan, I am convinced that the waits in the < 1/2/4 us range are measured via the regular kslwt[be]ctx() functions. I am not aware of alternative wait interface code (which is not a very "hard" argument, it could as well exist, it means I haven't seen it), nor a way to switch something because of very low latency. So if you see these very low latency numbers, it just means Oracle spend an approximate 6us executing the wait interface functions measuring something that took 1/2/4us.

  3. oracle48 said:

    Hello Frits,

    I’m digging into an issue that the ‘log file parallel write’ is much smaller than ‘log file sync’ and I was strace-ing the LGRW process. I could see a lot of ‘gettimeofday’ (~35%), ‘clock_gettime’ (~15%) calls and ‘pwrite’ (~35%), that’s how I came to this blog.

    Just wanting to let you know that I’m on OEL 7.2, Kernel 3.10.0-327 and running Oracle 12.1.0.2 and I do see these in the output of ‘strace’, without doing anything special.

    Maybe I misunderstood the “gettimeofday() and clock_gettime() calls are “userland” calls, and do not show up when using “strace” to see system calls of a process executing” comment in the post, but just letting you know in case I do not…

    Regards, Ian.

    • Hi Ian, these are good observations!
      Please mind there more things in play than the ones you mention in your question. First of all, when executing commit (which is the command that yields ‘log file sync’ waits), there are 3 stages that the process goes through:
      1) generating a commit record and copying the commit record to the public redo log buffer (and copying the change vectors in the private strand and applying the block changes in the buffer cache if private strands are used).
      2) sending a message to the log writer (master process) to request the public redo log buffer to be written.
      3) waiting for the log writer to write the change vectors to disk.
      The time between 2 and 3 is instrumented as ‘log file sync’.

      However, there a couple of remarks based on the information provided:
      – please be aware that since oracle version 12.1 oracle by default uses ‘adaptive scalable log writer workers’, which are lgnn process, which could write, instead of the master log writer process, and as the name indicates that function can dynamically change.
      – on a normal system you should see io_submit/io_getevents for writes, not pwrite.
      – gettimeofday and clock_gettime by definition are system calls, because they need information from ‘the system’. However, it is weird that you see these calls with strace, because in any normal situation these should be using special handling as VDSO (virtual dynamic shared object), which lowers the resources needed for selected often frequently executed system calls like gettimeofday and clock_gettime: http://man7.org/linux/man-pages/man7/vdso.7.html

      At first there is more information needed to solve your question. Are you using RAC? Are you running in archive log mode? How many sockets/cores/threads and how many active processes?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: