This is a blogpost about how I setup my test virtual machines. The seasoned sysadmin and DBA will notice that the techniques used here are perfectly usable for real production environments. The most important thing is there is no need to download or stage any software for installing the virtual machine, everything is downloaded when needed during installation. Obviously this works best when you have got reasonable bandwidth available for connecting to the internet.

The main infrastructure software components of this setup are:
Virtualbox as the virtualisation software.
Ansible as the configuration and management engine.
Vagrant as the virtualisation manager.

Installation (Mac OSX specific).
– Virtual box is installed by downloading and installing the installation image in the normal way.
– Ansible requires a few steps. Ansible relies on python.
First install pip using easy_install:

$ sudo easy_install pip

Then install ansible for your current user:

$ sudo pip install ansible --user

Now to pick up the ansible local install, add the following to .bash_profile:

export PATH=$PATH:~/Library/Python/2.7/bin

– Vagrant is installed by downloading and installing the installation image in the normal way.

My versions:

$ VBoxManage -v
$ ansible --version
  config file = /Users/fritshoogland/.ansible.cfg
  configured module search path = Default w/o overrides
$ vagrant version
Installed Version: 1.8.5
Latest Version: 1.8.5

You're running an up-to-date version of Vagrant!

Okay! Now all the software is installed, up to the next step!

SSH keys issue for vagrant (OSX specific).
For ssh certificate based authentication to work, the private key file needs to protected by mode 600 (rw-|—|—). Vagrant uses a ssh key to access the virtual machine. I am running the vagrant directory from a share, which forces the mode of the files to be 700. In order to workaround that issue, I setup my own keys in ~/.vagrant_ssh:

$ mkdir .vagrant_ssh
$ cd .vagrant_ssh
$ ssh-keygen -q -N "" -f id_rsa

Later on some settings are made in the Vagrantfile to use the private key just generated.

SSH keys for normal authentication (OSX specific).
The Ansible scripts copy your ssh public key and put it in the authorised_keys file of both the oracle and the root user. By putting your public key in the authorized_keys file of a user, the authentication part of logging on is done via the public key.

However, for this to work, you first need to have a private and public key pair. There are many ways of doing that, this is an example of that:

Check if you have a dsa key pair:

$ cd ~
$ find .ssh

In this case, as you can see, there are two id_dsa files, one with the extension “.pub”, which is the public key, and the other one without an extension is the private key. The private key should be kept private (at all times). If you don’t have id_dsa keys, set them up in the following way:

$ cd ~
$ mkdir .ssh
$ chmod 700 .ssh
$ cd .ssh
$ ssh-keygen -q -N "" -t dsa -f id_dsa

Setting the Virtualbox images directory.
Virtualbox will put the virtual machines in it’s default machine folder. You can see where the default machine folder is set to using the VBoxManage utility:

$ VBoxManage list systemproperties | grep machine
Default machine folder:          /Volumes/VirtualBox

Because the virtual machines are disk space intensive, you might want to change the location. At least, I later decided to dedicate an external disk to it. The way to get the disk images in another location, is change the virtualbox settings:

$ VBoxManage setproperty machinefolder /Volumes/Virtualbox

The vagrant ‘box’ location.
Vagrant uses operating system images that it is calling ‘boxes’. These ‘boxes’ are the base image that is copied when a virtual machine is created. It is stored to save it from downloading over and over. The location of these images or ‘boxes’ is ~/.vagrant.d/boxes. If you want to store the ‘boxes’ somewhere else, you can set the environment variable VAGRANT_HOME to make vagrant use a different location for storing its global state and the ‘boxes’.

Setting up the vagrant directory.
The next step is to set up the directory where the virtual machine will be run. The simplest way of doing so is cloning the files from github:

$ git clone

This will create a ‘Vagrant-oracle-setup’ directory.
Now you need to fill out the Vagrantfile in the directory:
a) line 12: set the hostname. (change optional)
b) line 14: set an IP address. (do not choose an already existing network, like the default virtualbox 10.0.2/24 network)
c) line 18/19: set the memory and CPU capacity for the VM to use.
d) line 41/42: set your My Oracle Support credentials to download the installation media.
e) line 43: set the IP address for ansible, as set at b)/line 14.
f) line 46: set the database name to create. (change optional)

Install the virtual machine with linux, the oracle database and setup a database.
Now the cool part: install linux, install the oracle database software and create a database without any manual intervention! For this you need to go into the Vagrant-oracle-setup directory, and execute:

$ vagrant up

That’s it.

You can follow the steps vagrant and ansible are executing. You’ll see vagrant first fetching (downloading) the OS image (which it will save to prevent from downloading it again), and then copy this image to be the image for the virtual machine to run. Vagrant then makes some settings, mostly network related, and hands over the control to ansible.

Ansible executes the playbook as set in the Vagrantfile, which I’ve put in the ‘ansible’ directory. Ansible sets up linux, configures the extra block devices using LVM for the software home and the database, gets the installation media from My Oracle Support, installs the software and then runs DBCA to create a database.

Further things.
Handling vagrant to administer the virtual machine is easy. Go into the directory with the Vagrantfile and use:
– vagrant halt: stop the virtual machine.
– vagrant up: for a new machine, get the base image and provision the machine. For an existing machine: startup the virtual machine.
– vagrant destroy: stop the virtual machine and remove all the information it has used like configuration/provisioning status and the disks.
– vagrant provision: run the ansible playbook again.

Anything else? Well, yes, I didn’t setup a listener for example. It should be really simple and straightforward: just start the listener. The listener will start listening on all network devices, and the database will register itself. There might be much more, depending on your specific need.

Thanks to Maris Elsins for getMOSPatch.

This blogpost is about using the linux ftrace kernel facility. If you are familiar with ftrace and specifically the function_graph tracer, you might already be aware of this functionality. This is Linux specific, and this facility is at least available in kernel 2.6.39 (Oracle’s UEK2 kernel).

What is a ‘kernel dive’? Whenever a process is running, it should mostly be in ‘user mode’, executing the program it is supposed to run. However, during running there could be situations (a lot of situations, depending on what the program is doing!) that the program needs something “from” the system. Such a thing could be allocating memory, or using a device that is shared like a block device, or a network device. These things are controlled by the kernel, and require a process to issue a system call. A user process executes a system call to request actions to such things. Starting from the system call, the execution ‘dives’ in the kernel, and executes in kernel or system mode. However, kernel dives are not limited to system calls; for example a page fault (paging in backing memory for allocated virtual memory) switches to system mode, as well as handling an interrupt.

The Linux kernel has a facility that is called ‘ftrace’. The name ftrace originally was named because of function tracing, but it has evolved into a tracing framework. It’s important to point out that ftrace currently ONLY works in KERNELSPACE. This means you miss the userspace code.

Ftrace uses both explicit tracepoints (defined in the linux kernel source), as well as dynamic tracepoints, for which the gcc -pg (profiling data) flag is used to capture function entry. For function exit a ‘trampoline’ is used. A trampoline here is an extra function executed (mcount) at function entry that stores the return address, and replaces the return address with that of the trampoline, so an exit can be detected.

the linux debugfs filesystem must be mounted for ftrace to work. You can check if the debugfs filesystem is mounted using:

[root@bigmachine ~]# mount -t debugfs

It does not return any rows if debugfs is not mounted. You can mount debugfs the following way:

mount -t debugfs none /sys/kernel/debug

Let’s do some basic steps first, just tracing an Oracle session!
First go to the tracing directory:

[root@bigmachine ~]# cd /sys/kernel/debug/tracing/
[root@bigmachine tracing]#

Obtain the PID of an Oracle foreground process, and enable tracing for this PID:

[root@bigmachine tracing]# echo 6431 > set_ftrace_pid

We also need to choose what we want to trace. Something very beneficial for understanding what is going on in the kernel is the tracer ‘function_graph’. Here’s how you enable that trace:

[root@bigmachine tracing]# echo function_graph > current_tracer

The trace output is in ‘trace’. Assuming that the PID to trace is idle, this is how the trace output looks like:

[root@bigmachine tracing]# cat trace
# tracer: function_graph
# |     |   |                     |   |   |   |

Now execute something silly (something which does little!).

TS@fv12102 > select * from dual;


Now let’s first show how much information is gathered:

[root@bigmachine tracing]# cat trace | wc -l

That’s correct: approximately 1900 rows of trace data are created during only ‘select * from dual’. Now think about this: most of the things ‘select * from dual’ does are done in userspace. What does ‘select * from dual’ actually do in kernel space? Here’s a little grep to see what the Oracle process did:

[root@bigmachine tracing]# grep \|\ \ [a-zA-Z_0-9]*\(\)\ \{ trace
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
...much more...

Probably you are aware Oracle executes a lot of times() and getrusage() calls. As you can see, the system calls are also audited by the Linux system. Now just open the trace file with your favourite file viewer (I use ‘less’, you can use the vi commands to search for pieces of text), and peek in the file:

   0)               |  sys_getrusage() {
   0)               |    getrusage() {
   0)               |      k_getrusage() {
   0)               |        task_cputime_adjusted() {
   0)               |          cputime_adjust() {
   0)   0.045 us    |            nsecs_to_jiffies();
   0)   0.503 us    |          }
   0)   0.876 us    |        }
   0)   0.041 us    |        jiffies_to_timeval();
   0)   0.042 us    |        jiffies_to_timeval();
   0)               |        get_task_mm() {
   0)   0.047 us    |          _raw_spin_lock();
   0)   0.380 us    |        }
   0)               |        mmput() {
   0)   0.041 us    |          _cond_resched();
   0)   0.351 us    |        }
   0)   3.836 us    |      }
   0)   0.043 us    |      _cond_resched();
   0)   4.596 us    |    }
   0)   5.004 us    |  }

I skipped the beginning of the trace, which is actually the ending of the kernel code of the waiting on a next command of the Oracle process. If you scroll down to the end, you will see how the waiting on a next command (instrumented by the wait event ‘SQL*Net message from client’) is actually implemented on the kernel side using a read function on a pipe. You find the above shown systemcall, getrusage, just after the ending of the read function in the beginning of the trace (or search for ‘sys_getrusage’).

What is interesting is that this trace is showing the different functions in the kernel and which function is calling what function, made visible by accolades and indention, made to look like a c program. In other words: this allows you to see in what specific function of the kernel the time is spend, and how the total time of a kernel function is build up!

Now that basic usage is known, let’s step up to something interesting; the getrusage timing in interesting, but just an example. One of such really interesting things is IO.

First clear the trace file:

[root@bigmachine tracing]# echo > trace

Verify that the current tracer still is function_graph:

[root@bigmachine tracing]# cat current_tracer

Verify the process id to trace (it should list the process id of the oracle foreground process you want to trace):

[root@bigmachine tracing]# cat set_ftrace_pid

Now an additional next step to only trace the pread systemcall using the ‘set_graph_function’ facility:

[root@bigmachine tracing]# echo sys_pread64 > set_graph_function

Now make the Oracle session that is traced do a pread call. I used ‘select * from t1 where rownum=1’. Because I flushed the buffer cache prior to doing this (to make sure physical IO is needed), I did get 2 physical IOs, one for the segment header and one for the data block.

To make sure nothing else will get into the trace buffer (the file really is a buffer in memory), do the following to stop further tracing:

[root@bigmachine tracing]# echo 0 > tracing_on

This is how my output looks like:
I added line numbering to it (you can do that yourself too with the nl linux utility), so there are a few things I can point you to.

The first pread call ends at line# 568. That is a lot of information. It also shows how much stuff is done during an IO.
– The filesystem type matters! At line 7 you see vfs_read (the linux filesystem abstraction layer), but at line 19 you see filesystem specific code!
– At line 27 you see a XFS specific function indicating DIO (direct IO) is used.
– At line 33 you see an interaction with the block layer using the blk_start_plug() function. Linux uses a method to group IOs that is logically equal to filling up and later draining a bathtub using a plug.
– At line 328 you see the IO request has been built by the filesystem code, and the the request queue is unplugged: blk_flush_plug_list. You see some IO scheduler functions (starting with ‘elv’) and you see the deadline scheduler is used (deadline_add_request). Not very far after that, we enter the scsi layer (indicated by functions starting with ‘scsi_’).
– At line 431 you see the request being submitted to the device using the mpt_put_msg_frame function. We are now in the low level driver layer. We see there is time involved (82ms, you can see this is a virtual machine). You see all kinds of loops ending here. At this point the IO request has been sent to the IO device. However, there is more interesting stuff coming!
– At line 429 we see the kernel is preparing for waiting on the IO request to return. This is done in using the dio_await_completion function.
– At line 436 we are totally done submitting the IO request, and entering the Linux process scheduler. In the scheduler code we see housekeeping (update_blocked_averages, line 456), and rebalancing the process, which means trying to find the best cpu thread to execute on (load_balance, line 462), before the process finally goes to sleep.
– At line 487 the process is woken by an interrupt (not visible in the trace) from it’s uninterruptible sleep (state ‘D’). The IO then really finishes up; at function dio_bio_complete, line 500 the data of the IO request is put in a BIO structure which can be passed on to userspace, some further housekeeping is done (dio_complete, line 539), the access time is updated (touch_atime, line 552) and some xfs housekeeping, unlocking the inode (xfs_iunlock, line 559).

Let me show you something which shows how valuable this tracing is for finding (linux operating system) issues. This is another trace on the very same system with slightly different kernel settings:

Can you spot the difference? If not, take a look at this diff:, at line number 115. The function called the second time is gup_huge_pmd. That’s a function for handling user pages, and this function is using huge pages! The change I made was setting vm.nr_hugepages from zero to a number higher than the database instance needed. So not only the Oracle database can benefit from huge pages, but also the Linux kernel IO subsystem!

If this wetted your appetite, read on! Here’s a trace, once again of pread: Can you see what is the difference here? If not, I created a diff of this trace and the previous one: A nice example of the difference is at line 42. The first trace file contains all kinds of functions for doing extent management for the XFS filesystem, until line 89. This is all replaced with a single function blkdev_get_block.

The similarity is it’s all the system call pread. The difference between the second and the third is that with the third pread trace I used Oracle’s ASM facility, alias database IO to a block device directly without a filesystem. This trace shows pread is not executing all the filesystem functions, because there is no filesystem. However, please do realise it’s all about spend time, not how many rows there are in a trace.

Now before jumping to conclusions about the greatness of ASM because it can skip a lot of the code path, please do realise that ASM adds code path inside the Oracle database, because some kind of disk space management must be done. Also there is an entire ASM instance for managing the diskspace (which is NOT part of database sessions’ code path). These are facts, not opinion. There are use cases for both filesystem usage and for ASM, although I have a personal preference for ASM.

Ftrace does not get the attention it deserves. It’s a great tool for investigating time spend in the kernel, and it’s available by mounting the debugfs filesystem. This blogpost describes the function_graph tracer, there are other tracers too, it’s absolutely not limited to the description in this blogpost.

To show how ftrace with the function_graph tracer works I took the pread system call as an example. It turns out the pread system call comes in many shapes. Unless you intimately know a system, there can be different layers in the kernel in play when executing pread. However, on a normal system the main time component of the pread system call should be off CPU in uninterruptible state. And that is exactly what the function_graph tracer can tell.

Again, as a reminder, ftrace only works for kernel level (“system”) execution of a process.

In a previous article called ‘memory allocation on startup’ I touched on the subject of NUMA; Non Uniform Memory Access. This article is about how to configure NUMA, how to look into NUMA usage and a real life case of NUMA optimisation using in-memory parallel execution.

At this point in time (start of the summer of 2016) we see that the CPU speed competition has stagnated and settled at somewhere below maximally 4 gigahertz, and instead the number of core’s and the size of memory is growing. The common used server type in the market I am in is a two socket server. It is not unthinkable that in the near future servers with more than two sockets will increase in popularity, which facilitates the increase in (parallel) computing capacity and maximal amount of memory.

On the other hand of course there are already 2+ socket count servers, of which some have been multi socket for quite a while: Oracle/Sun M series, IBM P series, and a large diversity of intel based 2+ socket servers, like the Oracle/Sun “-4” and “-8” series. The investigations in this blogpost are done done on a SGI UV 300RL, which is a system with (up to) 32 sockets, 576 cores, and 1152 CPU threads.

There already have been a lot of blogposts about Linux and NUMA:
Kevin Closson
Martin Bach: linux-large-pages-and-non-uniform-memory-distribution, _enable_numa_support and numactl
Yves Colin
Bertrand Drouvot
I am sure there will be more, these are the ones that I read in the past.

This article will take you through the configuration of NUMA for the Oracle database and how to measure specifics from the Linux side. It will touch some of the things mentioned in the other blogposts, in order to provide a consistent story.
The database version is:
The operating system is Oracle Linux 7.1
Kernel: 3.8.13-98.4.1.el7uek.x86_64

1. NUMA or not.
In order to sensibly use NUMA optimisation with the Oracle database, you first need to have hardware that is actually NUMA capable. This means multiple sockets in use that have local memory (please mind this includes common two socket server hardware). The next thing needed to use NUMA is NUMA support enabled by the Linux kernel. With the current version of the oracle-preinstall rpm, NUMA is disabled on the kernel load line in grub (numa=off). When NUMA is disabled, it will show one NUMA node (=socket) with numactl.

First make sure you are on NUMA enabled hardware by issuing ‘numactl –hardware’ and see multiple nodes specified.

2. Default Oracle.
When you don’t modify (undocumented) NUMA parameters, Oracle will NOT run in NUMA aware mode. In other words, you explicitly need to enable NUMA in the database. This is also visible if you look in the numa_maps, which shows process memory and NUMA information:

$ grep huge /proc/$(pgrep pmon_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3339 mapmax=904 N0=112 N1=98 N2=98 N3=98 N4=98 N5=98 N6=98 N7=97 N8=96 N9=96 N10=96 N11=101 N12=108 N13=106 N14=105 N15=105 N16=107 N17=107 N18=106 N19=109 N20=106 N21=106 N22=107 N23=107 N24=106 N25=107 N26=108 N27=108 N28=109 N29=112 N30=111 N31=113
9c460000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
9c480000000 prefer:1 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N0=1

Few things to note: I took the pmon process, and my instance name is slob (pmon_slob). The SGA only is allocated from huge pages, so I grepped for ‘huge’. numa_maps only shows the process’ paged memory, my SGA is way bigger. The important thing to see here is the main SGA memory allocation at 80000000 reads ‘interleave:0-31’, which means the memory is spread over all the NUMA nodes. This can also be seen on the same line, because the memory allocation per node is shown with Nnn=nr.pages.

3. Set parameters
In Oracle you need to set two parameters to fully enable NUMA support in the database:

alter system set "_enable_NUMA_support"=true scope=spfile;
alter system set "_px_numa_support_enabled"=true scope=spfile;

These are undocumented parameters. However, these parameters are explicitly mentioned in MOS article ‘What Is Correct Setting for _px_numa_support_enabled for 12c Databases That Wish to Use NUMA?’ (Doc ID 1956463.1).

Once you have set these parameters, Oracle will start up in NUMA aware mode. One of the things that is different in NUMA aware mode, is how the buffercache is created. This can be seen by looking at the numa_maps file of pmon again:

$ grep huge /proc/$(pgrep pmon_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3587 mapmax=904 N0=112 N1=112 N2=112 N3=112 N4=112 N5=112 N6=112 N7=112 N8=112 N9=112 N10=112 N11=112 N12=113 N13=112 N14=113 N15=112 N16=112 N17=112 N18=113 N19=112 N20=112 N21=112 N22=112 N23=112 N24=112 N25=112 N26=112 N27=112 N28=112 N29=112 N30=112 N31=112
8860000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=1536 N25=1536
d260000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=512 N26=512
11c40000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=512 N27=512
16620000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=512 N28=512
1b000000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=512 N29=512
1f9e0000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=256 N30=256
243c0000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=512 N31=512
28da0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=256 N0=256
2d780000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=512 N1=512
32160000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=256 N2=256
36b40000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=512 N3=512
3b520000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=256 N4=256
3ff00000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=512 N5=512
448e0000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=256 N6=256
492c0000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=512 N7=512
4dca0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=256 N8=256
52680000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=512 N9=512
57060000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=256 N10=256
5ba40000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=512 N11=512
60420000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=256 N12=256
64e00000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=512 N13=512
697e0000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=256 N14=256
6e1c0000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=512 N15=512
72ba0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=256 N16=256
77580000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=512 N17=512
7bf60000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=256 N18=256
80940000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=512 N19=512
85320000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=256 N20=256
89d00000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=512 N21=512
8e6e0000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=256 N22=256
930c0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=512 N23=512
97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=512 N24=512
9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge
9c480000000 prefer:2 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N5=1

If we look at the output, a few segments are still allocated interleaved (these are the fixed SGA and the shared pool), and we see allocations for every NUMA node. These per NUMA node allocations are the buffer cache.

Okay. Now the database is setup for NUMA, and we checked how this looks like on the operating system level using Linux tools. However, we need to realise that NUMA is no magic that sets your database into the next gear. Most things you do are not NUMA optimised by the database, simply because it wouldn’t make sense.

For example, when I perform a buffered full table scan without parallel query, it will spread out the blocks over the allocated segments. Why? Because my session is running on a random CPU thread which could run on another CPU thread the next time it is scheduled. What I am saying here is: only specific things are NUMA optimised.

A next logical question would be: but then what is actually NUMA optimised? One obvious thing that is NUMA optimised is the database writer (this is mentioned in the alert.log at startup). Another NUMA optimised thing is parallel query. If you look at the parallel query worker processes, the status file gives a hint:

# grep allowed_list /proc/$(pgrep p000_slob)/status
Cpus_allowed_list:	450-467,1026-1043
Mems_allowed_list:	0-31

This shows that the first parallel query worker process is limited in what CPU (threads) it is allowed to use. All parallel query worker processes are limited to ranges of CPU threads. Please mind it is NOT limited in what memory it can use, so it can use both local and remote memory. It may look weird that it still can access remote memory, totally ignoring NUMA optimised (=local) memory access.

First the CPU thread ranges. If you look what the range mean (it’s rather obvious), you see this:

# lscpu | grep 450
NUMA node25 CPU(s):    450-467,1026-1043

The CPU thread range is exactly a NUMA node. So whatever happens, parallel query worker process p000 can only run on the CPU threads of NUMA node 25!

Then absence of limits in memory: this means the parallel query process can still access anything it needs. This means that anything that used to work with parallel query can still run as it did without or before the NUMA optimisations. In specific cases, parallel query can take advantage of the NUMA setting.

One of such cases is in-memory parallel execution. With in-memory parallel execution, the parallel query slaves read blocks in the buffer cache instead of performing a direct path read. In-memory parallel execution is enabled when the parameter parallel_degree_policy is set to auto, and heuristically determines feasibility.

I setup a table of approximately 700GB, and performed an in-memory parallel query scan on a freshly started instance, in order to have the parallel query worker processes have as little memory touched as possible. At the end of the scan, this is how numa_maps of parallel query worker process p000 looks like:

# grep huge /proc/$(pgrep p000_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=906 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=4608 N0=144 N1=144 N2=144 N3=144 N4=144 N5=144 N6=144 N7=144 N8=144 N9=144 N10=144 N11=144 N12=144 N13=144 N14=144 N15=144 N16=144 N17=144 N18=144 N19=144 N20=144 N21=144 N22=144 N23=144 N24=144 N25=144 N26=144 N27=144 N28=144 N29=144 N30=144 N31=144
8880000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=50176 N25=50176
d2a0000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N26=6145
11c80000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=4608 N27=4608
16660000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=70 N28=5121
1b040000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=14336 N29=14336
1fa20000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=72 N30=5121
24400000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=4608 N31=4608
28de0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N0=6145
2d7c0000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=4099 N1=4099
321a0000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=2528 mapmax=73 N2=2528
36b80000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=2222 N3=2222
3b560000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=2504 mapmax=69 N4=2504
3ff40000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=2249 N5=2249
44920000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=4754 mapmax=69 N6=4754
49300000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=2296 N7=2296
4dce0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=2463 mapmax=70 N8=2463
526c0000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=2246 N9=2246
570a0000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=2440 mapmax=72 N10=2440
5ba80000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=2233 N11=2233
60460000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=2469 mapmax=69 N12=2469
64e40000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=2241 N13=2241
69820000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=2434 mapmax=69 N14=2434
6e200000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=2342 N15=2342
72be0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=2461 mapmax=70 N16=2461
775c0000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=2329 N17=2329
7bfa0000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=2499 mapmax=71 N18=2499
80980000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=2334 N19=2334
85360000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=2466 mapmax=73 N20=2466
89d40000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=2328 N21=2328
8e720000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=2180 mapmax=69 N22=2180
930e0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=2208 mapmax=69 N23=2208
97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=2167 mapmax=68 N24=2167
9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=147 N26=2
9c480000000 prefer:0 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=906 N2=1

If you look at memory address 8880000000, you see parallel query worker process p000 used 50176 pages, while on the other (buffer cache/’prefer’) segments it touched somewhere between 2200-6150 pages. So it clearly used it’s local memory more than the remote memories. This is proof in-memory parallel query is NUMA optimised.

In fact, when I time the query using in-memory parallel query with NUMA enabled, it takes a little less than 8 seconds consistently once the data is loaded in the buffercache. When I disable NUMA optimisations (_enable_NUMA_support=false), and execute the same query, also doing in-memory parallel query, the time goes up to a little less than 24 seconds, only scanning the buffercache.

This is a small blogpost on using ‘perf’. I got an error message when I tried to run ‘perf top’ systemwide:

# perf top
Too many events are opened.
Try again after reducing the number of events

What actually is the case here, is actually described in the perf wiki:

Open file limits
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor per event per-thread or per-cpu.
On a 16-way system, when you do:
perf stat -e cycles sleep 1
You are effectively creating 16 events, and thus consuming 16 file descriptors.

The point for this blogpost is perf (in Oracle Linux 7.1) says ‘too many events’, and hidden away in the perf wiki the true reason for the message is made clear: perf opens up a file descriptor per cpu thread, which means that if you are on a big system you might get this message if the open files (file descriptors actually) limit is set lower than the number of cpu threads.

You can see the current set limits using ‘ulimit -a’:

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 189909832
max locked memory       (kbytes, -l) 21878354152
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16384
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

If you are root, you can simply set the ‘open files’ limit higher than the amount of cpu threads, and perf will work:

# ulimit -n 10240

Recently I have been presenting on what running on a large intel based NUMA system looks like (OTN EMEA tour in Düsseldorf and Milan, and I will be presenting about this at the Dutch AMIS 25th anniversary event in june). The investigation of this presentation is done on a SGI UV 300 machine with 24 terabyte of memory, 32 sockets (=NUMA nodes), 480 core’s and 960 threads.

Recently I have been given access to a new version of the UV 300, the UV 300 RL, for which the CPU has improved from Ivy Bridge to Haswell, and now has 18 core’s per socket instead of 15, which means the number of core’s on a fully equipped system is 576, which makes 1152 threads.

Now the get back to the actual purpose of this blogpost: SGA memory allocation on startup. One of the things you can do on such a system is allocate an extremely high number of memory. In my case I chose to allocate 10 terabyte for the Oracle SGA. When you go (excessively) outside of normal numbers, you run into things simply because there is a (very) low chance that that has been actually tested, and very few (or none) have actually done it before.

First let me show you the other technical details:
Operating system: Oracle Linux version 7.1
Kernel: kernel-uek-3.8.13-98.4.1.el7uek.x86_64
Oracle grid infrastructure version:
Oracle database version:
For the database instance the use_large_pages parameter is set to ONLY. It simply does not make sense not to use it in a normal database, in my opinion it really is mandatory if you work with large memory databases.

Firing up the instance…
At first I set sga_target to 10737418240000, alias 10 terabyte, and started up the instance. My sqlplus session got busy after pressing enter, however it remained busy. I got coffee downstairs, got back to my office, and it was still busy. Long story short: after 30 minutes it was still busy and I wondered if something was wrong.

What can you do? I first checked the alert.log of the instance. It told me:

Tue May 24 07:52:35 2016
Starting ORACLE instance (normal) (OS id: 535803)
Tue May 24 07:52:35 2016
CLI notifier numLatches:131 maxDescs:3801

Not a lot of information, but also no indication anything was wrong.

Then I looked with the ‘top’ utility what was active at my system:

 535803 oracle    20   0  9.765t  22492  17396 R  99.4  0.0  11:10.56 oracleSLOB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

I see my local bequeath session running at and near 100% CPU. It’s working, and that is the correct state when you fire up an Oracle database instance. But what is it doing?

Next logical thing: strace:

[oracle@skynet5 frits]$ strace -fp 535803
Process 535803 attached

Yes, this is meant to show that strace does not show any output. This means the process is not doing any system calls. This does NOT mean the process is doing nothing, we saw with ‘top’ it’s busy, running close to 100% on a CPU (thread).

When strace does not show anything, the next layer to dive into is the C function layer with perf. The first thing I tried, and normally try at first, is ‘perf top’. Perf top -p zooms in on one process:

[oracle@skynet5 frits]$ perf top -p 535803
   PerfTop:    4003 irqs/sec  kernel:99.9%  exact:  0.0% [4000Hz cycles],  (target_pid: 535803)

    95.86%  [kernel]   [k] clear_page_c_e
     1.58%  [kernel]   [k] clear_huge_page
     0.41%  [kernel]   [k] __ticket_spin_lock
     0.32%  [kernel]   [k] clear_page
     0.29%  [kernel]   [k] _cond_resched
     0.28%  [kernel]   [k] mutex_lock
     0.11%  [kernel]   [k] apic_timer_interrupt
     0.11%  [kernel]   [k] hugetlb_fault

Ah! This shows what the process is actually doing! We can see it’s executing a function called ‘clear_page_c_e’ in kernel mode in 96% of the samples perf took of the running process. The name of the function describes what it does quite well: it is a kernel function to clear a page.

So, this does describe that the process starting up the instance is actually spending all of its time clearing memory, but it does not show how this relates to what the Oracle database is actually doing. In order to dig further into what Oracle is doing, we can use perf in another way: record backtraces, and later report on that. What perf does, is sample full backtraces in a file called ‘’. When perf is later used to report using that file, it will show the current function that was sampled most in the file, and all the backtraces leading to that function (when perf record is called with -g “call-graph”).

[oracle@skynet5 frits]$ perf record -g -p 535803
...after some time: ^C, then:
[oracle@skynet5 frits]$ perf report
# Overhead  Command       Shared Object                                      Symbol
# ........  .......  ..................  ..........................................
    97.14%   oracle  [kernel.kallsyms]   [k] clear_page_c_e
             --- clear_page_c_e
                |--100.00%-- hugetlb_fault
                |          handle_mm_fault
                |          __do_page_fault
                |          do_page_fault
                |          page_fault
                |          slaac_int
                |          slrac
                |          sskgm_segment_notify_action
                |          skgmcrone
                |          skgm_allocate_areas
                |          skgmcreate
                |          ksmcrealm
                |          ksmcsg
                |          opistr_real
                |          opistr
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                 --0.00%-- [...]

What this shows, is again the kernel function ‘clear_page_c_e’ (it also shows [k] to indicate it’s a kernel function), and when we follow the backtrace down, we see
* hugetlb_fault: this is a page fault handling function, which shows that I am using huge pages.
* handle_mm_fault, __do_page_fault, do_page_fault, page_fault: these are linux kernel functions indicating it’s handling a page fault.
* slaac_int, slrac, sskgm_segment_notify_action: these are Oracle database functions. Obviously, slaac_int is the function actually ‘touching’ memory, because it results in a page_fault.

So what is a page fault? A page fault is an interrupt that is raised when a program tries to access memory that has already been allocated into the process’ virtual address space, but not actually loaded into main memory. The interrupt causes the kernel to handle the page fault, and make the page available. Essentially Linux, like most other operating systems, makes pages available once they are actually used instead of when they are ‘just’ allocated. (this is a summary, there is a lot to tell about page faults alone, but that’s beyond the scope of this article)

It’s important to realise there is no system call visible that either allocates memory (think about mmap(), malloc(), etc.). If you want to know more about these calls, please read up on them using: ‘man SYSCALLNAME’ (on linux obviously). So, the above backtrace looks like code that deliberately touches the Oracle SGA memory in order to get it truly allocated!

Actually, in my case, with NUMA enabled, and a large amount of memory to be alloced, the above process of getting 10 terabyte ‘touched’ or ‘page faulted’ can be watched in progress in the proc ‘numa_maps’ file of the ‘bequeathing’ process (the process that is starting up the instance). The linux ‘watch’ utility is handy for this:

[oracle@skynet5 frits]$ watch -n 1 -d cat /proc/535803/numa_maps
00400000 prefer:1 file=/u01/app/oracle/product/ mapped=3124 N21=3124
10e74000 prefer:1 file=/u01/app/oracle/product/ anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:1 file=/u01/app/oracle/product/ anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:1 anon=28 dirty=28 N1=28
129b4000 prefer:1 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=360268 N0=11259 N1=11259 N2=11259 N3=11259 N4=11259 N5=11259 N6=11259 N7=11259 N8
=11259 N9=11259 N10=11259 N11=11259 N12=11258 N13=11258 N14=11258 N15=11258 N16=11258 N17=11258 N18=11258 N19=11258 N20=11258 N21=11258 N22=11258 N23
=11258 N24=11258 N25=11258 N26=11258 N27=11258 N28=11258 N29=11258 N30=11258 N31=11258
fca0000000 prefer:1 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:1 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:1 anon=534 dirty=534 N1=73 N2=461

With ‘watch’, ‘-n 1’ means a 1 second interval, ‘-d’ means highlight differences.
When the instance is starting (with NUMA enabled, and with a huge amount of memory allocated for SGA), first you will see it’s busy in the line indicated with ‘80000000’ (which is a memory address). In my case it says ‘interleave:0-31’, which means the memory allocation is done over all the indicated NUMA nodes. Actually Nnr=nr means a memory allocation of nr of pages on NUMA node Nnr. In my case I see the number growing per NUMA node jumping from one NUMA node to another for some time. This is the shared pool allocation, that is spread out over all NUMA nodes.

However, after some time, you’ll see this:

00400000 prefer:0 file=/u01/app/oracle/product/ mapped=3124 N21=3124
10e74000 prefer:0 file=/u01/app/oracle/product/ anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:0 file=/u01/app/oracle/product/ anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:0 anon=28 dirty=28 N1=28
129b4000 prefer:0 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=516352 N0=16136 N1=16136 N2=16136 N3=16136 N4=16136 N5=16136 N6=16136 N7=16136 N8
=16136 N9=16136 N10=16136 N11=16136 N12=16136 N13=16136 N14=16136 N15=16136 N16=16136 N17=16136 N18=16136 N19=16136 N20=16136 N21=16136 N22=16136 N23
=16136 N24=16136 N25=16136 N26=16136 N27=16136 N28=16136 N29=16136 N30=16136 N31=16136
fca0000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=144384 N25=144384
14320000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=143872 N26=143872
18960000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=12232 N27=12232
1cfa0000000 prefer:0 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:0 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:0 anon=534 dirty=534 N1=73 N2=461

Once the process is done allocating shared pool, it starts allocating buffer cache memory. It does this per NUMA node. You can see it here at the lines fca0000000, 14320000000, 18960000000. Actually, I copied these figures when the instance was still allocating memory at memory address 18960000000. You can see it does specific allocation of memory local to a specific NUMA node by ‘prefer:nr’.

At the end, when all memory was allocated, the sqlplus session showed the sizes of the different memory pools and opened the database. This took approximately 90 minutes. At first, I thought this had to do with the parameter ‘PRE_PAGE_SGA’. However, this parameter was set at its default value, FALSE.

Some investigating and some testing revealed this behaviour (touching the entire SGA) was governed by the hidden parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION, which defaults to TRUE in on Linux.

Next I tested setting it to FALSE, and my startup time reduced to 3.5 minutes (!).

I used perf record again to look where the time is spend when _TOUCH_SGA_PAGES_DURING_ALLOCATION is set to FALSE:

# Overhead          Command         Shared Object                                       Symbol
# ........  ...............  ....................  ...........................................
    41.25%  oracle_563318_s  [kernel.kallsyms]     [k] clear_page_c_e
            --- clear_page_c_e
               |--99.99%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          |
               |          |--93.26%-- __intel_new_memset
               |          |          |
               |          |          |--79.48%-- ksmnsin
               |          |          |          |
               |          |          |          |--95.98%-- kcbnfy
               |          |          |          |          kscnfy
               |          |          |          |          ksmcsg
               |          |          |          |          opistr_real
               |          |          |          |          opistr
               |          |          |          |          opiodr
               |          |          |          |          ttcpip
               |          |          |          |          opitsk
               |          |          |          |          opiino
               |          |          |          |          opiodr
               |          |          |          |          opidrv
               |          |          |          |          sou2o
               |          |          |          |          opimai_real
               |          |          |          |          ssthrdmain
               |          |          |          |          main
               |          |          |          |          __libc_start_main
               |          |          |          |
               |          |          |          |--3.32%-- kcbw_setup

This is a new backtrace, which looks like the former backtrace on the top (between clear_page_c_e and page_fault), simply because it’s doing exactly the same, handling a page fault. However, the Oracle functions are different. What is happening here is the Oracle performs a system call, memset(), which is used to wipe a memory area. If we go further down the stack trace, we see (full speculation!):
* ksmnsin: kernel system memory numa segment initialisation?
* kcbnfy/kscnfy: kernel cache buffers numa FY?/kernel system cache numa FY?
* ksmcsg: kernel system memory create sga?

When looking at the memory allocations during startup of the bequeathing process with _TOUCH_SGA_PAGES_DURING_ALLOCATION set to FALSE in numa_maps, way lesser memory pages are touched. numa_maps only shows truly allocated pages, instead of the logical memory allocations. This means the same memory allocations done per numa node (passing over the numa nodes several times) done for the shared pool, and then the buffer cache allocations per numa node, however now also passing over the different numa nodes several times too. Because of the much smaller amount of pages touched, the startup time of the instance is greatly reduced.

So, is this a solution to the long startup time?
On one hand yes, on the other hand no. What that means is: you need to page the memory, no matter what you do. If you choose not to page the memory during startup (_TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE), you need to do it later. Sounds logical, right: pay now, or pay later? But what does that mean?

After startup of the instance with _TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE, a lot of the memory if not really allocated. If you start to use the new prestine SGA with a normal session, you pay the price of paging which the bequeathing session otherwise took. Let me show you a ‘perf record -g’ example of a full table scan (which is reading the Oracle blocks into the buffercache, not doing a direct path read):

# Overhead          Command       Shared Object                                Symbol
# ........  ...............  ..................  ....................................
    61.25%  oracle_564482_s  [kernel.kallsyms]   [k] clear_page_c_e
            --- clear_page_c_e
               |--99.99%-- hugetlb_fault
               |          |
               |          |--99.86%-- handle_mm_fault
               |          |          __do_page_fault
               |          |          do_page_fault
               |          |          page_fault
               |          |          __intel_ssse3_rep_memcpy
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

If we follow the call stack from bottom to top:
* qergsFetch, qertbFetch: query execute row source code.
* kdsttgr: kernel data scan table get row.
* kdstf00000010000100kmP: kernel data scan table full, the ultra fast full table scan.
* kdst: other functions in kernel data scan table functions.
* ktrget2: kernel transaction layer.
* kcbgtcr: kernel cache buffers get consistent row.
* kcbz: kernel cache buffers Z, physical IO helper functions.
* __intel_ssse3_rep_memcpy: this is actually the function memcpy, but replaced by an optimised version for intel CPUs.
* page_fault, do_page_fault, __do_page_fault, handle_mm_fault, hugetlb_fault, clear_page_c_e: these are the same page fault kernel functions we saw earlier.

This clearly shows my full table scan now needs to do the paging!

If I set _TOUCH_SGA_PAGES_DURING_ALLOCATION to TRUE, startup the instance (for which the pages are touched and thus pages), and profile a full table scan, I see:

# Overhead          Command         Shared Object                                    Symbol
# ........  ...............  ....................  ........................................
    24.88%  oracle_577611_s  oracle                [.] ksl_get_shared_latch
            --- ksl_get_shared_latch
               |--78.46%-- kcbzfb
               |          kcbzgb
               |          |
               |          |--99.63%-- kcbzgm
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

This shows the most prominent function which is called is ‘ksl_get_shared_latch’. No paging to be seen.

At this point the reason for having _TOUCH_SGA_PAGES_DURING_ALLOCATION should be clear. The question I had on this point is: but how about PRE_PAGE_SGA? In essence, this parameter is supposed to more or less solve the same issue, having the SGA pages being touched at startup to prevent paging for foreground sessions.

BTW, if you read about PRE_PAGE_SGA in the online documentation, it tells a reason for using PRE_PAGE_SGA, which is not true (page table entries are prebuilt for the SGA pages), and it indicates the paging (=page faults) are done at startup, which also is not true. It also claims ‘every process that starts must access every page in the SGA’, again this is not true.

From what I can see, what happens when PRE_PAGE_SGA is set to true, is that a background process is started, that starts touching all SGA pages AFTER the instance has started and is open for usage. The background process I witnessed is ‘sa00’. When recording the backtraces of that process, I see:

# Overhead        Command      Shared Object                                      Symbol
# ........  .............  .................  ..........................................
    97.57%  ora_sa00_slob  [kernel.kallsyms]  [k] clear_page_c_e
            --- clear_page_c_e
               |--100.00%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          ksmprepage_memory
               |          ksm_prepage_sga_seg
               |          skgmapply
               |          ksmprepage
               |          ksm_sslv_exec_cbk
               |          ksvrdp
               |          opirip
               |          opidrv
               |          sou2o
               |          opimai_real
               |          ssthrdmain
               |          main
               |          __libc_start_main
                --0.00%-- [...]

The kernel paging functions are exactly the same as we have seen several times now. It’s clear the functions executed by this process are specifically for the prepage functionality. The pre-paging as done on behalf of _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE is done as part of the SGA creation and allocation (as can be seen by the Oracle function names). PRE_PAGE_SGA seems to be a ‘workaround’ if you don’t want to spend the long time paging on startup, but still want to page the memory as soon as possible after startup. Needless to say, this is not the same as _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE, PRE_PAGE_SGA paging is done serially by a single process after startup when the database is open for usage. So normal foreground process that encounter non-paged memory, which means they use it before the sa00 process pages it, still need to do the paging.

If you want to allocate a large SGA with Oracle (but may apply to earlier versions too), the startup time could be significant. The reason for that is the bequeathing session pages the memory on startup. This can be turned off by setting the undocumented parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION to FALSE. As a result, foreground (normal user) sessions need to do the paging. You can set PRE_PAGE_SGA parameter to TRUE to do paging, however the paging is done by a single process (sa00) that serially pages the memory after startup. Foreground processes that encounter non-paged memory, which means they use it before the sa00 process could page it, need to page it theirselves.

Thanks to: Klaas-Jan Jongsma for proofreading.

I was setting up disk devices for ASM in Oracle Linux 7. I knew things have changed between Oracle Linux 6 and Oracle Linux 7, but only just a little bit.

First of all, let’s take a look at the current disk usage. To see what disk devices are visible and how they are used, use lsblk:

# lsblk
fd0           2:0    1    4K  0 disk
sda           8:0    0   40G  0 disk
├─sda1        8:1    0  500M  0 part /boot
└─sda2        8:2    0 39.5G  0 part
  ├─ol-swap 253:0    0  3.9G  0 lvm  [SWAP]
  └─ol-root 253:1    0 35.6G  0 lvm  /
sdb           8:16   0   40G  0 disk
sr0          11:0    1 1024M  0 rom

As you can see, I got a floppy disk drive (fd0), a device sda which contains two partitions, sda1 and sda2. The partition sda1 is mounted to /boot. The second partition sda2 is used by LVM, and contains two logical volumes, one for swap, the other has a filesystem that is mounted on /. The device sdb is currently not used (I want to use that for ASM), and there is a device sr0, which is a DVD drive.

The only correct way to use udev (that I am aware of) for disk devices, is using the UUID of the disk. Please mind that if there are multiple connections to the disk/slice presented to the host, which is typically done with fiberchannel connections to a SAN, you should use multipathd to create a multi path device first, and use that device. Each path towards a SAN slice show up as a disk device. The way to distinguish the disk devices from paths is the UUID (with a subtle difference between the SCSI UUID and the WW ID), which is exactly what the multi path daemon uses too. In my setup there are no multiple paths to a disk device.

I want to use gdb as disk for ASM. In order to make sdb accessible for ASM, it needs to have user and group set to oracle. I find it convenient to have the device for usage with ASM in a different place (/dev/oracleasm/), so I can point my ASM_DISKSTRING to the entire directory. This is what we are going to use udev for.

First we need to obtain the UUID. One way of doing this, is via lsscsi:

# lsscsi -i
[1:0:0:0]    cd/dvd  NECVMWar VMware IDE CDR10 1.00  /dev/sr0   -
[2:0:0:0]    disk    VMware,  VMware Virtual S 1.0   /dev/sda   36000c29eea23a3f6b958ca77007bac53
[2:0:1:0]    disk    VMware,  VMware Virtual S 1.0   /dev/sdb   36000c2912554c8f44edc68106edd5040

Now we need to create a udev rules file to make udev create a device for us in /dev/oracleasm, and set the correct permissions. A rules file should be created in /etc/udev/rules.d/. Typically, rules files start with a number, which is used by the udev daemon to understand the sequence/ordering of the rules. In OL7 there are no additional rules files in /etc/udev/rules.d/, which is different from OL6, where multiple rules files are already present.

# vi /etc/udev/rules.d/99-oraccle-asmdevices.rules
KERNEL=="sd*", ENV{ID_SERIAL}=="36000c2912554c8f44edc68106edd5040", SYMLINK+="oracleasm/disk1", OWNER="oracle", GROUP="dba", MODE="0660"

This is the simplest form such a rule can have. The KERNEL matchkey says it should filter only devices matching ‘sd*’, ENV{ID_SERIAL} matches the UUID, SYMLINK creates a symbolic link, and OWNER, GROUP and MODE are obvious. In OL6, you could use NAME instead of SYMLINK which would create the device indicated, and remove the device from the original place (/dev/sdb in my case). In OL7 NAME is not allowed anymore.

At this point a link has been created with the correct permissions to be used with Oracle ASM.

However, udev is a black box to a lot of DBAs and sysadmins. Let’s look a bit deeper into udev!

Obtaining information about a device.
In my case, I wanted to use /dev/sdb. You can look what udev information exists about /dev/sdb by using:

# udevadm info --name sdb
P: /devices/pci0000:00/0000:00:10.0/host2/target2:0:1/2:0:1:0/block/sdb
N: sdb
S: disk/by-id/scsi-36000c2912554c8f44edc68106edd5040
S: disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040
S: disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0
E: DEVLINKS=/dev/disk/by-id/scsi-36000c2912554c8f44edc68106edd5040 /dev/disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040 /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0
E: DEVNAME=/dev/sdb
E: DEVPATH=/devices/pci0000:00/0000:00:10.0/host2/target2:0:1/2:0:1:0/block/sdb
E: ID_BUS=scsi
E: ID_MODEL=VMware_Virtual_S
E: ID_MODEL_ENC=VMware\x20Virtual\x20S
E: ID_PATH=pci-0000:00:10.0-scsi-0:0:1:0
E: ID_PATH_TAG=pci-0000_00_10_0-scsi-0_0_1_0
E: ID_SCSI_SERIAL=6000c2912554c8f44edc68106edd5040
E: ID_SERIAL=36000c2912554c8f44edc68106edd5040
E: ID_SERIAL_SHORT=6000c2912554c8f44edc68106edd5040
E: ID_TYPE=disk
E: ID_VENDOR_ENC=VMware\x2c\x20
E: ID_WWN=0x6000c2912554c8f4
E: ID_WWN_VENDOR_EXTENSION=0x4edc68106edd5040
E: ID_WWN_WITH_EXTENSION=0x6000c2912554c8f44edc68106edd5040
E: TAGS=:systemd:

This is all the information that udev sees currently, with no (custom) udev rule applied to it.

Testing udev rules.
Next up, we need to write a udev rules file. You can test the results of a udev rule using udevadm test. I’ve put ‘KERNEL==”sd*”, ENV{ID_SERIAL}==”36000c2912554c8f44edc68106edd5040″, NAME+=”oracleasm/disk1″, OWNER=”oracle”, GROUP=”dba”, MODE=”0660″‘ in /etc/udev/rules.d/99-oracle-asmdevices.rules, which is the OL6 version, which contains NAME as key, which is not allowed with OL7. Let’s test this specific rule:

# udevadm test /sys/block/sdb
calling: test
version 219
This program is for debugging only, it does not run any program
specified by a RUN key. It may show incorrect results, because
some values may be different, or not available at a simulation run.

=== trie on-disk ===
tool version:          219
file size:         6984832 bytes
header size             80 bytes
strings            1805856 bytes
nodes              5178896 bytes
Load module index
Created link configuration context.
timestamp of '/etc/udev/rules.d' changed
Reading rules file: /usr/lib/udev/rules.d/10-dm.rules
Reading rules file: /usr/lib/udev/rules.d/100-balloon.rules
Reading rules file: /usr/lib/udev/rules.d/11-dm-lvm.rules
Reading rules file: /usr/lib/udev/rules.d/13-dm-disk.rules
Reading rules file: /usr/lib/udev/rules.d/40-redhat.rules
Reading rules file: /usr/lib/udev/rules.d/42-usb-hid-pm.rules
Reading rules file: /usr/lib/udev/rules.d/50-udev-default.rules
Reading rules file: /usr/lib/udev/rules.d/60-alias-kmsg.rules
Reading rules file: /usr/lib/udev/rules.d/60-cdrom_id.rules
Reading rules file: /usr/lib/udev/rules.d/60-drm.rules
Reading rules file: /usr/lib/udev/rules.d/60-keyboard.rules
Reading rules file: /usr/lib/udev/rules.d/60-net.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-alsa.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-input.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-serial.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-storage-tape.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-storage.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-v4l.rules
Reading rules file: /usr/lib/udev/rules.d/60-raw.rules
Reading rules file: /usr/lib/udev/rules.d/61-accelerometer.rules
Reading rules file: /usr/lib/udev/rules.d/64-btrfs.rules
Reading rules file: /usr/lib/udev/rules.d/69-dm-lvm-metad.rules
Reading rules file: /usr/lib/udev/rules.d/70-mouse.rules
Reading rules file: /usr/lib/udev/rules.d/70-power-switch.rules
Reading rules file: /usr/lib/udev/rules.d/70-touchpad.rules
Reading rules file: /usr/lib/udev/rules.d/70-uaccess.rules
Reading rules file: /usr/lib/udev/rules.d/71-biosdevname.rules
Reading rules file: /usr/lib/udev/rules.d/71-seat.rules
Reading rules file: /usr/lib/udev/rules.d/73-idrac.rules
Reading rules file: /usr/lib/udev/rules.d/73-seat-late.rules
Reading rules file: /usr/lib/udev/rules.d/75-net-description.rules
Reading rules file: /usr/lib/udev/rules.d/75-probe_mtd.rules
Reading rules file: /usr/lib/udev/rules.d/75-tty-description.rules
Reading rules file: /usr/lib/udev/rules.d/78-sound-card.rules
Reading rules file: /usr/lib/udev/rules.d/80-drivers.rules
Reading rules file: /usr/lib/udev/rules.d/80-net-name-slot.rules
Reading rules file: /usr/lib/udev/rules.d/80-net-setup-link.rules
Reading rules file: /usr/lib/udev/rules.d/81-kvm-rhel.rules
Reading rules file: /usr/lib/udev/rules.d/85-nm-unmanaged.rules
Reading rules file: /usr/lib/udev/rules.d/90-alsa-tools-firmware.rules
Reading rules file: /usr/lib/udev/rules.d/90-iprutils.rules
Reading rules file: /usr/lib/udev/rules.d/90-vconsole.rules
Reading rules file: /usr/lib/udev/rules.d/91-drm-modeset.rules
Reading rules file: /usr/lib/udev/rules.d/95-dm-notify.rules
Reading rules file: /usr/lib/udev/rules.d/95-udev-late.rules
Reading rules file: /usr/lib/udev/rules.d/98-kexec.rules
Reading rules file: /etc/udev/rules.d/99-oracle-asmdevices.rules
Reading rules file: /usr/lib/udev/rules.d/99-systemd.rules
rules contain 24576 bytes tokens (2048 * 12 bytes), 12641 bytes strings
1896 strings (23762 bytes), 1263 de-duplicated (11755 bytes), 634 trie nodes used
IMPORT 'scsi_id --export --whitelisted -d /dev/sdb' /usr/lib/udev/rules.d/60-persistent-storage.rules:40
starting 'scsi_id --export --whitelisted -d /dev/sdb'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_SCSI=1'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_VENDOR=VMware_'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_VENDOR_ENC=VMware\x2c\x20'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_MODEL=VMware_Virtual_S'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_MODEL_ENC=VMware\x20Virtual\x20S'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_REVISION=1.0'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_TYPE=disk'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_SERIAL=36000c2912554c8f44edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_SERIAL_SHORT=6000c2912554c8f44edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_WWN=0x6000c2912554c8f4'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_WWN_VENDOR_EXTENSION=0x4edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_WWN_WITH_EXTENSION=0x6000c2912554c8f44edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_SCSI_SERIAL=6000c2912554c8f44edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb' [3130] exit with return code 0
LINK 'disk/by-id/scsi-36000c2912554c8f44edc68106edd5040' /usr/lib/udev/rules.d/60-persistent-storage.rules:42
IMPORT builtin 'path_id' /usr/lib/udev/rules.d/60-persistent-storage.rules:55
LINK 'disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0' /usr/lib/udev/rules.d/60-persistent-storage.rules:56
IMPORT builtin 'blkid' /usr/lib/udev/rules.d/60-persistent-storage.rules:72
probe /dev/sdb raid offset=0
LINK 'disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040' /usr/lib/udev/rules.d/60-persistent-storage.rules:82
no db file to read /run/udev/data/+scsi:2:0:1:0: No such file or directory
OWNER 54321 /etc/udev/rules.d/99-oracle-asmdevices.rules:1
GROUP 54322 /etc/udev/rules.d/99-oracle-asmdevices.rules:1
MODE 0660 /etc/udev/rules.d/99-oracle-asmdevices.rules:1
NAME="oracleasm/disk1" ignored, kernel device nodes can not be renamed; please fix it in /etc/udev/rules.d/99-oracle-asmdevices.rules:1
handling device node '/dev/sdb', devnum=b8:16, mode=0660, uid=54321, gid=54322
preserve permissions /dev/sdb, 060660, uid=54321, gid=54322
preserve already existing symlink '/dev/block/8:16' to '../sdb'
found 'b8:16' claiming '/run/udev/links/\x2fdisk\x2fby-id\x2fscsi-36000c2912554c8f44edc68106edd5040'
creating link '/dev/disk/by-id/scsi-36000c2912554c8f44edc68106edd5040' to '/dev/sdb'
preserve already existing symlink '/dev/disk/by-id/scsi-36000c2912554c8f44edc68106edd5040' to '../../sdb'
found 'b8:16' claiming '/run/udev/links/\x2fdisk\x2fby-id\x2fwwn-0x6000c2912554c8f44edc68106edd5040'
creating link '/dev/disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040' to '/dev/sdb'
preserve already existing symlink '/dev/disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040' to '../../sdb'
found 'b8:16' claiming '/run/udev/links/\x2fdisk\x2fby-path\x2fpci-0000:00:10.0-scsi-0:0:1:0'
creating link '/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0' to '/dev/sdb'
preserve already existing symlink '/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0' to '../../sdb'
created db file '/run/udev/data/b8:16' for '/devices/pci0000:00/0000:00:10.0/host2/target2:0:1/2:0:1:0/block/sdb'
ACTION=KERNEL=="sd*", ENV{ID_SERIAL}=="36000c2912554c8f44edc68106edd5040", NAME+="oracleasm/disk1", OWNER="oracle", GROUP="dba", MODE="0660"
DEVLINKS=/dev/disk/by-id/scsi-36000c2912554c8f44edc68106edd5040 /dev/disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040 /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0
Unload module index
Unloaded link configuration context.

As you can see in the highlighted row, udev recognised the device from the rule and applied the settings, but it ignored NAME, because it is not allowed to rename kernel device nodes. When you need to use udev, it’s very important you run a test like this to check for the validity of the syntax and if the rule actually does what you want it to do, and the correct line in the udev rules file is applied (you can see the rules file and row number in lines 90-93).

Applying a udev rule from a rules file.
After you tested the your new udev rule, and made sure it does what you want, you need to apply the rule. You can specifically apply it for the device using the following command:

# udevadm trigger --name sdb

The intention of this blogpost is to show the Oracle wait time granularity and the Oracle database time measurement granularity. One of the reasons for doing this, is the Oracle database switched from using the function gettimeofday() up to version 11.2 to clock_gettime() to measure time.

This switch is understandable, as gettimeofday() is a best guess of the kernel of the wall clock time, while clock_gettime(CLOCK_MONOTONIC,…) is an monotonic increasing timer, which means it is more precise and does not have the option to drift backward, which gettimeofday() can do in certain circumstances, like time adjustments via NTP.

The first thing I wanted to proof, is the switch of the gettimeofday() call to the clock_gettime() call. This turned out not to be as simple as I thought.

Because a lot of applications and executables need timing information, which is traditionally done via the gettimeofday() call, the Linux kernel maintainers understood the importance of making this call as performant as possible. Calling a (hardware) clock means you request information from a resource on your computer system. Requesting something like that requires a program to switch to kernel mode first. In order to lower the resources and time needed to get time information, the Linux kernel includes a “trick” to get that information, which is known as a virtual system call or vsyscall. Essentially this means this information is provided in userspace, so there are lesser resources needed, and there is no need to switch to kernel mode. James Morle has an excellent article describing this, this line is a link to that. By staying in userspace, the gettimeofday() and clock_gettime() calls are “userland” calls, and do not show up when using “strace” to see system calls of a process executing.

However I said it wasn’t as easy as I thought. I was looking into this, and thought making the vsyscalls visible by echoing “0” in /proc/sys/kernel/vsyscall64. However, I am working with kernel version 3.8.13 for doing this part of the research….which does not have /proc/sys/kernel/vsyscall64, which means I can’t turn off the vsyscall optimisation and make both gettimeofday() and clock_gettime() visible as systemcall.

Searching for kernel.vsyscall64 on the internet I found out that with early versions Linux kernel version 3 vsyscall64 has been removed. This means I can’t use a simple switch to flip to make the calls visible. So, instead of going straight to the thing I wanted to research, I got stuck in doing the necessary preparing and essential preliminary investigation for it.

Can I do it in another way? Yes, this can be done using gdb, the GNU debugger. Start up a foreground (Oracle database) session, and fetch the process ID of that session and attach to it with gdb:

gdb -p PID
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>

Now simply break on gettimeofday and clock_gettime, and make gdb notify you it encountered the call. First

(gdb) break gettimeofday
Breakpoint 1 at 0x332fc9c7c0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>printf "gettimeofday\n"
(gdb) break clock_gettime
Breakpoint 2 at 0x3330803e10
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>printf "clock_gettime\n"

You can save this gdb macro to time.gdb by executing “save breakpoints time.gdb”. Now execute “c” (continue) and enter, to make the process you attached to running again. Execute something very simple, like:

SQL> select * from dual;

This results in Oracle

(gdb) c

That is expected, we already knew the Oracle database is executing the gettimeofday function a lot. Now let’s do exactly the same, but with Oracle version If you saved the breakpoints and macro’s, you can attach to an Oracle foreground process with gdb and execute ‘source time.gdb’ to set the breakpoints and macro’s. When the ‘select * from dual’ is executed in this version of the database, it looks like this:

(gdb) c

It is clearly (mostly) executing the clock_gettime() function.

The clock_gettime() function can use a variety of time sources. If you read the manpage of clock_gettime you will see that the first argument is the clock source. You can see the clock sources in the kernel source file Linux/include/uapi/linux/time.h, which shows:

 * The IDs of the various system clocks (for POSIX.1b interval timers):
 #define CLOCK_REALTIME                  0
 #define CLOCK_MONOTONIC                 1
 #define CLOCK_PROCESS_CPUTIME_ID        2
 #define CLOCK_THREAD_CPUTIME_ID         3
 #define CLOCK_MONOTONIC_RAW             4

The first argument of clock_gettime is the type of clock, so if I remove the macro with clock_gettime, execution stops when clock_gettime is called:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        printf "gettimeofday\n"
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
        printf "clock_gettime\n"
(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        printf "gettimeofday\n"
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
(gdb) c

Now execute something in the sqlplus session. What will happen in the gdb session is:

Breakpoint 2, 0x0000003330803e10 in clock_gettime () from /lib64/

Now look up the first argument of the call:

(gdb) print $rdi
$1 = 1

So Oracle is using CLOCK_MONOTONIC. Not the point of this article, but this means Oracle database time measurement is granular on the microsecond layer.

Now let’s look how much time the Oracle wait interface takes itself. The Oracle wait interface is using the functions kslwtbctx() (kernel service layer wait begin context) and kslwtectx() (kernel service layer wait end context) to start and stop measuring a wait event. Please mind that instead of looking at the time the wait interface provides, this means looking at the time that is taken executing in the kslwtbctx() and kslwtectx() functions. This can be done using systemtap:

global kslwtbctx_time, kslwtectx_time, kslwtbctx_count=0, kslwtbctx_tot=0, kslwtectx_count=0, kslwtectx_tot=0

probe begin {
probe process("/u01/app/oracle/product/").function("kslwtbctx") {
	if ( pid() == target() ) {
probe process("/u01/app/oracle/product/").function("kslwtbctx").return {
	if ( pid() == target() ) {
		printf("kslwtbctx: %12d\n", local_clock_ns()-kslwtbctx_time)
probe process("/u01/app/oracle/product/").function("kslwtectx") {
	if ( pid() == target() ) {
probe process("/u01/app/oracle/product/").function("kslwtectx").return {
	if ( pid() == target() ) {
		printf("kslwtectx: %12d\n", local_clock_ns()-kslwtectx_time)
probe end {
	printf("\nkslwtbctx: avg %12d\nkslwtectx: avg %12d\n",kslwtbctx_tot/kslwtbctx_count,kslwtectx_tot/kslwtectx_count)

This systemtap script has a ‘begin probe’, which executes once the systemtap script starts running. I simply print ‘Begin.’ with a newline. The idea is that it prompts me once the systemtap script is actually running.

Then there is a (userspace) process based probe for the oracle process. There are two probes for both the kslwtbctx and the kslwtectx function in the oracle process. The first one (.function(“kslwtbctx”)) fires when the function is entered, the second one (.function(“kslwtbctx”).return) fires when the function has ended.

The ‘if ( pid() == target() )’ function filters all the invocations and returns of the probed functions for the PID set by “-x PID” parameter. Otherwise any invocation of the probed function by any process would be picked up.

The entering probe simply records the time in nanoseconds in a variable. The returning probe subtracts the previous recorded time from the current time, which means the time between entering and returning is shown. Also, the returning probe adds the time the function took to another variable, and counts the number of times the return probe has fired.

The end probe shows the total time spend in each of the two functions, divided by the number of times the return probe was fired. This way the average time spend in the two functions is calculated. As you will see, the time spend in the function varies.

When this is executed against an Oracle foreground session, this is how it looks like:

# stap -x 2914 wait_interface.stap
kslwtectx:         9806
kslwtbctx:         3182
kslwtectx:         1605
kslwtbctx:         1311
kslwtectx:         4200
kslwtbctx:         1126
kslwtectx:         1014
kslwtbctx:          840
kslwtectx:         4402
kslwtbctx:         2636
kslwtectx:         2023
kslwtbctx:         1586
kslwtbctx: avg         2165
kslwtectx: avg         4305

The time measured is in nanoseconds. The average wait interface overhead is roughly 6 microseconds including systemtap overhead on my system.

The obvious thought you might have, is: “why is this important?”. Well, this is actually important, because the 6us dictates what the wait interface should measure, and what it should not measure. What I mean to say, is that anything that is called inside the Oracle database for which the time spend is in the same ballpark as the wait interface overhead or lower, should not be measured by the wait interface, because the measurement would influence the overall performance in a negative way.

A good example of this are latch gets. The Oracle database does not instrument individual latch gets via the wait interface, but rather exposes waiting for a latch via the wait interface when a process has spun for it (a latch is a spinlock), and decides to go to sleep (on a semaphore) waiting to get woken once the latch becomes available.

However, using systemtap we can actually measure latch gets! Here is a simple script to measure the latch gets for non-shared latches via the kslgetl() function:

global kslgetl_time, kslgetl_count=0, kslgetl_tot=0

probe begin {
probe process("/u01/app/oracle/product/").function("kslgetl") {
	if ( pid() == target() )
probe process("/u01/app/oracle/product/").function("kslgetl").return {
	if ( pid() == target() ) {
		printf("kslgetl: %12d\n", local_clock_ns()-kslgetl_time)
probe end {
	printf("\nkslgetl: avg %12d\n",kslgetl_tot/kslgetl_count)

This systemtap script follows the exact same structure and logic of the previous systemtap script.

This is how this looks like on my system when executed against a database foreground session executing ‘select * from dual’:

# stap -x 2914 kslgetl.stap
kslgetl:         3363
kslgetl:          786
kslgetl:          744
kslgetl:          782
kslgetl:          721
kslgetl:          707
kslgetl:         1037
kslgetl:          728
kslgetl:          711
kslgetl:          736
kslgetl:          719
kslgetl:          714
kslgetl:         1671
kslgetl:          929
kslgetl:          968
kslgetl:          919
kslgetl:          883
kslgetl:          869
kslgetl:         3030
kslgetl:          750
kslgetl: avg         1362

As you can see, the measured average time spend inside the kslgetl function is 1.3us on my system, which includes systemtap overhead. Clearly the time for taking a latch is less than the wait interface takes, which means not instrumenting the kslgetl() function in the wait interface is a sensible thing.

This means that with the current state of the wait interface, it does not make sense to measure a lot more very fine grained events, even though the timer can time on microsecond granularity. Please mind I am not saying that it does not make sense to detail the response time, I think with modern computer systems with a lot of memory the Oracle database can run more and more without needing to wait for things like disk IOs. This means modern database servers can spend a lot of time just running on CPU, making it hard to understand in what routines the time is spend. Tuning on CPU execution requires an insight into where time is spend. The only option to understand how CPU time in Oracle is composited, is by using external (operating system based) tools like perf and flame graphs to detail CPU time. It would be great if an option would exist in the database to detail on CPU time.

%d bloggers like this: