Tag Archives: linux

There are many posts about the amount of memory that is taken by the Oracle database executables and the database SGA and PGA. The reason for adding yet another one on this topic is a question I recently gotten, and the complexities which surrounds memory usage on modern systems. The intention for this blogpost is to show a tiny bit about page sharing of linux for private pages, then move on to shared pages, and discuss how page allocation looks like with Oracle ASMM (sga_target or manual memory).

The version of linux in this blogpost is Oracle Linux 7.2, using kernel: 4.1.12-37.6.3.el7uek.x86_64 (UEK4)
The version of the Oracle database software is (july 2016).

Memory usage of virtual memory systems is complicated. For that reason I see a lot of people getting very confused about this topic. Let me state a very simple rule: the memory actively being used on a system should fit in physical memory. Swap (a file or partition), increases total virtual memory, but really only is a safety net for saving your system from an out of memory situation at the cost of moving pages from and to disk. Because modern linux kernels have swappiness (willingness to swap) to a non-zero value, it’s not uncommon to have some swap being used, despite physical memory not being oversubscribed. A system stops performing as soon as paging in and out starts to occur, and for that reason should not happen.

1. Private pages for linux executables
When an executable is executed on linux from the shell, the shell executes a fork() call to create a new process, which is implemented as a clone() system call on linux. Using the clone() system call, the virtual memory space of the newly created process is shared (readonly) with it’s parent. This includes the private allocations! Once the child process needs to write in it’s memory space, it will page fault and create it’s own version, abandoning the version of its parent.

Can we actually prove this is happening? Yes, the /proc/ filesystem gives an insight to a process’ virtual memory space.
Let’s start off with a very simple example: we execute ‘cat /proc/self/maps’ to see our own address space:

[oracle@oracle-linux ~]$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
0060c000-0060d000 rw-p 0000c000 fb:00 201666243                          /usr/bin/cat
00e41000-00e62000 rw-p 00000000 00:00 0                                  [heap]
7f69729be000-7f6978ee5000 r--p 00000000 fb:00 576065                     /usr/lib/locale/locale-archive
7f6978ee5000-7f6979099000 r-xp 00000000 fb:00 522359                     /usr/lib64/
7f6979099000-7f6979298000 ---p 001b4000 fb:00 522359                     /usr/lib64/
7f6979298000-7f697929c000 r--p 001b3000 fb:00 522359                     /usr/lib64/
7f697929c000-7f697929e000 rw-p 001b7000 fb:00 522359                     /usr/lib64/
7f697929e000-7f69792a3000 rw-p 00000000 00:00 0
7f69792a3000-7f69792c4000 r-xp 00000000 fb:00 522352                     /usr/lib64/
7f69794b9000-7f69794bc000 rw-p 00000000 00:00 0
7f69794c3000-7f69794c4000 rw-p 00000000 00:00 0
7f69794c4000-7f69794c5000 r--p 00021000 fb:00 522352                     /usr/lib64/
7f69794c5000-7f69794c6000 rw-p 00022000 fb:00 522352                     /usr/lib64/
7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0
7ffdab1c7000-7ffdab1e8000 rw-p 00000000 00:00 0                          [stack]
7ffdab1ea000-7ffdab1ec000 r--p 00000000 00:00 0                          [vvar]
7ffdab1ec000-7ffdab1ee000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Here’s a lot to see, but we see the cat executable at 0x00400000. The reason for three memory allocations are (linux/ELF) executables uses different sections with specific functions. A full overview of these can be obtained using the readelf executable. A simpler overview of an executable, which matches the above three memory allocations for the cat executable can be obtained using ‘size -B’ (the size executable, -B means ‘berkeley style’, which is default):

[oracle@oracle-linux ~]$ size -B /usr/bin/cat
   text	   data	    bss	    dec	    hex	filename
  43905	   1712	   2440	  48057	   bbb9	/usr/bin/cat

This describes the three memory sections an linux executable can have: text (the machine instructions, alias ‘the program’), data (all initialised variables declared in the program) and BSS (uninitialised data).
The first section always is the text allocation (not sure if it’s impossible to have the text section not being the first allocation, I have never seen it different). If you look at the memory flags, ‘r-xp’, this totally makes sense: ‘r-‘ meaning: read(only, followed by a’-‘ instead of a ‘w’), ‘x’: executable and ‘p’: this is a private allocation. The next allocation is the data section. We don’t execute variables, we read them, which is reflected in the flags: ‘r–p’. But what if we change the value of a variable? That is where the third section is for: changed values of initialised variables. This can be seen from the flag of this section: ‘rw-p’, read, write and private. The fourth allocation lists [heap], this is a mandatory allocation in every process’ memory space, which holds (small) memory allocations, this is NOT the BSS section. In this case, the BSS section does not seem to be allocated.

By having memory allocations for /usr/lib64/ we can see this is a dynamically linked executable. You can also see this by executing ‘file’ on the executable:

[oracle@oracle-linux ~]$ file /usr/bin/cat
/usr/bin/cat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=3207edc47638918ceaeede21947a20a4a496cf63, stripped

If a linux executable is dynamically linked, you can see the libraries that are loaded by the dynamic linker/loader using the ldd utility:

[oracle@oracle-linux ~]$ ldd /usr/bin/cat =>  (0x00007ffceb3e4000) => /lib64/ (0x00007fd46fb7e000)
       	/lib64/ (0x000055d5253c9000)

This output shows the dynamic loader (/lib64/, and two libraries the dynamic loader loads: and The first one, libc, is the standard C library. The second one, linux-vdso is for virtual dynamic shared object, which is an optimisation for certain system calls to be executed in user space (notably gettimeofday()).
The other allocations that exist in our example are anonymous mappings (usually done by programs using the mmap() call):

7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0

And some allocations for system purposes, like stack, var, vdso and vsyscall.

Now that you have become familiar with some basic linux memory address space specifics, let’s take it a little further. It’s possible to see more about the memory segments using the proc filesystem smaps file:

[oracle@oracle-linux ~]$ cat /proc/self/smaps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
Size:                 44 kB
Rss:                  44 kB
Pss:                  44 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:        44 kB
Private_Dirty:         0 kB
Referenced:           44 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
Size:                  4 kB
Rss:                   4 kB

Per allocation there are a lot of properties to be seen. ‘Size’ is the full size, ‘Rss’ is the resident set size, alias the amount of data of this segment that is truly resident for this process in it’s address space. ‘Pss’ is fairly unknown, and is the proportional size of this segment. The way it is proportional is that if pages in this allocation are shared with other processes, the size of these pages are divided by the number processes it is shared with. In this case, we have loaded the text segment of the cat executable into the process’ address space, which all is resident (size and rss are the same) and it’s not shared with any process (rss equals pss). There are many more properties, but these are out of scope for this blogpost.

Now let’s move on to Oracle. If you look at the maps output of the pmon process for example, you’ll see:

[oracle@oracle-linux 14153]$ cat maps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/
10b8f000-10de8000 rw-p 1058f000 fb:03 67209358                           /u01/app/oracle/product/
10de8000-10e19000 rw-p 00000000 00:00 0
1190f000-11930000 rw-p 00000000 00:00 0                                  [heap]

Here we see the Oracle executable, with a text segment, a readonly data segment and a read/write data segment, and we see an anonymous mapping directly following the data segments. That’s the BSS segment!
However, what is more interesting to see, is the properties of the distinct memory allocations in smaps:

[oracle@oracle-linux 14153]$ cat smaps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/
Size:             267704 kB
Rss:               40584 kB
Pss:                 819 kB
Shared_Clean:      40584 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:        40584 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/
Size:                136 kB
Rss:                 124 kB

If we look at the text segment for the oracle binary, we see the total text size is 267704 kB (size), but resident (truly available for this process in its address space) is only 40584 kB (rss), and because the oracle executable’s text segment is shared with a lot of processes, the proportional size is only 819 kB (pss).

If you want to understand how much memory is taken in the system, the size is telling the total size of the segment, but it doesn’t say anything on true memory usage. The rss size tells the amount of pages for the segment that is paged in to the address space of every process, and can (and is, for oracle) different for every process. The pss size is the proportional size for every process. Probably the only way to tell the true amount of memory taken by executables and libraries is to add up all the pss sizes. Any other value only tells something about the process’ point of view on memory usage, but not overall, true consumed space because that would lead to counting too much.

This is different for anonymous allocations. Since anonymous allocations are created when a process is run, I’ve only seen them initialised purely private. For that reason rss and pss sizes are equal, because every process initialises it strictly for itself. This too works in a lazy allocation way. When memory is allocated, the size is defined, but is only really allocated once it’s truly used, which is expressed by a difference between size and rss.

2. shared pages
The Oracle databases relies on shared caches and data structures, which are put into what is called the SGA, the system global area. The main components of the SGA are the shared pool (shared structures), log buffer (change vectors to be written to disk to persist changes) and the buffer cache, amongst others. With any memory management option (manual management, ASMM (automatic shared memory management, sga_target) and AMM (automatic memory management, memory_target)) there is a SGA. Depending on the memory option, these are visible in a different way.

When manual memory or ASMM is used, shared memory is allocated as system V shared memory. The ‘classic’ way of looking at system V shared memory is using ipcs -m (m is for shared memory, you can also use s for semaphores and q for message queues):

[oracle@oracle-linux ~]$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 655360     oracle     600        2932736    124
0x00000000 688129     oracle     600        905969664  62
0x00000000 720898     oracle     600        139673600  62
0x5f921964 753667     oracle     600        20480      62

Please mind that if you have more than one instance active, or an ASM instance active, you will see more shared memory allocations.
Apparently, the oracle database allocates a couple of shared memory segments. If you want to understand what these memory allocations are for, you can use the oradebug ipc command to see what their functions are:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug ipc
IPC information written to the trace file

This generates a trace file in the ‘trace’ directory in the diagnostics destination. Here is how this looks like (partial output with content of interest to this blogpost):

 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002cbe70 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2   655360 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cc000 00000000002cc000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000036000000 Minimum Subarea size 00400000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0   688129 0x00000060400000 0x00000060400000 0x00000060400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000036000000 0000000036000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1   720898 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008534000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000005000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3   753667 0x0000009ec00000 0x0000009ec00000 0x0000009ec00000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000005000 0000000000005000 default       readwrite

The first allocation is ‘fixed size’, alias the fixed SGA, the second allocation is the ‘variable size’, which contains the shared pool and the buffercache, the third allocation is the ‘redo buffers’ and the fourth is the ‘skgm overhead’ alias the index into the shared memory structures for this instance.

Because any memory allocation is visible in maps and smaps, this method can be used for shared memory too, to see how the shared memory segments are mapped into the process address space. All oracle database server processes have the shared memory segments for the instance mapped into their address space. The usage is different per process, so the amount of shared memory paged into the address space will be different:

12bcd000-12bee000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
60400000-96400000 rw-s 00000000 00:05 688129                             /SYSV00000000 (deleted)
96400000-9e934000 rw-s 00000000 00:05 720898                             /SYSV00000000 (deleted)
9ec00000-9ec05000 rw-s 00000000 00:05 753667                             /SYSV5f921964 (deleted)
7f473004e000-7f47301d4000 r-xp 00000000 fb:02 212635773                  /u01/app/oracle/product/

Shared memory is easily identified by the ‘s’, at which “normal” private memory mappings have ‘p’. If you want to know more about the process’ perspective of the shared memory, we can use smaps, just like with private memory mappings (virtual memory space of pmon):

60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
Size:                  4 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd sh mr mw me ms sd
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
Size:               2860 kB
Rss:                 392 kB
Pss:                  36 kB
Shared_Clean:          0 kB
Shared_Dirty:        372 kB
Private_Clean:         0 kB
Private_Dirty:        20 kB
Referenced:          392 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms sd

These two shared memory segments are belonging to the fixed sga. The reason for two segments is the first page (0x1000 equals 4096, alias a single linux page) is readonly (r–s). The other fixed SGA segment is read write (rw-s). Here we see that from the process’ perspective it really doesn’t matter much if a piece of mapped memory is shared or private; it’s exactly handled the same way, which means the full segment is mapped into the process’ virtual memory space, but only once pages are touched (alias truly used), the process registers the address in its pagetable, and the pages become resident (as can be seen in the difference between the total size and the rss). The sole purpose of shared memory is it is shared between process. That the pages are shared is very well visible with the difference between rss and pss size. Its also easy to spot this shared memory segment is created from small pages; MMUPageSize and KernelPageSize is 4kB.

However, this yields an interesting question: shared memory does not belong to any single process. Does that mean that if a shared memory segment is created, it is truly allocated, or can shared memory be lazy allocated as well? Please mind that above statistics are the process’ perspective, not the kernel’s perspective.

One way to see the state of shared memory system wide, is using the ‘-u’ flag with the ipcs command:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  255684
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This is a really useful view! What we can see from the output from this command, is that nearly all pages allocated as shared memory are resident. By having statistics for shared memory pages allocated and resident we can conclude that shared memory too could be allocated in a lazy, alias on demand. Also, there is a difference between resident and allocated, which indicates lazy allocation too.

Inside the database I am aware of two parameters that could influence shared pages usage; pre_page_sga and _touch_sga_pages_during_allocation, see my article on these. However, what is interesting, is that these parameters are different for the instance I am testing with for this blogpost, which is running on a VM:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: page
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%page%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden: Y
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('Y','Y',3,2))

-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
olap_page_pool_size    				   0   									  TRUE 	   FALSE      FALSE
pre_page_sga   					   TRUE        								  TRUE 	   FALSE      FALSE
use_large_pages        				   TRUE        								  TRUE 	   FALSE      FALSE
_max_largepage_alloc_time_secs 			   10  									  TRUE 	   FALSE      FALSE
_olap_page_pool_expand_rate    			   20  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hi     				   50  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hit_target     			   100 									  TRUE 	   FALSE      FALSE
_olap_page_pool_low    				   262144      								  TRUE 	   FALSE      FALSE
_olap_page_pool_pressure       			   90  									  TRUE 	   FALSE      FALSE
_olap_page_pool_shrink_rate    			   50  									  TRUE 	   FALSE      FALSE
_realfree_heap_pagesize        			   65536       								  TRUE 	   FALSE      FALSE
_realfree_pq_heap_pagesize     			   65536       								  TRUE 	   FALSE      FALSE
_session_page_extent   				   2048        								  TRUE 	   FALSE      FALSE
_touch_sga_pages_during_allocation     		   FALSE       								  TRUE 	   FALSE      FALSE

14 rows selected.

In the database I created on my VM, pre_page_sga equals to TRUE and _touch_sga_pages_during_allocation to FALSE, which is the exact inverse of the settings of a database (PSU 160419) on a huge machine. Perhaps these parameters are dynamically set based on size of the SGA and logic (if _touch_sga_pages_during_allocation is TRUE, it makes sense to set pre_page_sga to FALSE, as it’s function has been performed by the bequeathing session.

However, having pre_page_sga set to TRUE it makes sense almost all SGA (shared) pages are allocated, because pre_page_sga (at least in Oracle 12, not sure about earlier versions, because the Oracle description of this parameter is different from what happens in Oracle 12) spawns a background process (sa00) that scans SGA pages, which means it pages them, resulting in the actual allocation. Let’s test this by setting pre_page_sga to false, it should lead to way lesser shared memory pages allocated, which will eventually be allocated as database processes are paging them in:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force;

And then look at ipcs -mu again:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  92696
pages swapped   0
Swap performance: 0 attempts   	 0 successes

As expected, only the bare necessary pages are resident after startup force, all the other shared pages will be slowly paged in as foreground and background processes touching SGA pages during execution.

How would that work when we set sga_max_size to a different value than sga_target? If the pages beyond the sga_target are never allocated, you could control the amount of SGA pages used by setting sga_target, but ‘reserve’ extra memory to use by setting sga_max_size higher, which is never allocated, so it is not wasted. Let’s setup the instance:

SQL> alter system set pre_page_sga=true scope=spfile;
SQL> show spparameter sga_target

SID    	 NAME  			       TYPE    	   VALUE
-------- ----------------------------- ----------- ----------------------------
*      	 sga_target    		       big integer 1000M
SQL> ! ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  102512
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This sets the pre_page_sga parameter from the spfile, which means the instance will spawn a process to touch SGA pages on next startup.
Currently, the sga_target for sizing the SGA is set to 1000M in the spfile.
ipcs tells us we got 256005 pages are allocated, which makes sense: 256005*4=1024020k, which is slightly more than the set 1000M, which means essentially sga_target equals pages allocated.

SQL> alter system set sga_max_size=2g scope=spfile;
SQL> startup force;
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

This sets sga_max_size to double the amount of sga_target, and ‘startup force’ bounces the instance.

SQL> show parameter sga_target

NAME   				     TYPE      	 VALUE
------------------------------------ ----------- ------------------------------
sga_target     			     big integer 1008M

Here we see the actual parameter in the database is set to 1008M. Now let’s look at the ipcs -mu values again:

> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  521923
pages swapped   0
Swap performance: 0 attempts   	 0 successes

521923*4=2087692. So (almost) all the memory set for sga_max_size is allocated. In fact, if you look at the values at instance startup values reported above, you see ‘Total System Global Area’ showing the 2G, it’s all SGA, so it’s all touched because of pre_page_sga being set to TRUE. So the next test would be to have pre_page_sga being set to FALSE:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

All memory is still declared SGA, as we can see. However, by having _touch_sga_pages_during_allocation set to FALSE and pre_page_sga set to FALSE, we should see only the actual used SGA pages being allocated:

SQL> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  91692
pages swapped   0
Swap performance: 0 attempts   	 0 successes

The above output shows the shared memory status directly after I restart my instance, so this is not only less than sga_max_size, it is even less than sga_target (91692*4=336768, ~ 336M). This will grow up to sga_target, because these pages will get paged in by the database processes.

How does this look like when we add in huge pages? In Oracle in my instance the parameter to tell oracle to allocate huge pages if there are any (‘use_large_pages’) is set to TRUE. This will make Oracle use large pages if any are available. This is true, even if there are not enough huge pages to satisfy the entire SGA; Oracle will just allocate all that can be allocated, and create a new shared memory segment using small pages for the remainder of the needed shared memory.

Sadly, it seems per memory segment statistics like rss, pss, shared and private clean and dirty, etc. are not implemented for huge pages:

[oracle@oracle-linux [testdb] ~]$ cat /proc/$(pgrep pmon)/smaps
61000000-d8000000 rw-s 00000000 00:0e 688129                             /SYSV00000000 (deleted)
Size:            1949696 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms de ht sd

This is the main shared memory segment, allocated from huge pages (as can be seen with KernelPageSize and MMUPageSize), which means it’s the segment holding the shared pool and buffercache. This can also be seen by the size: 1949696 kB, which is nearly the 2G of sga_max_size.

However, we can just use the global information on system V shared memory (ipcs -mu) and we can use the huge page information in /proc/meminfo:

[oracle@oracle-linux [testdb] ~]$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:    1100
HugePages_Free:      880
HugePages_Rsvd:      805
HugePages_Surp:        0
Hugepagesize:       2048 kB

The statistics of interest are:
hugepages_total: the total number of huge pages allocated. warning: huge pages memory allocated by the kernel is NOT available for allocation of regular sized pages (which means you can starve your processes and the kernel for normal pages by setting the number of huge pages too high).
hugepages_free: the number of huge pages which are not used currently. warning: this includes allocated but not yet initialised pages, which hugepages_rsvd shows.
hugepages_rsvd: the number of huge pages allocated but not yet initialised.
hugepages_surp: the number of huge pages allocated (truly allocated and not yet initialised) greater than the total number of huge pages set. this value can be greater than zero if the kernel setting vm.nr_overcommit_hugepages is greater than zero. The value of this setting is zero by default, and at least for usage with the Oracle database, this value should remain zero.

The same information can be obtained using ipcs -mu, but with a twist:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524803
pages resident  122881
pages swapped   0
Swap performance: 0 attempts   	 0 successes

Some of you might get the twist on this by looking at the number.
It turns out ipcs has no facility for huge pages, it just reports the number of pages as if these were 4 kB.
524803*4 (kB) / 1024 (to make it MB) = 2050.

Now going back to the goal of looking into this: I told shared memory is allocated and paged at startup time when _touch_sga_pages_during_allocation is set to TRUE (set to false as default value in my current database), and it could be explicitly paged by the background process sa00 after startup of the instance when pre_page_sga is set to TRUE. When both are set to false, shared memory allocated from default sized 4kB pages is allocated only when it’s used. In the above examples with huge pages, the tests were done with pre_page_sga set to false. This shows exactly the same ‘lazy allocation’ behaviour as 4kB pages.

When ‘extra’ memory is reserved from the operating system by setting sga_max_size to a higher value than sga_target, this will all be allocated and paged if either _touch_sga_pages_during_allocation or pre_page_sga is set to TRUE, which doesn’t make sense; if the memory is taken, you might as well use it. However, this is different if both _touch_sga_pages_during_allocation and pre_page_sga are set to false. All memory beyond sga_target up to sga_max_size is allocated, but never touched, and thus never paged in, so never truly allocated. Please mind linux itself understands this perfectly (aiming at huge pages and ‘reserved’ pages), however the system V ipc kernel settings do not; you need to set the shared memory values high enough to facilitate the total sum of sga_max_size values, not the truly used sizes as indicated by the sum of sga_target values.

The inspiration for this investigation came from a question on my blog. However, the question was about memory_target and memory_max_target and AIX. I do not have an AIX system at hand. I did not investigate the implementation of memory_target and memory_max_target on AIX. So I can’t comment on that. What I can say, is that on Linux, you really, really should use automatic shared memory management (ASMM) alias setting sga_target or setting it manually (and set huge pages!). If you are used to these memory management settings on databases not on AIX, it probably makes sense to use that on AIX too, even if the automatic memory management (AMM) alias setting memory_target is implemented brilliantly on AIX, for the sake of predictability and standardisation.

This blogpost is about using the linux ftrace kernel facility. If you are familiar with ftrace and specifically the function_graph tracer, you might already be aware of this functionality. This is Linux specific, and this facility is at least available in kernel 2.6.39 (Oracle’s UEK2 kernel).

What is a ‘kernel dive’? Whenever a process is running, it should mostly be in ‘user mode’, executing the program it is supposed to run. However, during running there could be situations (a lot of situations, depending on what the program is doing!) that the program needs something “from” the system. Such a thing could be allocating memory, or using a device that is shared like a block device, or a network device. These things are controlled by the kernel, and require a process to issue a system call. A user process executes a system call to request actions to such things. Starting from the system call, the execution ‘dives’ in the kernel, and executes in kernel or system mode. However, kernel dives are not limited to system calls; for example a page fault (paging in backing memory for allocated virtual memory) switches to system mode, as well as handling an interrupt.

The Linux kernel has a facility that is called ‘ftrace’. The name ftrace originally was named because of function tracing, but it has evolved into a tracing framework. It’s important to point out that ftrace currently ONLY works in KERNELSPACE. This means you miss the userspace code.

Ftrace uses both explicit tracepoints (defined in the linux kernel source), as well as dynamic tracepoints, for which the gcc -pg (profiling data) flag is used to capture function entry. For function exit a ‘trampoline’ is used. A trampoline here is an extra function executed (mcount) at function entry that stores the return address, and replaces the return address with that of the trampoline, so an exit can be detected.

the linux debugfs filesystem must be mounted for ftrace to work. You can check if the debugfs filesystem is mounted using:

[root@bigmachine ~]# mount -t debugfs

It does not return any rows if debugfs is not mounted. You can mount debugfs the following way:

mount -t debugfs none /sys/kernel/debug

Let’s do some basic steps first, just tracing an Oracle session!
First go to the tracing directory:

[root@bigmachine ~]# cd /sys/kernel/debug/tracing/
[root@bigmachine tracing]#

Obtain the PID of an Oracle foreground process, and enable tracing for this PID:

[root@bigmachine tracing]# echo 6431 > set_ftrace_pid

We also need to choose what we want to trace. Something very beneficial for understanding what is going on in the kernel is the tracer ‘function_graph’. Here’s how you enable that trace:

[root@bigmachine tracing]# echo function_graph > current_tracer

The trace output is in ‘trace’. Assuming that the PID to trace is idle, this is how the trace output looks like:

[root@bigmachine tracing]# cat trace
# tracer: function_graph
# |     |   |                     |   |   |   |

Now execute something silly (something which does little!).

TS@fv12102 > select * from dual;


Now let’s first show how much information is gathered:

[root@bigmachine tracing]# cat trace | wc -l

That’s correct: approximately 1900 rows of trace data are created during only ‘select * from dual’. Now think about this: most of the things ‘select * from dual’ does are done in userspace. What does ‘select * from dual’ actually do in kernel space? Here’s a little grep to see what the Oracle process did:

[root@bigmachine tracing]# grep \|\ \ [a-zA-Z_0-9]*\(\)\ \{ trace
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_getrusage() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
   0)               |  sys_times() {
   0)               |  __audit_syscall_exit() {
   0)               |  __audit_syscall_entry() {
...much more...

Probably you are aware Oracle executes a lot of times() and getrusage() calls. As you can see, the system calls are also audited by the Linux system. Now just open the trace file with your favourite file viewer (I use ‘less’, you can use the vi commands to search for pieces of text), and peek in the file:

   0)               |  sys_getrusage() {
   0)               |    getrusage() {
   0)               |      k_getrusage() {
   0)               |        task_cputime_adjusted() {
   0)               |          cputime_adjust() {
   0)   0.045 us    |            nsecs_to_jiffies();
   0)   0.503 us    |          }
   0)   0.876 us    |        }
   0)   0.041 us    |        jiffies_to_timeval();
   0)   0.042 us    |        jiffies_to_timeval();
   0)               |        get_task_mm() {
   0)   0.047 us    |          _raw_spin_lock();
   0)   0.380 us    |        }
   0)               |        mmput() {
   0)   0.041 us    |          _cond_resched();
   0)   0.351 us    |        }
   0)   3.836 us    |      }
   0)   0.043 us    |      _cond_resched();
   0)   4.596 us    |    }
   0)   5.004 us    |  }

I skipped the beginning of the trace, which is actually the ending of the kernel code of the waiting on a next command of the Oracle process. If you scroll down to the end, you will see how the waiting on a next command (instrumented by the wait event ‘SQL*Net message from client’) is actually implemented on the kernel side using a read function on a pipe. You find the above shown systemcall, getrusage, just after the ending of the read function in the beginning of the trace (or search for ‘sys_getrusage’).

What is interesting is that this trace is showing the different functions in the kernel and which function is calling what function, made visible by accolades and indention, made to look like a c program. In other words: this allows you to see in what specific function of the kernel the time is spend, and how the total time of a kernel function is build up!

Now that basic usage is known, let’s step up to something interesting; the getrusage timing in interesting, but just an example. One of such really interesting things is IO.

First clear the trace file:

[root@bigmachine tracing]# echo > trace

Verify that the current tracer still is function_graph:

[root@bigmachine tracing]# cat current_tracer

Verify the process id to trace (it should list the process id of the oracle foreground process you want to trace):

[root@bigmachine tracing]# cat set_ftrace_pid

Now an additional next step to only trace the pread systemcall using the ‘set_graph_function’ facility:

[root@bigmachine tracing]# echo sys_pread64 > set_graph_function

Now make the Oracle session that is traced do a pread call. I used ‘select * from t1 where rownum=1’. Because I flushed the buffer cache prior to doing this (to make sure physical IO is needed), I did get 2 physical IOs, one for the segment header and one for the data block.

To make sure nothing else will get into the trace buffer (the file really is a buffer in memory), do the following to stop further tracing:

[root@bigmachine tracing]# echo 0 > tracing_on

This is how my output looks like:
I added line numbering to it (you can do that yourself too with the nl linux utility), so there are a few things I can point you to.

The first pread call ends at line# 568. That is a lot of information. It also shows how much stuff is done during an IO.
– The filesystem type matters! At line 7 you see vfs_read (the linux filesystem abstraction layer), but at line 19 you see filesystem specific code!
– At line 27 you see a XFS specific function indicating DIO (direct IO) is used.
– At line 33 you see an interaction with the block layer using the blk_start_plug() function. Linux uses a method to group IOs that is logically equal to filling up and later draining a bathtub using a plug.
– At line 328 you see the IO request has been built by the filesystem code, and the the request queue is unplugged: blk_flush_plug_list. You see some IO scheduler functions (starting with ‘elv’) and you see the deadline scheduler is used (deadline_add_request). Not very far after that, we enter the scsi layer (indicated by functions starting with ‘scsi_’).
– At line 431 you see the request being submitted to the device using the mpt_put_msg_frame function. We are now in the low level driver layer. We see there is time involved (82ms, you can see this is a virtual machine). You see all kinds of loops ending here. At this point the IO request has been sent to the IO device. However, there is more interesting stuff coming!
– At line 429 we see the kernel is preparing for waiting on the IO request to return. This is done in using the dio_await_completion function.
– At line 436 we are totally done submitting the IO request, and entering the Linux process scheduler. In the scheduler code we see housekeeping (update_blocked_averages, line 456), and rebalancing the process, which means trying to find the best cpu thread to execute on (load_balance, line 462), before the process finally goes to sleep.
– At line 487 the process is woken by an interrupt (not visible in the trace) from it’s uninterruptible sleep (state ‘D’). The IO then really finishes up; at function dio_bio_complete, line 500 the data of the IO request is put in a BIO structure which can be passed on to userspace, some further housekeeping is done (dio_complete, line 539), the access time is updated (touch_atime, line 552) and some xfs housekeeping, unlocking the inode (xfs_iunlock, line 559).

Let me show you something which shows how valuable this tracing is for finding (linux operating system) issues. This is another trace on the very same system with slightly different kernel settings:

Can you spot the difference? If not, take a look at this diff:, at line number 115. The function called the second time is gup_huge_pmd. That’s a function for handling user pages, and this function is using huge pages! The change I made was setting vm.nr_hugepages from zero to a number higher than the database instance needed. So not only the Oracle database can benefit from huge pages, but also the Linux kernel IO subsystem!

If this wetted your appetite, read on! Here’s a trace, once again of pread: Can you see what is the difference here? If not, I created a diff of this trace and the previous one: A nice example of the difference is at line 42. The first trace file contains all kinds of functions for doing extent management for the XFS filesystem, until line 89. This is all replaced with a single function blkdev_get_block.

The similarity is it’s all the system call pread. The difference between the second and the third is that with the third pread trace I used Oracle’s ASM facility, alias database IO to a block device directly without a filesystem. This trace shows pread is not executing all the filesystem functions, because there is no filesystem. However, please do realise it’s all about spend time, not how many rows there are in a trace.

Now before jumping to conclusions about the greatness of ASM because it can skip a lot of the code path, please do realise that ASM adds code path inside the Oracle database, because some kind of disk space management must be done. Also there is an entire ASM instance for managing the diskspace (which is NOT part of database sessions’ code path). These are facts, not opinion. There are use cases for both filesystem usage and for ASM, although I have a personal preference for ASM.

Ftrace does not get the attention it deserves. It’s a great tool for investigating time spend in the kernel, and it’s available by mounting the debugfs filesystem. This blogpost describes the function_graph tracer, there are other tracers too, it’s absolutely not limited to the description in this blogpost.

To show how ftrace with the function_graph tracer works I took the pread system call as an example. It turns out the pread system call comes in many shapes. Unless you intimately know a system, there can be different layers in the kernel in play when executing pread. However, on a normal system the main time component of the pread system call should be off CPU in uninterruptible state. And that is exactly what the function_graph tracer can tell.

Again, as a reminder, ftrace only works for kernel level (“system”) execution of a process.

This is a small blogpost on using ‘perf’. I got an error message when I tried to run ‘perf top’ systemwide:

# perf top
Too many events are opened.
Try again after reducing the number of events

What actually is the case here, is actually described in the perf wiki:

Open file limits
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor per event per-thread or per-cpu.
On a 16-way system, when you do:
perf stat -e cycles sleep 1
You are effectively creating 16 events, and thus consuming 16 file descriptors.

The point for this blogpost is perf (in Oracle Linux 7.1) says ‘too many events’, and hidden away in the perf wiki the true reason for the message is made clear: perf opens up a file descriptor per cpu thread, which means that if you are on a big system you might get this message if the open files (file descriptors actually) limit is set lower than the number of cpu threads.

You can see the current set limits using ‘ulimit -a’:

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 189909832
max locked memory       (kbytes, -l) 21878354152
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16384
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

If you are root, you can simply set the ‘open files’ limit higher than the amount of cpu threads, and perf will work:

# ulimit -n 10240

I was setting up disk devices for ASM in Oracle Linux 7. I knew things have changed between Oracle Linux 6 and Oracle Linux 7, but only just a little bit.

First of all, let’s take a look at the current disk usage. To see what disk devices are visible and how they are used, use lsblk:

# lsblk
fd0           2:0    1    4K  0 disk
sda           8:0    0   40G  0 disk
├─sda1        8:1    0  500M  0 part /boot
└─sda2        8:2    0 39.5G  0 part
  ├─ol-swap 253:0    0  3.9G  0 lvm  [SWAP]
  └─ol-root 253:1    0 35.6G  0 lvm  /
sdb           8:16   0   40G  0 disk
sr0          11:0    1 1024M  0 rom

As you can see, I got a floppy disk drive (fd0), a device sda which contains two partitions, sda1 and sda2. The partition sda1 is mounted to /boot. The second partition sda2 is used by LVM, and contains two logical volumes, one for swap, the other has a filesystem that is mounted on /. The device sdb is currently not used (I want to use that for ASM), and there is a device sr0, which is a DVD drive.

The only correct way to use udev (that I am aware of) for disk devices, is using the UUID of the disk. Please mind that if there are multiple connections to the disk/slice presented to the host, which is typically done with fiberchannel connections to a SAN, you should use multipathd to create a multi path device first, and use that device. Each path towards a SAN slice show up as a disk device. The way to distinguish the disk devices from paths is the UUID (with a subtle difference between the SCSI UUID and the WW ID), which is exactly what the multi path daemon uses too. In my setup there are no multiple paths to a disk device.

I want to use gdb as disk for ASM. In order to make sdb accessible for ASM, it needs to have user and group set to oracle. I find it convenient to have the device for usage with ASM in a different place (/dev/oracleasm/), so I can point my ASM_DISKSTRING to the entire directory. This is what we are going to use udev for.

First we need to obtain the UUID. One way of doing this, is via lsscsi:

# lsscsi -i
[1:0:0:0]    cd/dvd  NECVMWar VMware IDE CDR10 1.00  /dev/sr0   -
[2:0:0:0]    disk    VMware,  VMware Virtual S 1.0   /dev/sda   36000c29eea23a3f6b958ca77007bac53
[2:0:1:0]    disk    VMware,  VMware Virtual S 1.0   /dev/sdb   36000c2912554c8f44edc68106edd5040

Now we need to create a udev rules file to make udev create a device for us in /dev/oracleasm, and set the correct permissions. A rules file should be created in /etc/udev/rules.d/. Typically, rules files start with a number, which is used by the udev daemon to understand the sequence/ordering of the rules. In OL7 there are no additional rules files in /etc/udev/rules.d/, which is different from OL6, where multiple rules files are already present.

# vi /etc/udev/rules.d/99-oraccle-asmdevices.rules
KERNEL=="sd*", ENV{ID_SERIAL}=="36000c2912554c8f44edc68106edd5040", SYMLINK+="oracleasm/disk1", OWNER="oracle", GROUP="dba", MODE="0660"

This is the simplest form such a rule can have. The KERNEL matchkey says it should filter only devices matching ‘sd*’, ENV{ID_SERIAL} matches the UUID, SYMLINK creates a symbolic link, and OWNER, GROUP and MODE are obvious. In OL6, you could use NAME instead of SYMLINK which would create the device indicated, and remove the device from the original place (/dev/sdb in my case). In OL7 NAME is not allowed anymore.

At this point a link has been created with the correct permissions to be used with Oracle ASM.

However, udev is a black box to a lot of DBAs and sysadmins. Let’s look a bit deeper into udev!

Obtaining information about a device.
In my case, I wanted to use /dev/sdb. You can look what udev information exists about /dev/sdb by using:

# udevadm info --name sdb
P: /devices/pci0000:00/0000:00:10.0/host2/target2:0:1/2:0:1:0/block/sdb
N: sdb
S: disk/by-id/scsi-36000c2912554c8f44edc68106edd5040
S: disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040
S: disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0
E: DEVLINKS=/dev/disk/by-id/scsi-36000c2912554c8f44edc68106edd5040 /dev/disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040 /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0
E: DEVNAME=/dev/sdb
E: DEVPATH=/devices/pci0000:00/0000:00:10.0/host2/target2:0:1/2:0:1:0/block/sdb
E: ID_BUS=scsi
E: ID_MODEL=VMware_Virtual_S
E: ID_MODEL_ENC=VMware\x20Virtual\x20S
E: ID_PATH=pci-0000:00:10.0-scsi-0:0:1:0
E: ID_PATH_TAG=pci-0000_00_10_0-scsi-0_0_1_0
E: ID_SCSI_SERIAL=6000c2912554c8f44edc68106edd5040
E: ID_SERIAL=36000c2912554c8f44edc68106edd5040
E: ID_SERIAL_SHORT=6000c2912554c8f44edc68106edd5040
E: ID_TYPE=disk
E: ID_VENDOR_ENC=VMware\x2c\x20
E: ID_WWN=0x6000c2912554c8f4
E: ID_WWN_VENDOR_EXTENSION=0x4edc68106edd5040
E: ID_WWN_WITH_EXTENSION=0x6000c2912554c8f44edc68106edd5040
E: TAGS=:systemd:

This is all the information that udev sees currently, with no (custom) udev rule applied to it.

Testing udev rules.
Next up, we need to write a udev rules file. You can test the results of a udev rule using udevadm test. I’ve put ‘KERNEL==”sd*”, ENV{ID_SERIAL}==”36000c2912554c8f44edc68106edd5040″, NAME+=”oracleasm/disk1″, OWNER=”oracle”, GROUP=”dba”, MODE=”0660″‘ in /etc/udev/rules.d/99-oracle-asmdevices.rules, which is the OL6 version, which contains NAME as key, which is not allowed with OL7. Let’s test this specific rule:

# udevadm test /sys/block/sdb
calling: test
version 219
This program is for debugging only, it does not run any program
specified by a RUN key. It may show incorrect results, because
some values may be different, or not available at a simulation run.

=== trie on-disk ===
tool version:          219
file size:         6984832 bytes
header size             80 bytes
strings            1805856 bytes
nodes              5178896 bytes
Load module index
Created link configuration context.
timestamp of '/etc/udev/rules.d' changed
Reading rules file: /usr/lib/udev/rules.d/10-dm.rules
Reading rules file: /usr/lib/udev/rules.d/100-balloon.rules
Reading rules file: /usr/lib/udev/rules.d/11-dm-lvm.rules
Reading rules file: /usr/lib/udev/rules.d/13-dm-disk.rules
Reading rules file: /usr/lib/udev/rules.d/40-redhat.rules
Reading rules file: /usr/lib/udev/rules.d/42-usb-hid-pm.rules
Reading rules file: /usr/lib/udev/rules.d/50-udev-default.rules
Reading rules file: /usr/lib/udev/rules.d/60-alias-kmsg.rules
Reading rules file: /usr/lib/udev/rules.d/60-cdrom_id.rules
Reading rules file: /usr/lib/udev/rules.d/60-drm.rules
Reading rules file: /usr/lib/udev/rules.d/60-keyboard.rules
Reading rules file: /usr/lib/udev/rules.d/60-net.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-alsa.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-input.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-serial.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-storage-tape.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-storage.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-v4l.rules
Reading rules file: /usr/lib/udev/rules.d/60-raw.rules
Reading rules file: /usr/lib/udev/rules.d/61-accelerometer.rules
Reading rules file: /usr/lib/udev/rules.d/64-btrfs.rules
Reading rules file: /usr/lib/udev/rules.d/69-dm-lvm-metad.rules
Reading rules file: /usr/lib/udev/rules.d/70-mouse.rules
Reading rules file: /usr/lib/udev/rules.d/70-power-switch.rules
Reading rules file: /usr/lib/udev/rules.d/70-touchpad.rules
Reading rules file: /usr/lib/udev/rules.d/70-uaccess.rules
Reading rules file: /usr/lib/udev/rules.d/71-biosdevname.rules
Reading rules file: /usr/lib/udev/rules.d/71-seat.rules
Reading rules file: /usr/lib/udev/rules.d/73-idrac.rules
Reading rules file: /usr/lib/udev/rules.d/73-seat-late.rules
Reading rules file: /usr/lib/udev/rules.d/75-net-description.rules
Reading rules file: /usr/lib/udev/rules.d/75-probe_mtd.rules
Reading rules file: /usr/lib/udev/rules.d/75-tty-description.rules
Reading rules file: /usr/lib/udev/rules.d/78-sound-card.rules
Reading rules file: /usr/lib/udev/rules.d/80-drivers.rules
Reading rules file: /usr/lib/udev/rules.d/80-net-name-slot.rules
Reading rules file: /usr/lib/udev/rules.d/80-net-setup-link.rules
Reading rules file: /usr/lib/udev/rules.d/81-kvm-rhel.rules
Reading rules file: /usr/lib/udev/rules.d/85-nm-unmanaged.rules
Reading rules file: /usr/lib/udev/rules.d/90-alsa-tools-firmware.rules
Reading rules file: /usr/lib/udev/rules.d/90-iprutils.rules
Reading rules file: /usr/lib/udev/rules.d/90-vconsole.rules
Reading rules file: /usr/lib/udev/rules.d/91-drm-modeset.rules
Reading rules file: /usr/lib/udev/rules.d/95-dm-notify.rules
Reading rules file: /usr/lib/udev/rules.d/95-udev-late.rules
Reading rules file: /usr/lib/udev/rules.d/98-kexec.rules
Reading rules file: /etc/udev/rules.d/99-oracle-asmdevices.rules
Reading rules file: /usr/lib/udev/rules.d/99-systemd.rules
rules contain 24576 bytes tokens (2048 * 12 bytes), 12641 bytes strings
1896 strings (23762 bytes), 1263 de-duplicated (11755 bytes), 634 trie nodes used
IMPORT 'scsi_id --export --whitelisted -d /dev/sdb' /usr/lib/udev/rules.d/60-persistent-storage.rules:40
starting 'scsi_id --export --whitelisted -d /dev/sdb'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_SCSI=1'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_VENDOR=VMware_'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_VENDOR_ENC=VMware\x2c\x20'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_MODEL=VMware_Virtual_S'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_MODEL_ENC=VMware\x20Virtual\x20S'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_REVISION=1.0'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_TYPE=disk'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_SERIAL=36000c2912554c8f44edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_SERIAL_SHORT=6000c2912554c8f44edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_WWN=0x6000c2912554c8f4'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_WWN_VENDOR_EXTENSION=0x4edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_WWN_WITH_EXTENSION=0x6000c2912554c8f44edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb'(out) 'ID_SCSI_SERIAL=6000c2912554c8f44edc68106edd5040'
'scsi_id --export --whitelisted -d /dev/sdb' [3130] exit with return code 0
LINK 'disk/by-id/scsi-36000c2912554c8f44edc68106edd5040' /usr/lib/udev/rules.d/60-persistent-storage.rules:42
IMPORT builtin 'path_id' /usr/lib/udev/rules.d/60-persistent-storage.rules:55
LINK 'disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0' /usr/lib/udev/rules.d/60-persistent-storage.rules:56
IMPORT builtin 'blkid' /usr/lib/udev/rules.d/60-persistent-storage.rules:72
probe /dev/sdb raid offset=0
LINK 'disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040' /usr/lib/udev/rules.d/60-persistent-storage.rules:82
no db file to read /run/udev/data/+scsi:2:0:1:0: No such file or directory
OWNER 54321 /etc/udev/rules.d/99-oracle-asmdevices.rules:1
GROUP 54322 /etc/udev/rules.d/99-oracle-asmdevices.rules:1
MODE 0660 /etc/udev/rules.d/99-oracle-asmdevices.rules:1
NAME="oracleasm/disk1" ignored, kernel device nodes can not be renamed; please fix it in /etc/udev/rules.d/99-oracle-asmdevices.rules:1
handling device node '/dev/sdb', devnum=b8:16, mode=0660, uid=54321, gid=54322
preserve permissions /dev/sdb, 060660, uid=54321, gid=54322
preserve already existing symlink '/dev/block/8:16' to '../sdb'
found 'b8:16' claiming '/run/udev/links/\x2fdisk\x2fby-id\x2fscsi-36000c2912554c8f44edc68106edd5040'
creating link '/dev/disk/by-id/scsi-36000c2912554c8f44edc68106edd5040' to '/dev/sdb'
preserve already existing symlink '/dev/disk/by-id/scsi-36000c2912554c8f44edc68106edd5040' to '../../sdb'
found 'b8:16' claiming '/run/udev/links/\x2fdisk\x2fby-id\x2fwwn-0x6000c2912554c8f44edc68106edd5040'
creating link '/dev/disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040' to '/dev/sdb'
preserve already existing symlink '/dev/disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040' to '../../sdb'
found 'b8:16' claiming '/run/udev/links/\x2fdisk\x2fby-path\x2fpci-0000:00:10.0-scsi-0:0:1:0'
creating link '/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0' to '/dev/sdb'
preserve already existing symlink '/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0' to '../../sdb'
created db file '/run/udev/data/b8:16' for '/devices/pci0000:00/0000:00:10.0/host2/target2:0:1/2:0:1:0/block/sdb'
ACTION=KERNEL=="sd*", ENV{ID_SERIAL}=="36000c2912554c8f44edc68106edd5040", NAME+="oracleasm/disk1", OWNER="oracle", GROUP="dba", MODE="0660"
DEVLINKS=/dev/disk/by-id/scsi-36000c2912554c8f44edc68106edd5040 /dev/disk/by-id/wwn-0x6000c2912554c8f44edc68106edd5040 /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0
Unload module index
Unloaded link configuration context.

As you can see in the highlighted row, udev recognised the device from the rule and applied the settings, but it ignored NAME, because it is not allowed to rename kernel device nodes. When you need to use udev, it’s very important you run a test like this to check for the validity of the syntax and if the rule actually does what you want it to do, and the correct line in the udev rules file is applied (you can see the rules file and row number in lines 90-93).

Applying a udev rule from a rules file.
After you tested the your new udev rule, and made sure it does what you want, you need to apply the rule. You can specifically apply it for the device using the following command:

# udevadm trigger --name sdb

(warning: this is a rather detailed technical post on the internal working of the Oracle database’s commit interactions between the committing foreground processes and the log writer)

After the Trivadis Performance days I was chatting to Jonathan Lewis. I presented my Profiling the log writer and database writer presentation, in which I state the foreground (user/server) process looks at the commit SCN in order to determine if its logbuffer contents are written to disk by the logwriter(s). Jonathan suggested looking deeper into this matter, because looking at the commit SCN might not the way it truly works.

The reasoning is the foreground process flushes its log information from its private strand into the public logbuffer, and as such only needs to keep track of the log blocks it allocated in the public logbuffer to see if these are written. Different processes can allocate different blocks in the public log buffer, which potentially do not have to be written all in SCN order. In other words: it could be hard to figure out on write time what the exact on-disk SCN of the redo is.

In order to verify this, let’s first recap what I tell about how a foreground process behaves on commit:

This is a schematic drawing of a foreground process committing.
There are two timelines, one of the foreground process, and one of the logwriter.
The foreground process commits. The commit copies the changes (redo vectors) from the process’ private strand into the public logbuffer (not shown, all indicated with “commit;”), after which it signals (“posts”) the logwriter via the semctl() system call.

When idle, the logwriter sleeps in a systemcall called semtimedop() for 3 seconds after which it performs some “household” tasks (measure time and resource usage for example) then calls semtimedop() again. When signalled, or when it finds contents in the public log buffer, the log writer writes the public log buffer contents to the online redo log file(s) via AIO if available. This is indicated by the io_submit() and io_getevents() calls.

In the meantime, after executing semctl(), the foreground process enters the function kcrf_commit_force(), or the kcrf_commit_force_int() function with Oracle 12. Inside this function, it calls the function kcscur3() three times. If the log writer was able to write the public log buffer contents before kcscur3() has been executed 3 times, the foreground will not enter the wait event ‘log file sync’, and not sleep on semtimedop() in order to wait for being signalled by the log writer, which is why I put it in a gray bar (indicating “optional”). Also, in that case, because the foreground process is not awaiting to be signalled by the logwriter, the log writer does not execute semctl(). The log writer knows what process(es) it needs to signal via a list that administers posters and waiters, which is called the “post-wait queue”, which is not externalised. If the log writer did not write the public log buffer contents fast enough, the foreground process registers it started waiting (in the aforementioned post-wait queue), and starts sleeping in semtimedop() for 100ms, after which it executes kcscur3() two times, do some household actions like measuring time and resource usage, and then calls semtimedop() again. This repeats until the process is receiving a message from the log writer via semctl().

Starting from Oracle version, the underscore parameter “_use_adaptive_log_file_sync” is set to TRUE, which enables a feature called “adaptive log file sync”. This settings means the database can and will (!) switch between post-wait mode, which has just been described, and a mode called ‘polling’. Polling means the processes that requested the log writer to write will not wait until they are posted by the log writer, but look at the log writer write progress, and continue after the log writer has written their specific log blocks from the public log buffer.

This is how that schematically looks like:

As you can see, it is quite alike the post wait mode, only the sleeping is done using nanosleep() instead of semtimedop() and there is no semctl() on the log writer timeline. The foreground process issues the same kcscur3() functions, but because these measure SCN progress, the foreground can determine if its public log buffer contents have been written to disk, and stop waiting and continue processing.

I made the conclusion that the kcscur3() function is used to determine the commit SCN based on profiling the function call sequence and logic reasoning. Encouraged by Jonathan’s reasoning, let’s try to see if we can dig deeper.

The first thing we need to do, is see if we can obtain more information on the kcscur3() function. One way of doing that, is investigating the function’s arguments. This can be done without source code or debug symbols, because the arguments of a function are passed via CPU registers on X86_64 Linux: first to fourth arguments are in registers RDI, RSI, RDX, RCX.

A way to do this, is insert data into a table, then attach to this process using gdb, and execute the following gdb macro:

break kcscur3
printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx

After continuing the foreground via gdb, gdb shows kcscur3 is executed 6 times:

kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Interesting! But what are these numbers? Values? Addresses?
In order to understand if these arguments mean anything, let’s first get the addresses for the general shared memory area’s. This can be done using ipc command with oradebug:

SYS@v12102 AS SYSDBA> oradebug setmypid
Statement processed.
SYS@v12102 AS SYSDBA> oradebug ipc
IPC information written to the trace file
SYS@v12102 AS SYSDBA> @lt

@lt is a script to look at the current active trace file. The relevant contents of this trace file are the shared memory area’s:

Handle:            0x113c2940 `/u01/app/oracle/product/'
 Dump of unix-generic realm handle `/u01/app/oracle/product/', flags = 00000000
  key 604726764 actual_key 604726764 num_areas 4 num_subareas 4
  primary shmid: 112590854 primary sanum 3 version 3
  deferred alloc: FALSE (0) def_post_create: FALSE (0) exp_memlock: 2050M
 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002ca788 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2 112492547 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cb000 0000000000400000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000077000000 Minimum Subarea size 01000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0 112525316 0x00000061000000 0x00000061000000 0x00000061000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000077000000 0000000077000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008d35000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1 112558085 0x000000d8000000 0x000000d8000000 0x000000d8000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008d35000 0000000008e00000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000003000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3 112590854 0x000000e1000000 0x000000e1000000 0x000000e1000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000003000 0000000000003000 default       readwrite

We see the 4 shared area’s and their memory address:
– Fixed size, start address 0x60000000, size 0x2cb000
– Variable size, start address 0x61000000, size 0x77000000
– Redo buffers, start address 0xd8000000, size 0x8d35000
– Skgm overhead, start 0x1e000000, size 0x3000

If we combine this information with the kcscur3() arguments, we see that the first argument points to the ‘fixed size’ area, in other words: the fixed SGA. The fixed SGA variables metadata is listed in x$ksmfsv. Please mind this view does not list all the contents of the fixed SGA, for example the latches are in the fixed SGA too.

We got 3 addresses which are inside the fixed SGA on calling commit: 0x6001fbb0, 0x60027c68 and 0x60027c98. Let’s see if we can find them in x$ksmfsv:

SYS@v12102 AS SYSDBA> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv
2  where to_number('6001fbb0','XXXXXXXX')
3  between to_number(ksmfsadr,'XXXXXXXXXXXXXXXX') and to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')+ksmfssiz-1;

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcsgscn_                                                         000000006001FBB0         48

SYS@v12102 AS SYSDBA> c/6001fbb0/60027c68/
  2* where to_number('60027c68','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

SYS@v12102 AS SYSDBA> c/60027c68/60027c98/
  2* where to_number('60027c98','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

The first fixed SGA address, 0x6001fbb0, points to a variable called kcsgscn_. Some sources on the internet report this is the current SCN. (Kernel Cache System Global SCN?)
Let’s test this! First I set a watchpoint on 0x6001fbb0 in gdb, then query v$database.current_scn:

(gdb) awatch *0x6001fbb0
SYS@v12102 AS SYSDBA> select current_scn from v$database;

This triggers the watchpoint!

Hardware access (read/write) watchpoint 1: *0x6001fbb0

Old value = 11278094
New value = 11278163
0x000000000cc77983 in kcscur3 ()

Ah! So when querying the current_scn, it uses the kcscur3() function too, and apparently, kcscur3() can change a value too (we see an old value, and the new value). I press ‘c’ and enter to let the debugger continue the program it debugs. In fact, I have to do this multiple times in the functions kcscur3(), kcsgssn(), kcsgcsn() and kcsgbsn. The watchpoint shows memory address 0x6001fbb0 is left with a number exactly one higher than is shown in the query ‘select current_scn from v$database’.

The second and third fixed SGA addresses, 0x60027c68 and 0x60027c98, both point to a variable called kcrfsg_. Actually, they do not point to the starting address of the variable, but rather to “somewhere” in this variable. First let’s see what value is stored at these addresses in the variable kcrfsg_ using gdb:

(gdb) x/dw 0x60027c68
0x60027c68:	11278915
(gdb) x/dw 0x60027c98
0x60027c98:	11278917

These are large numbers, which are highly likely to be SCNs of some sort.

The variable kcrfsg_, which is quite probably a c “struct” (a variable composed of multiple variables, alike records in a table) is linked with the x$ view x$kcrfws:

SYS@v12102 AS SYSDBA> select addr from x$kcrfws;


The address reported is 8 bits into kcrfsg_. The x$kcrfws view is only used by the v$ view v$xstream_capture, and because of that it is reported to have something to do with replication. That is incorrect. My current assumption is x$kcrfws means Kernel Cache Redo Write Status.

In order to figure out which field in x$kcrfws is linked to which memory address (0x60027c68 and 0x60027c98), I use gdb once again, and use a watchpoint on a memory address. Oradebug also provides this functionality, but it doesn’t work in my setup (mprotect error 22). I attach gdb to a SYSDBA session, and execute:

(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 1: *0x60027c68

Then query the fields in x$kcrfws one by one. It turns out, memory address 0x60027c68 is accessed for the fields lwn_scn_bas and lwn_scn_wrp, and memory address 0x60027c98 is accessed for the fields on_disk_scn_bas and on_disk_scn_wrp.

So, what do we know now? It becomes apparent Oracle uses the kcscur3() function for reading SCN values. The function seems to be dynamic and can be used for multiple locations holding different types of SCNs. We witnessed reading the instance current SCN, the on disk SCN and the LWN SCN.

The on disk SCN (x$kcrfws.on_disk_scn_(bas|wrp)) seems to be a registration of up to which SCN is written by the logwriter, and the LWN SCN (x$kcrfws.lwn_scn_(bas|wrp)) is a registration of up to which SCN is in the current log write number (LWN). The log write number (LWN) seems to be a number appointed to groups of redo blocks for writing them in batch.

This information is needed to make more sense of how my foreground works. In order to make the tracing of the foreground more meaningful, we need to add a break on semctl() to understand when all redo vectors are copied into the public log buffer, and the foreground actually starts waiting on the log writer, and peeks at its progress. It is also handy to add breaks to semtimedop() and nanosleep(), so we know what logwrite mode is in use:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000000cc77970 <kcscur3>
	breakpoint already hit 6 times
        printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
(gdb) break semctl
Breakpoint 2 at 0x3bfdaeb030
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>printf "semctl\n"
(gdb) break semtimedop
Breakpoint 3 at 0x3bfdaeb060
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>printf "semtimedop\n"
(gdb) break nanosleep
Breakpoint 4 at 0x3bfde0ef90
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>printf "nanosleep\n"
(gdb) disable

I disabled all breakpoints (disable command), and do the insert. After the insert, the enable command enables all breakpoints, and watch what the function call sequence:

(gdb) c
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

So, after a foreground process has written its change vectors into the public log buffer and requests the log writer to write using semctl(), it doesn’t poll a single location for a SCN, but requests the on disk SCN and the current SCN! It also seemed my log writer was speedy enough to write, because there is no nanosleep() nor semtimedop() call.

It seems we need to manipulate the logwriter too, in order to get the call sequences I showed in my slides. In order to do that, I open another terminal session, and attached a debugger session to the lgwr too. At this point I should point out I made sure my Oracle ( database was using the single lgwr process and not the log writer workers by setting the “_use_single_log_writer” undocumented parameter to true. (do NOT do this on a production database; both setting an undocumented parameter without the blessing of oracle support, nor attach gdb to a log writer process!!)

I added a break to the io_getevents_0_4 function, which is the function for reaping asynchronous submitted IO, and then disable the breakpoint:

break io_getevents_0_4

I also disabled the breakpoints in the other debugger session attached to the foreground session, and insert another record. After the insertion, enable the breakpoints in both gdb sessions, and enter commit. Now the log writer debugger session will break on io_getevents_0_4, and the foreground debugger session should show a repeating pattern:

(gdb) enable
(gdb) c
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 60027c68, f6ab0118, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0

The nanosleep and kcscur3 calls for address 60027c98 and 60027c68 continue to be executed. After continuing the log writer debugger session, the foreground debugger session shows:

kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Bingo! We see the nanosleep() call, so we are in polling mode, not in post-wait mode. And because we artificially made the log writer stop from progressing , the foreground process is reading the on-disk SCN and LWN SCN, then calls nano sleep, then scans the on-disk and LWN SCNs again, etcetera, until the on-disk SCN gets higher than the foreground process commit SCN. Interestingly, it seems that once kcscur3() on address 0x60027c98 was able to identify the log writer progressed beyond it’s commit SCN, the next kcscur3() call does not read address 0x60027c68, but instead reads address 0x6001fbb0, alias the current SCN.

Okay, so this points to the on-disk SCN being responsible for commit progress. If this is actually true, we should now take a look at the log writer and see if the log writer indeed changes the on-disk SCN at memory location 0x60027c98 after the write. For this I can use the memory watch facility of the debugger again:

(gdb) awatch *0x60027c98
Hardware access (read/write) watchpoint 2: *0x60027c98
(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 3: *0x60027c68
(gdb) c

This will quite probably immediately break on access of these memory areas. The log writer is looking at these memory area’s a lot. Here’s a snippet of the logwriter approximately during the commit of the foreground:

(gdb) c
Hardware access (read/write) watchpoint 3: *0x60027c68

Old value = 11174890
New value = 11174892
0x000000000cc77ac4 in kcsnew3 ()
(gdb) c

Breakpoint 1, io_getevents_0_4 (ctx=0x7f165de70000, min_nr=2, nr=128, events=0x7ffc696611b8, timeout=0x7ffc696621b8)
    at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
(gdb) c
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x000000000cc77983 in kcscur3 ()
(gdb) c
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x0000000002d70bb9 in kcsadj3 ()
(gdb) c
Hardware access (read/write) watchpoint 2: *0x60027c98

Old value = 11174890
New value = 11174892
0x0000000002d70c13 in kcsadj3 ()

First we see a function called kcsnew3() change the value at 0x60027c68. Then we see the breakpoint at io_getevents() indicating it looks for submitted IOs to return, in other words: the log writer was asked to write something or found something to write. After the write we see the log writer reading the memory location 0x60027c98 using the kcscur3() function, just as our foreground process does. Then we find a function called kcsadj3() which first reads the on-disk SCN location 0x60027c98, and then changing it. This indicates the log writer is increasing the value of the on-disk SCN in 0x60027c98 after it has written using the function kcsadj3(), a safe bet is Kernel Cache Service Adjust, which the foreground in polling mode is reading to understand if the log writer has written the contents the process put in the public log buffer.

Of course this still isn’t the ultimate proof, I break at a few, arbitrary functions, there can be all kinds of other things going on. This means that still the foreground process can just keep the list of log buffer blocks and really use that, and only use the SCN values for verification, because I am looking at only a small subset of the functions it is executing.

However, there is a way to test this!
=> Another word of caution, the next manipulation can render your database into a smoking pile of bits and bytes! <=
The log writer mode needs to be polling. Also, for simplicity, in Oracle 12 “_use_single_log_writer” must be set to true, to be sure the log writer is performing all the tasks, and not log writer slaves.

In order to proof that a foreground session is truly looking at 0x60027c98 for the on-disk SCN to understand if the log writer has written its log buffer contents, we let the log writer stop before it has adjusted the on-disk SCN. That can be accomplished by attaching to the log writer with gdb, and break on io_getevents_0_4:

break io_getevents_0_4

Now the breakpoint is set, disabled, and execution is resumed.

The next thing to do, is go to the sqlplus foreground session and insert a row of data. Once the insert is done, go to the debugger session, stop execution, enter “enable”, and continue. The debugger resumed execution again, but now will break execution once it encounter io_getevents. Now go back to the foreground sqlplus session and enter commit.

This will trigger the log writer to write, and break execution because of the breakpoint. Because the log writer is stopped the foreground session will appear to be “hanging”. In fact what the foreground session is doing is sleeping in nanosleep(), then executing kcscur3() to look at the on-disk and LWN SCNs, then sleeping in nanosleep() again, etc.

This next part is experimental!
Because the log writer is stopped at the io_getevents() call, it didn’t execute kcsadj3() yet to update the on-disk SCN in 0x60027c98. What we will do, is read the current value in 0x60027c98, and increase the value using gdb. If the foreground is truly only waiting for the value at 0x60027c98, increasing the value at this memory address should give back the prompt of the sqlplus session, because it thinks the commit fully succeeded because the log writer performed all its functions.
First query the on-disk SCN:

(gdb) x/dw 0x60027c98
0x60027c98:	11175042

So, the on-disk SCN value is 11175042. Mind you the sqlplus session appears hanging, but really is “polling” the on-disk SCN, and the log writer has stopped executing, because of the breakpoint.

Let’s see if increasing the on-disk SCN makes the sqlplus session think the commit succeeded. Please mind I randomly increased the value from above by changing the 7 to 8 on the 5th position.

(gdb) set {int}0x60027c98 = 11185042

Voila! the foreground process continues and returns the prompt, indicating it is thinking the log writer has finished!

This means that Oracle uses the log writer on-disk SCN for processes to determine if their log buffer contents have been written to disk.

When the log writer is in post-wait mode, this mechanism is in place too, but the foreground process, once waiting, needs to wait for the log writer to post it via semctl(). There is one exception to this, which is true for both the polling and the post-wait mechanisms: right after a foreground process signalled the log writer to write, it will call kcscur3() looking for the on-disk SCN and if it finds the on-disk SCN beyond its commit SCN, there will not be a wait triggered. In all other cases the process will register a wait “log file sync”.

The kcscur3() function is a function to read and update SCNs. There are multiple SCNs, like current SCN, on-disk SCN and LWN SCN, which all can be read with kcscur3(). Also kcscur3() is not exclusive to commit, when querying the current_scn field in v$database, this function is used too.

When a process commits, right after it signalled the log writer, it checks for the log writer progress via the on-disk SCN. If the log writer was fast enough to have the contents flushed to disk and the on-disk SCN updated, the process will not wait for ‘log file sync’ at all.

Both log writer modi, post-wait and polling, look for the log writer write progress via the on-disk SCN. Both modi will show the wait ‘log file sync’ if the above “fast log writer exception” didn’t happen.

When the log writer is in post-wait mode, a process without the “fast log writer exception” will register itself in the post-wait queue and sleep on a semaphore waiting to be posted by the log writer, and register it is waiting for event ‘log file sync’. Still, it does keep track the on-disk SCN (twice) with a check of the LWN SCN in between, via calls to kcscur3(). After being signalled by the log writer, it again verifies the on-disk SCN and looks for the current SCN, after which it continues.

When the log writer is in polling mode, a process without the “fast log writer exception” will register a waiting for event ‘log file sync’. Now it sleeps in nanosleep() and only executes two checks: the on-disk SCN and the LWN SCN, via calls to kcscur3(). Once the log writer progressed the on-disk SCN beyond the process’ commit SCN, it will continue.

This is a question that I played with for a long time. There have been statements on logical IO performance (“Logical IO is x times faster than Physical IO”), but nobody could answer the question what the actual logical IO time is. Of course you can see part of it in the system and session statistics (v$sysstat/v$sesstat), statistic name “session logical reads”. However, if you divide the number of logical reads by the total time a query took, the logical IO time is too high, because then it assumed all the time the query took was spend on doing logical IO, which obviously is not the case, because there is time spend on parsing, maybe physical IO, etc. Also, when doing that, you calculate an average. Averages are known to hide actual behaviour.

Luckily, with Redhat Enterprise Linux and Oracle Linux version 7, there is the kernel version 3.10 as the stock RedHat kernel version. This kernel version supports systemtap userspace return probes. A systemtap probe is a way to trigger an action when a certain action (the probed event) is started, a return probe is an action triggered when an action is finished. The Oracle UEK kernel version 3 at the time of writing is version 3.8, which does not support this.

My current knowledge is consistent reads are handled by the Oracle database C function kcbgtcr(). Current reads are quite probably handled by the function kcbgcur(). Having said that, I know of at least one exception to this: scans on hybrid columnar compressed segments do not use the kcbgtcr() function.

Please mind all kernel code translations, like kcbgtcr (kernel cache buffers get consistent read) are pure guesses, albeit somewhat educated guesses, as there are a lot of internet publications naming these, including My Oracle Support itself.

With the 3.10 version kernel, we can create a small systemtap script to measure the time between the start and stop of the kcbgtcr routine. If you want to experiment with this, it is probably best to download the latest version of systemtap and compile it yourself.. This is the reason you see /usr/local/bin/stap in the shebang.


global latency

probe begin {

probe process("/u01/app/oracle/product/").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()

probe process("/u01/app/oracle/product/").function("kcbgtcr").return {
	printf("<kcbgtcr, latency(us): %d\n", gettimeofday_us() - latency[pid()])

Now make the script executable, and run it against a database session:

# ./lio.stap -x 3877

Next, I execute a scan in the database foreground session, and watch the systemtap script output:
(execution of the systemtap script can be cancelled by pressing CTRL-c)

<kcbgtcr, latency(us): 2
<kcbgtcr, latency(us): 79542
<kcbgtcr, latency(us): 4
<kcbgtcr, latency(us): 2
<kcbgtcr, latency(us): 2
<kcbgtcr, latency(us): 13
<kcbgtcr, latency(us): 1

This shows the function being executed by the database session. However, it seems the time spend in the kcbgtcr() function is not consistent.

In order to get a better overview, we can add a histogram of the kcbgtcr latencies:

global latency, latency_histogram

probe begin {

probe process("/u01/app/oracle/product/").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()

probe process("/u01/app/oracle/product/").function("kcbgtcr").return {
	latency[pid()] = gettimeofday_us() - latency[pid()]
	latency_histogram <<< latency[pid()]
	printf("<kcbgtcr, latency(us): %d\n", latency[pid()])

probe end {
	if ( @count(latency_histogram) > 0 ) {
		printf("\n==kcbgtcr latency==\n")

Next, attach the systemtap script to the database session again, and issue a scan. Once the scan in the database session is finished, press CTRL-c to finish the systemtap script:


==kcbgtcr latency==
  value |-------------------------------------------------- count
      0 |                                                       0
      1 |@@@@@@@@@@@@@@@@@@@@                                5998
      2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  14284
      4 |                                                     268
      8 |                                                      58
     16 |                                                     137
     32 |                                                      25
     64 |                                                       1
    128 |                                                      15
    256 |                                                      69
    512 |                                                       1
   1024 |                                                       0
   2048 |                                                       0
   4096 |                                                       1
   8192 |                                                       5
  16384 |                                                      35
  32768 |                                                      39
  65536 |                                                       8
 131072 |                                                       0
 262144 |                                                       1

Okay, the majority is 2 microseconds, but outside the 1 and 2 microseconds buckets, there are a lot of executions that totally fall outside of these, up to 262144 microseconds (262 milliseconds)!

What could cause these huge changes in logical IO time?

At this point I am quite much turning the squelch down and make a lot of information visible (this is a warning!). Here is the systemtap script I am using:

global latency, latency_histogram

probe begin {

probe process("/u01/app/oracle/product/").function("kc*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
	if ( probefunc() == "kcbgtcr" )
			latency[pid()] = gettimeofday_us()
probe process("/u01/app/oracle/product/").function("kc*").return {
	printf("%s < %s", thread_indent(-1), ppfunc())
	if ( ppfunc() == "kcbgtcr" ) {
		latency[pid()] = gettimeofday_us() - latency[pid()]
		latency_histogram <<< latency[pid()]
		printf(" -- latency(us): %d", latency[pid()])
probe process("/u01/app/oracle/product/").function("qe*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
probe process("/u01/app/oracle/product/").function("qe*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
probe process("/u01/app/oracle/product/").function("kt*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
probe process("/u01/app/oracle/product/").function("kt*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
probe process("/u01/app/oracle/product/").function("kd*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
probe process("/u01/app/oracle/product/").function("kd*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
probe process("/u01/app/oracle/product/").function("opiosq0") {
probe process("/u01/app/oracle/product/").function("opiexe") {
probe process("/u01/app/oracle/product/").function("opifch2") {
probe process("/u01/app/oracle/product/").function("opiclo") {
probe process("/u01/app/oracle/product/").function("kslwtbctx") {
probe process("/u01/app/oracle/product/").function("kslwtectx") {
probe process("/lib64/").function("io_submit") {
probe process("/lib64/").function("io_getevents_0_4") {
probe process("/lib64/").function("pread64") {

Warning! This script will run for a long time before it is compiled as kernel module!! Systemtap works by dynamically creating a kernel module from the system tap script, and insert it into the kernel. Because I injected a huge chunk of code to run in the kernel a lot of things are potentially influenced. I found the Oracle database to drop core’s for example.

After running this against a session and doing a very simple full table scan with a count(*), I searched for the logical IO, in other words where the kcbgrcr() function was performed. Here it is:

166149 oracle_5487_tes(5487):     < kdst_fetch
166155 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166159 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets
166162 oracle_5487_tes(5487):     > kdst_fetch
166164 oracle_5487_tes(5487):      > kdst_fetch0
166167 oracle_5487_tes(5487):       > kcbipnns
166170 oracle_5487_tes(5487):       < kcbipnns
166173 oracle_5487_tes(5487):       > kcbrls
166177 oracle_5487_tes(5487):        > kcbrls_direct
166181 oracle_5487_tes(5487):        < kcbrls_direct
166183 oracle_5487_tes(5487):       < kcbrls
166185 oracle_5487_tes(5487):       > kdstsnb
166188 oracle_5487_tes(5487):       < kdstsnb
166191 oracle_5487_tes(5487):       > ktrget2
166194 oracle_5487_tes(5487):        > ktsmg_max_query
166197 oracle_5487_tes(5487):        < ktsmg_max_query
166200 oracle_5487_tes(5487):        > kcbgtcr
166204 oracle_5487_tes(5487):         > kcbldrget
166209 oracle_5487_tes(5487):          > kcblgt
166211 oracle_5487_tes(5487):          < kcblgt
166213 oracle_5487_tes(5487):          > kcbzvb
166216 oracle_5487_tes(5487):           > kcbhvbo
166219 oracle_5487_tes(5487):            > kcbhxoro
166222 oracle_5487_tes(5487):            < kcbhxoro
166224 oracle_5487_tes(5487):           < kcbhvbo
166226 oracle_5487_tes(5487):          < kcbzvb
166228 oracle_5487_tes(5487):          > kcbztek_trace_blk
166230 oracle_5487_tes(5487):          < kcbztek_trace_blk
166233 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
166236 oracle_5487_tes(5487):           > kcbtgobj
166239 oracle_5487_tes(5487):            > kd4obj
166251 oracle_5487_tes(5487):            < kd4obj
166254 oracle_5487_tes(5487):           < kcbtgobj
166257 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
166260 oracle_5487_tes(5487):         < kcbldrget
166262 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 62
166265 oracle_5487_tes(5487):        > ktrgcm
166268 oracle_5487_tes(5487):         > kcbcge
166271 oracle_5487_tes(5487):         < kcbcge
166273 oracle_5487_tes(5487):         > ktcckv
166275 oracle_5487_tes(5487):         < ktcckv
166285 oracle_5487_tes(5487):        < ktrgcm
166286 oracle_5487_tes(5487):       < ktrget2
166289 oracle_5487_tes(5487):       > kdr9ir2blk
166291 oracle_5487_tes(5487):       < kdr9ir2blk
166293 oracle_5487_tes(5487):      < kdst_fetch0
166294 oracle_5487_tes(5487):     < kdst_fetch
166297 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166300 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

How to read: “>” means entering a function, “<" means return from a function.
I selected a piece of the systemtap output/tracing where the counting procedure is visible.

The first row is "< kdst_fetch" in other words: returning from kernel data scan table fetch. So this function has performed a fetch. Not surprisingly, the next function is qeaeCn1SerialRowsets, which I think is the count function.

After the count, the kdst_fetch function is entered again, then kdst_fetch0. The next functions are kcbipnns, kcbrls and kcbrls_direct. Probably these functions are related to pinning and releasing blocks. See Alexander Anokhin’s post on that.

A few functions further we encounter kcbgrcr(). Interestingly, it is followed by the kcbldrget() function, which is kernel cache buffers direct path loader get. In other words, Oracle has chosen to do a direct path read, because this is the function that starts off the direct path read code. The next function, kcblget() requests a block. This means that blocks read in this code path are from PGA memory, not SGA (buffer cache) memory.

The next functions, kcbzvb, kcbhvbo and kcbhxoro are the block XOR checking functionality. The function actually performing this (sxorchk) is not visible because I did not probe for it.

The next functions are not clear to me at this point (kcbztek_trace_blk, kcbl_objdchk_with_cache_reread, kcbtgobj, kd4obj). Then the kcbldrget returns, and the kcbgtcr function too.

Here we can see that probing a huge number of functions does influence the performance of a process. While previously we saw kcbgtcr() took 2us, probably because of all the probes the time the kcbgtcr function took 62us.

The next piece of execution is done by the ktrgcm function. This is handling undo and buffer cleanout. After that function, we cross kdr9ir2blk (function unknown), after which kdst_fetch0 and kdst_fetch return, and the count is done using qeaeCn1SerialRowsets.

Okay, so far so good, but the question was: why are there logical IOs/kcbgtcr() executions that take excessively more time?

After a little searching, I was able to find a very good reason:

169252 oracle_5487_tes(5487):     > kdst_fetch
169254 oracle_5487_tes(5487):      > kdst_fetch0
169256 oracle_5487_tes(5487):       > kcbipnns
169258 oracle_5487_tes(5487):       < kcbipnns
169259 oracle_5487_tes(5487):       > kcbrls
169262 oracle_5487_tes(5487):        > kcbrls_direct
169264 oracle_5487_tes(5487):        < kcbrls_direct
169265 oracle_5487_tes(5487):       < kcbrls
169267 oracle_5487_tes(5487):       > kdstsnb
169269 oracle_5487_tes(5487):       < kdstsnb
169270 oracle_5487_tes(5487):       > ktrget2
169272 oracle_5487_tes(5487):        > ktsmg_max_query
169274 oracle_5487_tes(5487):        < ktsmg_max_query
169275 oracle_5487_tes(5487):        > kcbgtcr
169278 oracle_5487_tes(5487):         > kcbldrget
169280 oracle_5487_tes(5487):          > kcblgt
169283 oracle_5487_tes(5487):           > kcblrs
169286 oracle_5487_tes(5487):            > kdsdrcbk
169288 oracle_5487_tes(5487):            < kdsdrcbk
169291 oracle_5487_tes(5487):            > kcbldio
169296 oracle_5487_tes(5487):             > kcfaioe
169299 oracle_5487_tes(5487):             < kcfaioe
169301 oracle_5487_tes(5487):             > kcflbi
169314 oracle_5487_tes(5487):              > kcf_hard_ftype_check
169317 oracle_5487_tes(5487):              < kcf_hard_ftype_check
169416 oracle_5487_tes(5487):             < kcflbi
169420 oracle_5487_tes(5487):            < kcbldio
169425 oracle_5487_tes(5487):            > kcblcffln
169429 oracle_5487_tes(5487):            < kcblcffln
169432 oracle_5487_tes(5487):           < kcblrs
169434 oracle_5487_tes(5487):           > kcblsinc
169438 oracle_5487_tes(5487):           < kcblsinc
169440 oracle_5487_tes(5487):           > kcblcio
169443 oracle_5487_tes(5487):            > kcblci
169447 oracle_5487_tes(5487):             > kcflci
169481 oracle_5487_tes(5487):              > kcflwi
179477 oracle_5487_tes(5487):              < kcflwi
179484 oracle_5487_tes(5487):             < kcflci
179488 oracle_5487_tes(5487):            < kcblci
179491 oracle_5487_tes(5487):           < kcblcio
179494 oracle_5487_tes(5487):          < kcblgt
179497 oracle_5487_tes(5487):          > kcbzvb
179509 oracle_5487_tes(5487):           > kcbhvbo
179513 oracle_5487_tes(5487):            > kcbhxoro
179516 oracle_5487_tes(5487):            < kcbhxoro
179518 oracle_5487_tes(5487):           < kcbhvbo
179520 oracle_5487_tes(5487):          < kcbzvb
179539 oracle_5487_tes(5487):          > kcbztek_trace_blk
179544 oracle_5487_tes(5487):          < kcbztek_trace_blk
179549 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
179555 oracle_5487_tes(5487):           > kcbtgobj
179559 oracle_5487_tes(5487):            > kd4obj
179562 oracle_5487_tes(5487):            < kd4obj
179563 oracle_5487_tes(5487):           < kcbtgobj
179565 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
179569 oracle_5487_tes(5487):         < kcbldrget
179571 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 10295
179576 oracle_5487_tes(5487):        > ktrgcm
179580 oracle_5487_tes(5487):         > kcbcge
179582 oracle_5487_tes(5487):         < kcbcge
179585 oracle_5487_tes(5487):         > ktcckv
179587 oracle_5487_tes(5487):         < ktcckv
179589 oracle_5487_tes(5487):        < ktrgcm
179591 oracle_5487_tes(5487):       < ktrget2
179593 oracle_5487_tes(5487):       > kdr9ir2blk
179606 oracle_5487_tes(5487):       < kdr9ir2blk
179609 oracle_5487_tes(5487):      < kdst_fetch0
179611 oracle_5487_tes(5487):     < kdst_fetch
179616 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
179620 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

If you go through the calls, you will see that the start is exactly the same, until line 17. After kcbgtcr>kcbldrgt (consistent read request function choosing direct path reads), the kcblgt function does not return immediately, but rather starts off a lot of extra code path.

This code path fetches new blocks. The most striking thing here is that kcbgtcr requests the blocks, and physical IO is done on behalf of the consistent read request, in other words: on behalf of the logical IO. This is obvious if you think about it, if you want to read blocks you have to look if they are available or not, and if they are not available, you have to fetch them.

As a conclusion: if kcbgtcr() together with kcbgcur() and a couple of other functions is considered the logical IO, then a logical IO has a variable time, instead of a consistent one. Because physical IO is requested inside the logical IO request, technically the physical IO is part of the logical IO. This means that it is technically incorrect to state that a physical IO is slower than logical IO, because a physical IO is part of the logical IO request that needed that physical IO. and as such a physical IO can not be slower than a logical IO

This article is written with examples taken from an (virtualised) Oracle Linux 6u6 X86_64 operating system, and Oracle database version However, I think the same behaviour is true for Oracle 11 and 10 and earlier versions.

Probably most readers of this blog are aware that a “map” of mapped memory for a process exists for every process in /proc, in a pseudo file called “maps”. If I want to look at my current process’ mappings, I can simply issue:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 786125                             /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 786125                             /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0080a000-0080b000 rw-p 0000a000 fc:00 786125                             /bin/cat
01243000-01264000 rw-p 00000000 00:00 0                                  [heap]
345b000000-345b020000 r-xp 00000000 fc:00 276143                         /lib64/
345b21f000-345b220000 r--p 0001f000 fc:00 276143                         /lib64/
345b220000-345b221000 rw-p 00020000 fc:00 276143                         /lib64/
345b221000-345b222000 rw-p 00000000 00:00 0
345b800000-345b98a000 r-xp 00000000 fc:00 276144                         /lib64/
345b98a000-345bb8a000 ---p 0018a000 fc:00 276144                         /lib64/
345bb8a000-345bb8e000 r--p 0018a000 fc:00 276144                         /lib64/
345bb8e000-345bb8f000 rw-p 0018e000 fc:00 276144                         /lib64/
345bb8f000-345bb94000 rw-p 00000000 00:00 0
7f8f69686000-7f8f6f517000 r--p 00000000 fc:00 396081                     /usr/lib/locale/locale-archive
7f8f6f517000-7f8f6f51a000 rw-p 00000000 00:00 0
7f8f6f524000-7f8f6f525000 rw-p 00000000 00:00 0
7fff2b5a5000-7fff2b5c6000 rw-p 00000000 00:00 0                          [stack]
7fff2b5fe000-7fff2b600000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

What we see, is the start and end address, the rights (rwx), absence of rights is shown with a ‘-‘, and an indication of the mapped memory region is (p)rivate or (s)hared. In this example, there are no shared memory regions. Then an offset of the mapped file, then the device (major and minor device number). In our case sometimes this is ‘fc:00’. If you wonder what device this might be:

$ echo "ibase=16; FC" | bc
$ ls -l /dev | egrep 252,\ *0
brw-rw---- 1 root disk    252,   0 Mar 23 14:19 dm-0
$ sudo dmsetup info /dev/dm-0
Name:              vg_oggdest-lv_root
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      252, 0
Number of targets: 2
UUID: LVM-q4nr4HQXgotaaJFaGF1nzd4eZPPTohndgz553dw6O5pTlvM0SQGLFsdp170pgHuw

So, this is a logical volume lv_root (in the volume group vg_oggdest).

Then the inode number (if a file was mapped, if anonymous memory was mapped the number 0 is shown), and then the path if a file was mapped. This is empty for anonymous mapped memory (which is memory which is added to a process using the mmap() call). Please mind there are also special regions like: [heap],[stack],[vdso] and [vsyscall].

Okay, so far I’ve shown there is a pseudo file called ‘maps’ which shows mapped memory and told a bit about the fields in the file. Now let’s move on to the actual topic of this blog: the Oracle database SGA memory, and the indicator this is deleted!

In this example I pick the maps file of the PMON process of an Oracle database. Of course the database must use system V shared memory, not shared memory in /dev/shm (which is typically what you see when Oracle’s automatic memory (AMM) feature is used). This is a snippet from the maps file of the pmon process on my server:

 cat /proc/2895/maps
00400000-1093f000 r-xp 00000000 fc:00 1326518                            /u01/app/oracle/product/
10b3e000-10dbf000 rw-p 1053e000 fc:00 1326518                            /u01/app/oracle/product/
10dbf000-10df0000 rw-p 00000000 00:00 0
12844000-1289d000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:04 111902723                          /SYSV00000000 (deleted)
60001000-602cc000 rw-s 00001000 00:04 111902723                          /SYSV00000000 (deleted)
60400000-96400000 rw-s 00000000 00:04 111935492                          /SYSV00000000 (deleted)
96400000-9e934000 rw-s 00000000 00:04 111968261                          /SYSV00000000 (deleted)
9ec00000-9ec05000 rw-s 00000000 00:04 112001030                          /SYSV6ce0e164 (deleted)
345b000000-345b020000 r-xp 00000000 fc:00 276143                         /lib64/
345b21f000-345b220000 r--p 0001f000 fc:00 276143                         /lib64/

If you look closely, you see the oracle executable first, with two entries, one being readonly (r-xp), the other being read-write (rw-p). The first entry is readonly because it is shared with other processes, which means that there is no need for all the processes to load the Oracle database executable in memory, it shares the executable with other process. There’s much to say about that too, which should be done in another blogpost.

After the executable there are two anonymous memory mappings, of which one is the process’ heap memory.

Then we see what this blogpost is about: there are 5 mappings which are shared (r–s and rw-s). These are the shared memory regions of the Oracle database SGA. What is very odd, is that at the end of the lines it says “(deleted)”.

Of course we all know what “deleted” means. But what does it mean in this context? Did somebody delete the memory segments? Which actually can be done with the ‘ipcrm’ command…

If you go look at the maps of other Oracle processes and other databases you will see that every database’s shared memory segment are indicated as ‘(deleted)’.

Word of warning: only execute the steps below on a test environment, do NOT do this in a production situation.

In order to understand this, the best way to see what actually is happening, is starting up the Oracle database with a process which is traced with the ‘strace’ utility with the ‘-f’ option set (follow). Together with the ‘-o’ option this will produce a (long) file with all the system calls and the arguments of the calls which happened during startup:

$ strace -f -o /tmp/oracle_startup.txt sqlplus / as sysdba

Now start up the database. Depending on your system you will notice the instance startup takes longer. This is because for every system call, strace needs to write a line in the file /tmp/oracle_start.txt. Because of this setup, stop the database as soon as it has started, on order to stop the tracing from crippling the database performance.

Now open the resulting trace file (/tmp/oracle_startup.txt) and filter it for the system calls that are relevant (calls with ‘shm’ in their name):

$ grep shm /tmp/oracle_startup.txt | less

Scroll through the output until you see a line alike ‘shmget(IPC_PRIVATE, 4096, 0600) = 130777091’:

4545  shmget(IPC_PRIVATE, 4096, 0600)   = 130777091
4545  shmat(130777091, 0, 0)            = ?
4545  shmctl(130777091, IPC_STAT, 0x7fff9eb9da30) = 0
4545  shmdt(0x7f406f2ba000)             = 0
4545  shmctl(130777091, IPC_RMID, 0)    = 0
4545  shmget(IPC_PRIVATE, 4096, 0600)   = 130809859
4545  shmat(130809859, 0, 0)            = ?
4545  shmctl(130809859, IPC_STAT, 0x7fff9eb9da30) = 0
4545  shmdt(0x7f406f2ba000)             = 0
4545  shmctl(130809859, IPC_RMID, 0)    = 0

What we see here is a (filtered) sequence of systems calls that could explain the status deleted of the shared memory segments. If you look up what process id is in front of these shm system calls, you will see it’s the foreground process starting up the instance. If you look closely, you’ll that there is a sequence which is repeated often:

1. shmget(IPC_PRIVATE, 4096, 0600) = 130777091
The system call shmget allocates a shared memory segment of 4 kilobyte, rights set to 600. The return value is the shared memory identifier of the requested shared memory segment.

2. shmat(130777091, 0, 0) = ?
The system call shmat attaches the a shared memory segment to the process’ address space. The first argument is the shared memory identifier, the second argument is the address to attach the segment to. If the argument is zero, like in the call above, it means the operating system is tasked with finding a suitable (non used) address. The third argument is for flags, the value zero here means no flags are used. The returncode (here indicated with a question mark) is the address at which the segment is attached. This being a question mark means strace is not able to read the address, which is a shame, because we can’t be 100% certain at which memory address this shared memory segment is mapped.

3. shmctl(130777091, IPC_STAT, 0x7fff9eb9da30) = 0
The system call shmctl with the argument IPC_STAT has the function to read the (kernel) shared memory information of the shared memory identifier indicated by the first argument, and write it at the memory location in the third argument in a struct called shmid_ds.

4. shmdt(0x7f406f2ba000) = 0
With this system call, the shared memory segment is detached from the process’ address space. For the sake of the investigation, I assumed that the address in this call is the address which is returned by the shmat() call earlier.

5. shmctl(130777091, IPC_RMID, 0) = 0
This is another shared memory control system call, concerning our just created shared memory segment (shared memory identifier 130777091), with the command ‘IPC_RMID’. This is what the manpage says about IPC_RMID:

       IPC_RMID  Mark the segment to be destroyed.  The segment will only  actually  be  destroyed
                 after the last process detaches it (i.e., when the shm_nattch member of the asso-
                 ciated structure shmid_ds is zero).  The caller must be the owner or creator,  or
                 be privileged.  If a segment has been marked for destruction, then the (non-stan-
                 dard) SHM_DEST flag of the shm_perm.mode field in the associated  data  structure
                 retrieved by IPC_STAT will be set.

What I thought this means was:
It looked like to me the database instance starts building up its shared memory segments per 4096 page. Because IPC_RMID only marks the segment to be destroyed, and because it will only be truly destroyed when there are no processes attached to the shared memory segment, it looked like to me the background processes were pointed to the shared memory segment which was marked destroyed (in some way I hadn’t discovered yet), which meant the shared memory segment would actually survive and all database processes can use it. If ALL the database processes would be killed for any reason, for example with a shutdown abort, the processes would stop being connected to the shared memory segment, which would mean the shared memory segment would vanish automatically, because it was marked for destruction.
Sounds compelling, right?

Well…I was wrong! The sequence of creating and destroying small shared memory segments is done, but it turns out these are truly destroyed with the shmctl(…,IPC_RMID,…) call. I don’t know why the sequence of creating shared memory segments is happening.

I started looking for the actual calls that create the final, usable shared memory segments in the /tmp/oracle_startup.txt file. This is actually quite easy to do; first look up the shared memory segment identifiers using the sysresv utility (make sure the database’s ORACLE_HOME and ORACLE_SID are set):

$ sysresv
...a lot of other output...
Shared Memory:
197394436	0x00000000
197427205	0x00000000
197361667	0x00000000
197459974	0x6ce0e164
1015811 	0xd5cdbca4
Oracle Instance alive for sid "dest"

Actually the ‘sysresv’ utility (system remove system V memory I think is what the name means) has the task of removing memory segments if there is no instance left to use them. It will not remove the memory segments if it finds the instance alive. It prints out a lot of information as a bonus.

Now that we got the shared memory identifiers, simply search in the trace file generated by strace, and search for the creation of the memory segment with the identifiers: (please mind searching with ‘less’ is done with the forward slash)

$ less /tmp/oracle_startup.txt
9492  shmget(IPC_PRIVATE, 905969664, IPC_CREAT|IPC_EXCL|0640) = 197394436
9492  shmat(197394436, 0x60400000, 0)   = ?
9492  times(NULL)                       = 430497743
9492  write(4, " Shared memory segment allocated"..., 109) = 109
9492  write(4, "\n", 1)                 = 1

Aha! here we see shmget() again, but now with a size (905969664) that looks much more like a real shared memory segment size used by the database! After the shared memory identifier is created, the process attaches it to its addressing space with shmat() to a specific memory address: 0x60400000.

The next thing to do, is to look for any shmctl() call for this identifier. Oracle could still do the trick of marking the segment for destruction…
…But…there are no shmctl() calls for this identifier, nor for any of the other identifiers shown with the sysresv utility. This is rather odd, because Linux shows them as “(deleted)”. There ARE dozens of shmat() calls, of the other (background) processes forked from the starting process when they attach to the shared memory segments.

So, conclusion at this point is Linux shows the shared memory segments as deleted in ‘maps’, but the Oracle database does not mark the segments for destruction after creation. This means that either Linux is lying, or something mysterious is happening in the Oracle executable which I didn’t discover yet.

I could only think of one way to verify what is truly happening here. That is to create a program myself that uses shared memory, so I have 100% full control over what is happening, and can control every distinct step.

This is what I came up with:

#include <stdio.h>
#include <sys/shm.h>
#include <sys/stat.h>

int main ()
  int segment_id;
  char* shared_memory;
  struct shmid_ds shmbuffer;
  int segment_size;
  const int shared_segment_size = 0x6400;

  /* Allocate a shared memory segment.  */
  segment_id = shmget (IPC_PRIVATE, shared_segment_size,
                     IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR);
  printf ("1.shmget done\n");
  /* Attach the shared memory segment.  */
  shared_memory = (char*) shmat (segment_id, 0, 0);
  printf ("shared memory attached at address %p\n", shared_memory);
  printf ("2.shmat done\n");
  /* Determine the segment's size. */
  shmctl (segment_id, IPC_STAT, &shmbuffer);
  segment_size  =               shmbuffer.shm_segsz;
  printf ("segment size: %d\n", segment_size);
  printf ("3.shmctl done\n");
  /* Write a string to the shared memory segment.  */
  sprintf (shared_memory, "Hello, world.");
  /* Detach the shared memory segment.  */
  shmdt (shared_memory);
  printf ("4.shmdt done\n");

  /* Deallocate the shared memory segment.  */
  shmctl (segment_id, IPC_RMID, 0);
  printf ("5.shmctl ipc_rmid done\n");

  return 0;

(I took the code from this site, and modified it a bit for my purposes)
If you’ve got a linux system which is setup with the preinstall rpm, you should be able to copy this in a file on your (TEST!) linux database server, in let’s say ‘shm.c’, and compile it using ‘cc shm.c -o smh’. This will create an executable ‘shm’ from this c file.

This program does more or less the same sequence we saw earlier:
1. Create a shared memory identifier.
2. Attach to the shared memory identifier.
3. Get information on the shared memory segment in a shmid_ds struct.
4. Detach the shared memory segment.
5. Destroy it using shmctl(IPC_RMID).

What I did was have two terminals open, one to run the shm program, and one to look for the results of the steps.

Step 1. (shmget)

$ ./shm
1. shmget done

When looking with ipcs, you can see the shared memory segment which is created because of the shmget() call:

$ ipcs -m

------ Shared Memory Segments --------
0x00000000 451608583  oracle     600        25600      0

when looking in the address space of the process running the shm program, the shared memory segment is not found. This is exactly what I expect, because it’s only created, not attached yet.

Step 2. (shmat)

shared memory attached at address 0x7f3c4aa6e000
2.shmat done

Of course the shared memory segment is still visible with ipcs:

0x00000000 451608583  oracle     600        25600      1

And we can see from ipcs in the last column (‘1’) that one process attached to the segment. Of course exactly what we suspected.
But now that we attached the shared memory to the addressing space, it should be visible in maps:

7f3c4aa6e000-7f3c4aa75000 rw-s 00000000 00:04 451608583                  /SYSV00000000 (deleted)

Bingo! The shared memory segment is visible, as it should be, because we just attached it with shmat(). But look: it’s deleted already according to Linux!

However I am pretty sure, as in 100% sure, that I did not do any attempts to mark the shared memory segment destroyed or do anything else to make it appear to be deleted. So, this means maps lies to us.

So, the conclusion is the shared memory Oracle uses is not deleted, it’s something that Linux shows us, and is wrong. When looking at the maps output again, we can see the shared memory identifier is put at the place of the inode number. This is handy, because it allows you to take the identifier, and look with ipcs for shared memory segments and understand which specific shared memory segment a process is using. It probably means that maps tries to look up the identifier number as inode number, which it will not be able to find, and then comes to the conclusion that it’s deleted.

However, this is speculation. Anyone with more or better insight is welcome to react on this article.

%d bloggers like this: