Oracle DBAs who are so old that they remember the days before Oracle 11.2 probably remember the tuning efforts for latches. I can still recall the latch number for cache buffers chains from the top of my head: number 98. In the older days this was another number, 157.

But it seems latches have become less of a problem in the modern days of Oracle 11.2 and higher. Still, when I generate heavy concurrency I can see some latch waits. (I am talking about you and SLOB mister Closson).

I decided to look into latches on Oracle instance on Oracle Linux 7. This might also be a good time to go through how you think they work for yourself, it might be different than you think or have been taught.

In order to understand how latching works, I searched for Oracle related traces. I could find event 10005, but it did not return anything latching related. My current understanding is that event 10005 is Oracle KST tracing, for which the results are put in X$TRACE.

Luckily, I could get a great headstart by studying the work of Andrey Nikolaev. However, that work seems to be strictly Solaris based.

I decided to take a look on how this works on Linux. In order to do this, I did setup a system for the specific purpose of this test. Disclaimer: The techniques below are for educational purposes only, and should never be done on a real database!

The work of Andrey shows kslgetl() as the overal latching function, which does:
– sskgslgf (immediate latch get)
– kslges (wait latch get)
— kskthbwt
— kslwlmod (setup wait list)
— sskgslgf (immediate latch get)
— skgpwwait (sleep latch get)
— semop

In order to do predictable latch gets and generate latch misses, in a very predictable way, I used the method that Andrey shows on his website (which he attributed to Tanel), which is using oradebug call to call the latch get function. In order to get latch waits, you need at least two processes doing something with a latch, one holding the latch, and another one requesting it. In order to facilitate this, I setup two sqlplus / as sysdba sessions.

Taking a latch manually can be done using the kslgetl or the ksl_get_shared_latch functions. Freeing a latch is done using the kslfre function. The kslfre function just takes the latch address as argument. The kslgetl and ksl_get_shared_latch functions take the following arguments:
1-latch address
2-immediate get (0 means yes, 1 means no)
3-where (X$KSLLW.INDX)
5-mode (8=shared,16=exclusive; only for ksl_get_shared_latch function)

Immediate mode get for a non shared latch.
An immediate mode get just tries to fetch a latch once. In order to manually do an immediate latch get, I done:

Session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 1

This means session 1 has gotten latch 0x60023a80. I gotten the latch in willing to wait mode, but that does not really matter here. The session did get the latch.

If you want to check this, simply use V$LATCHHOLDER view to verify (in another session):

SQL> select * from v$latchholder;

        PID        SID LADDR            NAME                                                                   GETS     CON_ID
---------- ---------- ---------------- ---------------------------------------------------------------- ---------- ----------
        38        134 0000000060023A80 cache table scan latch                                                  709          0

Please mind that with a latch manually gotten, you need to free the latch before you try to do anything else with your session, otherwise you encounter a (non critical) ORA-600. Freeing a latch is done using oradebug call kslfre and one argument: the latch address.

Session 2:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 0

The ‘Function returned 0′ means the immediate latch get failed. As you can see this was an immediate get because the second argument is 0.

In order to understand which functions are involved, I first used the perf record linux utility. However, because the immediate get does not spin, and a latch get by all means has speed/low overhead as one of the principal design criterions, I could not see the function.

This meant I needed to go to one of the tools I have used extensively in the past: gdb (the GNU debugger). You need to attach to the Oracle database server shared process locally on the database server. Here is what I did:

# gdb -p 4600
(gdb) set pagination off
(gdb) rbreak ^ksl.*
Breakpoint 262 at 0x8ea7b0
<function, no debug info> ksl_event_stats_rollup;
(gdb) commands 1-262
type commands for breakpoint(s) 1-262, one per line.
End with a line saying just "end".
(gdb) c

The first gdb function turns off having to press enter for every full screen of output of gdb, the second function breaks on all functions in the oracle executable that start with ‘ksl’. The commands command creates commands that are executed in gdb if breakpoints 1-262 are encountered, which is ‘c': continue.

Now, with the debugger set, I executed the kslgetl function again:

Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 247, 0x000000000c8e10a0 in kslws_check_waitstack ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 242, 0x000000000c8ddcb0 in kslgetl ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()

It is important to understand Oracle does a lot of other stuff outside the latch get via kslgetl. Most of the stuff above are functions which start with kslwt, which is the Oracle wait interface. A couple of times a shared latch is taken (as can be seen by the function ksl_get_shared_latch), and freed (kslfre). The important part here is: kslgetl is executed once, and did not go into any other function to try to get the latch.

Getting a latch in willing to wait mode for a non shared latch.
Now let’s do something a bit more exciting: getting a taken latch in willing to wait mode. The first session can do exactly the same, just take the latch. The second session needs to be changed a little bit to indicate it is willing to wait:

SQL> oradebug cell kslgetl 0x60023a80 1 0 2442

This will call additional functions. In order to understand what these functions are, I used perf record, perf report and perf script.

I was able to create a smaller, more specific gdb script to see what is going on:

break kslgetl
    printf "kslgetl laddr:%x, willing:%d, where:%d, why:%d\n", $rdi, $rsi, $rdx, $rcx
break kslges
    printf "kslges %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
break kslwlmod
    printf "kslwlmod %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
break skgpwwait
    printf "skgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
break sskgpwwait
    printf "sskgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
break semop
    printf "semop %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx

This is how the gdb output looks like when the latch get in willing to wait mode is executed:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
kslwlmod 13311368, -1780327896, 1610758784, 1
skgpwwait 13311608, -1767360, -1780326976, 0
sskgpwwait 13311608, -1767360, -1780326976, 0
semop 360451, 13310840, 1, -1

Interestingly, if the latch is not taken, this is how the latch get sequence looks like:

kslgetl laddr:60023a80, willing:1, where:0, why:2442

In other words, for getting a non shared latch in willing to wait mode:

1-the function kslgetl is called, which tries to fetch the latch.
If the latch can be taken, the function returns, if not:
2-the function kslges (kernel service latch get spinning) is called, which supposedly also tries to take the same latch.
If the latch still can not be taken, the next function is:
3-the function kslwlmod (kernel service latch waiting list modify) is entered.
In this function the process registers itself as waiting in the post/wait list.
4-the function skgpwwait (system kernel generic post/wait wait) is entered.
This function sets up the waiting for the process so it can be posted.
5-the function sskgpwwait (system system kernel generic post/wait wait)
My current understanding is the ‘ss’ function contain the platform specific code for database functions.
6-the (operating system) function semop (semaphore operation) is called.
This will make the process sleep waiting on a semaphore (operating system signalling mechanism). This way, the process will not be runnable on the CPU unless the semaphore is posed.

The information that is missing here, is the spinning. The earlier work of Andrey Nikolaev showed that in the Solaris port of the database, a distinct function (sskgslgf [immediate] and sskgslspin [spin]) was used to get the latch, which made it easy to count.

Some searching around revealed that a CPU register reveals this information. Add this to the above gdb script:

break *0xc29b51
    printf " kslges loop: %d\n", $ecx

And try to get a non shared taken latch in willing to wait mode:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
 kslges loop: 19999
 kslges loop: 19998
 kslges loop: 1
 kslges loop: 0
kslwlmod 1208568840, -1780327896, 1610758784, 1
skgpwwait 1208569080, -1281169344, -1780326976, 0
sskgpwwait 1208569080, -1281169344, -1780326976, 0
semop 360451, 1208568312, 1, -1

So…this reveals that getting a non shared latch in willing to wait mode will spin 10*_spin_count. In other words: not _spin_count, which is 2000 by default. What is even more interesting, is for the described type of latch, there is no (short) timed sleep done; the kslges function spins 10*_spin_count times for the latch, then puts itself on the waiting list, and goes to sleep on a semaphore. The way the latching mechanism works has been described (by Oracle AFAIK) as that a process will spin for _spin_count times trying to get a latch, then goes to sleep for some time, and then spins trying to get the latch, goes to sleep, etc.

I strace’d the process holding the latch to see if it is doing a semctl systemcall to signal the waiting process if the latch is freed, and indeed that is what is happening. This is different from the “old days” where processes spinning on latches (cache buffers chains latches most of the time) were really hammering the system and would eat up a lot of the CPU slices trying to get a latch.

Part of the classic latching problem, specifically the cache buffers chains latching problem, quite probably is mitigated by having shared latches, which were introduced in Oracle in different versions for different ports.

Watch out for a second blogpost where I do the same investigation for shared latches.

This is a question that I played with for a long time. There have been statements on logical IO performance (“Logical IO is x times faster than Physical IO”), but nobody could answer the question what the actual logical IO time is. Of course you can see part of it in the system and session statistics (v$sysstat/v$sesstat), statistic name “session logical reads”. However, if you divide the number of logical reads by the total time a query took, the logical IO time is too high, because then it assumed all the time the query took was spend on doing logical IO, which obviously is not the case, because there is time spend on parsing, maybe physical IO, etc. Also, when doing that, you calculate an average. Averages are known to hide actual behaviour.

Luckily, with Redhat Enterprise Linux and Oracle Linux version 7, there is the kernel version 3.10 as the stock RedHat kernel version. This kernel version supports systemtap userspace return probes. A systemtap probe is a way to trigger an action when a certain action (the probed event) is started, a return probe is an action triggered when an action is finished. The Oracle UEK kernel version 3 at the time of writing is version 3.8, which does not support this.

My current knowledge is consistent reads are handled by the Oracle database C function kcbgtcr(). Current reads are quite probably handled by the function kcbgcur(). Having said that, I know of at least one exception to this: scans on hybrid columnar compressed segments do not use the kcbgtcr() function.

Please mind all kernel code translations, like kcbgtcr (kernel cache buffers get consistent read) are pure guesses, albeit somewhat educated guesses, as there are a lot of internet publications naming these, including My Oracle Support itself.

With the 3.10 version kernel, we can create a small systemtap script to measure the time between the start and stop of the kcbgtcr routine. If you want to experiment with this, it is probably best to download the latest version of systemtap and compile it yourself.. This is the reason you see /usr/local/bin/stap in the shebang.


global latency

probe begin {

probe process("/u01/app/oracle/product/").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()

probe process("/u01/app/oracle/product/").function("kcbgtcr").return {
	printf("<kcbgtcr, latency(us): %d\n", gettimeofday_us() - latency[pid()])

Now make the script executable, and run it against a database session:

# ./lio.stap -x 3877

Next, I execute a scan in the database foreground session, and watch the systemtap script output:
(execution of the systemtap script can be cancelled by pressing CTRL-c)

<kcbgtcr, latency(us): 2
<kcbgtcr, latency(us): 79542
<kcbgtcr, latency(us): 4
<kcbgtcr, latency(us): 2
<kcbgtcr, latency(us): 2
<kcbgtcr, latency(us): 13
<kcbgtcr, latency(us): 1

This shows the function being executed by the database session. However, it seems the time spend in the kcbgtcr() function is not consistent.

In order to get a better overview, we can add a histogram of the kcbgtcr latencies:

global latency, latency_histogram

probe begin {

probe process("/u01/app/oracle/product/").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()

probe process("/u01/app/oracle/product/").function("kcbgtcr").return {
	latency[pid()] = gettimeofday_us() - latency[pid()]
	latency_histogram <<< latency[pid()]
	printf("<kcbgtcr, latency(us): %d\n", latency[pid()])

probe end {
	if ( @count(latency_histogram) > 0 ) {
		printf("\n==kcbgtcr latency==\n")

Next, attach the systemtap script to the database session again, and issue a scan. Once the scan in the database session is finished, press CTRL-c to finish the systemtap script:


==kcbgtcr latency==
  value |-------------------------------------------------- count
      0 |                                                       0
      1 |@@@@@@@@@@@@@@@@@@@@                                5998
      2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  14284
      4 |                                                     268
      8 |                                                      58
     16 |                                                     137
     32 |                                                      25
     64 |                                                       1
    128 |                                                      15
    256 |                                                      69
    512 |                                                       1
   1024 |                                                       0
   2048 |                                                       0
   4096 |                                                       1
   8192 |                                                       5
  16384 |                                                      35
  32768 |                                                      39
  65536 |                                                       8
 131072 |                                                       0
 262144 |                                                       1

Okay, the majority is 2 microseconds, but outside the 1 and 2 microseconds buckets, there are a lot of executions that totally fall outside of these, up to 262144 microseconds (262 milliseconds)!

What could cause these huge changes in logical IO time?

At this point I am quite much turning the squelch down and make a lot of information visible (this is a warning!). Here is the systemtap script I am using:

global latency, latency_histogram

probe begin {

probe process("/u01/app/oracle/product/").function("kc*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
	if ( probefunc() == "kcbgtcr" )
			latency[pid()] = gettimeofday_us()
probe process("/u01/app/oracle/product/").function("kc*").return {
	printf("%s < %s", thread_indent(-1), ppfunc())
	if ( ppfunc() == "kcbgtcr" ) {
		latency[pid()] = gettimeofday_us() - latency[pid()]
		latency_histogram <<< latency[pid()]
		printf(" -- latency(us): %d", latency[pid()])
probe process("/u01/app/oracle/product/").function("qe*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
probe process("/u01/app/oracle/product/").function("qe*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
probe process("/u01/app/oracle/product/").function("kt*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
probe process("/u01/app/oracle/product/").function("kt*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
probe process("/u01/app/oracle/product/").function("kd*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
probe process("/u01/app/oracle/product/").function("kd*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
probe process("/u01/app/oracle/product/").function("opiosq0") {
probe process("/u01/app/oracle/product/").function("opiexe") {
probe process("/u01/app/oracle/product/").function("opifch2") {
probe process("/u01/app/oracle/product/").function("opiclo") {
probe process("/u01/app/oracle/product/").function("kslwtbctx") {
probe process("/u01/app/oracle/product/").function("kslwtectx") {
probe process("/lib64/").function("io_submit") {
probe process("/lib64/").function("io_getevents_0_4") {
probe process("/lib64/").function("pread64") {

Warning! This script will run for a long time before it is compiled as kernel module!! Systemtap works by dynamically creating a kernel module from the system tap script, and insert it into the kernel. Because I injected a huge chunk of code to run in the kernel a lot of things are potentially influenced. I found the Oracle database to drop core’s for example.

After running this against a session and doing a very simple full table scan with a count(*), I searched for the logical IO, in other words where the kcbgrcr() function was performed. Here it is:

166149 oracle_5487_tes(5487):     < kdst_fetch
166155 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166159 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets
166162 oracle_5487_tes(5487):     > kdst_fetch
166164 oracle_5487_tes(5487):      > kdst_fetch0
166167 oracle_5487_tes(5487):       > kcbipnns
166170 oracle_5487_tes(5487):       < kcbipnns
166173 oracle_5487_tes(5487):       > kcbrls
166177 oracle_5487_tes(5487):        > kcbrls_direct
166181 oracle_5487_tes(5487):        < kcbrls_direct
166183 oracle_5487_tes(5487):       < kcbrls
166185 oracle_5487_tes(5487):       > kdstsnb
166188 oracle_5487_tes(5487):       < kdstsnb
166191 oracle_5487_tes(5487):       > ktrget2
166194 oracle_5487_tes(5487):        > ktsmg_max_query
166197 oracle_5487_tes(5487):        < ktsmg_max_query
166200 oracle_5487_tes(5487):        > kcbgtcr
166204 oracle_5487_tes(5487):         > kcbldrget
166209 oracle_5487_tes(5487):          > kcblgt
166211 oracle_5487_tes(5487):          < kcblgt
166213 oracle_5487_tes(5487):          > kcbzvb
166216 oracle_5487_tes(5487):           > kcbhvbo
166219 oracle_5487_tes(5487):            > kcbhxoro
166222 oracle_5487_tes(5487):            < kcbhxoro
166224 oracle_5487_tes(5487):           < kcbhvbo
166226 oracle_5487_tes(5487):          < kcbzvb
166228 oracle_5487_tes(5487):          > kcbztek_trace_blk
166230 oracle_5487_tes(5487):          < kcbztek_trace_blk
166233 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
166236 oracle_5487_tes(5487):           > kcbtgobj
166239 oracle_5487_tes(5487):            > kd4obj
166251 oracle_5487_tes(5487):            < kd4obj
166254 oracle_5487_tes(5487):           < kcbtgobj
166257 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
166260 oracle_5487_tes(5487):         < kcbldrget
166262 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 62
166265 oracle_5487_tes(5487):        > ktrgcm
166268 oracle_5487_tes(5487):         > kcbcge
166271 oracle_5487_tes(5487):         < kcbcge
166273 oracle_5487_tes(5487):         > ktcckv
166275 oracle_5487_tes(5487):         < ktcckv
166285 oracle_5487_tes(5487):        < ktrgcm
166286 oracle_5487_tes(5487):       < ktrget2
166289 oracle_5487_tes(5487):       > kdr9ir2blk
166291 oracle_5487_tes(5487):       < kdr9ir2blk
166293 oracle_5487_tes(5487):      < kdst_fetch0
166294 oracle_5487_tes(5487):     < kdst_fetch
166297 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166300 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

How to read: “>” means entering a function, “<" means return from a function.
I selected a piece of the systemtap output/tracing where the counting procedure is visible.

The first row is "< kdst_fetch" in other words: returning from kernel data scan table fetch. So this function has performed a fetch. Not surprisingly, the next function is qeaeCn1SerialRowsets, which I think is the count function.

After the count, the kdst_fetch function is entered again, then kdst_fetch0. The next functions are kcbipnns, kcbrls and kcbrls_direct. Probably these functions are related to pinning and releasing blocks. See Alexander Anokhin’s post on that.

A few functions further we encounter kcbgrcr(). Interestingly, it is followed by the kcbldrget() function, which is kernel cache buffers direct path loader get. In other words, Oracle has chosen to do a direct path read, because this is the function that starts off the direct path read code. The next function, kcblget() requests a block. This means that blocks read in this code path are from PGA memory, not SGA (buffer cache) memory.

The next functions, kcbzvb, kcbhvbo and kcbhxoro are the block XOR checking functionality. The function actually performing this (sxorchk) is not visible because I did not probe for it.

The next functions are not clear to me at this point (kcbztek_trace_blk, kcbl_objdchk_with_cache_reread, kcbtgobj, kd4obj). Then the kcbldrget returns, and the kcbgtcr function too.

Here we can see that probing a huge number of functions does influence the performance of a process. While previously we saw kcbgtcr() took 2us, probably because of all the probes the time the kcbgtcr function took 62us.

The next piece of execution is done by the ktrgcm function. This is handling undo and buffer cleanout. After that function, we cross kdr9ir2blk (function unknown), after which kdst_fetch0 and kdst_fetch return, and the count is done using qeaeCn1SerialRowsets.

Okay, so far so good, but the question was: why are there logical IOs/kcbgtcr() executions that take excessively more time?

After a little searching, I was able to find a very good reason:

169252 oracle_5487_tes(5487):     > kdst_fetch
169254 oracle_5487_tes(5487):      > kdst_fetch0
169256 oracle_5487_tes(5487):       > kcbipnns
169258 oracle_5487_tes(5487):       < kcbipnns
169259 oracle_5487_tes(5487):       > kcbrls
169262 oracle_5487_tes(5487):        > kcbrls_direct
169264 oracle_5487_tes(5487):        < kcbrls_direct
169265 oracle_5487_tes(5487):       < kcbrls
169267 oracle_5487_tes(5487):       > kdstsnb
169269 oracle_5487_tes(5487):       < kdstsnb
169270 oracle_5487_tes(5487):       > ktrget2
169272 oracle_5487_tes(5487):        > ktsmg_max_query
169274 oracle_5487_tes(5487):        < ktsmg_max_query
169275 oracle_5487_tes(5487):        > kcbgtcr
169278 oracle_5487_tes(5487):         > kcbldrget
169280 oracle_5487_tes(5487):          > kcblgt
169283 oracle_5487_tes(5487):           > kcblrs
169286 oracle_5487_tes(5487):            > kdsdrcbk
169288 oracle_5487_tes(5487):            < kdsdrcbk
169291 oracle_5487_tes(5487):            > kcbldio
169296 oracle_5487_tes(5487):             > kcfaioe
169299 oracle_5487_tes(5487):             < kcfaioe
169301 oracle_5487_tes(5487):             > kcflbi
169314 oracle_5487_tes(5487):              > kcf_hard_ftype_check
169317 oracle_5487_tes(5487):              < kcf_hard_ftype_check
169416 oracle_5487_tes(5487):             < kcflbi
169420 oracle_5487_tes(5487):            < kcbldio
169425 oracle_5487_tes(5487):            > kcblcffln
169429 oracle_5487_tes(5487):            < kcblcffln
169432 oracle_5487_tes(5487):           < kcblrs
169434 oracle_5487_tes(5487):           > kcblsinc
169438 oracle_5487_tes(5487):           < kcblsinc
169440 oracle_5487_tes(5487):           > kcblcio
169443 oracle_5487_tes(5487):            > kcblci
169447 oracle_5487_tes(5487):             > kcflci
169481 oracle_5487_tes(5487):              > kcflwi
179477 oracle_5487_tes(5487):              < kcflwi
179484 oracle_5487_tes(5487):             < kcflci
179488 oracle_5487_tes(5487):            < kcblci
179491 oracle_5487_tes(5487):           < kcblcio
179494 oracle_5487_tes(5487):          < kcblgt
179497 oracle_5487_tes(5487):          > kcbzvb
179509 oracle_5487_tes(5487):           > kcbhvbo
179513 oracle_5487_tes(5487):            > kcbhxoro
179516 oracle_5487_tes(5487):            < kcbhxoro
179518 oracle_5487_tes(5487):           < kcbhvbo
179520 oracle_5487_tes(5487):          < kcbzvb
179539 oracle_5487_tes(5487):          > kcbztek_trace_blk
179544 oracle_5487_tes(5487):          < kcbztek_trace_blk
179549 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
179555 oracle_5487_tes(5487):           > kcbtgobj
179559 oracle_5487_tes(5487):            > kd4obj
179562 oracle_5487_tes(5487):            < kd4obj
179563 oracle_5487_tes(5487):           < kcbtgobj
179565 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
179569 oracle_5487_tes(5487):         < kcbldrget
179571 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 10295
179576 oracle_5487_tes(5487):        > ktrgcm
179580 oracle_5487_tes(5487):         > kcbcge
179582 oracle_5487_tes(5487):         < kcbcge
179585 oracle_5487_tes(5487):         > ktcckv
179587 oracle_5487_tes(5487):         < ktcckv
179589 oracle_5487_tes(5487):        < ktrgcm
179591 oracle_5487_tes(5487):       < ktrget2
179593 oracle_5487_tes(5487):       > kdr9ir2blk
179606 oracle_5487_tes(5487):       < kdr9ir2blk
179609 oracle_5487_tes(5487):      < kdst_fetch0
179611 oracle_5487_tes(5487):     < kdst_fetch
179616 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
179620 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

If you go through the calls, you will see that the start is exactly the same, until line 17. After kcbgtcr>kcbldrgt (consistent read request function choosing direct path reads), the kcblgt function does not return immediately, but rather starts off a lot of extra code path.

This code path fetches new blocks. The most striking thing here is that kcbgtcr requests the blocks, and physical IO is done on behalf of the consistent read request, in other words: on behalf of the logical IO. This is obvious if you think about it, if you want to read blocks you have to look if they are available or not, and if they are not available, you have to fetch them.

As a conclusion: if kcbgtcr() together with kcbgcur() and a couple of other functions is considered the logical IO, then a logical IO has a variable time, instead of a consistent one. Because physical IO is requested inside the logical IO request, technically the physical IO is part of the logical IO. This means that it is technically incorrect to state that a physical IO is slower than logical IO, because a physical IO is part of the logical IO request that needed that physical IO. and as such a physical IO can not be slower than a logical IO

In my previous article I started exploring the memory usage of a process on a recent linux kernel (2.6.39-400.243.1 (UEK2)), recent means “recent for the Enterprise Linux distributions” in this context, linux kernel developers would point out that the kernel itself is at version 3.19 (“stable version” at the time of writing of this blogpost).

The previous article showed that every process has its own address space, and that different allocations exists for execution. These allocations can be seen in the proc pseudo filesystem, in a process specific file called ‘maps’. The process itself needs some administration area’s which are anonymous allocations which are marked with [heap] and [stack], and some a few others called [vdso] and [vsyscall]. Every process executes something (not all people realise that: a process can wait for something, but essentially is always executing). So there always will be an executable in a (regular) process’ address space. In a lot of cases, the executable uses shared libraries. In that case, the libraries are loaded in the address space of the process too, alongside the executable.

The executable contains (at least) two sections; the code segment, which is readonly and potentially shared, and the data segment, which is read write and gets truly allocated instead of used shared with the parent if the process needs to write. In a lot of cases, the executable uses shared libraries, which means it uses functions which are stored in that library. A library also needs to be loaded, and also contains multiple sections, which can be read only or read write, and are shared unless the process needs to write to that segment.

For completeness, here’s the complete maps output of a process executing the ‘cat’ executable again:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 2605084                            /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 2605084                            /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0139d000-013be000 rw-p 00000000 00:00 0                                  [heap]
7f444468d000-7f444a51e000 r--p 00000000 fc:00 821535                     /usr/lib/locale/locale-archive
7f444a51e000-7f444a6a8000 r-xp 00000000 fc:00 3801096                    /lib64/
7f444a6a8000-7f444a8a8000 ---p 0018a000 fc:00 3801096                    /lib64/
7f444a8a8000-7f444a8ac000 r--p 0018a000 fc:00 3801096                    /lib64/
7f444a8ac000-7f444a8ad000 rw-p 0018e000 fc:00 3801096                    /lib64/
7f444a8ad000-7f444a8b2000 rw-p 00000000 00:00 0
7f444a8b2000-7f444a8d2000 r-xp 00000000 fc:00 3801089                    /lib64/
7f444aacd000-7f444aad1000 rw-p 00000000 00:00 0
7f444aad1000-7f444aad2000 r--p 0001f000 fc:00 3801089                    /lib64/
7f444aad2000-7f444aad3000 rw-p 00020000 fc:00 3801089                    /lib64/
7f444aad3000-7f444aad4000 rw-p 00000000 00:00 0
7fff51980000-7fff519a1000 rw-p 00000000 00:00 0                          [stack]
7fff519ff000-7fff51a00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

If you look closely, you will see that I didn’t explain one type of allocation yet: the anonymous allocations. The anonymous allocations are visible in lines 4, 11, 13 and 16. The anonymous mappings directly following the data segment of either an executable or a library mapped into a process address space is called the BSS. The data segment and the BSS store static variables, however the data segment stores initialised variables, the BSS stores uninitialised variables. The anonymous mapping for the BSS section might exist for a library, as seen above, or might not exist; not all Oracle database executable libraries use an anonymous memory mapping for example. Actually, there is one other memory allocation visible, /usr/lib/locale/locale-archive, which is a file for locale (multi-language support) functions in the C library, which is out of scope for this article.

When a process requests memory to store something, the system call malloc() (memory allocation) can be called. This system call inspects the size of the allocation, and will allocate memory from either the process’ heap (the memory mapping with [heap], using the system call brk()) or it will allocate space using a new anonymous memory segment, using the system call mmap(). If you follow the link with malloc(), you can read the source code of the malloc() call. There are different malloc()’s, which fulfil different purposes (embedded devices have different requirements than huge servers), the implementation that Enterprise Linuxes use is one called ptmalloc2, which is based on a version written bij Doug Lea. If you read the comments in the source code, specifically at ‘Why use this malloc?’, you will see that it tries to be smart with requests up to 128KB (for memory re-usability, to avoid fragmentation and memory wastage), which are allocated from the heap. If an allocation is larger than 128KB, it will use the system memory mapping facilities.

Okay, this brings us back at the original question: how much memory does this process take? I hope you recall from the first blogpost that Linux tries to share as much memory as possible, and when a new process is created, the allocations in the address space of this new process are pointers to the memory areas of the parent process. Let’s first use a utility a lot of people are using: top.

top - 10:53:40 up 9 days, 14:11,  2 users,  load average: 1.34, 1.36, 1.37
Tasks: 1124 total,   1 running, 1123 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.9%us,  0.8%sy,  0.1%ni, 98.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  98807316k total, 97411444k used,  1395872k free,   400288k buffers
Swap: 25165820k total,  3560852k used, 21604968k free, 27573200k cached

 16391 oracle    -2   0 2369m 4960 4812 S  0.7  0.0  91:41.37 asm_vktm_+asm1
 16407 oracle    20   0 2388m  20m 8580 S  0.7  0.0  46:16.40 asm_dia0_+asm1

I edited the output a bit to show only two process from an ASM instance.
The columns that show information on memory are VIRT, RES, SHR, %MEM.

VIRT is described in the man-page of top as ‘The total amount of virtual memory used by the task’. This means it’s ALL the memory visible in the addressing space of the process. A useful utility to get the contents of the virtual memory allocations for a process is pmap, let’s use it for process 16391, which is asm_vktm_+asm1:

$ pmap -x 16391
16391:   asm_vktm_+ASM1
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000  246644    2520       0 r-x--  oracle
000000000f6dd000    1952      44      20 rw---  oracle
000000000f8c5000     140       0       0 rw---    [ anon ]
000000001042c000     356       0       0 rw---    [ anon ]
0000000060000000    4096       0       0 rw-s-  SYSV00000000 (deleted)
0000000061000000 2080768       0       0 rw-s-  SYSV00000000 (deleted)
00000000e0000000      12       4       4 rw-s-    [ shmid=0x4e78003 ]
00007fa275589000      72      12       0 r-x--
00007fa27559b000    2044       0       0 -----
00007fa27579a000       8       0       0 rw---
00007fa27579c000    1604       0       0 r-x--
00007fa27592d000    2044       0       0 -----
00007fa275b2c000      72       0       0 rw---
00007fa275b3e000      20       4       0 r-x--
00007fa275b43000    2044       0       0 -----
00007fa275d42000       4       0       0 rw---
00007fa27abd5000       8       0       0 r-x--
00007fa27abd7000    2044       0       0 -----
00007fa27add6000       4       0       0 rw---
00007fa27add7000     128     116       0 r-x--
00007fa27adf8000     512      12      12 rw---    [ anon ]
00007fa27ae78000     212       4       4 r--s-  passwd
00007fa27aead000    1260      68      68 rw---    [ anon ]
00007fa27aff3000       4       4       0 rw-s-  hc_+ASM1.dat
00007fa27aff4000       8       0       0 rw---    [ anon ]
00007fa27aff6000       4       0       0 r----
00007fa27aff7000       4       0       0 rw---
00007fa27aff8000       4       0       0 rw---    [ anon ]
00007fff7b116000     132       8       8 rw---    [ stack ]
00007fff7b1ff000       4       4       0 r-x--    [ anon ]
ffffffffff600000       4       0       0 r-x--    [ anon ]
----------------  ------  ------  ------
total kB         2426668    5056     156

The column ‘Kbytes’ represents the full size of the executable, libraries, shared memory, anonymous mappings and other mappings of this process. For completeness sake: 2426668/1024=2369.79, which matches the 2369 in the top output. This is all the memory this process can see, and could use. Does this tell us anything on what memory the process 16391 actually takes? No. (parts of) the Oracle executable’s allocations are potentially shared, the shared memory (SYSV00000000 (deleted) and [ shmid=0x4e78003 ], which represent the Oracle SGA) is shared, the memory allocations for the libraries are potentially shared. The anonymous memory mappings have been defined, but the actual allocation is not visible in the Kbytes column. What this column in top does for me, is tell the approximate SGA size, especially if the SGA is larger (meaning multiple gigabytes).

The second memory column in top is RES. RES is described as: ‘The non-swapped physical memory a task is using’. RES is sometimes referred to as RSS, and called ‘resident set size’. As we can see from the total, the RSS is way lesser than the virtual memory size. One important thing in the RES description of top is that it described that swapped memory pages are not counted for the RES/RSS value. RES/RSS corresponds to the actual used (“touched”) memory by a process, and is directly usable. If you look back to the RSS column of pmap above, you see the oracle executable’s two mappings, one has a RSS size of 2520, and one has a RSS size of 44. But…if you remember that the code/readonly segment is potentially shared with other process, and then look at the 2520 value (which is of the oracle memory segment with the rights r-x–, which means the code segment), I hope you understand this just means this process (vktm) read a subset of the entire executable, and more importantly: the RSS size does not reflect physical memory uniquely allocated by this process.

If we look at the shared memory segments, it’s interesting to see what happens during normal life of a database session. I think it is needless to say that you should calculate shared memory outside of process memory usage, since it’s a distinct memory set that is truly shared by all the processes that are created for the instance.

This is a session which has just started:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     348     348 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848    2316    2316 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592       0       0 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]

This instance has a SGA set to 4G. Because the session just started, it only touched 2316(KB) of the SGA. Next, I do a big (buffered!) full table scan, requiring the session to put a lot of blocks into the buffercache. After the scan, look at the shared memory segment using pmap again:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     384     384 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848 2279040 2279040 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592   66564   66564 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]

The session has touched half of the SGA shared memory segment (visible in the RSS column of the 6th line). This is logical if you understand what is going on: the process does a buffered table scan, which means the blocks read from disk need to be stored in the buffer cache, which is one of the memory structures in the Linux shared memory segments. However, if you look strictly at the top utility output of a database that has just started up, you see the RSS size of the all the processes that work with the buffercache growing. This phenomenon has lead to a repetitive question on the Oracle Linux forums if Oracle database processes are leaking memory. Of course the correct answer is that the RSS size just grows because the process just touches more of the shared memory (=SGA) that has been mapped into its address space. It will stop increasing once it touched all the memory it could touch.

%MEM is the RES/RSS size expressed as a percentage of the total physical memory.

SHR is the amount of shared memory. The manpage of top says ‘It simply reflects memory that could be potentially shared with other processes’. Do not confuse this with shared memory segments mapped into the process’ address space. Empirical tests show the SHR value always seems to be lower than the RSS size, which means it seems to track the RSS value of memory, and shows RSS (touched) memory that could be shared (which seems to contain both touched memory from the shared memory segments, as well as privately mapped executable and libraries). At least from the perspective of an Oracle performance and troubleshooting perspective I can’t see any benefit from using this value.

The conclusion of this part on memory usage is that both the virtual set size (VIRT/VSZ) and resident set size (RES/RSS) are no figures you can add up to indicate physical memory usage of a single process or a group of processes.

The virtual set size gives you the total amount of virtual memory available to a single process, which includes the executable, and potentially shared libraries, anonymous memory mappings and files mapped into the address space and shared memory. In a lot of cases, you get an indication of the total SGA size of the instance, because the shared memory segments which contain the SGA are entirely mapped into the address space of the process.

The resident set size shows you how much of the memory and files mapped into the address space are actually “touched”, and directly usable. Some of the memory usage result in memory pages private to the process, because of writing to the memory and the Copy On Write mechanism of the Linux memory manager. A lot of other memory mappings can be used, increasing the resident set size, while these are shared with other processes. A third potential component is the actual usage of memory as a result of anonymous memory mappings (versus the total allocation, which can be much more), which are private to the process.

This blogpost is about finding the actual amount of memory a process is taking. In order to do so, this post dives into the memory mechanisms of Linux. The examples in this article are taken from an Oracle Linux version 6.6 server, with kernel 2.6.39-400.243.1 (UEK2). This is written with the Oracle database processes in mind, but actually uses examples of a processes running ‘cat’, which means the contents of this post are absolutely not limited to Oracle database processes.

Let’s start off with a simple example. Let’s look at our own memory map. In order to do so, I use the ‘cat’ executable and the ‘maps’ entry in the proc pseudo-filesystem. This is how that is done, including the result:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 2605084                            /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 2605084                            /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0139d000-013be000 rw-p 00000000 00:00 0                                  [heap]
7f444468d000-7f444a51e000 r--p 00000000 fc:00 821535                     /usr/lib/locale/locale-archive
7f444a51e000-7f444a6a8000 r-xp 00000000 fc:00 3801096                    /lib64/
7f444a6a8000-7f444a8a8000 ---p 0018a000 fc:00 3801096                    /lib64/
7f444a8a8000-7f444a8ac000 r--p 0018a000 fc:00 3801096                    /lib64/
7f444a8ac000-7f444a8ad000 rw-p 0018e000 fc:00 3801096                    /lib64/
7f444a8ad000-7f444a8b2000 rw-p 00000000 00:00 0
7f444a8b2000-7f444a8d2000 r-xp 00000000 fc:00 3801089                    /lib64/
7f444aacd000-7f444aad1000 rw-p 00000000 00:00 0
7f444aad1000-7f444aad2000 r--p 0001f000 fc:00 3801089                    /lib64/
7f444aad2000-7f444aad3000 rw-p 00020000 fc:00 3801089                    /lib64/
7f444aad3000-7f444aad4000 rw-p 00000000 00:00 0
7fff51980000-7fff519a1000 rw-p 00000000 00:00 0                          [stack]
7fff519ff000-7fff51a00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Most people know that in order to execute ‘cat’, the shell forks and executes the cat command in a new process.

What we see, is the executable (/bin/cat), the heap ([heap]), two dynamic libraries (/lib64/ and /lib64/, the stack ([stack]), two entries called [vdso] (virtual dynamically linked shared objects) and [vsyscall] (virtual syscall), and anonymous memory allocations (00000000 00:00 0 allocations without a marker to indicate a process function).

In order to understand the libraries, we first need to know something about the executable itself, using the command ‘file':

$ file /bin/cat
/bin/cat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped

There actually is a great deal of information to be seen from this one line:
This is an ELF format executable. Another executable format is COFF, which is used on the Windows platform.
The executable is 64 bits (x86-64).
The most important part for this article: the executable is dynamically linked, and uses shared libraries.
The last word ‘stripped’ deserves some explanation too: the executables is stripped, which means a lot of symbolic information (function names for example) are removed from the executable. One reason for doing so is to make the executable smaller. If you look at the oracle executable, you will see it’s not stripped. Except for the Oracle XE executable, which is stripped.

We now established this is a dynamically linked executable. The next step is to see the libraries it is using. This is done with the ‘ldd’ (loader dependencies) executable:

$ ldd /bin/cat =>  (0x00007fffa9388000) => /lib64/ (0x00007f6ffa3dd000)
	/lib64/ (0x00007f6ffa772000)

Please mind these are the dependencies for executing /bin/cat, which does not necessarily means you see all those in the address space of the process executing the /bin/cat executable. Two you don’t see is, which is used for virtual dynamic shared objects, which essentially means from the oracle perspective that some system calls can be executed fully in userspace, most notably for Oracle engineers: gettimeofday(). Fellow Oaky James Morle wrote a nice article explaining this.
The other one is /lib64/, which is the dynamic loader needed for executing the executable. The last one is, which “truly” is a library that is dynamically loaded for use during execution. The other ones discussed earlier are necessary for the instantiating the execution, not so much during the execution.

If we now look back to the maps output, we see the library, and we see a library. The library is the dynamic loader library, to provide dynamic loading function on runtime.

Okay, we now gained a some understanding on executables and the libraries. Now let’s look a bit more in detail to the executable in maps:

00400000-0040b000 r-xp 00000000 fc:00 2605084                            /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 2605084                            /bin/cat

Why is the executable mentioned twice?
The answer is this is because of the way an ELF object is built up. In order to look deeper into this ELF binary, we can use the ‘readelf’ executable:

$ readelf -l /bin/cat

Elf file type is EXEC (Executable file)
Entry point 0x401850
There are 8 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001c0 0x00000000000001c0  R E    8
  INTERP         0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x000000000000a204 0x000000000000a204  R E    200000
  LOAD           0x000000000000a208 0x000000000060a208 0x000000000060a208
                 0x0000000000000648 0x0000000000001000  RW     200000
  DYNAMIC        0x000000000000a3e8 0x000000000060a3e8 0x000000000060a3e8
                 0x0000000000000190 0x0000000000000190  RW     8
  NOTE           0x000000000000021c 0x000000000040021c 0x000000000040021c
                 0x0000000000000044 0x0000000000000044  R      4
  GNU_EH_FRAME   0x0000000000009414 0x0000000000409414 0x0000000000409414
                 0x000000000000027c 0x000000000000027c  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     8

 Section to Segment mapping:
  Segment Sections...
   01     .interp
   02     .interp .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame
   03     .ctors .dtors .jcr .dynamic .got .got.plt .data .bss
   04     .dynamic
   05     .note.ABI-tag
   06     .eh_frame_hdr

When ‘readelf’ is invoked with the -l option, readelf displays the information that is contained in the ELF segment headers of the executable. This is a lot of detail into which I don’t go too deep.

The biggest distinction between the two entries of the executable in maps is the first entry is readonly executable (r-xp), and the second entry is read-write (rw-p). If we now look at the readelf -l output, we see the same flags in the FLAGS column, now indicated by R, W and E for Read, Write and Execute. If we look in the segment sections part, we see section 02 containing a lot of entries, and if we look up at the headers, and we count from zero, we see “LOAD”, and the entry being flagged as ‘R E’. This memory area often is referred to as the code segment.

The some of the types of memory allocations in section 02 are:
– Dynamic linking information (.interp, .gnu.hash, .dynsyms, dynstr)
– C runtime code (.init, .fini)
– Relocation information (.rela.dyn, .rela.plt)
– String constants (.rodata)
– Machine instructions (.text)
– Procedure Linkage Table (.plt)

The next section that deserves attention is section 03. If we look in the headers, count from zero, we see this section is a “LOAD” section too, but the entry being flagged as ‘RW ‘. This memory area often is referred to as the data segment.

The some of the types of memory allocations in section 03 are:
– Dynamic linking into (.dynamic)
– Relocated pointer values to external symbols (.got)
– Procedure linkage table of the global offset table (.got.plt)
– C runtime data (.ctors, .dtors)

As you can see with these two, section 02 contains information which can actually be readonly (the executable itself is never changed on runtime), which is why it’s readonly. Section 03 contains information which by the nature of dynamic linking, really can’t be readonly, because we don’t know how the library looks like which the executable needs to dynamically load via the dynamic loader; the only thing that is required from the library is that it contains the functions the executable needs from that library.

At this point I think it should be clear why there are two sections for a single executable. But how about the libraries?

Actually, simplified, a library is an executable with only functions, and not a program to run any of these functions. Libraries also can use other shared libraries for its functions. This means you can use the above mentioned executables (file, ldd, readelf) to examine libraries, and get quite similar results. The programmer can choose how the memory segments look like and how these are divided, which is the reason the dynamic loader library ( has 3 entries, and the libc library ( has 4. Although 4 or 3 sections seems to be extremely common for libraries, as well as 2 for executables.

Okay, now that we gained some understanding on the memory segments for a process, let’s continue on actual memory usage. When a process is forked, Linux tries to do the least possible and save as much memory as possible. To do so, the process gets its own virtual memory address space which is a duplicate of the process which called fork, including its memory. However, what really happens is the memory areas are the ones from the forking process, linked via pointers. Only when a change is needed, the process starts truly allocating its own memory, of course not using the memory from the forking process. This technique is often referred to as COW (Copy On Write). The ‘pointer trick’ is made possible by the kernel, and the use of virtual memory: every process allocates at the same (virtual) memory addresses, which are kept separate from other processes by translating them to different physical locations, or the same locations for copy on write. Of course the read-only sections as we saw in maps will never change, and as a result only will be in memory at a single place, and potentially referenced/used by a lot of processes. But even the writable sections are only physically allocated as soon as there truly is a write action.

The ‘just in time’ principle of memory allocation is also used for anonymous memory allocations via mmap(), which are most of the Oracle database processes PGA allocations. When a process allocates memory via mmap(), it results in an anonymous memory section, and is visible in maps. However, it isn’t allocated yet, only when the memory truly is getting used it is physically allocated.

I hope you start to see at this point that you can see the memory areas a process is using, but you can not tell how much of that is really, uniquely allocated for that process. Also, and maybe even more importantly, just adding memory figures of Oracle databases processes will do (a lot of) double counting. In a future blogpost or posts I will dive into how bits and pieces of the working can actually be seen from for example the /proc/PID/smaps meta-file, and how this knowledge can be applied for sizing Oracle databases.

I was testing Oracle Goldengate on a non-clustered Oracle database with ASM. With ASM, you need to have the grid infrastructure installed. The cluster ware for the single node install is called ‘oracle restart’.

The most convenient way to have Goldengate running at startup that I could find, was using the Oracle Grid Infrastructure Agents. These agents are not installed by default, you need to download these from the Oracle Technology Network. The download is with the grid infrastructure downloads in the database section.

The installation is very simple: unzip the file, and run the script with the arguments ‘–install’ and ‘–directory’ arguments. The directory argument needs to point to a directory outside of the grid infrastructure home. My first choice would be to have it in the infrastructure home, but the install script does not allow that.

$ ./ --install --directory /u01/app/oracle/product/xag
Installing Oracle Grid Infrastructure Agents on: ogg-dest

The control utility, ‘agctl’ can be run without any environment variables set to point to the directory where the utility is installed, or the grid infrastructure home.

$ /u01/app/oracle/product/xag/bin/agctl query releaseversion
The Oracle Grid Infrastructure Agents release version is 6.1.1

To add a golden gate resource, add it via the agctl utility:

$ /u01/app/oracle/product/xag/bin/agctl add goldengate gg1 --gg_home /u01/app/oracle/product/12.1.2/oggcore_1 --instance_type target --oracle_home /u01/app/oracle/product/ --databases ora.dest.db

This is a very simple example, where a cluster resource called ‘gg1′ is created, for which we point out the golden gate home, the database ORACLE_HOME and the database cluster resource (not the database name as some documentation from Oracle says).

In my case, golden gate was not running. After the golden gate cluster resource was added, it’s offline:

      1        OFFLINE OFFLINE                               STABLE

Now start the resource using the ‘agctl’ utility:

/u01/app/oracle/product/xag/bin/agctl start goldengate gg1

(please mind you could do exactly the same with crsctl: crsctl start res xag.gg1.goldengate)
Starting does not result in any message. Let’s look at the golden gate cluster resource:

      1        ONLINE  ONLINE       ogg-dest                 STABLE

What it does, is start the golden gate manager (mgr) process:

$ /u01/app/oracle/product/12.1.2/oggcore_1/ggsci << H
> info all
> H

Oracle GoldenGate Command Interpreter for Oracle
Version OGGCORE_12.
Linux, x64, 64bit (optimized), Oracle 12c on Aug  7 2014 10:21:34
Operating system character set identified as UTF-8.

Copyright (C) 1995, 2014, Oracle and/or its affiliates. All rights reserved.

GGSCI (ogg-dest.local) 1>
Program     Status      Group       Lag at Chkpt  Time Since Chkpt

REPLICAT    ABENDED     REP1        00:00:00      17:55:53

Let’s add monitoring of the replicat. In real life scenario’s it’s the extracts or replicats that are important, these perform the actual work golden gate is supposed to do!

/u01/app/oracle/product/xag/bin/agctl modify goldengate gg1 --monitor_replicats rep1

The cluster ware checks the golden gate resource type status every 30 seconds. So if you check the cluster immediately after modifying the golden gate resource, it could not have performed the check with the added check for the replicat.

      1        ONLINE  INTERMEDIATE ogg-dest                 ER(s) not running :

It now neatly reports the replicat group rep1 not running! I guess ‘ER’ means Extract and Replicat.
Now start the replicat, and check the status again:

      1        ONLINE  ONLINE       ogg-dest                 STABLE

A few final notes. This is the simplest possible setup. I couldn’t find a lot of information on the infrastructure agents except for some Oracle provided ones, which is why I created this blog. If you apply this on a cluster, I would urge you to look into the oracle provided ones, and create the full setup with golden gate writing its files on a cluster filesystem, using an application VIP, etcetera, so golden gate could be started or failed over to another host.

Goldengate is not the only thing this agents can incorporate into the cluster ware. The current version (6.1.1) supports, besides golden gate: tomcat, apache, JDE, mysql, various types of peoplesoft servers, siebel and web logic.

Deinstalling is very simple, but might put you on your wrong foot. The agents are deinstalled by using the setup script ( with the ‘–deinstall’ argument. This of course is quite normal, but it deinstalls the agents from the location where you call This means that if you run from the location where you unzipped it for installation, it will ‘deinstall’, which means clean out, from that directory, leaving the true installation alone. The documentation states that upgrades to the agents are done by removing them and installing the new version.

This article is written with examples taken from an (virtualised) Oracle Linux 6u6 X86_64 operating system, and Oracle database version However, I think the same behaviour is true for Oracle 11 and 10 and earlier versions.

Probably most readers of this blog are aware that a “map” of mapped memory for a process exists for every process in /proc, in a pseudo file called “maps”. If I want to look at my current process’ mappings, I can simply issue:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 786125                             /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 786125                             /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0080a000-0080b000 rw-p 0000a000 fc:00 786125                             /bin/cat
01243000-01264000 rw-p 00000000 00:00 0                                  [heap]
345b000000-345b020000 r-xp 00000000 fc:00 276143                         /lib64/
345b21f000-345b220000 r--p 0001f000 fc:00 276143                         /lib64/
345b220000-345b221000 rw-p 00020000 fc:00 276143                         /lib64/
345b221000-345b222000 rw-p 00000000 00:00 0
345b800000-345b98a000 r-xp 00000000 fc:00 276144                         /lib64/
345b98a000-345bb8a000 ---p 0018a000 fc:00 276144                         /lib64/
345bb8a000-345bb8e000 r--p 0018a000 fc:00 276144                         /lib64/
345bb8e000-345bb8f000 rw-p 0018e000 fc:00 276144                         /lib64/
345bb8f000-345bb94000 rw-p 00000000 00:00 0
7f8f69686000-7f8f6f517000 r--p 00000000 fc:00 396081                     /usr/lib/locale/locale-archive
7f8f6f517000-7f8f6f51a000 rw-p 00000000 00:00 0
7f8f6f524000-7f8f6f525000 rw-p 00000000 00:00 0
7fff2b5a5000-7fff2b5c6000 rw-p 00000000 00:00 0                          [stack]
7fff2b5fe000-7fff2b600000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

What we see, is the start and end address, the rights (rwx), absence of rights is shown with a ‘-‘, and an indication of the mapped memory region is (p)rivate or (s)hared. In this example, there are no shared memory regions. Then an offset of the mapped file, then the device (major and minor device number). In our case sometimes this is ‘fc:00′. If you wonder what device this might be:

$ echo "ibase=16; FC" | bc
$ ls -l /dev | egrep 252,\ *0
brw-rw---- 1 root disk    252,   0 Mar 23 14:19 dm-0
$ sudo dmsetup info /dev/dm-0
Name:              vg_oggdest-lv_root
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      252, 0
Number of targets: 2
UUID: LVM-q4nr4HQXgotaaJFaGF1nzd4eZPPTohndgz553dw6O5pTlvM0SQGLFsdp170pgHuw

So, this is a logical volume lv_root (in the volume group vg_oggdest).

Then the inode number (if a file was mapped, if anonymous memory was mapped the number 0 is shown), and then the path if a file was mapped. This is empty for anonymous mapped memory (which is memory which is added to a process using the mmap() call). Please mind there are also special regions like: [heap],[stack],[vdso] and [vsyscall].

Okay, so far I’ve shown there is a pseudo file called ‘maps’ which shows mapped memory and told a bit about the fields in the file. Now let’s move on to the actual topic of this blog: the Oracle database SGA memory, and the indicator this is deleted!

In this example I pick the maps file of the PMON process of an Oracle database. Of course the database must use system V shared memory, not shared memory in /dev/shm (which is typically what you see when Oracle’s automatic memory (AMM) feature is used). This is a snippet from the maps file of the pmon process on my server:

 cat /proc/2895/maps
00400000-1093f000 r-xp 00000000 fc:00 1326518                            /u01/app/oracle/product/
10b3e000-10dbf000 rw-p 1053e000 fc:00 1326518                            /u01/app/oracle/product/
10dbf000-10df0000 rw-p 00000000 00:00 0
12844000-1289d000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:04 111902723                          /SYSV00000000 (deleted)
60001000-602cc000 rw-s 00001000 00:04 111902723                          /SYSV00000000 (deleted)
60400000-96400000 rw-s 00000000 00:04 111935492                          /SYSV00000000 (deleted)
96400000-9e934000 rw-s 00000000 00:04 111968261                          /SYSV00000000 (deleted)
9ec00000-9ec05000 rw-s 00000000 00:04 112001030                          /SYSV6ce0e164 (deleted)
345b000000-345b020000 r-xp 00000000 fc:00 276143                         /lib64/
345b21f000-345b220000 r--p 0001f000 fc:00 276143                         /lib64/

If you look closely, you see the oracle executable first, with two entries, one being readonly (r-xp), the other being read-write (rw-p). The first entry is readonly because it is shared with other processes, which means that there is no need for all the processes to load the Oracle database executable in memory, it shares the executable with other process. There’s much to say about that too, which should be done in another blogpost.

After the executable there are two anonymous memory mappings, of which one is the process’ heap memory.

Then we see what this blogpost is about: there are 5 mappings which are shared (r–s and rw-s). These are the shared memory regions of the Oracle database SGA. What is very odd, is that at the end of the lines it says “(deleted)”.

Of course we all know what “deleted” means. But what does it mean in this context? Did somebody delete the memory segments? Which actually can be done with the ‘ipcrm’ command…

If you go look at the maps of other Oracle processes and other databases you will see that every database’s shared memory segment are indicated as ‘(deleted)’.

Word of warning: only execute the steps below on a test environment, do NOT do this in a production situation.

In order to understand this, the best way to see what actually is happening, is starting up the Oracle database with a process which is traced with the ‘strace’ utility with the ‘-f’ option set (follow). Together with the ‘-o’ option this will produce a (long) file with all the system calls and the arguments of the calls which happened during startup:

$ strace -f -o /tmp/oracle_startup.txt sqlplus / as sysdba

Now start up the database. Depending on your system you will notice the instance startup takes longer. This is because for every system call, strace needs to write a line in the file /tmp/oracle_start.txt. Because of this setup, stop the database as soon as it has started, on order to stop the tracing from crippling the database performance.

Now open the resulting trace file (/tmp/oracle_startup.txt) and filter it for the system calls that are relevant (calls with ‘shm’ in their name):

$ grep shm /tmp/oracle_startup.txt | less

Scroll through the output until you see a line alike ‘shmget(IPC_PRIVATE, 4096, 0600) = 130777091′:

4545  shmget(IPC_PRIVATE, 4096, 0600)   = 130777091
4545  shmat(130777091, 0, 0)            = ?
4545  shmctl(130777091, IPC_STAT, 0x7fff9eb9da30) = 0
4545  shmdt(0x7f406f2ba000)             = 0
4545  shmctl(130777091, IPC_RMID, 0)    = 0
4545  shmget(IPC_PRIVATE, 4096, 0600)   = 130809859
4545  shmat(130809859, 0, 0)            = ?
4545  shmctl(130809859, IPC_STAT, 0x7fff9eb9da30) = 0
4545  shmdt(0x7f406f2ba000)             = 0
4545  shmctl(130809859, IPC_RMID, 0)    = 0

What we see here is a (filtered) sequence of systems calls that could explain the status deleted of the shared memory segments. If you look up what process id is in front of these shm system calls, you will see it’s the foreground process starting up the instance. If you look closely, you’ll that there is a sequence which is repeated often:

1. shmget(IPC_PRIVATE, 4096, 0600) = 130777091
The system call shmget allocates a shared memory segment of 4 kilobyte, rights set to 600. The return value is the shared memory identifier of the requested shared memory segment.

2. shmat(130777091, 0, 0) = ?
The system call shmat attaches the a shared memory segment to the process’ address space. The first argument is the shared memory identifier, the second argument is the address to attach the segment to. If the argument is zero, like in the call above, it means the operating system is tasked with finding a suitable (non used) address. The third argument is for flags, the value zero here means no flags are used. The returncode (here indicated with a question mark) is the address at which the segment is attached. This being a question mark means strace is not able to read the address, which is a shame, because we can’t be 100% certain at which memory address this shared memory segment is mapped.

3. shmctl(130777091, IPC_STAT, 0x7fff9eb9da30) = 0
The system call shmctl with the argument IPC_STAT has the function to read the (kernel) shared memory information of the shared memory identifier indicated by the first argument, and write it at the memory location in the third argument in a struct called shmid_ds.

4. shmdt(0x7f406f2ba000) = 0
With this system call, the shared memory segment is detached from the process’ address space. For the sake of the investigation, I assumed that the address in this call is the address which is returned by the shmat() call earlier.

5. shmctl(130777091, IPC_RMID, 0) = 0
This is another shared memory control system call, concerning our just created shared memory segment (shared memory identifier 130777091), with the command ‘IPC_RMID’. This is what the manpage says about IPC_RMID:

       IPC_RMID  Mark the segment to be destroyed.  The segment will only  actually  be  destroyed
                 after the last process detaches it (i.e., when the shm_nattch member of the asso-
                 ciated structure shmid_ds is zero).  The caller must be the owner or creator,  or
                 be privileged.  If a segment has been marked for destruction, then the (non-stan-
                 dard) SHM_DEST flag of the shm_perm.mode field in the associated  data  structure
                 retrieved by IPC_STAT will be set.

What I thought this means was:
It looked like to me the database instance starts building up its shared memory segments per 4096 page. Because IPC_RMID only marks the segment to be destroyed, and because it will only be truly destroyed when there are no processes attached to the shared memory segment, it looked like to me the background processes were pointed to the shared memory segment which was marked destroyed (in some way I hadn’t discovered yet), which meant the shared memory segment would actually survive and all database processes can use it. If ALL the database processes would be killed for any reason, for example with a shutdown abort, the processes would stop being connected to the shared memory segment, which would mean the shared memory segment would vanish automatically, because it was marked for destruction.
Sounds compelling, right?

Well…I was wrong! The sequence of creating and destroying small shared memory segments is done, but it turns out these are truly destroyed with the shmctl(…,IPC_RMID,…) call. I don’t know why the sequence of creating shared memory segments is happening.

I started looking for the actual calls that create the final, usable shared memory segments in the /tmp/oracle_startup.txt file. This is actually quite easy to do; first look up the shared memory segment identifiers using the sysresv utility (make sure the database’s ORACLE_HOME and ORACLE_SID are set):

$ sysresv
...a lot of other output...
Shared Memory:
197394436	0x00000000
197427205	0x00000000
197361667	0x00000000
197459974	0x6ce0e164
1015811 	0xd5cdbca4
Oracle Instance alive for sid "dest"

Actually the ‘sysresv’ utility (system remove system V memory I think is what the name means) has the task of removing memory segments if there is no instance left to use them. It will not remove the memory segments if it finds the instance alive. It prints out a lot of information as a bonus.

Now that we got the shared memory identifiers, simply search in the trace file generated by strace, and search for the creation of the memory segment with the identifiers: (please mind searching with ‘less’ is done with the forward slash)

$ less /tmp/oracle_startup.txt
9492  shmget(IPC_PRIVATE, 905969664, IPC_CREAT|IPC_EXCL|0640) = 197394436
9492  shmat(197394436, 0x60400000, 0)   = ?
9492  times(NULL)                       = 430497743
9492  write(4, " Shared memory segment allocated"..., 109) = 109
9492  write(4, "\n", 1)                 = 1

Aha! here we see shmget() again, but now with a size (905969664) that looks much more like a real shared memory segment size used by the database! After the shared memory identifier is created, the process attaches it to its addressing space with shmat() to a specific memory address: 0x60400000.

The next thing to do, is to look for any shmctl() call for this identifier. Oracle could still do the trick of marking the segment for destruction…
…But…there are no shmctl() calls for this identifier, nor for any of the other identifiers shown with the sysresv utility. This is rather odd, because Linux shows them as “(deleted)”. There ARE dozens of shmat() calls, of the other (background) processes forked from the starting process when they attach to the shared memory segments.

So, conclusion at this point is Linux shows the shared memory segments as deleted in ‘maps’, but the Oracle database does not mark the segments for destruction after creation. This means that either Linux is lying, or something mysterious is happening in the Oracle executable which I didn’t discover yet.

I could only think of one way to verify what is truly happening here. That is to create a program myself that uses shared memory, so I have 100% full control over what is happening, and can control every distinct step.

This is what I came up with:

#include <stdio.h>
#include <sys/shm.h>
#include <sys/stat.h>

int main ()
  int segment_id;
  char* shared_memory;
  struct shmid_ds shmbuffer;
  int segment_size;
  const int shared_segment_size = 0x6400;

  /* Allocate a shared memory segment.  */
  segment_id = shmget (IPC_PRIVATE, shared_segment_size,
                     IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR);
  printf ("1.shmget done\n");
  /* Attach the shared memory segment.  */
  shared_memory = (char*) shmat (segment_id, 0, 0);
  printf ("shared memory attached at address %p\n", shared_memory);
  printf ("2.shmat done\n");
  /* Determine the segment's size. */
  shmctl (segment_id, IPC_STAT, &shmbuffer);
  segment_size  =               shmbuffer.shm_segsz;
  printf ("segment size: %d\n", segment_size);
  printf ("3.shmctl done\n");
  /* Write a string to the shared memory segment.  */
  sprintf (shared_memory, "Hello, world.");
  /* Detach the shared memory segment.  */
  shmdt (shared_memory);
  printf ("4.shmdt done\n");

  /* Deallocate the shared memory segment.  */
  shmctl (segment_id, IPC_RMID, 0);
  printf ("5.shmctl ipc_rmid done\n");

  return 0;

(I took the code from this site, and modified it a bit for my purposes)
If you’ve got a linux system which is setup with the preinstall rpm, you should be able to copy this in a file on your (TEST!) linux database server, in let’s say ‘shm.c’, and compile it using ‘cc shm.c -o smh’. This will create an executable ‘shm’ from this c file.

This program does more or less the same sequence we saw earlier:
1. Create a shared memory identifier.
2. Attach to the shared memory identifier.
3. Get information on the shared memory segment in a shmid_ds struct.
4. Detach the shared memory segment.
5. Destroy it using shmctl(IPC_RMID).

What I did was have two terminals open, one to run the shm program, and one to look for the results of the steps.

Step 1. (shmget)

$ ./shm
1. shmget done

When looking with ipcs, you can see the shared memory segment which is created because of the shmget() call:

$ ipcs -m

------ Shared Memory Segments --------
0x00000000 451608583  oracle     600        25600      0

when looking in the address space of the process running the shm program, the shared memory segment is not found. This is exactly what I expect, because it’s only created, not attached yet.

Step 2. (shmat)

shared memory attached at address 0x7f3c4aa6e000
2.shmat done

Of course the shared memory segment is still visible with ipcs:

0x00000000 451608583  oracle     600        25600      1

And we can see from ipcs in the last column (‘1′) that one process attached to the segment. Of course exactly what we suspected.
But now that we attached the shared memory to the addressing space, it should be visible in maps:

7f3c4aa6e000-7f3c4aa75000 rw-s 00000000 00:04 451608583                  /SYSV00000000 (deleted)

Bingo! The shared memory segment is visible, as it should be, because we just attached it with shmat(). But look: it’s deleted already according to Linux!

However I am pretty sure, as in 100% sure, that I did not do any attempts to mark the shared memory segment destroyed or do anything else to make it appear to be deleted. So, this means maps lies to us.

So, the conclusion is the shared memory Oracle uses is not deleted, it’s something that Linux shows us, and is wrong. When looking at the maps output again, we can see the shared memory identifier is put at the place of the inode number. This is handy, because it allows you to take the identifier, and look with ipcs for shared memory segments and understand which specific shared memory segment a process is using. It probably means that maps tries to look up the identifier number as inode number, which it will not be able to find, and then comes to the conclusion that it’s deleted.

However, this is speculation. Anyone with more or better insight is welcome to react on this article.

Sometimes you need to see the difference between two pieces of console output. When I research, this can be two stacktraces, but also /proc//maps and smaps output; really anything. Of course, there’s diff, but the diff output is not very visual. Also, diff doesn’t do diffing between more than two files.

This can be done reasonably simple in vim. Here’s how to do that:
1. start vi; vi
2. do a vertical split using a new buffer; :vnew
3. open the first (left side) file; :r path/file or goto insert mode (esc i) and paste text.
4. goto the second window: ctrl+w ctrl+w
5. open the second (right side) file; :r path/file or goto insert mode (esc i) and paste text.
6. diff the two windows; :windo diffthis
7. turn diff mode off; :windo diffoff

You can also expand your diffing to three windows:
1. goto the rightside; ctrl+w l
2. change new window placement to the right side; :set splitright
3. do another virtual split; :vnew
4. open another file or paste text
5. diff again; : windo diffthis

I wrote this down for myself, but hopefully this helps other people too.


Get every new post delivered to your Inbox.

Join 2,502 other followers

%d bloggers like this: