Archive

Oracle EE

In my previous post, I introduced Intel Pin. If you are new to pin, please follow this link to my previous post on how to set it up and how to run it.

One of the things you can do with Pin, is profile memory access. Profiling memory access using the pin tool ‘pinatrace’ is done in the following way:

$ cd ~/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 12284 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

The pid is a pid of an oracle database foreground process. Now execute something in the session you attached pin to and you find the ‘pinatrace’ output in $ORACLE_HOME/dbs:

$ ls -l $ORACLE_HOME/dbs
total 94064
-rw-rw----. 1 oracle oinstall     1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--. 1 oracle oinstall     2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall       57 Nov  5 09:42 inittestdb.ora
-rw-r-----. 1 oracle oinstall       24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall     7680 Nov  5 09:41 orapwtestdb
-rw-r--r--  1 oracle oinstall 10552584 Nov 17 06:36 pinatrace.out

Please mind memory access generates A LOT of information! The above 11MB is what a ‘select * from dual’ generates (!)

This is how the file looks like:

$ head pinatrace.out
#
# Memory Access Trace Generated By Pin
#
0x00007f85c63fe218: R 0x00007fff6fd2c4c8  8          0xcefb615
0x000000000cefb61e: W 0x00007fff6fd2c4f8  8              0x12c
0x000000000cefb621: R 0x00007fff6fd2c4d0  8     0x7f85c5bebd96
0x000000000cefb625: R 0x00007fff6fd2c4d8  8     0x7f85c5bebd96
0x000000000cefb62c: R 0x00007fff6fd2c4e0  8     0x7fff6fd2c570
0x000000000cefb62d: R 0x00007fff6fd2c4e8  8          0xcefb54e

The first field is the function location, the second field is R or W (reading or writing obviously), the third field is the memory location read or written the fourth field is the amount of bits read and the fifth field is prefetched memory.

The function that is used can be looked up using the addr2line linux utility:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x000000000cefb61e
sntpread at ??:?

I looked up the second address from the pinatrace.out file above, and that address belongs to the function sntpread. There is no additional information available for this function (‘at ??:?’). If the address is not available in the oracle executable, a ‘??’ is displayed:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x00007f85c63fe218
?? ??:0

The pinatrace.out file is usable if you know the exact instruction pointer address or the memory location. However, that usage is fairly limited. An example of that is Mahmoud Hatem’s blog on tracing access to a memory location. Wouldn’t it be nice if we can change the functions addresses to function names, and the memory addresses to named memory locations whenever possible?

That’s where I created the pinatrace annotate oracle tool for. This is a little scriptset that contains scripts to generate memory information from the instance, after which the instruction pointer addresses and memory locations of a pinatrace.out file generated by pinatrace are translated to function names and memory area names. Let’s have a look what that means. This is a snippet of a pinatrace.out file:

0x000000000c967e46: R 0x0000000095f69910  8         0x95fcf6b0
0x000000000c967e4d: W 0x00007fff6fd2b2b8  8          0xc967e52
0x000000000c937b32: W 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937b3a: W 0x00007fff6fd2b278  8                0xe
0x000000000c937b41: W 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937b45: W 0x00007fff6fd2b270  8                0x1
0x000000000c937b49: W 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937b4d: R 0x0000000095fcf6bc  2               0x12
0x000000000c937b52: W 0x00007fff6fd2b288  8              0x2c4
0x000000000c937b59: W 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937b60: W 0x00007fff6fd2b2a0  4               0x73
0x000000000c937b6b: W 0x00007fff6fd2b2a8  4                0x1
0x000000000c937b6e: R 0x00007f85ca1db280  8     0x7f85ca1db280
0x000000000c937b77: R 0x000000000d0a40e4  4                0x1
0x000000000c937b84: R 0x00007f85ca1d43c8  8         0x95dc0e20
0x000000000c937b92: R 0x0000000095dc10b0  8                  0
0x000000000c937ba2: R 0x0000000095fcf6c0  4                0x1
0x000000000c937ba9: R 0x0000000095dc10e0  4                  0
0x000000000c937baf: R 0x000000000cfbe644  4            0x1cffe
0x000000000c937bbc: W 0x0000000095dc10b0  8         0x95fcf6b0
0x000000000c937bc5: R 0x0000000095fcf6b0  8                  0
0x000000000c937bc5: W 0x0000000095fcf6b0  8                0x1
0x000000000c937bca: W 0x00007fff6fd2b260  8                  0
0x000000000c937be1: R 0x00007f85ca1d4290  8     0x7f85ca1a9ca0
0x000000000c937bec: R 0x00007f85ca1ab1c0  4                0x3
0x000000000c937bf3: W 0x0000000095dc0faa  2                0x3
0x000000000c937bf9: R 0x00007f85ca1d43e0  8         0x95f68ea8
0x000000000c937c09: R 0x0000000095f69470  2                  0
0x000000000c937c16: W 0x0000000095dc0fac  2                  0
0x000000000c937c1e: R 0x0000000095dc10e0  4                  0
0x000000000c937c1e: W 0x0000000095dc10e0  4                0x2
0x000000000c937c24: W 0x0000000095dc0fa0  8         0x95fcf6b0
0x000000000c937c28: W 0x0000000095dc0fa8  2                0x8
0x000000000c937c2e: R 0x000000006000a9d8  4                0x1
0x000000000c937c3b: R 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937c3f: R 0x00007fff6fd2b2a0  4               0x73
0x000000000c937c42: W 0x0000000095fcf6c8  8         0x95f68ea8
0x000000000c937c46: W 0x0000000095fcf6c4  4               0x73
0x000000000c937c4a: R 0x00007fff6fd2b2a8  4                0x1
0x000000000c937c50: R 0x0000000095fcf6b8  4              0x83e
0x000000000c937c50: W 0x0000000095fcf6b8  4              0x83f
0x000000000c937c5a: W 0x0000000095dc10b0  8                  0
0x000000000c937c65: R 0x00007f85ca1d71d6  1                  0
0x000000000c937c76: R 0x00007fff6fd2b270  8                0x1
0x000000000c937c7a: R 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937c7e: R 0x00007fff6fd2b288  8              0x2c4
0x000000000c937c82: R 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937c86: R 0x00007fff6fd2b278  8                0xe
0x000000000c937c8d: R 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937c8e: R 0x00007fff6fd2b2b8  8          0xc967e52

The usefulness of this is limited in this form. The only thing I could derive is that big numbers in the memory access column (‘0x00007fff6fd2ac60’) are probably PGA related, and the numbers between roughly 0x000000006000000 and 0x0000000095dc0fd0 are probably SGA related. After running the annotate tool, it looks like this:

ksl_get_shared_latch:W:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:W:0x00007fff6fd2b278():8
ksl_get_shared_latch:W:0x00007fff6fd2b298():8
ksl_get_shared_latch:W:0x00007fff6fd2b270():8
ksl_get_shared_latch:W:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x0000000095fcf6bc(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):2
ksl_get_shared_latch:W:0x00007fff6fd2b288():8
ksl_get_shared_latch:W:0x00007fff6fd2b290():8
ksl_get_shared_latch:W:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x00007f85ca1db280(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x000000000d0a40e4():4
ksl_get_shared_latch:R:0x00007f85ca1d43c8(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6c0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:R:0x000000000cfbe644():4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x00007fff6fd2b260():8
ksl_get_shared_latch:R:0x00007f85ca1d4290(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x00007f85ca1ab1c0(pga|Other, pga heap, kgh stack pga|Other, pga heap, free memory pga|Other, pga heap, permanent memory):4
ksl_get_shared_latch:W:0x0000000095dc0faa(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x00007f85ca1d43e0(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095f69470(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:W:0x0000000095dc0fac(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc0fa0(shared pool|permanent memor,duration 1,cls perm):8
ksl_get_shared_latch:W:0x0000000095dc0fa8(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x000000006000a9d8(fixed sga|var:kslf_stats_):4
ksl_get_shared_latch:R:0x00007fff6fd2b298():8
ksl_get_shared_latch:R:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x0000000095fcf6c8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6c4(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x00007f85ca1d71d6(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):1
ksl_get_shared_latch:R:0x00007fff6fd2b270():8
ksl_get_shared_latch:R:0x00007fff6fd2b290():8
ksl_get_shared_latch:R:0x00007fff6fd2b288():8
ksl_get_shared_latch:R:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x00007fff6fd2b278():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b8():8

So, now you can see the reason I picked a seemingly arbitrary range of lines actually was because that range is the memory accesses of the ksl_get_shared_latch function. This annotated version show a shared latch get for the ‘session idle bit’ latch. It’s also visible the function uses PGA memory, some of it annotated, some of it not, and that most of the shared pool access is for the latch (a latch essentially is a memory range with the function of serialising access to a resource), which is in the shared pool because it’s a child latch. It’s also visible memory belonging to X$KSUPR is read and written (X$KSUPR is the table responsible for V$PROCESS, the fields KSLLALAQ and KSLLALOW are not externalised in V$PROCESS).

Why are a lot of the assumed PGA addresses (the ones like 0x00007fff6fd2b2b8) not annotated? Well, PGA memory allocations are very transient of nature. Because a PGA memory snapshot is made at a certain point in time, this snapshot represents the memory layout of that moment, which has a high probability of having memory deallocated and freed to the operating system. A lot of the SGA/shared pool allocations on the other hand have the intention of re-usability, and thus are not freed immediately after usage, which gives the SGA memory snapshot a good chance of capturing a lot of the memory allocations.

Get the pinatrace oracle annotate tool via github: git clone https://github.com/FritsHoogland/pinatrace_annotate_oracle.git

Please mind this tool uses the bash shell, it might not work in other shells like ksh.

How to use the tool?
– Use pin with the pinatrace.so tool, as described above. Move the the pinatrace.out file from $ORACLE_HOME/dbs to the directory with the pinatrace_annotate_oracle.sh script.
Immediately after the trace has been generated (!), execute the following scripts using sqlplus as SYSDBA:
– 0_get_pga_detail.sql (this lists the sessions in the database and requires you to specify the oracle PID of the session)
– 1_generate_memory_ranges.sql
– 2_generate_memory_ranges_xtables.sql
– 3_generate_memory_ranges_pga.sql
This results in the following files: memory_ranges.csv, memory_ranges_pga.csv and memory_ranges_xtables.csv.
Now execute the annotate script:
– ./pinatrace_annotate_oracle.sh pinatrace.out
The script outputs to STDOUT, so if you want to save the annotation, redirect it to a file (> file.txt) or if you want to look and redirect to a file: | tee file.txt.

I hope this tool is useful for your research. If you know a memory area described in the data dictionary that is not included, please drop me a message with the script, then I’ll include it.

This blogpost is an introduction to Intel’s Pin dynamic instrumentation framework. Pin and the pintools were brought to my attention by Mahmoud Hatem in his blogpost Tracing Memory access of an oracle process: Intel PinTools. The Pin framework provides an API that abstracts instruction-set specifics (on the CPU layer). Because this is a dynamic binary instrumentation tool, it requires no recompiling of source code. This means we can use it with programs like the Oracle database executable.
The Pin framework download comes with a set of pre-created tools called ‘Pintools’. Some of these tools are really useful for Oracle investigation and research.

Pin works in a very sophisticated way. The description in the Pin manualis to think of Pin as a JIT (just in time) compiler, where the compiler does not take byte code (as JIT compilation does with Java), but the executable of the process pin is executed against. This means pin inserts itself into the process’ execution. This can be seen when looking at the memory map of such a process:

$ cat /proc/29595/maps | grep -e pin-3.0 -e oracle
00400000-1098a000 r-xp 00000000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8a000-10bac000 r--p 1058a000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10bac000-10e05000 rw-p 105ac000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
7fa01e04c000-7fa01e31d000 r-xp 00000000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e51c000-7fa01e530000 r--p 002d0000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e530000-7fa01e531000 rw-p 002e4000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e533000-7fa01e71a000 r-xp 00000000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91a000-7fa01e91c000 r--p 001e7000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91c000-7fa01e91e000 rw-p 001e9000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e946000-7fa01e9c9000 r-xp 00000000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9ca000-7fa01e9cc000 r--p 00083000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9cc000-7fa01e9ce000 rw-p 00085000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9d6000-7fa01ea04000 r-xp 00000000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea04000-7fa01ea05000 r--p 0002d000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea05000-7fa01ea06000 rw-p 0002e000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea06000-7fa01eac4000 r-xp 00000000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac5000-7fa01eac8000 r--p 000be000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac8000-7fa01eaca000 rw-p 000c1000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eacb000-7fa01edb7000 r-xp 00000000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01edb8000-7fa01ee24000 r--p 002ec000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee24000-7fa01ee25000 rw-p 00358000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee80000-7fa01f385000 r-xp 00000000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f385000-7fa01f38d000 r--p 00504000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f38d000-7fa01f394000 rw-p 0050c000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f54e000-7fa01f6d4000 r-xp 00000000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
7fa01f6d4000-7fa01f8d3000 ---p 00186000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so

Here you see this process an oracle foreground (rows 1-3), after which we see a lot of stuff that pin pushed into the process’ address space.

For this reason, it’s vitally important to use and run pin as the same user as the process you want to run pin against. The way pin works is that, upon execution of pin, the pin executable inserts itself into the process’ address space, gains control and then tries to load necessary libraries. If it can’t find these libraries, it will send a SIGKILL to the process, effectively stopping it!

In order to obtain pin, go to the pin homepage, downloads, linux and select the ‘gcc compiler kit’. This is a zipped tarball. Upload that to a linux server as the database owner, usually ‘oracle’. I created a directory ‘pin’ in which I put the tarball, and extracted it (tar xzf). The next step is to compile the tools that come with pin:

$ cd pin-3.0-76991-gcc-linux/source/tools
$ make

This will output a lot of stuff on your screen and compiles everything in the tools directory.

Now that we have seen an introduction, let’s use a Pin tool to do something useful! One such useful is the tool ‘DebugTrace’. First obtain the process id (pid) from an Oracle server foreground process. Then execute the pin tool against this process:

$ pwd
/home/oracle/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 2407 -t source/tools/DebugTrace/obj-intel64/debugtrace.so

Upon execution, nothing is returned, because the majority of the things are happening in the process pin is run against (2407 in this case). If no parameters are specified, the pintool will create a file in the current working directory of the process it is run against. For an oracle foreground process, the current working directory is $ORACLE_HOME/dbs:

$ ls -l /proc/2407/cwd
lrwxrwxrwx 1 oracle oinstall 0 Nov 17 01:45 /proc/2407/cwd -> /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs

For the sake of testing this, I ran ‘select * from dual’ in the oracle database foreground process.

Let’s look what this produces! This is how my dbs directory looks like:

$ ls -ltr
total 2332
-rw-r--r--. 1 oracle oinstall    2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall      24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall    7680 Nov  5 09:41 orapwtestdb
-rw-r-----. 1 oracle oinstall      57 Nov  5 09:42 inittestdb.ora
-rw-rw----. 1 oracle oinstall    1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--  1 oracle oinstall 2361460 Nov 17 01:49 debugtrace.out

You can see the debugtrace.out file is what is produced by the pin tool, and it look rather big (remember I only ran ‘select * from dual’)!

Let’s look into the file:

$ less debugtrace.out
         3 @@@ return underflow
Return 0x00007f706b664218 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x12c
        14 @@@ return underflow
Return 0x000000000cefb62d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cefb580 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cedcc9d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cedc90a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000ceeb746 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b170d0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       156 Tailcall 0x0000000005b170de /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b170c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffeef95c305, 0x7f706ae51da0, ...)
       158 | Call 0x0000000005b170c1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b1dcb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       172 | Return 0x0000000005b20097 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffeef95c305
       174 Return 0x0000000005b170c7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffeef95c305
       197 @@@ return underflow
Return 0x000000000ceeb7ad /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
       213 Call 0x000000000cb5d8b0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:opikndf2+0x000000000410 -> 0x000000000c940560 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx(0x7ffeef95c040, 0x7f706ae51da1, ...)
       268 | Call 0x000000000c94066e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx+0x00000000010e -> 0x000000000cd585c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64(0x7ffeef95c040, 0x7f706ae51da1, ...)
       278 | | Call 0x000000000cd5863b /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64+0x00000000007b -> 0x000000000b13ef90 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt(0x7ffeef95c040, 0x7f706ae51da1, ...)
       286 | | | Call 0x000000000b13efa3 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt+0x000000000013 -> 0x0000000000ba6860 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:.plt+0x000000001360(0x1, 0x7ffeef95beb0, ...)
       293 | | | | Call 0x00007f706b188dcb /lib64/libc.so.6:__clock_gettime+0x00000000001b -> 0x00007ffeef9ab8a0 clock_gettime(0x1, 0x7ffeef95beb0, ...)
...etc...

I think this is rather exciting! This is a FULL function call trace, indented by call depth!
The first lines in the trace are returns, and these returns produce a ‘return underflow’, which is because these functions have been called before the trace was put on the process.
This trace shows:
Calling: the address and function where the function call is made, the address and function that is called and first two arguments of the function.
Returning: the address and function where the return is executed, and the returncode.
Tailcall: this is a function that is called as the last instruction in the current function. This means it returns as part of the current function, which is shown by not indenting this call further (call, tailcall and return are all indented at the same level).

Please mind currently I am not aware of a way to remove the pin framework and tool from a process address space. Also mind that because of how this works (a lot of intercepting in the execution flow), it will slow down the process significantly. Use this tool and the above described techniques at your own risk.

This blogpost is about the Oracle redo log structures and redo efficiency in modern Oracle databases. Actually, a lot of subtle things changed surrounding redo (starting from Oracle 10 actually) which have gone fairly unnoticed. One thing the changes have gone unnoticed for is the Oracle documentation, the description of redo in it is an accurate description for Oracle 9, not how it is working in Oracle 10 or today in Oracle 12.1.0.2.

My test environment is a virtual machine with Oracle Linux 7.2 and Oracle 12.1.0.2.161018, and a “bare metal” server running Oracle Linux 6.7 and Oracle 12.1.0.2.160419. Versions are important, as things can change between versions.

Multi threaded redo and dynamic strands
One of the prominent changes to the redo mechanism is what is referred to as ‘multi threaded redo’ or ‘multiple log buffers’. The way the redo structures are described most of the time is the redo log buffer being in the SGA as a single buffer to which change vectors are written (in a circular fashion), protected by a latch (redo allocation latch). That is not an accurate description since Oracle 9.2.

Starting from Oracle 9.2, multiple log buffers can be and most probably are created. At least in Oracle 12.1 you get 2 redo log buffers. Each of these buffers is protected by a redo allocation latch. The benefit of multiple log buffers is that copying redo into the log buffer is not strictly serialised. However, this has some consequences too. One of the consequences is redo potentially could be not written in strict SCN order by the log writer or its worker processes.

You can see the log buffers in the view x$kcrfstrand:

SYS@testdb AS SYSDBA> select strand_size_kcrfa from x$kcrfstrand where ptr_kcrf_pvt_strand = hextoraw(0) and pnext_buf_kcrfa_cln != hextoraw(0);

STRAND_SIZE_KCRFA
-----------------
         67108864
         67108864

Both values added together exactly matches the size of the log buffer:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: log_buffer
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%log_buffer%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden:
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('','Y',3,2))

NAME                         VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
---------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
log_buffer                   134217728                                                              FALSE    FALSE      TRUE

The parameters that handle dynamic redo strands are “hidden”:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: log_parallelism
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%log_parallelism%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden: Y
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('Y','Y',3,2))

NAME                            VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_log_parallelism_dynamic        TRUE                                                                   TRUE     FALSE      FALSE
_log_parallelism_max            2                                                                      TRUE     FALSE      FALSE

The mechanisms can further be investigated using fixed SGA variables:

SYS@testdb AS SYSDBA> oradebug setmypid
Statement processed.
SYS@testdb AS SYSDBA> oradebug dumpvar sga kcrf_max_strands
uword kcrf_max_strands_ [0600283C8, 0600283CC) = 00000002
SYS@testdb AS SYSDBA> oradebug dumpvar sga kcrf_actv_strands
uword kcrf_actv_strands_ [0600283E0, 0600283E4) = 00000001

Note: the output of oradebug is hexadecimal (!). Of course here it doesn’t matter since the values are smaller than 10.
This shows the maximum number of public redo strands is 2 (kcrf_max_strands), as we have seen with x$kcrfstrand, and the number of active strands is 1 (kcrf_actv_strands), which makes sense in my test database, since I am the sole user of the database.
The maximum number of public redo strands seems to be at least 2, some limited research shows the value is set by the formula CPU_COUNT/16 for systems with higher CPU counts. Some sources report that the value for _log_parallelism_max will be reduced to CPU_COUNT if it is set higher manually.

When I count the number of redo allocation latches, I see something odd:

SYS@testdb AS SYSDBA> select count(*) from v$latch_children where name = 'redo allocation';

  COUNT(*)
----------
	20

A redo allocation latch protects a log strand. I just showed my system has actually two public log strands. Why do I have 20 redo allocation latches if my system has a fixed number of public redo buffers? This finding leads to private redo strands.

Private strands
Outside of multi threaded public redo buffers, Oracle introduced another optimisation for redo, which are private strands. Private strands optimise redo generation by letting a session create the change vectors for a transaction directly in a private redo buffer in the SGA, instead of generating the change vectors in the PGA, and later copying these to the public redo strand. Some restrictions apply on the use of it, although I am not aware of a concrete list of restrictions and limitations (the closest thing to such a list would be the table x$ktiff shown further below with in-memory undo, which is directly connected to private strands). Whenever a private strand can not be used, Oracle reverts to the ‘classic’ mechanism of redo generation using the public redo log buffer. Each private strand is also protected by a redo allocation latch, just like the public strands.

The private strands SGA buffers are allocated at startup time:

SYS@testdb AS SYSDBA> select pool, name, bytes from v$sgastat where name like 'private strands';

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  private strands               2451456

It seems the number of private strands allocated from SGA (and as a consequence the number of redo allocation latches on top of the public strand number) is depended on the transactions database parameter, and takes 10% (_log_private_parallelism_mul) of the transactions multiplied by 129KB plus 4KB overhead:

SYS@testdb AS SYSDBA> select trunc(value * KSPPSTVL / 100) * (129+4) * 1024 "private strands"
                      from (select value from v$parameter where name = 'transactions') a,
                           (select val.KSPPSTVL
                            from sys.x$ksppi nam, sys.x$ksppsv val
                            where nam.indx = val.indx AND nam.ksppinm = '_log_private_parallelism_mul') b;

private strands
---------------
        2451456

However, that is not all that is in play for private strands. Once the private strands are allocated, and after every logswitch, the number of usable private strands is determined by the size of the current online redologfile minus the size of the log buffer and the 5% multiplier (_log_private_mul):

SYS@testdb AS SYSDBA> select trunc(((select bytes from v$log where status = 'CURRENT') - (select to_number(value) from v$parameter where name = 'log_buffer'))*
                             (select to_number(val.KSPPSTVL)
                              from sys.x$ksppi nam, sys.x$ksppsv val
                              where nam.indx = val.indx AND nam.ksppinm = '_log_private_mul') / 100 / 66560)
                             as "private strands"
                       from dual;

private strands
---------------
            -22

Yes, that is a negative number. Let’s look at the number of private strands in this instance:

SYS@testdb AS SYSDBA> select indx,strand_size_kcrfa from x$kcrfstrand where last_buf_kcrfa = '00';

no rows selected

The instance takes the least of the two calculations above for the number of private strands. That explains my test instance not having any private strands! The size of my log buffer is 128M, the size of my online redologfiles is 100M. This shows that sizing the online redologfiles too small can lead to inefficiencies (!!).

In-Memory Undo
Another mechanism that has been changed in order to optimise generating redo is in-memory undo. Actually in-memory undo is directly tied to private redo strands. The idea of in-memory undo is to store the undo (read consistent related) part of a transaction in an in-memory undo buffer until it is committed or rolled back. The in-memory undo pool can be seen in the shared pool using:

SYS@testdb AS SYSDBA> select pool, name, bytes from v$sgastat where name = 'KTI-UNDO';

POOL         NAME                                                    BYTES
------------ -------------------------------------------------- ----------
shared pool  KTI-UNDO                                              1274976

The in-memory undo pool is externalised through x$ktifp. This view shows the in-memory buffers and their sizes:

SYS@testdb AS SYSDBA> select ktifpno, ktifppsi from x$ktifp;

   KTIFPNO   KTIFPPSI
---------- ----------
         0      65535
...
        17      65535

The number of in-memory undo buffers is determined by the transactions parameter:

SYS@testdb AS SYSDBA> show parameter transactions

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
transactions                         integer     189
transactions_per_rollback_segment    integer     5

And defaults to 10% of the transactions. Every in-memory undo buffer is protected by an in-memory undo latch:

SYS@testdb AS SYSDBA> select count(*) from v$latch_children where name like 'In memory undo latch';
        18

In memory undo can be switched on and off on at instance-wide level or at the session level using the ‘_in_memory_undo’ parameter, and is true/on by default. Switching off un-memory undo switches off private strands too.

In fact some of these optimisations are announced in the alert log file during startup of an instance:

IMODE=BR
ILAT =18

BR means batched redo mode, which means the private strands feature is turned on (I found that with the too small online redologfiles described above, which caused on private private strands to be made available, the database still announces ‘BR’).
ILAT means number of in-memory pools and latches.

Because in-memory undo is directly tied to private redo strands, the usage of private strands and in-memory undo can be monitored using the view x$ktiff:

SYS@testdb AS SYSDBA> select ktiffcat, ktiffflc from x$ktiff;

KTIFFCAT															   KTIFFFLC
-------------------------------------------------------------------------------------------------------------------------------- ----------
Undo pool overflow flushes														  0
Stack cv flushes															  1
Multi-block undo flushes														  0
Max. chgs flushes															  0
NTP flushes																  0
Contention flushes															  0
Redo pool overflow flushes														  0
Logfile space flushes															  0
Multiple persistent buffer flushes													  0
Bind time flushes															  0
Rollback flushes															  2
Commit flushes															       1694
Recursive txn flushes															  0
Redo only CR flushes															  0
Ditributed txn flushes															  0
Set txn use rbs flushes 														  0
Bitmap state change flushes														  0
Presumed commit violation														  0
Securefile direct-write lob update flushes												  0
Unknown 																  0
Unknown 																  0

This view shows the reasons for flushing the in-memory undo buffers.

Throw-away undo
The reason that the database is mentioning ‘IMODE’ is there is actually another mode one besides ‘BR’, ‘TUA’. The ‘TUA’ abbreviation is a bit weird (Throw Undo Away vs. Throw-away undo?), but it means the database is set to ‘throw-away undo mode’. This is accomplished by setting the undo_retention parameter to a value of 3 or lower, and setting the undocumented parameter _imu_pools to a value greater than 3. Once this is done, and the instance is restarted, it will be reported in the alert.log file as follows:

IMODE=TUA
ILAT =189

The purpose of throw-away undo mode is to throw away undo instead of applying (the name already hints that :-)). By enabling throw-away undo, some of the actions normally executed are not done anymore. Such actions: are applying undo (not all undo is skipped), which reduces block changes, read, writes, buffer activity and reduces redo activity. By enabling this mode, and thus not performing some of the actions that are normally done, there are some consequences: it will significantly increase the risk of ORA-1555 snapshot too old errors. Also a lot of features of the database are not compatible with ‘TUA mode’: flashback database, cluster_database=TRUE, supplemental logging to name a few. However, it looks to me as a mode designed for a specific purpose: being able to do massive loads or changes at a faster rate, and with some consequences. I would add that this mode should always be a temporal setting, and once the desired action is completed, the ‘TUA’ mode should be reverted back to normal behaviour by resetting _imu_pools and increasing undo_retention back to your business transaction needs.

Conclusion
The aim for this blogpost is to describe private strands, in-memory undo and throw-away undo. The name of the blogpost is actually a hat-tip to mainly the paper from Stephan Haisley, which has most of what is written here covered. So in that sense, it is a ‘redo’ of his paper, although I tested this all on Oracle version 12.1.0.2.181016 database instance.

Private strands and in-memory undo reduce the work needed when changes are applied to the database. You might want to make sure you are using these features, especially since too small online redologfiles can prevent private strands from being made available. I come across a lot of databases which have inappropriately sized online redologfiles.

Throw-away undo mode, or TUA mode, is a mode which can reduce the time spend on bulk actions. However, anything comes at a price. For TUA mode it means the database looses some of its durability properties. However, if this is used for importing huge amounts of data, or doing mass changes without any session concurrency, this could be a welcome time saver.

References:
Hellodba
Stephan Haisley’s 2008 presentation ‘Redo and Undo Optimisations in Oracle 10g’
Tanel Põder’s presentation ‘Performance and Scalability Improvements in Oracle 10g and 11g’
Jonathan Lewis’ write-ups about private strands and ‘Oracle Core essential internals for DBAs and Developers’.
Twitter: @piontekdd, @martinberx, @westendwookie
Thanks: Mauro Pagano

There are many posts about the amount of memory that is taken by the Oracle database executables and the database SGA and PGA. The reason for adding yet another one on this topic is a question I recently gotten, and the complexities which surrounds memory usage on modern systems. The intention for this blogpost is to show a tiny bit about page sharing of linux for private pages, then move on to shared pages, and discuss how page allocation looks like with Oracle ASMM (sga_target or manual memory).

The version of linux in this blogpost is Oracle Linux 7.2, using kernel: 4.1.12-37.6.3.el7uek.x86_64 (UEK4)
The version of the Oracle database software is 12.1.0.2.160719 (july 2016).

Memory usage of virtual memory systems is complicated. For that reason I see a lot of people getting very confused about this topic. Let me state a very simple rule: the memory actively being used on a system should fit in physical memory. Swap (a file or partition), increases total virtual memory, but really only is a safety net for saving your system from an out of memory situation at the cost of moving pages from and to disk. Because modern linux kernels have swappiness (willingness to swap) to a non-zero value, it’s not uncommon to have some swap being used, despite physical memory not being oversubscribed. A system stops performing as soon as paging in and out starts to occur, and for that reason should not happen.

1. Private pages for linux executables
When an executable is executed on linux from the shell, the shell executes a fork() call to create a new process, which is implemented as a clone() system call on linux. Using the clone() system call, the virtual memory space of the newly created process is shared (readonly) with it’s parent. This includes the private allocations! Once the child process needs to write in it’s memory space, it will page fault and create it’s own version, abandoning the version of its parent.

Can we actually prove this is happening? Yes, the /proc/ filesystem gives an insight to a process’ virtual memory space.
Let’s start off with a very simple example: we execute ‘cat /proc/self/maps’ to see our own address space:

[oracle@oracle-linux ~]$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
0060c000-0060d000 rw-p 0000c000 fb:00 201666243                          /usr/bin/cat
00e41000-00e62000 rw-p 00000000 00:00 0                                  [heap]
7f69729be000-7f6978ee5000 r--p 00000000 fb:00 576065                     /usr/lib/locale/locale-archive
7f6978ee5000-7f6979099000 r-xp 00000000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f6979099000-7f6979298000 ---p 001b4000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f6979298000-7f697929c000 r--p 001b3000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f697929c000-7f697929e000 rw-p 001b7000 fb:00 522359                     /usr/lib64/libc-2.17.so
7f697929e000-7f69792a3000 rw-p 00000000 00:00 0
7f69792a3000-7f69792c4000 r-xp 00000000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794b9000-7f69794bc000 rw-p 00000000 00:00 0
7f69794c3000-7f69794c4000 rw-p 00000000 00:00 0
7f69794c4000-7f69794c5000 r--p 00021000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794c5000-7f69794c6000 rw-p 00022000 fb:00 522352                     /usr/lib64/ld-2.17.so
7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0
7ffdab1c7000-7ffdab1e8000 rw-p 00000000 00:00 0                          [stack]
7ffdab1ea000-7ffdab1ec000 r--p 00000000 00:00 0                          [vvar]
7ffdab1ec000-7ffdab1ee000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Here’s a lot to see, but we see the cat executable at 0x00400000. The reason for three memory allocations are (linux/ELF) executables uses different sections with specific functions. A full overview of these can be obtained using the readelf executable. A simpler overview of an executable, which matches the above three memory allocations for the cat executable can be obtained using ‘size -B’ (the size executable, -B means ‘berkeley style’, which is default):

[oracle@oracle-linux ~]$ size -B /usr/bin/cat
   text	   data	    bss	    dec	    hex	filename
  43905	   1712	   2440	  48057	   bbb9	/usr/bin/cat

This describes the three memory sections an linux executable can have: text (the machine instructions, alias ‘the program’), data (all initialised variables declared in the program) and BSS (uninitialised data).
The first section always is the text allocation (not sure if it’s impossible to have the text section not being the first allocation, I have never seen it different). If you look at the memory flags, ‘r-xp’, this totally makes sense: ‘r-‘ meaning: read(only, followed by a’-‘ instead of a ‘w’), ‘x’: executable and ‘p’: this is a private allocation. The next allocation is the data section. We don’t execute variables, we read them, which is reflected in the flags: ‘r–p’. But what if we change the value of a variable? That is where the third section is for: changed values of initialised variables. This can be seen from the flag of this section: ‘rw-p’, read, write and private. The fourth allocation lists [heap], this is a mandatory allocation in every process’ memory space, which holds (small) memory allocations, this is NOT the BSS section. In this case, the BSS section does not seem to be allocated.

By having memory allocations for /usr/lib64/ld-2.17.so we can see this is a dynamically linked executable. You can also see this by executing ‘file’ on the executable:

[oracle@oracle-linux ~]$ file /usr/bin/cat
/usr/bin/cat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=3207edc47638918ceaeede21947a20a4a496cf63, stripped

If a linux executable is dynamically linked, you can see the libraries that are loaded by the dynamic linker/loader using the ldd utility:

[oracle@oracle-linux ~]$ ldd /usr/bin/cat
       	linux-vdso.so.1 =>  (0x00007ffceb3e4000)
       	libc.so.6 => /lib64/libc.so.6 (0x00007fd46fb7e000)
       	/lib64/ld-linux-x86-64.so.2 (0x000055d5253c9000)

This output shows the dynamic loader (/lib64/ld-linux-x86-64.so.2), and two libraries the dynamic loader loads: libc.so.6 and linux-vdso.so.1. The first one, libc, is the standard C library. The second one, linux-vdso is for virtual dynamic shared object, which is an optimisation for certain system calls to be executed in user space (notably gettimeofday()).
The other allocations that exist in our example are anonymous mappings (usually done by programs using the mmap() call):

7f69794c6000-7f69794c7000 rw-p 00000000 00:00 0

And some allocations for system purposes, like stack, var, vdso and vsyscall.

Now that you have become familiar with some basic linux memory address space specifics, let’s take it a little further. It’s possible to see more about the memory segments using the proc filesystem smaps file:

[oracle@oracle-linux ~]$ cat /proc/self/smaps
00400000-0040b000 r-xp 00000000 fb:00 201666243                          /usr/bin/cat
Size:                 44 kB
Rss:                  44 kB
Pss:                  44 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:        44 kB
Private_Dirty:         0 kB
Referenced:           44 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
0060b000-0060c000 r--p 0000b000 fb:00 201666243                          /usr/bin/cat
Size:                  4 kB
Rss:                   4 kB
...etc...

Per allocation there are a lot of properties to be seen. ‘Size’ is the full size, ‘Rss’ is the resident set size, alias the amount of data of this segment that is truly resident for this process in it’s address space. ‘Pss’ is fairly unknown, and is the proportional size of this segment. The way it is proportional is that if pages in this allocation are shared with other processes, the size of these pages are divided by the number processes it is shared with. In this case, we have loaded the text segment of the cat executable into the process’ address space, which all is resident (size and rss are the same) and it’s not shared with any process (rss equals pss). There are many more properties, but these are out of scope for this blogpost.

Now let’s move on to Oracle. If you look at the maps output of the pmon process for example, you’ll see:

[oracle@oracle-linux 14153]$ cat maps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8f000-10de8000 rw-p 1058f000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10de8000-10e19000 rw-p 00000000 00:00 0
1190f000-11930000 rw-p 00000000 00:00 0                                  [heap]
...

Here we see the Oracle executable, with a text segment, a readonly data segment and a read/write data segment, and we see an anonymous mapping directly following the data segments. That’s the BSS segment!
However, what is more interesting to see, is the properties of the distinct memory allocations in smaps:

[oracle@oracle-linux 14153]$ cat smaps
00400000-1096e000 r-xp 00000000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
Size:             267704 kB
Rss:               40584 kB
Pss:                 819 kB
Shared_Clean:      40584 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:        40584 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd
10b6d000-10b8f000 r--p 1056d000 fb:03 67209358                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
Size:                136 kB
Rss:                 124 kB
...

If we look at the text segment for the oracle binary, we see the total text size is 267704 kB (size), but resident (truly available for this process in its address space) is only 40584 kB (rss), and because the oracle executable’s text segment is shared with a lot of processes, the proportional size is only 819 kB (pss).

If you want to understand how much memory is taken in the system, the size is telling the total size of the segment, but it doesn’t say anything on true memory usage. The rss size tells the amount of pages for the segment that is paged in to the address space of every process, and can (and is, for oracle) different for every process. The pss size is the proportional size for every process. Probably the only way to tell the true amount of memory taken by executables and libraries is to add up all the pss sizes. Any other value only tells something about the process’ point of view on memory usage, but not overall, true consumed space because that would lead to counting too much.

This is different for anonymous allocations. Since anonymous allocations are created when a process is run, I’ve only seen them initialised purely private. For that reason rss and pss sizes are equal, because every process initialises it strictly for itself. This too works in a lazy allocation way. When memory is allocated, the size is defined, but is only really allocated once it’s truly used, which is expressed by a difference between size and rss.

2. shared pages
The Oracle databases relies on shared caches and data structures, which are put into what is called the SGA, the system global area. The main components of the SGA are the shared pool (shared structures), log buffer (change vectors to be written to disk to persist changes) and the buffer cache, amongst others. With any memory management option (manual management, ASMM (automatic shared memory management, sga_target) and AMM (automatic memory management, memory_target)) there is a SGA. Depending on the memory option, these are visible in a different way.

When manual memory or ASMM is used, shared memory is allocated as system V shared memory. The ‘classic’ way of looking at system V shared memory is using ipcs -m (m is for shared memory, you can also use s for semaphores and q for message queues):

[oracle@oracle-linux ~]$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 655360     oracle     600        2932736    124
0x00000000 688129     oracle     600        905969664  62
0x00000000 720898     oracle     600        139673600  62
0x5f921964 753667     oracle     600        20480      62

Please mind that if you have more than one instance active, or an ASM instance active, you will see more shared memory allocations.
Apparently, the oracle database allocates a couple of shared memory segments. If you want to understand what these memory allocations are for, you can use the oradebug ipc command to see what their functions are:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug ipc
IPC information written to the trace file

This generates a trace file in the ‘trace’ directory in the diagnostics destination. Here is how this looks like (partial output with content of interest to this blogpost):

 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002cbe70 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2   655360 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cc000 00000000002cc000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000036000000 Minimum Subarea size 00400000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0   688129 0x00000060400000 0x00000060400000 0x00000060400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000036000000 0000000036000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1   720898 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008534000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000005000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3   753667 0x0000009ec00000 0x0000009ec00000 0x0000009ec00000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000005000 0000000000005000 default       readwrite

The first allocation is ‘fixed size’, alias the fixed SGA, the second allocation is the ‘variable size’, which contains the shared pool and the buffercache, the third allocation is the ‘redo buffers’ and the fourth is the ‘skgm overhead’ alias the index into the shared memory structures for this instance.

Because any memory allocation is visible in maps and smaps, this method can be used for shared memory too, to see how the shared memory segments are mapped into the process address space. All oracle database server processes have the shared memory segments for the instance mapped into their address space. The usage is different per process, so the amount of shared memory paged into the address space will be different:

...
12bcd000-12bee000 rw-p 00000000 00:00 0                                  [heap]
60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
60400000-96400000 rw-s 00000000 00:05 688129                             /SYSV00000000 (deleted)
96400000-9e934000 rw-s 00000000 00:05 720898                             /SYSV00000000 (deleted)
9ec00000-9ec05000 rw-s 00000000 00:05 753667                             /SYSV5f921964 (deleted)
7f473004e000-7f47301d4000 r-xp 00000000 fb:02 212635773                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
...

Shared memory is easily identified by the ‘s’, at which “normal” private memory mappings have ‘p’. If you want to know more about the process’ perspective of the shared memory, we can use smaps, just like with private memory mappings (virtual memory space of pmon):

60000000-60001000 r--s 00000000 00:05 655360                             /SYSV00000000 (deleted)
Size:                  4 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd sh mr mw me ms sd
60001000-602cc000 rw-s 00001000 00:05 655360                             /SYSV00000000 (deleted)
Size:               2860 kB
Rss:                 392 kB
Pss:                  36 kB
Shared_Clean:          0 kB
Shared_Dirty:        372 kB
Private_Clean:         0 kB
Private_Dirty:        20 kB
Referenced:          392 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms sd

These two shared memory segments are belonging to the fixed sga. The reason for two segments is the first page (0x1000 equals 4096, alias a single linux page) is readonly (r–s). The other fixed SGA segment is read write (rw-s). Here we see that from the process’ perspective it really doesn’t matter much if a piece of mapped memory is shared or private; it’s exactly handled the same way, which means the full segment is mapped into the process’ virtual memory space, but only once pages are touched (alias truly used), the process registers the address in its pagetable, and the pages become resident (as can be seen in the difference between the total size and the rss). The sole purpose of shared memory is it is shared between process. That the pages are shared is very well visible with the difference between rss and pss size. Its also easy to spot this shared memory segment is created from small pages; MMUPageSize and KernelPageSize is 4kB.

However, this yields an interesting question: shared memory does not belong to any single process. Does that mean that if a shared memory segment is created, it is truly allocated, or can shared memory be lazy allocated as well? Please mind that above statistics are the process’ perspective, not the kernel’s perspective.

One way to see the state of shared memory system wide, is using the ‘-u’ flag with the ipcs command:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  255684
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This is a really useful view! What we can see from the output from this command, is that nearly all pages allocated as shared memory are resident. By having statistics for shared memory pages allocated and resident we can conclude that shared memory too could be allocated in a lazy, alias on demand. Also, there is a difference between resident and allocated, which indicates lazy allocation too.

Inside the database I am aware of two parameters that could influence shared pages usage; pre_page_sga and _touch_sga_pages_during_allocation, see my article on these. However, what is interesting, is that these parameters are different for the instance I am testing with for this blogpost, which is running on a VM:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: page
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%page%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden: Y
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('Y','Y',3,2))

NAME   						   VALUE       								  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
olap_page_pool_size    				   0   									  TRUE 	   FALSE      FALSE
pre_page_sga   					   TRUE        								  TRUE 	   FALSE      FALSE
use_large_pages        				   TRUE        								  TRUE 	   FALSE      FALSE
_max_largepage_alloc_time_secs 			   10  									  TRUE 	   FALSE      FALSE
_olap_page_pool_expand_rate    			   20  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hi     				   50  									  TRUE 	   FALSE      FALSE
_olap_page_pool_hit_target     			   100 									  TRUE 	   FALSE      FALSE
_olap_page_pool_low    				   262144      								  TRUE 	   FALSE      FALSE
_olap_page_pool_pressure       			   90  									  TRUE 	   FALSE      FALSE
_olap_page_pool_shrink_rate    			   50  									  TRUE 	   FALSE      FALSE
_realfree_heap_pagesize        			   65536       								  TRUE 	   FALSE      FALSE
_realfree_pq_heap_pagesize     			   65536       								  TRUE 	   FALSE      FALSE
_session_page_extent   				   2048        								  TRUE 	   FALSE      FALSE
_touch_sga_pages_during_allocation     		   FALSE       								  TRUE 	   FALSE      FALSE

14 rows selected.

In the database I created on my VM, pre_page_sga equals to TRUE and _touch_sga_pages_during_allocation to FALSE, which is the exact inverse of the settings of a database (PSU 160419) on a huge machine. Perhaps these parameters are dynamically set based on size of the SGA and logic (if _touch_sga_pages_during_allocation is TRUE, it makes sense to set pre_page_sga to FALSE, as it’s function has been performed by the bequeathing session.

However, having pre_page_sga set to TRUE it makes sense almost all SGA (shared) pages are allocated, because pre_page_sga (at least in Oracle 12, not sure about earlier versions, because the Oracle description of this parameter is different from what happens in Oracle 12) spawns a background process (sa00) that scans SGA pages, which means it pages them, resulting in the actual allocation. Let’s test this by setting pre_page_sga to false, it should lead to way lesser shared memory pages allocated, which will eventually be allocated as database processes are paging them in:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force;

And then look at ipcs -mu again:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  92696
pages swapped   0
Swap performance: 0 attempts   	 0 successes

As expected, only the bare necessary pages are resident after startup force, all the other shared pages will be slowly paged in as foreground and background processes touching SGA pages during execution.

How would that work when we set sga_max_size to a different value than sga_target? If the pages beyond the sga_target are never allocated, you could control the amount of SGA pages used by setting sga_target, but ‘reserve’ extra memory to use by setting sga_max_size higher, which is never allocated, so it is not wasted. Let’s setup the instance:

SQL> alter system set pre_page_sga=true scope=spfile;
SQL> show spparameter sga_target

SID    	 NAME  			       TYPE    	   VALUE
-------- ----------------------------- ----------- ----------------------------
*      	 sga_target    		       big integer 1000M
SQL> ! ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 256005
pages resident  102512
pages swapped   0
Swap performance: 0 attempts   	 0 successes

This sets the pre_page_sga parameter from the spfile, which means the instance will spawn a process to touch SGA pages on next startup.
Currently, the sga_target for sizing the SGA is set to 1000M in the spfile.
ipcs tells us we got 256005 pages are allocated, which makes sense: 256005*4=1024020k, which is slightly more than the set 1000M, which means essentially sga_target equals pages allocated.

SQL> alter system set sga_max_size=2g scope=spfile;
SQL> startup force;
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

This sets sga_max_size to double the amount of sga_target, and ‘startup force’ bounces the instance.

SQL> show parameter sga_target

NAME   				     TYPE      	 VALUE
------------------------------------ ----------- ------------------------------
sga_target     			     big integer 1008M

Here we see the actual parameter in the database is set to 1008M. Now let’s look at the ipcs -mu values again:

> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  521923
pages swapped   0
Swap performance: 0 attempts   	 0 successes

521923*4=2087692. So (almost) all the memory set for sga_max_size is allocated. In fact, if you look at the values at instance startup values reported above, you see ‘Total System Global Area’ showing the 2G, it’s all SGA, so it’s all touched because of pre_page_sga being set to TRUE. So the next test would be to have pre_page_sga being set to FALSE:

SQL> alter system set pre_page_sga=false scope=spfile;
SQL> startup force
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size     		    2926472 bytes
Variable Size  		 1358956664 bytes
Database Buffers       	  637534208 bytes
Redo Buffers   		  148066304 bytes
Database mounted.
Database opened.

All memory is still declared SGA, as we can see. However, by having _touch_sga_pages_during_allocation set to FALSE and pre_page_sga set to FALSE, we should see only the actual used SGA pages being allocated:

SQL> !ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524291
pages resident  91692
pages swapped   0
Swap performance: 0 attempts   	 0 successes

The above output shows the shared memory status directly after I restart my instance, so this is not only less than sga_max_size, it is even less than sga_target (91692*4=336768, ~ 336M). This will grow up to sga_target, because these pages will get paged in by the database processes.

How does this look like when we add in huge pages? In Oracle 12.1.0.2.160719 in my instance the parameter to tell oracle to allocate huge pages if there are any (‘use_large_pages’) is set to TRUE. This will make Oracle use large pages if any are available. This is true, even if there are not enough huge pages to satisfy the entire SGA; Oracle will just allocate all that can be allocated, and create a new shared memory segment using small pages for the remainder of the needed shared memory.

Sadly, it seems per memory segment statistics like rss, pss, shared and private clean and dirty, etc. are not implemented for huge pages:

[oracle@oracle-linux [testdb] ~]$ cat /proc/$(pgrep pmon)/smaps
...
61000000-d8000000 rw-s 00000000 00:0e 688129                             /SYSV00000000 (deleted)
Size:            1949696 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Locked:                0 kB
VmFlags: rd wr sh mr mw me ms de ht sd
...

This is the main shared memory segment, allocated from huge pages (as can be seen with KernelPageSize and MMUPageSize), which means it’s the segment holding the shared pool and buffercache. This can also be seen by the size: 1949696 kB, which is nearly the 2G of sga_max_size.

However, we can just use the global information on system V shared memory (ipcs -mu) and we can use the huge page information in /proc/meminfo:

[oracle@oracle-linux [testdb] ~]$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:    1100
HugePages_Free:      880
HugePages_Rsvd:      805
HugePages_Surp:        0
Hugepagesize:       2048 kB

The statistics of interest are:
hugepages_total: the total number of huge pages allocated. warning: huge pages memory allocated by the kernel is NOT available for allocation of regular sized pages (which means you can starve your processes and the kernel for normal pages by setting the number of huge pages too high).
hugepages_free: the number of huge pages which are not used currently. warning: this includes allocated but not yet initialised pages, which hugepages_rsvd shows.
hugepages_rsvd: the number of huge pages allocated but not yet initialised.
hugepages_surp: the number of huge pages allocated (truly allocated and not yet initialised) greater than the total number of huge pages set. this value can be greater than zero if the kernel setting vm.nr_overcommit_hugepages is greater than zero. The value of this setting is zero by default, and at least for usage with the Oracle database, this value should remain zero.

The same information can be obtained using ipcs -mu, but with a twist:

[oracle@oracle-linux [testdb] ~]$ ipcs -mu

------ Shared Memory Status --------
segments allocated 4
pages allocated 524803
pages resident  122881
pages swapped   0
Swap performance: 0 attempts   	 0 successes

Some of you might get the twist on this by looking at the number.
It turns out ipcs has no facility for huge pages, it just reports the number of pages as if these were 4 kB.
524803*4 (kB) / 1024 (to make it MB) = 2050.

Now going back to the goal of looking into this: I told shared memory is allocated and paged at startup time when _touch_sga_pages_during_allocation is set to TRUE (set to false as default value in my current database), and it could be explicitly paged by the background process sa00 after startup of the instance when pre_page_sga is set to TRUE. When both are set to false, shared memory allocated from default sized 4kB pages is allocated only when it’s used. In the above examples with huge pages, the tests were done with pre_page_sga set to false. This shows exactly the same ‘lazy allocation’ behaviour as 4kB pages.

When ‘extra’ memory is reserved from the operating system by setting sga_max_size to a higher value than sga_target, this will all be allocated and paged if either _touch_sga_pages_during_allocation or pre_page_sga is set to TRUE, which doesn’t make sense; if the memory is taken, you might as well use it. However, this is different if both _touch_sga_pages_during_allocation and pre_page_sga are set to false. All memory beyond sga_target up to sga_max_size is allocated, but never touched, and thus never paged in, so never truly allocated. Please mind linux itself understands this perfectly (aiming at huge pages and ‘reserved’ pages), however the system V ipc kernel settings do not; you need to set the shared memory values high enough to facilitate the total sum of sga_max_size values, not the truly used sizes as indicated by the sum of sga_target values.

The inspiration for this investigation came from a question on my blog. However, the question was about memory_target and memory_max_target and AIX. I do not have an AIX system at hand. I did not investigate the implementation of memory_target and memory_max_target on AIX. So I can’t comment on that. What I can say, is that on Linux, you really, really should use automatic shared memory management (ASMM) alias setting sga_target or setting it manually (and set huge pages!). If you are used to these memory management settings on databases not on AIX, it probably makes sense to use that on AIX too, even if the automatic memory management (AMM) alias setting memory_target is implemented brilliantly on AIX, for the sake of predictability and standardisation.

In a previous article called ‘memory allocation on startup’ I touched on the subject of NUMA; Non Uniform Memory Access. This article is about how to configure NUMA, how to look into NUMA usage and a real life case of NUMA optimisation using in-memory parallel execution.

At this point in time (start of the summer of 2016) we see that the CPU speed competition has stagnated and settled at somewhere below maximally 4 gigahertz, and instead the number of core’s and the size of memory is growing. The common used server type in the market I am in is a two socket server. It is not unthinkable that in the near future servers with more than two sockets will increase in popularity, which facilitates the increase in (parallel) computing capacity and maximal amount of memory.

On the other hand of course there are already 2+ socket count servers, of which some have been multi socket for quite a while: Oracle/Sun M series, IBM P series, and a large diversity of intel based 2+ socket servers, like the Oracle/Sun “-4” and “-8” series. The investigations in this blogpost are done done on a SGI UV 300RL, which is a system with (up to) 32 sockets, 576 cores, and 1152 CPU threads.

There already have been a lot of blogposts about Linux and NUMA:
Kevin Closson
Martin Bach: linux-large-pages-and-non-uniform-memory-distribution, _enable_numa_support and numactl
Yves Colin
Bertrand Drouvot
I am sure there will be more, these are the ones that I read in the past.

This article will take you through the configuration of NUMA for the Oracle database and how to measure specifics from the Linux side. It will touch some of the things mentioned in the other blogposts, in order to provide a consistent story.
The database version is: 12.1.0.2.160419
The operating system is Oracle Linux 7.1
Kernel: 3.8.13-98.4.1.el7uek.x86_64

1. NUMA or not.
In order to sensibly use NUMA optimisation with the Oracle database, you first need to have hardware that is actually NUMA capable. This means multiple sockets in use that have local memory (please mind this includes common two socket server hardware). The next thing needed to use NUMA is NUMA support enabled by the Linux kernel. With the current version of the oracle-preinstall rpm, NUMA is disabled on the kernel load line in grub (numa=off). When NUMA is disabled, it will show one NUMA node (=socket) with numactl.

First make sure you are on NUMA enabled hardware by issuing ‘numactl –hardware’ and see multiple nodes specified.

2. Default Oracle.
When you don’t modify (undocumented) NUMA parameters, Oracle will NOT run in NUMA aware mode. In other words, you explicitly need to enable NUMA in the database. This is also visible if you look in the numa_maps, which shows process memory and NUMA information:

$ grep huge /proc/$(pgrep pmon_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3339 mapmax=904 N0=112 N1=98 N2=98 N3=98 N4=98 N5=98 N6=98 N7=97 N8=96 N9=96 N10=96 N11=101 N12=108 N13=106 N14=105 N15=105 N16=107 N17=107 N18=106 N19=109 N20=106 N21=106 N22=107 N23=107 N24=106 N25=107 N26=108 N27=108 N28=109 N29=112 N30=111 N31=113
9c460000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
9c480000000 prefer:1 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N0=1

Few things to note: I took the pmon process, and my instance name is slob (pmon_slob). The SGA only is allocated from huge pages, so I grepped for ‘huge’. numa_maps only shows the process’ paged memory, my SGA is way bigger. The important thing to see here is the main SGA memory allocation at 80000000 reads ‘interleave:0-31’, which means the memory is spread over all the NUMA nodes. This can also be seen on the same line, because the memory allocation per node is shown with Nnn=nr.pages.

3. Set parameters
In Oracle 12.1.0.2 you need to set two parameters to fully enable NUMA support in the database:

alter system set "_enable_NUMA_support"=true scope=spfile;
alter system set "_px_numa_support_enabled"=true scope=spfile;

These are undocumented parameters. However, these parameters are explicitly mentioned in MOS article ‘What Is Correct Setting for _px_numa_support_enabled for 12c Databases That Wish to Use NUMA?’ (Doc ID 1956463.1).

Once you have set these parameters, Oracle will start up in NUMA aware mode. One of the things that is different in NUMA aware mode, is how the buffercache is created. This can be seen by looking at the numa_maps file of pmon again:

$ grep huge /proc/$(pgrep pmon_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=904 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=3587 mapmax=904 N0=112 N1=112 N2=112 N3=112 N4=112 N5=112 N6=112 N7=112 N8=112 N9=112 N10=112 N11=112 N12=113 N13=112 N14=113 N15=112 N16=112 N17=112 N18=113 N19=112 N20=112 N21=112 N22=112 N23=112 N24=112 N25=112 N26=112 N27=112 N28=112 N29=112 N30=112 N31=112
8860000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=1536 N25=1536
d260000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=512 N26=512
11c40000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=512 N27=512
16620000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=512 N28=512
1b000000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=512 N29=512
1f9e0000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=256 N30=256
243c0000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=512 N31=512
28da0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=256 N0=256
2d780000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=512 N1=512
32160000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=256 N2=256
36b40000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=512 N3=512
3b520000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=256 N4=256
3ff00000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=512 N5=512
448e0000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=256 N6=256
492c0000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=512 N7=512
4dca0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=256 N8=256
52680000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=512 N9=512
57060000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=256 N10=256
5ba40000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=512 N11=512
60420000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=256 N12=256
64e00000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=512 N13=512
697e0000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=256 N14=256
6e1c0000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=512 N15=512
72ba0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=256 N16=256
77580000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=512 N17=512
7bf60000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=256 N18=256
80940000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=512 N19=512
85320000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=256 N20=256
89d00000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=512 N21=512
8e6e0000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=256 N22=256
930c0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=512 N23=512
97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=512 N24=512
9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge
9c480000000 prefer:2 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=904 N5=1

If we look at the output, a few segments are still allocated interleaved (these are the fixed SGA and the shared pool), and we see allocations for every NUMA node. These per NUMA node allocations are the buffer cache.

Okay. Now the database is setup for NUMA, and we checked how this looks like on the operating system level using Linux tools. However, we need to realise that NUMA is no magic that sets your database into the next gear. Most things you do are not NUMA optimised by the database, simply because it wouldn’t make sense.

For example, when I perform a buffered full table scan without parallel query, it will spread out the blocks over the allocated segments. Why? Because my session is running on a random CPU thread which could run on another CPU thread the next time it is scheduled. What I am saying here is: only specific things are NUMA optimised.

A next logical question would be: but then what is actually NUMA optimised? One obvious thing that is NUMA optimised is the database writer (this is mentioned in the alert.log at startup). Another NUMA optimised thing is parallel query. If you look at the parallel query worker processes, the status file gives a hint:

# grep allowed_list /proc/$(pgrep p000_slob)/status
Cpus_allowed_list:	450-467,1026-1043
Mems_allowed_list:	0-31

This shows that the first parallel query worker process is limited in what CPU (threads) it is allowed to use. All parallel query worker processes are limited to ranges of CPU threads. Please mind it is NOT limited in what memory it can use, so it can use both local and remote memory. It may look weird that it still can access remote memory, totally ignoring NUMA optimised (=local) memory access.

First the CPU thread ranges. If you look what the range mean (it’s rather obvious), you see this:

# lscpu | grep 450
NUMA node25 CPU(s):    450-467,1026-1043

The CPU thread range is exactly a NUMA node. So whatever happens, parallel query worker process p000 can only run on the CPU threads of NUMA node 25!

Then absence of limits in memory: this means the parallel query process can still access anything it needs. This means that anything that used to work with parallel query can still run as it did without or before the NUMA optimisations. In specific cases, parallel query can take advantage of the NUMA setting.

One of such cases is in-memory parallel execution. With in-memory parallel execution, the parallel query slaves read blocks in the buffer cache instead of performing a direct path read. In-memory parallel execution is enabled when the parameter parallel_degree_policy is set to auto, and heuristically determines feasibility.

I setup a table of approximately 700GB, and performed an in-memory parallel query scan on a freshly started instance, in order to have the parallel query worker processes have as little memory touched as possible. At the end of the scan, this is how numa_maps of parallel query worker process p000 looks like:

# grep huge /proc/$(pgrep p000_slob)/numa_maps
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=906 N0=1 N1=1
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=4608 N0=144 N1=144 N2=144 N3=144 N4=144 N5=144 N6=144 N7=144 N8=144 N9=144 N10=144 N11=144 N12=144 N13=144 N14=144 N15=144 N16=144 N17=144 N18=144 N19=144 N20=144 N21=144 N22=144 N23=144 N24=144 N25=144 N26=144 N27=144 N28=144 N29=144 N30=144 N31=144
8880000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=50176 N25=50176
d2a0000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N26=6145
11c80000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=4608 N27=4608
16660000000 prefer:28 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=70 N28=5121
1b040000000 prefer:29 file=/SYSV00000000\040(deleted) huge dirty=14336 N29=14336
1fa20000000 prefer:30 file=/SYSV00000000\040(deleted) huge dirty=5121 mapmax=72 N30=5121
24400000000 prefer:31 file=/SYSV00000000\040(deleted) huge dirty=4608 N31=4608
28de0000000 prefer:0 file=/SYSV00000000\040(deleted) huge dirty=6145 mapmax=70 N0=6145
2d7c0000000 prefer:1 file=/SYSV00000000\040(deleted) huge dirty=4099 N1=4099
321a0000000 prefer:2 file=/SYSV00000000\040(deleted) huge dirty=2528 mapmax=73 N2=2528
36b80000000 prefer:3 file=/SYSV00000000\040(deleted) huge dirty=2222 N3=2222
3b560000000 prefer:4 file=/SYSV00000000\040(deleted) huge dirty=2504 mapmax=69 N4=2504
3ff40000000 prefer:5 file=/SYSV00000000\040(deleted) huge dirty=2249 N5=2249
44920000000 prefer:6 file=/SYSV00000000\040(deleted) huge dirty=4754 mapmax=69 N6=4754
49300000000 prefer:7 file=/SYSV00000000\040(deleted) huge dirty=2296 N7=2296
4dce0000000 prefer:8 file=/SYSV00000000\040(deleted) huge dirty=2463 mapmax=70 N8=2463
526c0000000 prefer:9 file=/SYSV00000000\040(deleted) huge dirty=2246 N9=2246
570a0000000 prefer:10 file=/SYSV00000000\040(deleted) huge dirty=2440 mapmax=72 N10=2440
5ba80000000 prefer:11 file=/SYSV00000000\040(deleted) huge dirty=2233 N11=2233
60460000000 prefer:12 file=/SYSV00000000\040(deleted) huge dirty=2469 mapmax=69 N12=2469
64e40000000 prefer:13 file=/SYSV00000000\040(deleted) huge dirty=2241 N13=2241
69820000000 prefer:14 file=/SYSV00000000\040(deleted) huge dirty=2434 mapmax=69 N14=2434
6e200000000 prefer:15 file=/SYSV00000000\040(deleted) huge dirty=2342 N15=2342
72be0000000 prefer:16 file=/SYSV00000000\040(deleted) huge dirty=2461 mapmax=70 N16=2461
775c0000000 prefer:17 file=/SYSV00000000\040(deleted) huge dirty=2329 N17=2329
7bfa0000000 prefer:18 file=/SYSV00000000\040(deleted) huge dirty=2499 mapmax=71 N18=2499
80980000000 prefer:19 file=/SYSV00000000\040(deleted) huge dirty=2334 N19=2334
85360000000 prefer:20 file=/SYSV00000000\040(deleted) huge dirty=2466 mapmax=73 N20=2466
89d40000000 prefer:21 file=/SYSV00000000\040(deleted) huge dirty=2328 N21=2328
8e720000000 prefer:22 file=/SYSV00000000\040(deleted) huge dirty=2180 mapmax=69 N22=2180
930e0000000 prefer:23 file=/SYSV00000000\040(deleted) huge dirty=2208 mapmax=69 N23=2208
97aa0000000 prefer:24 file=/SYSV00000000\040(deleted) huge dirty=2167 mapmax=68 N24=2167
9c460000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=147 N26=2
9c480000000 prefer:0 file=/SYSV85a066e4\040(deleted) huge dirty=1 mapmax=906 N2=1

If you look at memory address 8880000000, you see parallel query worker process p000 used 50176 pages, while on the other (buffer cache/’prefer’) segments it touched somewhere between 2200-6150 pages. So it clearly used it’s local memory more than the remote memories. This is proof in-memory parallel query is NUMA optimised.

In fact, when I time the query using in-memory parallel query with NUMA enabled, it takes a little less than 8 seconds consistently once the data is loaded in the buffercache. When I disable NUMA optimisations (_enable_NUMA_support=false), and execute the same query, also doing in-memory parallel query, the time goes up to a little less than 24 seconds, only scanning the buffercache.

Recently I have been presenting on what running on a large intel based NUMA system looks like (OTN EMEA tour in Düsseldorf and Milan, and I will be presenting about this at the Dutch AMIS 25th anniversary event in june). The investigation of this presentation is done on a SGI UV 300 machine with 24 terabyte of memory, 32 sockets (=NUMA nodes), 480 core’s and 960 threads.

Recently I have been given access to a new version of the UV 300, the UV 300 RL, for which the CPU has improved from Ivy Bridge to Haswell, and now has 18 core’s per socket instead of 15, which means the number of core’s on a fully equipped system is 576, which makes 1152 threads.

Now the get back to the actual purpose of this blogpost: SGA memory allocation on startup. One of the things you can do on such a system is allocate an extremely high number of memory. In my case I chose to allocate 10 terabyte for the Oracle SGA. When you go (excessively) outside of normal numbers, you run into things simply because there is a (very) low chance that that has been actually tested, and very few (or none) have actually done it before.

First let me show you the other technical details:
Operating system: Oracle Linux version 7.1
Kernel: kernel-uek-3.8.13-98.4.1.el7uek.x86_64
Oracle grid infrastructure version: 12.1.0.2.160419
Oracle database version: 12.1.0.2.160419
For the database instance the use_large_pages parameter is set to ONLY. It simply does not make sense not to use it in a normal database, in my opinion it really is mandatory if you work with large memory databases.

Firing up the instance…
At first I set sga_target to 10737418240000, alias 10 terabyte, and started up the instance. My sqlplus session got busy after pressing enter, however it remained busy. I got coffee downstairs, got back to my office, and it was still busy. Long story short: after 30 minutes it was still busy and I wondered if something was wrong.

What can you do? I first checked the alert.log of the instance. It told me:

Tue May 24 07:52:35 2016
Starting ORACLE instance (normal) (OS id: 535803)
Tue May 24 07:52:35 2016
CLI notifier numLatches:131 maxDescs:3801

Not a lot of information, but also no indication anything was wrong.

Then I looked with the ‘top’ utility what was active at my system:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 535803 oracle    20   0  9.765t  22492  17396 R  99.4  0.0  11:10.56 oracleSLOB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

I see my local bequeath session running at and near 100% CPU. It’s working, and that is the correct state when you fire up an Oracle database instance. But what is it doing?

Next logical thing: strace:

[oracle@skynet5 frits]$ strace -fp 535803
Process 535803 attached

Yes, this is meant to show that strace does not show any output. This means the process is not doing any system calls. This does NOT mean the process is doing nothing, we saw with ‘top’ it’s busy, running close to 100% on a CPU (thread).

When strace does not show anything, the next layer to dive into is the C function layer with perf. The first thing I tried, and normally try at first, is ‘perf top’. Perf top -p zooms in on one process:

[oracle@skynet5 frits]$ perf top -p 535803
   PerfTop:    4003 irqs/sec  kernel:99.9%  exact:  0.0% [4000Hz cycles],  (target_pid: 535803)
-----------------------------------------------------------------------------------------------------------------------------------

    95.86%  [kernel]   [k] clear_page_c_e
     1.58%  [kernel]   [k] clear_huge_page
     0.41%  [kernel]   [k] __ticket_spin_lock
     0.32%  [kernel]   [k] clear_page
     0.29%  [kernel]   [k] _cond_resched
     0.28%  [kernel]   [k] mutex_lock
     0.11%  [kernel]   [k] apic_timer_interrupt
     0.11%  [kernel]   [k] hugetlb_fault

Ah! This shows what the process is actually doing! We can see it’s executing a function called ‘clear_page_c_e’ in kernel mode in 96% of the samples perf took of the running process. The name of the function describes what it does quite well: it is a kernel function to clear a page.

So, this does describe that the process starting up the instance is actually spending all of its time clearing memory, but it does not show how this relates to what the Oracle database is actually doing. In order to dig further into what Oracle is doing, we can use perf in another way: record backtraces, and later report on that. What perf does, is sample full backtraces in a file called ‘perf.data’. When perf is later used to report using that file, it will show the current function that was sampled most in the perf.data file, and all the backtraces leading to that function (when perf record is called with -g “call-graph”).

[oracle@skynet5 frits]$ perf record -g -p 535803
...after some time: ^C, then:
[oracle@skynet5 frits]$ perf report
# Overhead  Command       Shared Object                                      Symbol
# ........  .......  ..................  ..........................................
#
    97.14%   oracle  [kernel.kallsyms]   [k] clear_page_c_e
             |
             --- clear_page_c_e
                |
                |--100.00%-- hugetlb_fault
                |          handle_mm_fault
                |          __do_page_fault
                |          do_page_fault
                |          page_fault
                |          slaac_int
                |          slrac
                |          sskgm_segment_notify_action
                |          skgmcrone
                |          skgm_allocate_areas
                |          skgmcreate
                |          ksmcrealm
                |          ksmcsg
                |          opistr_real
                |          opistr
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                 --0.00%-- [...]

What this shows, is again the kernel function ‘clear_page_c_e’ (it also shows [k] to indicate it’s a kernel function), and when we follow the backtrace down, we see
* hugetlb_fault: this is a page fault handling function, which shows that I am using huge pages.
* handle_mm_fault, __do_page_fault, do_page_fault, page_fault: these are linux kernel functions indicating it’s handling a page fault.
* slaac_int, slrac, sskgm_segment_notify_action: these are Oracle database functions. Obviously, slaac_int is the function actually ‘touching’ memory, because it results in a page_fault.

So what is a page fault? A page fault is an interrupt that is raised when a program tries to access memory that has already been allocated into the process’ virtual address space, but not actually loaded into main memory. The interrupt causes the kernel to handle the page fault, and make the page available. Essentially Linux, like most other operating systems, makes pages available once they are actually used instead of when they are ‘just’ allocated. (this is a summary, there is a lot to tell about page faults alone, but that’s beyond the scope of this article)

It’s important to realise there is no system call visible that either allocates memory (think about mmap(), malloc(), etc.). If you want to know more about these calls, please read up on them using: ‘man SYSCALLNAME’ (on linux obviously). So, the above backtrace looks like code that deliberately touches the Oracle SGA memory in order to get it truly allocated!

Actually, in my case, with NUMA enabled, and a large amount of memory to be alloced, the above process of getting 10 terabyte ‘touched’ or ‘page faulted’ can be watched in progress in the proc ‘numa_maps’ file of the ‘bequeathing’ process (the process that is starting up the instance). The linux ‘watch’ utility is handy for this:

[oracle@skynet5 frits]$ watch -n 1 -d cat /proc/535803/numa_maps
00400000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle mapped=3124 N21=3124
10e74000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:1 anon=28 dirty=28 N1=28
129b4000 prefer:1 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=360268 N0=11259 N1=11259 N2=11259 N3=11259 N4=11259 N5=11259 N6=11259 N7=11259 N8
=11259 N9=11259 N10=11259 N11=11259 N12=11258 N13=11258 N14=11258 N15=11258 N16=11258 N17=11258 N18=11258 N19=11258 N20=11258 N21=11258 N22=11258 N23
=11258 N24=11258 N25=11258 N26=11258 N27=11258 N28=11258 N29=11258 N30=11258 N31=11258
fca0000000 prefer:1 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:1 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:1 anon=534 dirty=534 N1=73 N2=461
...

With ‘watch’, ‘-n 1’ means a 1 second interval, ‘-d’ means highlight differences.
When the instance is starting (with NUMA enabled, and with a huge amount of memory allocated for SGA), first you will see it’s busy in the line indicated with ‘80000000’ (which is a memory address). In my case it says ‘interleave:0-31’, which means the memory allocation is done over all the indicated NUMA nodes. Actually Nnr=nr means a memory allocation of nr of pages on NUMA node Nnr. In my case I see the number growing per NUMA node jumping from one NUMA node to another for some time. This is the shared pool allocation, that is spread out over all NUMA nodes.

However, after some time, you’ll see this:

00400000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle mapped=3124 N21=3124
10e74000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:0 anon=28 dirty=28 N1=28
129b4000 prefer:0 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=516352 N0=16136 N1=16136 N2=16136 N3=16136 N4=16136 N5=16136 N6=16136 N7=16136 N8
=16136 N9=16136 N10=16136 N11=16136 N12=16136 N13=16136 N14=16136 N15=16136 N16=16136 N17=16136 N18=16136 N19=16136 N20=16136 N21=16136 N22=16136 N23
=16136 N24=16136 N25=16136 N26=16136 N27=16136 N28=16136 N29=16136 N30=16136 N31=16136
fca0000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=144384 N25=144384
14320000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=143872 N26=143872
18960000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=12232 N27=12232
1cfa0000000 prefer:0 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:0 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:0 anon=534 dirty=534 N1=73 N2=461

Once the process is done allocating shared pool, it starts allocating buffer cache memory. It does this per NUMA node. You can see it here at the lines fca0000000, 14320000000, 18960000000. Actually, I copied these figures when the instance was still allocating memory at memory address 18960000000. You can see it does specific allocation of memory local to a specific NUMA node by ‘prefer:nr’.

At the end, when all memory was allocated, the sqlplus session showed the sizes of the different memory pools and opened the database. This took approximately 90 minutes. At first, I thought this had to do with the parameter ‘PRE_PAGE_SGA’. However, this parameter was set at its default value, FALSE.

Some investigating and some testing revealed this behaviour (touching the entire SGA) was governed by the hidden parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION, which defaults to TRUE in 12.1.0.2 on Linux.

Next I tested setting it to FALSE, and my startup time reduced to 3.5 minutes (!).

I used perf record again to look where the time is spend when _TOUCH_SGA_PAGES_DURING_ALLOCATION is set to FALSE:

# Overhead          Command         Shared Object                                       Symbol
# ........  ...............  ....................  ...........................................
#
    41.25%  oracle_563318_s  [kernel.kallsyms]     [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--99.99%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          |
               |          |--93.26%-- __intel_new_memset
               |          |          |
               |          |          |--79.48%-- ksmnsin
               |          |          |          |
               |          |          |          |--95.98%-- kcbnfy
               |          |          |          |          kscnfy
               |          |          |          |          ksmcsg
               |          |          |          |          opistr_real
               |          |          |          |          opistr
               |          |          |          |          opiodr
               |          |          |          |          ttcpip
               |          |          |          |          opitsk
               |          |          |          |          opiino
               |          |          |          |          opiodr
               |          |          |          |          opidrv
               |          |          |          |          sou2o
               |          |          |          |          opimai_real
               |          |          |          |          ssthrdmain
               |          |          |          |          main
               |          |          |          |          __libc_start_main
               |          |          |          |
               |          |          |          |--3.32%-- kcbw_setup

This is a new backtrace, which looks like the former backtrace on the top (between clear_page_c_e and page_fault), simply because it’s doing exactly the same, handling a page fault. However, the Oracle functions are different. What is happening here is the Oracle performs a system call, memset(), which is used to wipe a memory area. If we go further down the stack trace, we see (full speculation!):
* ksmnsin: kernel system memory numa segment initialisation?
* kcbnfy/kscnfy: kernel cache buffers numa FY?/kernel system cache numa FY?
* ksmcsg: kernel system memory create sga?

When looking at the memory allocations during startup of the bequeathing process with _TOUCH_SGA_PAGES_DURING_ALLOCATION set to FALSE in numa_maps, way lesser memory pages are touched. numa_maps only shows truly allocated pages, instead of the logical memory allocations. This means the same memory allocations done per numa node (passing over the numa nodes several times) done for the shared pool, and then the buffer cache allocations per numa node, however now also passing over the different numa nodes several times too. Because of the much smaller amount of pages touched, the startup time of the instance is greatly reduced.

So, is this a solution to the long startup time?
On one hand yes, on the other hand no. What that means is: you need to page the memory, no matter what you do. If you choose not to page the memory during startup (_TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE), you need to do it later. Sounds logical, right: pay now, or pay later? But what does that mean?

After startup of the instance with _TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE, a lot of the memory if not really allocated. If you start to use the new prestine SGA with a normal session, you pay the price of paging which the bequeathing session otherwise took. Let me show you a ‘perf record -g’ example of a full table scan (which is reading the Oracle blocks into the buffercache, not doing a direct path read):

# Overhead          Command       Shared Object                                Symbol
# ........  ...............  ..................  ....................................
#
    61.25%  oracle_564482_s  [kernel.kallsyms]   [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--99.99%-- hugetlb_fault
               |          |
               |          |--99.86%-- handle_mm_fault
               |          |          __do_page_fault
               |          |          do_page_fault
               |          |          page_fault
               |          |          __intel_ssse3_rep_memcpy
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

If we follow the call stack from bottom to top:
* qergsFetch, qertbFetch: query execute row source code.
* kdsttgr: kernel data scan table get row.
* kdstf00000010000100kmP: kernel data scan table full, the ultra fast full table scan.
* kdst: other functions in kernel data scan table functions.
* ktrget2: kernel transaction layer.
* kcbgtcr: kernel cache buffers get consistent row.
* kcbz: kernel cache buffers Z, physical IO helper functions.
* __intel_ssse3_rep_memcpy: this is actually the function memcpy, but replaced by an optimised version for intel CPUs.
* page_fault, do_page_fault, __do_page_fault, handle_mm_fault, hugetlb_fault, clear_page_c_e: these are the same page fault kernel functions we saw earlier.

This clearly shows my full table scan now needs to do the paging!

If I set _TOUCH_SGA_PAGES_DURING_ALLOCATION to TRUE, startup the instance (for which the pages are touched and thus pages), and profile a full table scan, I see:

# Overhead          Command         Shared Object                                    Symbol
# ........  ...............  ....................  ........................................
#
    24.88%  oracle_577611_s  oracle                [.] ksl_get_shared_latch
            |
            --- ksl_get_shared_latch
               |
               |--78.46%-- kcbzfb
               |          kcbzgb
               |          |
               |          |--99.63%-- kcbzgm
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

This shows the most prominent function which is called is ‘ksl_get_shared_latch’. No paging to be seen.

At this point the reason for having _TOUCH_SGA_PAGES_DURING_ALLOCATION should be clear. The question I had on this point is: but how about PRE_PAGE_SGA? In essence, this parameter is supposed to more or less solve the same issue, having the SGA pages being touched at startup to prevent paging for foreground sessions.

BTW, if you read about PRE_PAGE_SGA in the online documentation, it tells a reason for using PRE_PAGE_SGA, which is not true (page table entries are prebuilt for the SGA pages), and it indicates the paging (=page faults) are done at startup, which also is not true. It also claims ‘every process that starts must access every page in the SGA’, again this is not true.

From what I can see, what happens when PRE_PAGE_SGA is set to true, is that a background process is started, that starts touching all SGA pages AFTER the instance has started and is open for usage. The background process I witnessed is ‘sa00’. When recording the backtraces of that process, I see:

# Overhead        Command      Shared Object                                      Symbol
# ........  .............  .................  ..........................................
#
    97.57%  ora_sa00_slob  [kernel.kallsyms]  [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--100.00%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          ksmprepage_memory
               |          ksm_prepage_sga_seg
               |          skgmapply
               |          ksmprepage
               |          ksm_sslv_exec_cbk
               |          ksvrdp
               |          opirip
               |          opidrv
               |          sou2o
               |          opimai_real
               |          ssthrdmain
               |          main
               |          __libc_start_main
                --0.00%-- [...]

The kernel paging functions are exactly the same as we have seen several times now. It’s clear the functions executed by this process are specifically for the prepage functionality. The pre-paging as done on behalf of _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE is done as part of the SGA creation and allocation (as can be seen by the Oracle function names). PRE_PAGE_SGA seems to be a ‘workaround’ if you don’t want to spend the long time paging on startup, but still want to page the memory as soon as possible after startup. Needless to say, this is not the same as _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE, PRE_PAGE_SGA paging is done serially by a single process after startup when the database is open for usage. So normal foreground process that encounter non-paged memory, which means they use it before the sa00 process pages it, still need to do the paging.

Conclusion
If you want to allocate a large SGA with Oracle 12.1.0.2 (but may apply to earlier versions too), the startup time could be significant. The reason for that is the bequeathing session pages the memory on startup. This can be turned off by setting the undocumented parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION to FALSE. As a result, foreground (normal user) sessions need to do the paging. You can set PRE_PAGE_SGA parameter to TRUE to do paging, however the paging is done by a single process (sa00) that serially pages the memory after startup. Foreground processes that encounter non-paged memory, which means they use it before the sa00 process could page it, need to page it theirselves.

Thanks to: Klaas-Jan Jongsma for proofreading.

The intention of this blogpost is to show the Oracle wait time granularity and the Oracle database time measurement granularity. One of the reasons for doing this, is the Oracle database switched from using the function gettimeofday() up to version 11.2 to clock_gettime() to measure time.

This switch is understandable, as gettimeofday() is a best guess of the kernel of the wall clock time, while clock_gettime(CLOCK_MONOTONIC,…) is an monotonic increasing timer, which means it is more precise and does not have the option to drift backward, which gettimeofday() can do in certain circumstances, like time adjustments via NTP.

The first thing I wanted to proof, is the switch of the gettimeofday() call to the clock_gettime() call. This turned out not to be as simple as I thought.

Because a lot of applications and executables need timing information, which is traditionally done via the gettimeofday() call, the Linux kernel maintainers understood the importance of making this call as performant as possible. Calling a (hardware) clock means you request information from a resource on your computer system. Requesting something like that requires a program to switch to kernel mode first. In order to lower the resources and time needed to get time information, the Linux kernel includes a “trick” to get that information, which is known as a virtual system call or vsyscall. Essentially this means this information is provided in userspace, so there are lesser resources needed, and there is no need to switch to kernel mode. James Morle has an excellent article describing this, this line is a link to that. By staying in userspace, the gettimeofday() and clock_gettime() calls are “userland” calls, and do not show up when using “strace” to see system calls of a process executing.

However I said it wasn’t as easy as I thought. I was looking into this, and thought making the vsyscalls visible by echoing “0” in /proc/sys/kernel/vsyscall64. However, I am working with kernel version 3.8.13 for doing this part of the research….which does not have /proc/sys/kernel/vsyscall64, which means I can’t turn off the vsyscall optimisation and make both gettimeofday() and clock_gettime() visible as systemcall.

Searching for kernel.vsyscall64 on the internet I found out that with early versions Linux kernel version 3 vsyscall64 has been removed. This means I can’t use a simple switch to flip to make the calls visible. So, instead of going straight to the thing I wanted to research, I got stuck in doing the necessary preparing and essential preliminary investigation for it.

Can I do it in another way? Yes, this can be done using gdb, the GNU debugger. Start up a foreground (Oracle database) session, and fetch the process ID of that session and attach to it with gdb:

gdb -p PID
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...etc...
(gdb)

Now simply break on gettimeofday and clock_gettime, and make gdb notify you it encountered the call. First 11.2.0.4:

(gdb) break gettimeofday
Breakpoint 1 at 0x332fc9c7c0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>printf "gettimeofday\n"
>continue
>end
(gdb) break clock_gettime
Breakpoint 2 at 0x3330803e10
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>printf "clock_gettime\n"
>continue
>end
(gdb) 

You can save this gdb macro to time.gdb by executing “save breakpoints time.gdb”. Now execute “c” (continue) and enter, to make the process you attached to running again. Execute something very simple, like:

SQL> select * from dual;

This results in Oracle 11.2.0.4:

(gdb) c
Continuing.
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
...etc...

That is expected, we already knew the Oracle database is executing the gettimeofday function a lot. Now let’s do exactly the same, but with Oracle version 12.1.0.2. If you saved the breakpoints and macro’s, you can attach to an Oracle 12.1.0.2 foreground process with gdb and execute ‘source time.gdb’ to set the breakpoints and macro’s. When the ‘select * from dual’ is executed in this version of the database, it looks like this:

(gdb) c
Continuing.
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
...etc...

It is clearly (mostly) executing the clock_gettime() function.

The clock_gettime() function can use a variety of time sources. If you read the manpage of clock_gettime you will see that the first argument is the clock source. You can see the clock sources in the kernel source file Linux/include/uapi/linux/time.h, which shows:

/*
 * The IDs of the various system clocks (for POSIX.1b interval timers):
 */
 #define CLOCK_REALTIME                  0
 #define CLOCK_MONOTONIC                 1
 #define CLOCK_PROCESS_CPUTIME_ID        2
 #define CLOCK_THREAD_CPUTIME_ID         3
 #define CLOCK_MONOTONIC_RAW             4
 ...

The first argument of clock_gettime is the type of clock, so if I remove the macro with clock_gettime, execution stops when clock_gettime is called:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
        silent
        printf "clock_gettime\n"
        c
(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>end
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
(gdb) c
Continuing.

Now execute something in the sqlplus session. What will happen in the gdb session is:

Breakpoint 2, 0x0000003330803e10 in clock_gettime () from /lib64/librt.so.1
(gdb)

Now look up the first argument of the call:

(gdb) print $rdi
$1 = 1

So Oracle is using CLOCK_MONOTONIC. Not the point of this article, but this means Oracle database time measurement is granular on the microsecond layer.

Now let’s look how much time the Oracle wait interface takes itself. The Oracle wait interface is using the functions kslwtbctx() (kernel service layer wait begin context) and kslwtectx() (kernel service layer wait end context) to start and stop measuring a wait event. Please mind that instead of looking at the time the wait interface provides, this means looking at the time that is taken executing in the kslwtbctx() and kslwtectx() functions. This can be done using systemtap:

global kslwtbctx_time, kslwtectx_time, kslwtbctx_count=0, kslwtbctx_tot=0, kslwtectx_count=0, kslwtectx_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	if ( pid() == target() ) {
		kslwtbctx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx").return {
	if ( pid() == target() ) {
		printf("kslwtbctx: %12d\n", local_clock_ns()-kslwtbctx_time)
		kslwtbctx_tot+=local_clock_ns()-kslwtbctx_time
		kslwtbctx_count++
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	if ( pid() == target() ) {
		kslwtectx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx").return {
	if ( pid() == target() ) {
		printf("kslwtectx: %12d\n", local_clock_ns()-kslwtectx_time)
		kslwtectx_tot+=local_clock_ns()-kslwtectx_time
		kslwtectx_count++
	}
}
probe end {
	printf("\nkslwtbctx: avg %12d\nkslwtectx: avg %12d\n",kslwtbctx_tot/kslwtbctx_count,kslwtectx_tot/kslwtectx_count)
}

This systemtap script has a ‘begin probe’, which executes once the systemtap script starts running. I simply print ‘Begin.’ with a newline. The idea is that it prompts me once the systemtap script is actually running.

Then there is a (userspace) process based probe for the oracle process. There are two probes for both the kslwtbctx and the kslwtectx function in the oracle process. The first one (.function(“kslwtbctx”)) fires when the function is entered, the second one (.function(“kslwtbctx”).return) fires when the function has ended.

The ‘if ( pid() == target() )’ function filters all the invocations and returns of the probed functions for the PID set by “-x PID” parameter. Otherwise any invocation of the probed function by any process would be picked up.

The entering probe simply records the time in nanoseconds in a variable. The returning probe subtracts the previous recorded time from the current time, which means the time between entering and returning is shown. Also, the returning probe adds the time the function took to another variable, and counts the number of times the return probe has fired.

The end probe shows the total time spend in each of the two functions, divided by the number of times the return probe was fired. This way the average time spend in the two functions is calculated. As you will see, the time spend in the function varies.

When this is executed against an Oracle foreground session, this is how it looks like:

# stap -x 2914 wait_interface.stap
Begin.
kslwtectx:         9806
kslwtbctx:         3182
kslwtectx:         1605
kslwtbctx:         1311
kslwtectx:         4200
kslwtbctx:         1126
kslwtectx:         1014
kslwtbctx:          840
kslwtectx:         4402
kslwtbctx:         2636
kslwtectx:         2023
kslwtbctx:         1586
^C
kslwtbctx: avg         2165
kslwtectx: avg         4305

The time measured is in nanoseconds. The average wait interface overhead is roughly 6 microseconds including systemtap overhead on my system.

The obvious thought you might have, is: “why is this important?”. Well, this is actually important, because the 6us dictates what the wait interface should measure, and what it should not measure. What I mean to say, is that anything that is called inside the Oracle database for which the time spend is in the same ballpark as the wait interface overhead or lower, should not be measured by the wait interface, because the measurement would influence the overall performance in a negative way.

A good example of this are latch gets. The Oracle database does not instrument individual latch gets via the wait interface, but rather exposes waiting for a latch via the wait interface when a process has spun for it (a latch is a spinlock), and decides to go to sleep (on a semaphore) waiting to get woken once the latch becomes available.

However, using systemtap we can actually measure latch gets! Here is a simple script to measure the latch gets for non-shared latches via the kslgetl() function:

global kslgetl_time, kslgetl_count=0, kslgetl_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl") {
	if ( pid() == target() )
		kslgetl_time=local_clock_ns()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl").return {
	if ( pid() == target() ) {
		printf("kslgetl: %12d\n", local_clock_ns()-kslgetl_time)
		kslgetl_tot+=local_clock_ns()-kslgetl_time
		kslgetl_count++
	}
}
probe end {
	printf("\nkslgetl: avg %12d\n",kslgetl_tot/kslgetl_count)
}

This systemtap script follows the exact same structure and logic of the previous systemtap script.

This is how this looks like on my system when executed against a database foreground session executing ‘select * from dual’:

# stap -x 2914 kslgetl.stap
Begin.
kslgetl:         3363
kslgetl:          786
kslgetl:          744
kslgetl:          782
kslgetl:          721
kslgetl:          707
kslgetl:         1037
kslgetl:          728
kslgetl:          711
kslgetl:          736
kslgetl:          719
kslgetl:          714
kslgetl:         1671
kslgetl:          929
kslgetl:          968
kslgetl:          919
kslgetl:          883
kslgetl:          869
kslgetl:         3030
kslgetl:          750
^C
kslgetl: avg         1362

As you can see, the measured average time spend inside the kslgetl function is 1.3us on my system, which includes systemtap overhead. Clearly the time for taking a latch is less than the wait interface takes, which means not instrumenting the kslgetl() function in the wait interface is a sensible thing.

This means that with the current state of the wait interface, it does not make sense to measure a lot more very fine grained events, even though the timer can time on microsecond granularity. Please mind I am not saying that it does not make sense to detail the response time, I think with modern computer systems with a lot of memory the Oracle database can run more and more without needing to wait for things like disk IOs. This means modern database servers can spend a lot of time just running on CPU, making it hard to understand in what routines the time is spend. Tuning on CPU execution requires an insight into where time is spend. The only option to understand how CPU time in Oracle is composited, is by using external (operating system based) tools like perf and flame graphs to detail CPU time. It would be great if an option would exist in the database to detail on CPU time.

%d bloggers like this: