Archive

Tag Archives: internals

In my previous post, I introduced Intel Pin. If you are new to pin, please follow this link to my previous post on how to set it up and how to run it.

One of the things you can do with Pin, is profile memory access. Profiling memory access using the pin tool ‘pinatrace’ is done in the following way:

$ cd ~/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 12284 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

The pid is a pid of an oracle database foreground process. Now execute something in the session you attached pin to and you find the ‘pinatrace’ output in $ORACLE_HOME/dbs:

$ ls -l $ORACLE_HOME/dbs
total 94064
-rw-rw----. 1 oracle oinstall     1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--. 1 oracle oinstall     2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall       57 Nov  5 09:42 inittestdb.ora
-rw-r-----. 1 oracle oinstall       24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall     7680 Nov  5 09:41 orapwtestdb
-rw-r--r--  1 oracle oinstall 10552584 Nov 17 06:36 pinatrace.out

Please mind memory access generates A LOT of information! The above 11MB is what a ‘select * from dual’ generates (!)

This is how the file looks like:

$ head pinatrace.out
#
# Memory Access Trace Generated By Pin
#
0x00007f85c63fe218: R 0x00007fff6fd2c4c8  8          0xcefb615
0x000000000cefb61e: W 0x00007fff6fd2c4f8  8              0x12c
0x000000000cefb621: R 0x00007fff6fd2c4d0  8     0x7f85c5bebd96
0x000000000cefb625: R 0x00007fff6fd2c4d8  8     0x7f85c5bebd96
0x000000000cefb62c: R 0x00007fff6fd2c4e0  8     0x7fff6fd2c570
0x000000000cefb62d: R 0x00007fff6fd2c4e8  8          0xcefb54e

The first field is the function location, the second field is R or W (reading or writing obviously), the third field is the memory location read or written the fourth field is the amount of bits read and the fifth field is prefetched memory.

The function that is used can be looked up using the addr2line linux utility:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x000000000cefb61e
sntpread at ??:?

I looked up the second address from the pinatrace.out file above, and that address belongs to the function sntpread. There is no additional information available for this function (‘at ??:?’). If the address is not available in the oracle executable, a ‘??’ is displayed:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x00007f85c63fe218
?? ??:0

The pinatrace.out file is usable if you know the exact instruction pointer address or the memory location. However, that usage is fairly limited. An example of that is Mahmoud Hatem’s blog on tracing access to a memory location. Wouldn’t it be nice if we can change the functions addresses to function names, and the memory addresses to named memory locations whenever possible?

That’s where I created the pinatrace annotate oracle tool for. This is a little scriptset that contains scripts to generate memory information from the instance, after which the instruction pointer addresses and memory locations of a pinatrace.out file generated by pinatrace are translated to function names and memory area names. Let’s have a look what that means. This is a snippet of a pinatrace.out file:

0x000000000c967e46: R 0x0000000095f69910  8         0x95fcf6b0
0x000000000c967e4d: W 0x00007fff6fd2b2b8  8          0xc967e52
0x000000000c937b32: W 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937b3a: W 0x00007fff6fd2b278  8                0xe
0x000000000c937b41: W 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937b45: W 0x00007fff6fd2b270  8                0x1
0x000000000c937b49: W 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937b4d: R 0x0000000095fcf6bc  2               0x12
0x000000000c937b52: W 0x00007fff6fd2b288  8              0x2c4
0x000000000c937b59: W 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937b60: W 0x00007fff6fd2b2a0  4               0x73
0x000000000c937b6b: W 0x00007fff6fd2b2a8  4                0x1
0x000000000c937b6e: R 0x00007f85ca1db280  8     0x7f85ca1db280
0x000000000c937b77: R 0x000000000d0a40e4  4                0x1
0x000000000c937b84: R 0x00007f85ca1d43c8  8         0x95dc0e20
0x000000000c937b92: R 0x0000000095dc10b0  8                  0
0x000000000c937ba2: R 0x0000000095fcf6c0  4                0x1
0x000000000c937ba9: R 0x0000000095dc10e0  4                  0
0x000000000c937baf: R 0x000000000cfbe644  4            0x1cffe
0x000000000c937bbc: W 0x0000000095dc10b0  8         0x95fcf6b0
0x000000000c937bc5: R 0x0000000095fcf6b0  8                  0
0x000000000c937bc5: W 0x0000000095fcf6b0  8                0x1
0x000000000c937bca: W 0x00007fff6fd2b260  8                  0
0x000000000c937be1: R 0x00007f85ca1d4290  8     0x7f85ca1a9ca0
0x000000000c937bec: R 0x00007f85ca1ab1c0  4                0x3
0x000000000c937bf3: W 0x0000000095dc0faa  2                0x3
0x000000000c937bf9: R 0x00007f85ca1d43e0  8         0x95f68ea8
0x000000000c937c09: R 0x0000000095f69470  2                  0
0x000000000c937c16: W 0x0000000095dc0fac  2                  0
0x000000000c937c1e: R 0x0000000095dc10e0  4                  0
0x000000000c937c1e: W 0x0000000095dc10e0  4                0x2
0x000000000c937c24: W 0x0000000095dc0fa0  8         0x95fcf6b0
0x000000000c937c28: W 0x0000000095dc0fa8  2                0x8
0x000000000c937c2e: R 0x000000006000a9d8  4                0x1
0x000000000c937c3b: R 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937c3f: R 0x00007fff6fd2b2a0  4               0x73
0x000000000c937c42: W 0x0000000095fcf6c8  8         0x95f68ea8
0x000000000c937c46: W 0x0000000095fcf6c4  4               0x73
0x000000000c937c4a: R 0x00007fff6fd2b2a8  4                0x1
0x000000000c937c50: R 0x0000000095fcf6b8  4              0x83e
0x000000000c937c50: W 0x0000000095fcf6b8  4              0x83f
0x000000000c937c5a: W 0x0000000095dc10b0  8                  0
0x000000000c937c65: R 0x00007f85ca1d71d6  1                  0
0x000000000c937c76: R 0x00007fff6fd2b270  8                0x1
0x000000000c937c7a: R 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937c7e: R 0x00007fff6fd2b288  8              0x2c4
0x000000000c937c82: R 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937c86: R 0x00007fff6fd2b278  8                0xe
0x000000000c937c8d: R 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937c8e: R 0x00007fff6fd2b2b8  8          0xc967e52

The usefulness of this is limited in this form. The only thing I could derive is that big numbers in the memory access column (‘0x00007fff6fd2ac60’) are probably PGA related, and the numbers between roughly 0x000000006000000 and 0x0000000095dc0fd0 are probably SGA related. After running the annotate tool, it looks like this:

ksl_get_shared_latch:W:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:W:0x00007fff6fd2b278():8
ksl_get_shared_latch:W:0x00007fff6fd2b298():8
ksl_get_shared_latch:W:0x00007fff6fd2b270():8
ksl_get_shared_latch:W:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x0000000095fcf6bc(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):2
ksl_get_shared_latch:W:0x00007fff6fd2b288():8
ksl_get_shared_latch:W:0x00007fff6fd2b290():8
ksl_get_shared_latch:W:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x00007f85ca1db280(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x000000000d0a40e4():4
ksl_get_shared_latch:R:0x00007f85ca1d43c8(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6c0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:R:0x000000000cfbe644():4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x00007fff6fd2b260():8
ksl_get_shared_latch:R:0x00007f85ca1d4290(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x00007f85ca1ab1c0(pga|Other, pga heap, kgh stack pga|Other, pga heap, free memory pga|Other, pga heap, permanent memory):4
ksl_get_shared_latch:W:0x0000000095dc0faa(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x00007f85ca1d43e0(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095f69470(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:W:0x0000000095dc0fac(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc0fa0(shared pool|permanent memor,duration 1,cls perm):8
ksl_get_shared_latch:W:0x0000000095dc0fa8(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x000000006000a9d8(fixed sga|var:kslf_stats_):4
ksl_get_shared_latch:R:0x00007fff6fd2b298():8
ksl_get_shared_latch:R:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x0000000095fcf6c8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6c4(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x00007f85ca1d71d6(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):1
ksl_get_shared_latch:R:0x00007fff6fd2b270():8
ksl_get_shared_latch:R:0x00007fff6fd2b290():8
ksl_get_shared_latch:R:0x00007fff6fd2b288():8
ksl_get_shared_latch:R:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x00007fff6fd2b278():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b8():8

So, now you can see the reason I picked a seemingly arbitrary range of lines actually was because that range is the memory accesses of the ksl_get_shared_latch function. This annotated version show a shared latch get for the ‘session idle bit’ latch. It’s also visible the function uses PGA memory, some of it annotated, some of it not, and that most of the shared pool access is for the latch (a latch essentially is a memory range with the function of serialising access to a resource), which is in the shared pool because it’s a child latch. It’s also visible memory belonging to X$KSUPR is read and written (X$KSUPR is the table responsible for V$PROCESS, the fields KSLLALAQ and KSLLALOW are not externalised in V$PROCESS).

Why are a lot of the assumed PGA addresses (the ones like 0x00007fff6fd2b2b8) not annotated? Well, PGA memory allocations are very transient of nature. Because a PGA memory snapshot is made at a certain point in time, this snapshot represents the memory layout of that moment, which has a high probability of having memory deallocated and freed to the operating system. A lot of the SGA/shared pool allocations on the other hand have the intention of re-usability, and thus are not freed immediately after usage, which gives the SGA memory snapshot a good chance of capturing a lot of the memory allocations.

Get the pinatrace oracle annotate tool via github: git clone https://github.com/FritsHoogland/pinatrace_annotate_oracle.git

Please mind this tool uses the bash shell, it might not work in other shells like ksh.

How to use the tool?
– Use pin with the pinatrace.so tool, as described above. Move the the pinatrace.out file from $ORACLE_HOME/dbs to the directory with the pinatrace_annotate_oracle.sh script.
Immediately after the trace has been generated (!), execute the following scripts using sqlplus as SYSDBA:
– 0_get_pga_detail.sql (this lists the sessions in the database and requires you to specify the oracle PID of the session)
– 1_generate_memory_ranges.sql
– 2_generate_memory_ranges_xtables.sql
– 3_generate_memory_ranges_pga.sql
This results in the following files: memory_ranges.csv, memory_ranges_pga.csv and memory_ranges_xtables.csv.
Now execute the annotate script:
– ./pinatrace_annotate_oracle.sh pinatrace.out
The script outputs to STDOUT, so if you want to save the annotation, redirect it to a file (> file.txt) or if you want to look and redirect to a file: | tee file.txt.

I hope this tool is useful for your research. If you know a memory area described in the data dictionary that is not included, please drop me a message with the script, then I’ll include it.

This blogpost is an introduction to Intel’s Pin dynamic instrumentation framework. Pin and the pintools were brought to my attention by Mahmoud Hatem in his blogpost Tracing Memory access of an oracle process: Intel PinTools. The Pin framework provides an API that abstracts instruction-set specifics (on the CPU layer). Because this is a dynamic binary instrumentation tool, it requires no recompiling of source code. This means we can use it with programs like the Oracle database executable.
The Pin framework download comes with a set of pre-created tools called ‘Pintools’. Some of these tools are really useful for Oracle investigation and research.

Pin works in a very sophisticated way. The description in the Pin manualis to think of Pin as a JIT (just in time) compiler, where the compiler does not take byte code (as JIT compilation does with Java), but the executable of the process pin is executed against. This means pin inserts itself into the process’ execution. This can be seen when looking at the memory map of such a process:

$ cat /proc/29595/maps | grep -e pin-3.0 -e oracle
00400000-1098a000 r-xp 00000000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8a000-10bac000 r--p 1058a000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10bac000-10e05000 rw-p 105ac000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
7fa01e04c000-7fa01e31d000 r-xp 00000000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e51c000-7fa01e530000 r--p 002d0000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e530000-7fa01e531000 rw-p 002e4000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e533000-7fa01e71a000 r-xp 00000000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91a000-7fa01e91c000 r--p 001e7000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91c000-7fa01e91e000 rw-p 001e9000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e946000-7fa01e9c9000 r-xp 00000000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9ca000-7fa01e9cc000 r--p 00083000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9cc000-7fa01e9ce000 rw-p 00085000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9d6000-7fa01ea04000 r-xp 00000000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea04000-7fa01ea05000 r--p 0002d000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea05000-7fa01ea06000 rw-p 0002e000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea06000-7fa01eac4000 r-xp 00000000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac5000-7fa01eac8000 r--p 000be000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac8000-7fa01eaca000 rw-p 000c1000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eacb000-7fa01edb7000 r-xp 00000000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01edb8000-7fa01ee24000 r--p 002ec000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee24000-7fa01ee25000 rw-p 00358000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee80000-7fa01f385000 r-xp 00000000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f385000-7fa01f38d000 r--p 00504000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f38d000-7fa01f394000 rw-p 0050c000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f54e000-7fa01f6d4000 r-xp 00000000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
7fa01f6d4000-7fa01f8d3000 ---p 00186000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so

Here you see this process an oracle foreground (rows 1-3), after which we see a lot of stuff that pin pushed into the process’ address space.

For this reason, it’s vitally important to use and run pin as the same user as the process you want to run pin against. The way pin works is that, upon execution of pin, the pin executable inserts itself into the process’ address space, gains control and then tries to load necessary libraries. If it can’t find these libraries, it will send a SIGKILL to the process, effectively stopping it!

In order to obtain pin, go to the pin homepage, downloads, linux and select the ‘gcc compiler kit’. This is a zipped tarball. Upload that to a linux server as the database owner, usually ‘oracle’. I created a directory ‘pin’ in which I put the tarball, and extracted it (tar xzf). The next step is to compile the tools that come with pin:

$ cd pin-3.0-76991-gcc-linux/source/tools
$ make

This will output a lot of stuff on your screen and compiles everything in the tools directory.

Now that we have seen an introduction, let’s use a Pin tool to do something useful! One such useful is the tool ‘DebugTrace’. First obtain the process id (pid) from an Oracle server foreground process. Then execute the pin tool against this process:

$ pwd
/home/oracle/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 2407 -t source/tools/DebugTrace/obj-intel64/debugtrace.so

Upon execution, nothing is returned, because the majority of the things are happening in the process pin is run against (2407 in this case). If no parameters are specified, the pintool will create a file in the current working directory of the process it is run against. For an oracle foreground process, the current working directory is $ORACLE_HOME/dbs:

$ ls -l /proc/2407/cwd
lrwxrwxrwx 1 oracle oinstall 0 Nov 17 01:45 /proc/2407/cwd -> /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs

For the sake of testing this, I ran ‘select * from dual’ in the oracle database foreground process.

Let’s look what this produces! This is how my dbs directory looks like:

$ ls -ltr
total 2332
-rw-r--r--. 1 oracle oinstall    2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall      24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall    7680 Nov  5 09:41 orapwtestdb
-rw-r-----. 1 oracle oinstall      57 Nov  5 09:42 inittestdb.ora
-rw-rw----. 1 oracle oinstall    1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--  1 oracle oinstall 2361460 Nov 17 01:49 debugtrace.out

You can see the debugtrace.out file is what is produced by the pin tool, and it look rather big (remember I only ran ‘select * from dual’)!

Let’s look into the file:

$ less debugtrace.out
         3 @@@ return underflow
Return 0x00007f706b664218 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x12c
        14 @@@ return underflow
Return 0x000000000cefb62d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cefb580 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cedcc9d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cedc90a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000ceeb746 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b170d0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       156 Tailcall 0x0000000005b170de /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b170c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffeef95c305, 0x7f706ae51da0, ...)
       158 | Call 0x0000000005b170c1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b1dcb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       172 | Return 0x0000000005b20097 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffeef95c305
       174 Return 0x0000000005b170c7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffeef95c305
       197 @@@ return underflow
Return 0x000000000ceeb7ad /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
       213 Call 0x000000000cb5d8b0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:opikndf2+0x000000000410 -> 0x000000000c940560 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx(0x7ffeef95c040, 0x7f706ae51da1, ...)
       268 | Call 0x000000000c94066e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx+0x00000000010e -> 0x000000000cd585c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64(0x7ffeef95c040, 0x7f706ae51da1, ...)
       278 | | Call 0x000000000cd5863b /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64+0x00000000007b -> 0x000000000b13ef90 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt(0x7ffeef95c040, 0x7f706ae51da1, ...)
       286 | | | Call 0x000000000b13efa3 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt+0x000000000013 -> 0x0000000000ba6860 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:.plt+0x000000001360(0x1, 0x7ffeef95beb0, ...)
       293 | | | | Call 0x00007f706b188dcb /lib64/libc.so.6:__clock_gettime+0x00000000001b -> 0x00007ffeef9ab8a0 clock_gettime(0x1, 0x7ffeef95beb0, ...)
...etc...

I think this is rather exciting! This is a FULL function call trace, indented by call depth!
The first lines in the trace are returns, and these returns produce a ‘return underflow’, which is because these functions have been called before the trace was put on the process.
This trace shows:
Calling: the address and function where the function call is made, the address and function that is called and first two arguments of the function.
Returning: the address and function where the return is executed, and the returncode.
Tailcall: this is a function that is called as the last instruction in the current function. This means it returns as part of the current function, which is shown by not indenting this call further (call, tailcall and return are all indented at the same level).

Please mind currently I am not aware of a way to remove the pin framework and tool from a process address space. Also mind that because of how this works (a lot of intercepting in the execution flow), it will slow down the process significantly. Use this tool and the above described techniques at your own risk.

Recently I was investigating the inner working of Oracle. One of the things that is fundamental to the Oracle database is the SCN (system change number). SCNs are used to synchronise changes in the database. There is one source for SCNs in every instance (kcbgscn; the global or current SCN in the fixed SGA), and there are multiple tasks for which Oracle keeps track of synchronisation using SCNs. A few of these tasks for which Oracle stores and uses SCNs to keep track of progress are on disk SCN and lwn SCN.

This blogpost is about some oddities I found when using gdb (the GNU debugger) to watch memory locations of a running Oracle database. This should not be done on a production instance, and is purely for research purposes. Only use the methods mentioned in this article if you are absolutely sure what you are doing, and/or if you are using an Oracle instance that can be crashed and can be restored.

The situation
I was researching the ‘on disk scn’ in a 11.2.0.2 database. I used the following (simplified) gdb macro:

break kcrf_commit_force
  commands
    print $rip
    c
  end
awatch *0x60022659
  commands
    c
  end

This macro breaks on the function kcrf_commit_force essentially to understand the different phases of a commit, and uses the ‘awatch’ function (access watch; which means this watchpoint fires at both reading and writing) to look at memory location 0x60022659. That memory location is the on disk SCN for this instance.

Upon executing a commit after an insert, I saw the following output:

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$1 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Old value = 898
New value = 899
0x0000000008fb5ad1 in kcscur3 ()

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$2 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()

What this shows, is the following:
– line 1 and 9: the kcrf_commit_force function is called.
– line 11, 15 and 19: the memory location *0x60022659 is read, resulting in the value 899.
– line 3: the memory location *0x60022659 is written, it contained 898, and the new value is 899.

This greatly puzzled me. As far as I understand the way Oracle works, a server process that issues ‘commit’ is supposed to READ the on disk SCN, it does not write the on disk SCN, because the only process that changes the on disk SCN is the logwriter. However, quite clearly gdb tells me the server process is writing it. Further consideration is that IF the server process is writing it, where does this information come from?

After some further investigation, I came to the conclusion that the only way to prove to myself that what I am seeing is correct, is to create a program myself mimicking the behaviour witnessed above to truly understand what is going on, because I don’t believe what I am seeing is correct.

I created two little c programs. One that starts up, creates a shared memory segment, writes a letter into the shared memory segment and then waits for a keystroke, and puts the keystroke into the same location, and one that reads the same shared memory segment, then waits for enter to pressed, and reads the shared memory segment again, and outputs if the letter read from the shared memory is different. I called the first one ‘server’, and the second one ‘client’. This way, I can attach gdb to the client executable, which very much like an Oracle server process, reads a shared memory location which I can change at will.

The code can be found on GitHub: shared memory test source. Please mind this is simple proof of concept code that is not exhaustive in error checking and other sanity checks in any way. The source files can be compiled and linked in the following way:

$ cc server.c -o server
$ cc client.c -o client

The server executable can be run in the following way:

$ ./server
Type a letter. q quits.

This creates the shared memory segment, and puts the ‘a’ character in the variable c (line 42, c = ‘a’). As long as the c variable is not ‘q’, it writes the c variable into shared memory segment using the pointer s (*s = c) , and then waits for user input using the scanf function.

Once the shared memory segment is created and written into, the client program can be started:

$ ./client
client started. ^C to quit.
waiting for enter

The first time enter is pressed it will output ‘keystroke: a’ because the p variable was not initialised, and will be assigned the ‘a’ character (p = *s), the second time enter is pressed, ‘keystroke:’ is not printed, because the p variable and the memory location the pointer *s points to both contain ‘a’. What is vital, is that the client program NEVER writes, only reads the shared memory location, in three occasions: in line 50:

if( p != *s) {

In line 51:

printf("keystroke: %c\n", *s);

And in line 52:

p = *s;

Okay. So at this point we got a program that mimics what I think an Oracle server process is doing too on commit time (reading shared memory), which is the program ‘client’, and we got a program which creates and can manipulate the shared memory location the client is reading from, which is ‘server’. Startup the server program (./server), and startup the client using gdb:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb)

This is different than attaching to a process using gdb -p PID, but attaching to a process using ‘-p’ is the only way that I am aware of to use gdb on Oracle database processes. Now that we started gdb with the client program, we must first run it in order to see the program’s runtime environment:

(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter

Now invoke gdb (using control-C) and look at the memory mappings to see where the shared memory segment is:

(gdb) info proc mappings
process 3069
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]

Line 19 contains the shared memory segment, at memory address 0x7ffff7ff8000. To do the same as watching the on disk SCN location in Oracle, put an access watchpoint at 0x7ffff7ff8000:

(gdb) awatch *0x7ffff7ff8000
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now press enter, and we see the shared memory location being accessed three times:

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

If you take the source code, you’ll see that these are the following lines of code at rows 50-52:

       if( p != *s ) {
          printf("keystroke: %c\n", *s);
          p = *s;

Every line reads the shared memory location via the pointer *s. It’s also clear it’s a read because the value in the memory location (97) is displayed once.

Now go over to the server side, and press ‘b’ and enter. Now the shared memory segment contains ‘b’. Next, go back to the client program, and press enter:

waiting for enter

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

Hey! This shows what looks like a write with the first watchpoint, because the memory location is touched, and it reports the value at the location being changed. However, because we know the source code, we know we only read the shared memory location with the client program, not write to it! So what this shows in this case is that gdb knew what the previous value was of the memory location was, and detected that it was changed.

To see how a true write looks like, you can startup the server executable with gdb too, and set an ‘awatch’ watchpoint on the shared memory location exactly the same way as shown with the client program above. This is how that looks like:

(gdb) c
Continuing.
b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x00000000004006e9 in main ()

So conclusion here is that you can’t see the difference between a write to a memory location or a change of value when reading when using an ‘awatch’ watchpoint.

Now that we understand how ‘awatch’ works, would there be a way to make the distinction between reading and writing? Outside of ‘awatch’ (access watchpoint), there’s the ‘watch’ command, which is a watchpoint on writing to a memory location and ‘rwatch’, which is a watchpoint on reading a memory location. This is how the output of ‘rwatch’ looks like in the client program:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7b23670 in __read_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info proc map
process 3517
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]
(gdb) rwatch *0x7ffff7ff8000
Hardware read watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

At line 1 gdb is started with the ‘client’ program. Once inside gdb, the executable is run (line 12). Once it’s running we interrupt in order to get to gdb (line 16). In gdb, the info proc map command is used (line 20) to look at the memory mappings. From the memory mappings, we see address 0x7ffff7ff8000 is the shared memory segment. A read watch is set and a c (continue) command is added to the read watchpoint (line 46 and 48). Then the execution is continued with the c command (line 53).

Now we are running in the executable again. After continuing, I pressed enter to go through the loop in the client executable, and the watchpoint is triggered three times (lines 56, 60 and 68), as we did see earlier, for source code lines 50, 51 and 52.

Now I changed the letter in the shared memory segment from ‘a’ to ‘b’ with the server executable, and pressed enter at line 60. The interesting things that happens is that the read watchpoint is triggered TWO times. This should be three times (!!).

There is a way to make the now invisible read visible again, using the following command in gdb:

(gdb) watch *0x7ffff7ff8000
Hardware watchpoint 2: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Yes, that’s a WRITE watchpoint. After adding a write watchpoint, this is how the output looks like when I change the letter from ‘b’ to ‘c’:

Hardware watchpoint 2: *0x7ffff7ff8000

Old value = 98
New value = 99
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040071e in main ()
keystroke: c
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040073c in main ()
waiting for enter

There we have the tree reads visible again, however, one is ‘disguised’ as a write. To conclude: it seems gdb is not able to make a distinction between reads and writes for watchpoints if the watchpoint memory location is changed outside of the executable active in the debugger. Gdb seems to determine memory access a write if the value found is different from the last it was read.

Bonus information, not applicable to Oracle:
When you have the source code of an application, you can make debugging even better by adding in debug information for gdb in the executable. This is done by adding the ‘-g’ flag to the cc command:

$ cc -g client.c -o client

With the debug information available, it’s easier to debug:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...done.
(gdb) tbreak 1
Temporary breakpoint 1 at 0x40067c: file client.c, line 1.
(gdb) run
Starting program: /home/oracle/src/shm_test/client

Temporary breakpoint 1, main () at client.c:22
22	    key = 666;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info local
shmid = 0
key = 0
shm = 0x400590 "1\355I\211\321^H\211\342H\203\344\360PTI\307\300P\a@"
s = 0x7fffffffe550 "\001"
p = 0 '\000'
(gdb) awatch s
Hardware access (read/write) watchpoint 2: s
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Old value = 0x7fffffffe550 "\001"
New value = 0x7ffff7ff8000 "a"
main () at client.c:45
45	    printf("client started. ^C to quit.\n");
(gdb) c
Continuing.
client started. ^C to quit.
waiting for enter

Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040070f in main () at client.c:50
50	       if( p != *s ) {
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040071b in main () at client.c:51
51	          printf("keystroke: %c\n", *s);
(gdb) c
Continuing.
keystroke: a
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x0000000000400739 in main () at client.c:52
52	          p = *s;
(gdb) c
Continuing.
waiting for enter

Above you see starting up the executable with gdb at line 1. After gdb is started, I first set a temporal breakpoint on line 1 (of the source code, shown on line 12 above), to have gdb ‘enter’ the executable and stop when I run it. If I wouldn’t do that, the program would ‘wait’ in the getchar() function, which is a function in libc, and the libc library would be my scope, line 19 tells that I have not loaded debug symbols for glibc. After setting the breakpoint, I run the executable, which immediately breaks on the temporal breakpoint (line 14).

Now that the executable is active, I can list the variables used using the info local command (line 20). Also, I can simply set a watchpoint on the variable s (line 26), instead of having to search the memory address myself. When the watchpoint is triggered, it now shows the line number in the source file, and the actual source code on that line (lines 34, 35, 44, 45, etc).

This is the second blogpost on using PL/SQL inside SQL. If you landed on this page and have not read the first part, click this link and read that first. I gotten some reactions on the first article, of which one was: how does this look like with ‘pragma udf’ in the function?

Pragma udf is a way to speed up using PL/SQL functions in (user defined function), starting from version 12. If you want to know more about the use of pragma udf, and when it does help, and when it doesn’t, please google for it.

create or replace function add_one( value number ) return number is
        pragma udf;
        l_value number(10):= value;
begin
        return l_value+1;
end;
/

select sum(add_one(id)) from t2;

As you can see, really the only thing you have to do is add ‘pragma udf’ in the declaration section of PL/SQL.

Here is how the flamegraph looks like:

What is visible, is that the functions between the plsql interpreter (pfrrun) and the function that makes the operand evaluation switch to PL/SQL (evapls) now is only one function, peidxrex. However, inside the evapls function there are two additional functions called (kkxmsagof, kkxmsagif, not readable) which take noticeable time. Conclusion at this point is pragma udf is doing it in yet another way than a native PL/SQL function and the subquery factoring.

Profiling this using the systemtap script:

global evapls_time, pfrrun_time, evapls_tot=0, pfrrun_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls") {
	if ( pid() == target() )
		evapls_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evapls").return {
	if ( pid() == target() )
		evapls_tot+=local_clock_us()-evapls_time
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun") {
	if ( pid() == target() )
		pfrrun_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("pfrrun").return {
	if ( pid() == target() )
		pfrrun_tot+=local_clock_us()-pfrrun_time
}

probe end {
	printf("\nevapls time: %12d\npfrrun time: %12d\n", evapls_tot, pfrrun_tot)
}

Shows:

# stap -x 92509 plsql.stap
Begin.
^C
evapls time:      2211412
pfrrun time:       804178

So, that’s very close to using this function using subquery factoring, a bit longer (2192685). This is very strictly depending on what is actually done, so milage may vary for your own use.

While we are at it, let’s have a look how this looks like when no PL/SQL is used, so:

select sum(id+1) from t2;

Here it is:

The function used for adding is now evaaddrset. From the size of the kdst_fetch function can be seen that it takes way lesser time. Let’s measure it with a changed version of the systemtap script:

global evaaddrset_time, evaaddrset_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evaaddrset") {
	if ( pid() == target() )
		evaaddrset_time=local_clock_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("evaaddrset").return {
	if ( pid() == target() )
		evaaddrset_tot+=local_clock_us()-evaaddrset_time
}

probe end {
	printf("\nevaaddrset time: %12d\n", evaaddrset_tot)
}

This is how the output looks like:

# stap -x 92509 plsql.stap
Begin.
^C
evaaddrset time:        43389

A simple calculation shows that doing the addition native in SQL only takes 43389/2211412*100=2% of the runtime of PL/SQL with pragma udf.

This post is about manually calling and freeing a shared latch. Credits should go to Andrey Nikolaev, who has this covered in his presentation which was presented at UKOUG Tech 15. I am very sorry to see I did miss it.

Essentially, if you follow my Oracle 12 and shared latches part 2 blogpost, which is about shared latches, I showed how to get a shared latch:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call ksl_get_shared_latch 0x94af8768 1 0 2303 16
Function returned 1

Which works okay, but leaves a bit of a mess when freed:

SQL> oradebug call kslfre 0x93FC02B8
ORA-03113: end-of-file on communication channel
ORA-24323: value not allowed

(please read part 2 to get the full story on how to set it and how to use it for investigations)

Andrey Nikolaev shows there is another function, kslgetsl_w(latch address, wait, why, where, mode), which can be used to get a shared latch (alike the function ksl_get_shared_latch) but without the oradebug session blowing up in your face:

SQL> oradebug call kslgetsl_w 0x93faec78 1 0 2329 16
Function returned 1

And after testing the latch:

SQL> oradebug call kslfre 0x93faec78
Function returned 1

However, when querying v$latchholder from the same session as the oradebug call, an ORA-600 is thrown:

SQL> select * from v$latchholder;
select * from v$latchholder
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [504], [0x095306210], [2], [1], [session idle bit], [1], [0x093FAEC78], [cache buffers chains], [], [], [], []

After which the latch is freed.

The error message strongly hints at the latches gotten in an incompatible order. The session idle bit seems to be a latch gotten to indicate the session state switching from active to inactive (according to an article here), and indeed holding a cache buffer chains latch, a session should not switch state at all. However, a second sysdba session can be used to query v$latchholder.

Some very limited investigations show the function is kslgetsl_w is not used by the database itself, which rather uses ksl_get_shared_latch. Again, the investigation was very limited, there might be situations this function is used.

Conclusion
The reason for this blogpost is to point out an alternative for using the ksl_get_shared_latch function manually for doing investigation. This post also shows that you should not change or play with internals on production systems, because it might lead to all kinds of additional behaviour or plain errors, for which the ORA-600 is a good example.

Oracle DBAs who are so old that they remember the days before Oracle 11.2 probably remember the tuning efforts for latches. I can still recall the latch number for cache buffers chains from the top of my head: number 98. In the older days this was another number, 157.

But it seems latches have become less of a problem in the modern days of Oracle 11.2 and higher. Still, when I generate heavy concurrency I can see some latch waits. (I am talking about you and SLOB mister Closson).

I decided to look into latches on Oracle 12.1.0.2 instance on Oracle Linux 7. This might also be a good time to go through how you think they work for yourself, it might be different than you think or have been taught.

In order to understand how latching works, I searched for Oracle related traces. I could find event 10005, but it did not return anything latching related. My current understanding is that event 10005 is Oracle KST tracing, for which the results are put in X$TRACE.

Luckily, I could get a great headstart by studying the work of Andrey Nikolaev. However, that work seems to be strictly Solaris based.

I decided to take a look on how this works on Linux. In order to do this, I did setup a system for the specific purpose of this test. Disclaimer: The techniques below are for educational purposes only, and should never be done on a real database!

The work of Andrey shows kslgetl() as the overal latching function, which does:
kslgetl
– sskgslgf (immediate latch get)
– kslges (wait latch get)
— kskthbwt
— kslwlmod (setup wait list)
— sskgslgf (immediate latch get)
— skgpwwait (sleep latch get)
— semop

In order to do predictable latch gets and generate latch misses, in a very predictable way, I used the method that Andrey shows on his website (which he attributed to Tanel), which is using oradebug call to call the latch get function. In order to get latch waits, you need at least two processes doing something with a latch, one holding the latch, and another one requesting it. In order to facilitate this, I setup two sqlplus / as sysdba sessions.

Taking a latch manually can be done using the kslgetl or the ksl_get_shared_latch functions. Freeing a latch is done using the kslfre function. The kslfre function just takes the latch address as argument. The kslgetl and ksl_get_shared_latch functions take the following arguments:
1-latch address
2-immediate get (0 means yes, 1 means no)
3-where (X$KSLLW.INDX)
4-why (X$KSLWSC.INDX)
5-mode (8=shared,16=exclusive; only for ksl_get_shared_latch function)

Immediate mode get for a non shared latch.
An immediate mode get just tries to fetch a latch once. In order to manually do an immediate latch get, I done:

Session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 1

This means session 1 has gotten latch 0x60023a80. I gotten the latch in willing to wait mode, but that does not really matter here. The session did get the latch.

If you want to check this, simply use V$LATCHHOLDER view to verify (in another session):

SQL> select * from v$latchholder;

        PID        SID LADDR            NAME                                                                   GETS     CON_ID
---------- ---------- ---------------- ---------------------------------------------------------------- ---------- ----------
        38        134 0000000060023A80 cache table scan latch                                                  709          0

Please mind that with a latch manually gotten, you need to free the latch before you try to do anything else with your session, otherwise you encounter a (non critical) ORA-600. Freeing a latch is done using oradebug call kslfre and one argument: the latch address.

Session 2:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 0

The ‘Function returned 0’ means the immediate latch get failed. As you can see this was an immediate get because the second argument is 0.

In order to understand which functions are involved, I first used the perf record linux utility. However, because the immediate get does not spin, and a latch get by all means has speed/low overhead as one of the principal design criterions, I could not see the function.

This meant I needed to go to one of the tools I have used extensively in the past: gdb (the GNU debugger). You need to attach to the Oracle database server shared process locally on the database server. Here is what I did:

# gdb -p 4600
...
(gdb) set pagination off
(gdb) rbreak ^ksl.*
...
Breakpoint 262 at 0x8ea7b0
<function, no debug info> ksl_event_stats_rollup;
(gdb) commands 1-262
type commands for breakpoint(s) 1-262, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

The first gdb function turns off having to press enter for every full screen of output of gdb, the second function breaks on all functions in the oracle executable that start with ‘ksl’. The commands command creates commands that are executed in gdb if breakpoints 1-262 are encountered, which is ‘c’: continue.

Now, with the debugger set, I executed the kslgetl function again:

Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 247, 0x000000000c8e10a0 in kslws_check_waitstack ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 242, 0x000000000c8ddcb0 in kslgetl ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()

It is important to understand Oracle does a lot of other stuff outside the latch get via kslgetl. Most of the stuff above are functions which start with kslwt, which is the Oracle wait interface. A couple of times a shared latch is taken (as can be seen by the function ksl_get_shared_latch), and freed (kslfre). The important part here is: kslgetl is executed once, and did not go into any other function to try to get the latch.

Getting a latch in willing to wait mode for a non shared latch.
Now let’s do something a bit more exciting: getting a taken latch in willing to wait mode. The first session can do exactly the same, just take the latch. The second session needs to be changed a little bit to indicate it is willing to wait:

SQL> oradebug cell kslgetl 0x60023a80 1 0 2442

This will call additional functions. In order to understand what these functions are, I used perf record, perf report and perf script.

I was able to create a smaller, more specific gdb script to see what is going on:

break kslgetl
  commands
    silent
    printf "kslgetl laddr:%x, willing:%d, where:%d, why:%d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslges
  commands
    silent
    printf "kslges %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslwlmod
  commands
    silent
    printf "kslwlmod %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break skgpwwait
  commands
    silent
    printf "skgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break sskgpwwait
  commands
    silent
    printf "sskgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break semop
  commands
    silent
    printf "semop %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end

This is how the gdb output looks like when the latch get in willing to wait mode is executed:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
kslwlmod 13311368, -1780327896, 1610758784, 1
skgpwwait 13311608, -1767360, -1780326976, 0
sskgpwwait 13311608, -1767360, -1780326976, 0
semop 360451, 13310840, 1, -1

Interestingly, if the latch is not taken, this is how the latch get sequence looks like:

kslgetl laddr:60023a80, willing:1, where:0, why:2442

In other words, for getting a non shared latch in willing to wait mode:

1-the function kslgetl is called, which tries to fetch the latch.
If the latch can be taken, the function returns, if not:
2-the function kslges (kernel service latch get spinning) is called, which supposedly also tries to take the same latch.
If the latch still can not be taken, the next function is:
3-the function kslwlmod (kernel service latch waiting list modify) is entered.
In this function the process registers itself as waiting in the post/wait list.
4-the function skgpwwait (system kernel generic post/wait wait) is entered.
This function sets up the waiting for the process so it can be posted.
5-the function sskgpwwait (system system kernel generic post/wait wait)
My current understanding is the ‘ss’ function contain the platform specific code for database functions.
6-the (operating system) function semop (semaphore operation) is called.
This will make the process sleep waiting on a semaphore (operating system signalling mechanism). This way, the process will not be runnable on the CPU unless the semaphore is posed.

The information that is missing here, is the spinning. The earlier work of Andrey Nikolaev showed that in the Solaris port of the database, a distinct function (sskgslgf [immediate] and sskgslspin [spin]) was used to get the latch, which made it easy to count.

Some searching around revealed that a CPU register reveals this information. Add this to the above gdb script:

break *0xc29b51
  commands
    silent
    printf " kslges loop: %d\n", $ecx
    c
  end

And try to get a non shared taken latch in willing to wait mode:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
 kslges loop: 19999
 kslges loop: 19998
...
 kslges loop: 1
 kslges loop: 0
kslwlmod 1208568840, -1780327896, 1610758784, 1
skgpwwait 1208569080, -1281169344, -1780326976, 0
sskgpwwait 1208569080, -1281169344, -1780326976, 0
semop 360451, 1208568312, 1, -1

So…this reveals that getting a non shared latch in willing to wait mode will spin 10*_spin_count. In other words: not _spin_count, which is 2000 by default. What is even more interesting, is for the described type of latch, there is no (short) timed sleep done; the kslges function spins 10*_spin_count times for the latch, then puts itself on the waiting list, and goes to sleep on a semaphore. The way the latching mechanism works has been described (by Oracle AFAIK) as that a process will spin for _spin_count times trying to get a latch, then goes to sleep for some time, and then spins trying to get the latch, goes to sleep, etc.

I strace’d the process holding the latch to see if it is doing a semctl systemcall to signal the waiting process if the latch is freed, and indeed that is what is happening. This is different from the “old days” where processes spinning on latches (cache buffers chains latches most of the time) were really hammering the system and would eat up a lot of the CPU slices trying to get a latch.

Part of the classic latching problem, specifically the cache buffers chains latching problem, quite probably is mitigated by having shared latches, which were introduced in Oracle in different versions for different ports.

Watch out for a second blogpost where I do the same investigation for shared latches.

This is a question that I played with for a long time. There have been statements on logical IO performance (“Logical IO is x times faster than Physical IO”), but nobody could answer the question what the actual logical IO time is. Of course you can see part of it in the system and session statistics (v$sysstat/v$sesstat), statistic name “session logical reads”. However, if you divide the number of logical reads by the total time a query took, the logical IO time is too high, because then it assumed all the time the query took was spend on doing logical IO, which obviously is not the case, because there is time spend on parsing, maybe physical IO, etc. Also, when doing that, you calculate an average. Averages are known to hide actual behaviour.

Luckily, with Redhat Enterprise Linux and Oracle Linux version 7, there is the kernel version 3.10 as the stock RedHat kernel version. This kernel version supports systemtap userspace return probes. A systemtap probe is a way to trigger an action when a certain action (the probed event) is started, a return probe is an action triggered when an action is finished. The Oracle UEK kernel version 3 at the time of writing is version 3.8, which does not support this.

My current knowledge is consistent reads are handled by the Oracle database C function kcbgtcr(). Current reads are quite probably handled by the function kcbgcur(). Having said that, I know of at least one exception to this: scans on hybrid columnar compressed segments do not use the kcbgtcr() function.

Please mind all kernel code translations, like kcbgtcr (kernel cache buffers get consistent read) are pure guesses, albeit somewhat educated guesses, as there are a lot of internet publications naming these, including My Oracle Support itself.

With the 3.10 version kernel, we can create a small systemtap script to measure the time between the start and stop of the kcbgtcr routine. If you want to experiment with this, it is probably best to download the latest version of systemtap and compile it yourself.. This is the reason you see /usr/local/bin/stap in the shebang.

#!/usr/local/bin/stap

global latency

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	printf("<kcbgtcr, latency(us): %d\n", gettimeofday_us() - latency[pid()])
}

Now make the script executable, and run it against a database session:

# ./lio.stap -x 3877

Next, I execute a scan in the database foreground session, and watch the systemtap script output:
(execution of the systemtap script can be cancelled by pressing CTRL-c)

...
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 79542
>kcbgtcr
<kcbgtcr, latency(us): 4
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 2
>kcbgtcr
<kcbgtcr, latency(us): 13
>kcbgtcr
<kcbgtcr, latency(us): 1
...

This shows the function being executed by the database session. However, it seems the time spend in the kcbgtcr() function is not consistent.

In order to get a better overview, we can add a histogram of the kcbgtcr latencies:

#!/usr/local/bin/stap
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	latency[pid()] = gettimeofday_us() - latency[pid()]
	latency_histogram <<< latency[pid()]
	printf("<kcbgtcr, latency(us): %d\n", latency[pid()])
}

probe end {
	if ( @count(latency_histogram) > 0 ) {
		printf("\n==kcbgtcr latency==\n")
		println(@hist_log(latency_histogram))
	}
}

Next, attach the systemtap script to the database session again, and issue a scan. Once the scan in the database session is finished, press CTRL-c to finish the systemtap script:

...

==kcbgtcr latency==
  value |-------------------------------------------------- count
      0 |                                                       0
      1 |@@@@@@@@@@@@@@@@@@@@                                5998
      2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  14284
      4 |                                                     268
      8 |                                                      58
     16 |                                                     137
     32 |                                                      25
     64 |                                                       1
    128 |                                                      15
    256 |                                                      69
    512 |                                                       1
   1024 |                                                       0
   2048 |                                                       0
   4096 |                                                       1
   8192 |                                                       5
  16384 |                                                      35
  32768 |                                                      39
  65536 |                                                       8
 131072 |                                                       0
 262144 |                                                       1

Okay, the majority is 2 microseconds, but outside the 1 and 2 microseconds buckets, there are a lot of executions that totally fall outside of these, up to 262144 microseconds (262 milliseconds)!

What could cause these huge changes in logical IO time?

At this point I am quite much turning the squelch down and make a lot of information visible (this is a warning!). Here is the systemtap script I am using:

#!/usr/local/bin/stap
#
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
	if ( probefunc() == "kcbgtcr" )
			latency[pid()] = gettimeofday_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*").return {
	printf("%s < %s", thread_indent(-1), ppfunc())
	if ( ppfunc() == "kcbgtcr" ) {
		latency[pid()] = gettimeofday_us() - latency[pid()]
		latency_histogram <<< latency[pid()]
		printf(" -- latency(us): %d", latency[pid()])
	}
	printf("\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiosq0") {
	printf("=PARSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiexe") {
	printf("=EXECUTE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opifch2") {
	printf("=FETCH\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiclo") {
	printf("=CLOSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	printf("=KSLWTBCTX\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	printf("=KSLWTECTX\n")
}
probe process("/lib64/libaio.so.1").function("io_submit") {
	printf("=IO_SUBMIT\n")
}
probe process("/lib64/libaio.so.1").function("io_getevents_0_4") {
	printf("=IO_GETEVENTS\n")
}
probe process("/lib64/libc.so.6").function("pread64") {
	printf("=PREAD\n")
}

Warning! This script will run for a long time before it is compiled as kernel module!! Systemtap works by dynamically creating a kernel module from the system tap script, and insert it into the kernel. Because I injected a huge chunk of code to run in the kernel a lot of things are potentially influenced. I found the Oracle database to drop core’s for example.

After running this against a session and doing a very simple full table scan with a count(*), I searched for the logical IO, in other words where the kcbgrcr() function was performed. Here it is:

166149 oracle_5487_tes(5487):     < kdst_fetch
166155 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166159 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets
166162 oracle_5487_tes(5487):     > kdst_fetch
166164 oracle_5487_tes(5487):      > kdst_fetch0
166167 oracle_5487_tes(5487):       > kcbipnns
166170 oracle_5487_tes(5487):       < kcbipnns
166173 oracle_5487_tes(5487):       > kcbrls
166177 oracle_5487_tes(5487):        > kcbrls_direct
166181 oracle_5487_tes(5487):        < kcbrls_direct
166183 oracle_5487_tes(5487):       < kcbrls
166185 oracle_5487_tes(5487):       > kdstsnb
166188 oracle_5487_tes(5487):       < kdstsnb
166191 oracle_5487_tes(5487):       > ktrget2
166194 oracle_5487_tes(5487):        > ktsmg_max_query
166197 oracle_5487_tes(5487):        < ktsmg_max_query
166200 oracle_5487_tes(5487):        > kcbgtcr
166204 oracle_5487_tes(5487):         > kcbldrget
166209 oracle_5487_tes(5487):          > kcblgt
166211 oracle_5487_tes(5487):          < kcblgt
166213 oracle_5487_tes(5487):          > kcbzvb
166216 oracle_5487_tes(5487):           > kcbhvbo
166219 oracle_5487_tes(5487):            > kcbhxoro
166222 oracle_5487_tes(5487):            < kcbhxoro
166224 oracle_5487_tes(5487):           < kcbhvbo
166226 oracle_5487_tes(5487):          < kcbzvb
166228 oracle_5487_tes(5487):          > kcbztek_trace_blk
166230 oracle_5487_tes(5487):          < kcbztek_trace_blk
166233 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
166236 oracle_5487_tes(5487):           > kcbtgobj
166239 oracle_5487_tes(5487):            > kd4obj
166251 oracle_5487_tes(5487):            < kd4obj
166254 oracle_5487_tes(5487):           < kcbtgobj
166257 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
166260 oracle_5487_tes(5487):         < kcbldrget
166262 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 62
166265 oracle_5487_tes(5487):        > ktrgcm
166268 oracle_5487_tes(5487):         > kcbcge
166271 oracle_5487_tes(5487):         < kcbcge
166273 oracle_5487_tes(5487):         > ktcckv
166275 oracle_5487_tes(5487):         < ktcckv
166285 oracle_5487_tes(5487):        < ktrgcm
166286 oracle_5487_tes(5487):       < ktrget2
166289 oracle_5487_tes(5487):       > kdr9ir2blk
166291 oracle_5487_tes(5487):       < kdr9ir2blk
166293 oracle_5487_tes(5487):      < kdst_fetch0
166294 oracle_5487_tes(5487):     < kdst_fetch
166297 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166300 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

How to read: “>” means entering a function, “<" means return from a function.
I selected a piece of the systemtap output/tracing where the counting procedure is visible.

The first row is "< kdst_fetch" in other words: returning from kernel data scan table fetch. So this function has performed a fetch. Not surprisingly, the next function is qeaeCn1SerialRowsets, which I think is the count function.

After the count, the kdst_fetch function is entered again, then kdst_fetch0. The next functions are kcbipnns, kcbrls and kcbrls_direct. Probably these functions are related to pinning and releasing blocks. See Alexander Anokhin’s post on that.

A few functions further we encounter kcbgrcr(). Interestingly, it is followed by the kcbldrget() function, which is kernel cache buffers direct path loader get. In other words, Oracle has chosen to do a direct path read, because this is the function that starts off the direct path read code. The next function, kcblget() requests a block. This means that blocks read in this code path are from PGA memory, not SGA (buffer cache) memory.

The next functions, kcbzvb, kcbhvbo and kcbhxoro are the block XOR checking functionality. The function actually performing this (sxorchk) is not visible because I did not probe for it.

The next functions are not clear to me at this point (kcbztek_trace_blk, kcbl_objdchk_with_cache_reread, kcbtgobj, kd4obj). Then the kcbldrget returns, and the kcbgtcr function too.

Here we can see that probing a huge number of functions does influence the performance of a process. While previously we saw kcbgtcr() took 2us, probably because of all the probes the time the kcbgtcr function took 62us.

The next piece of execution is done by the ktrgcm function. This is handling undo and buffer cleanout. After that function, we cross kdr9ir2blk (function unknown), after which kdst_fetch0 and kdst_fetch return, and the count is done using qeaeCn1SerialRowsets.

Okay, so far so good, but the question was: why are there logical IOs/kcbgtcr() executions that take excessively more time?

After a little searching, I was able to find a very good reason:

169252 oracle_5487_tes(5487):     > kdst_fetch
169254 oracle_5487_tes(5487):      > kdst_fetch0
169256 oracle_5487_tes(5487):       > kcbipnns
169258 oracle_5487_tes(5487):       < kcbipnns
169259 oracle_5487_tes(5487):       > kcbrls
169262 oracle_5487_tes(5487):        > kcbrls_direct
169264 oracle_5487_tes(5487):        < kcbrls_direct
169265 oracle_5487_tes(5487):       < kcbrls
169267 oracle_5487_tes(5487):       > kdstsnb
169269 oracle_5487_tes(5487):       < kdstsnb
169270 oracle_5487_tes(5487):       > ktrget2
169272 oracle_5487_tes(5487):        > ktsmg_max_query
169274 oracle_5487_tes(5487):        < ktsmg_max_query
169275 oracle_5487_tes(5487):        > kcbgtcr
169278 oracle_5487_tes(5487):         > kcbldrget
169280 oracle_5487_tes(5487):          > kcblgt
169283 oracle_5487_tes(5487):           > kcblrs
169286 oracle_5487_tes(5487):            > kdsdrcbk
169288 oracle_5487_tes(5487):            < kdsdrcbk
169291 oracle_5487_tes(5487):            > kcbldio
169296 oracle_5487_tes(5487):             > kcfaioe
169299 oracle_5487_tes(5487):             < kcfaioe
169301 oracle_5487_tes(5487):             > kcflbi
169314 oracle_5487_tes(5487):              > kcf_hard_ftype_check
169317 oracle_5487_tes(5487):              < kcf_hard_ftype_check
=IO_SUBMIT
169416 oracle_5487_tes(5487):             < kcflbi
169420 oracle_5487_tes(5487):            < kcbldio
169425 oracle_5487_tes(5487):            > kcblcffln
169429 oracle_5487_tes(5487):            < kcblcffln
169432 oracle_5487_tes(5487):           < kcblrs
169434 oracle_5487_tes(5487):           > kcblsinc
169438 oracle_5487_tes(5487):           < kcblsinc
169440 oracle_5487_tes(5487):           > kcblcio
169443 oracle_5487_tes(5487):            > kcblci
169447 oracle_5487_tes(5487):             > kcflci
=IO_GETEVENTS
=IO_GETEVENTS
169481 oracle_5487_tes(5487):              > kcflwi
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTBCTX
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTECTX
179477 oracle_5487_tes(5487):              < kcflwi
179484 oracle_5487_tes(5487):             < kcflci
179488 oracle_5487_tes(5487):            < kcblci
179491 oracle_5487_tes(5487):           < kcblcio
179494 oracle_5487_tes(5487):          < kcblgt
179497 oracle_5487_tes(5487):          > kcbzvb
179509 oracle_5487_tes(5487):           > kcbhvbo
179513 oracle_5487_tes(5487):            > kcbhxoro
179516 oracle_5487_tes(5487):            < kcbhxoro
179518 oracle_5487_tes(5487):           < kcbhvbo
179520 oracle_5487_tes(5487):          < kcbzvb
179539 oracle_5487_tes(5487):          > kcbztek_trace_blk
179544 oracle_5487_tes(5487):          < kcbztek_trace_blk
179549 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
179555 oracle_5487_tes(5487):           > kcbtgobj
179559 oracle_5487_tes(5487):            > kd4obj
179562 oracle_5487_tes(5487):            < kd4obj
179563 oracle_5487_tes(5487):           < kcbtgobj
179565 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
179569 oracle_5487_tes(5487):         < kcbldrget
179571 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 10295
179576 oracle_5487_tes(5487):        > ktrgcm
179580 oracle_5487_tes(5487):         > kcbcge
179582 oracle_5487_tes(5487):         < kcbcge
179585 oracle_5487_tes(5487):         > ktcckv
179587 oracle_5487_tes(5487):         < ktcckv
179589 oracle_5487_tes(5487):        < ktrgcm
179591 oracle_5487_tes(5487):       < ktrget2
179593 oracle_5487_tes(5487):       > kdr9ir2blk
179606 oracle_5487_tes(5487):       < kdr9ir2blk
179609 oracle_5487_tes(5487):      < kdst_fetch0
179611 oracle_5487_tes(5487):     < kdst_fetch
179616 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
179620 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

If you go through the calls, you will see that the start is exactly the same, until line 17. After kcbgtcr>kcbldrgt (consistent read request function choosing direct path reads), the kcblgt function does not return immediately, but rather starts off a lot of extra code path.

This code path fetches new blocks. The most striking thing here is that kcbgtcr requests the blocks, and physical IO is done on behalf of the consistent read request, in other words: on behalf of the logical IO. This is obvious if you think about it, if you want to read blocks you have to look if they are available or not, and if they are not available, you have to fetch them.

As a conclusion: if kcbgtcr() together with kcbgcur() and a couple of other functions is considered the logical IO, then a logical IO has a variable time, instead of a consistent one. Because physical IO is requested inside the logical IO request, technically the physical IO is part of the logical IO. This means that it is technically incorrect to state that a physical IO is slower than logical IO, because a physical IO is part of the logical IO request that needed that physical IO. and as such a physical IO can not be slower than a logical IO

%d bloggers like this: