Archive

Tag Archives: debug

In my previous post, I introduced Intel Pin. If you are new to pin, please follow this link to my previous post on how to set it up and how to run it.

One of the things you can do with Pin, is profile memory access. Profiling memory access using the pin tool ‘pinatrace’ is done in the following way:

$ cd ~/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 12284 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

The pid is a pid of an oracle database foreground process. Now execute something in the session you attached pin to and you find the ‘pinatrace’ output in $ORACLE_HOME/dbs:

$ ls -l $ORACLE_HOME/dbs
total 94064
-rw-rw----. 1 oracle oinstall     1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--. 1 oracle oinstall     2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall       57 Nov  5 09:42 inittestdb.ora
-rw-r-----. 1 oracle oinstall       24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall     7680 Nov  5 09:41 orapwtestdb
-rw-r--r--  1 oracle oinstall 10552584 Nov 17 06:36 pinatrace.out

Please mind memory access generates A LOT of information! The above 11MB is what a ‘select * from dual’ generates (!)

This is how the file looks like:

$ head pinatrace.out
#
# Memory Access Trace Generated By Pin
#
0x00007f85c63fe218: R 0x00007fff6fd2c4c8  8          0xcefb615
0x000000000cefb61e: W 0x00007fff6fd2c4f8  8              0x12c
0x000000000cefb621: R 0x00007fff6fd2c4d0  8     0x7f85c5bebd96
0x000000000cefb625: R 0x00007fff6fd2c4d8  8     0x7f85c5bebd96
0x000000000cefb62c: R 0x00007fff6fd2c4e0  8     0x7fff6fd2c570
0x000000000cefb62d: R 0x00007fff6fd2c4e8  8          0xcefb54e

The first field is the function location, the second field is R or W (reading or writing obviously), the third field is the memory location read or written the fourth field is the amount of bits read and the fifth field is prefetched memory.

The function that is used can be looked up using the addr2line linux utility:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x000000000cefb61e
sntpread at ??:?

I looked up the second address from the pinatrace.out file above, and that address belongs to the function sntpread. There is no additional information available for this function (‘at ??:?’). If the address is not available in the oracle executable, a ‘??’ is displayed:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x00007f85c63fe218
?? ??:0

The pinatrace.out file is usable if you know the exact instruction pointer address or the memory location. However, that usage is fairly limited. An example of that is Mahmoud Hatem’s blog on tracing access to a memory location. Wouldn’t it be nice if we can change the functions addresses to function names, and the memory addresses to named memory locations whenever possible?

That’s where I created the pinatrace annotate oracle tool for. This is a little scriptset that contains scripts to generate memory information from the instance, after which the instruction pointer addresses and memory locations of a pinatrace.out file generated by pinatrace are translated to function names and memory area names. Let’s have a look what that means. This is a snippet of a pinatrace.out file:

0x000000000c967e46: R 0x0000000095f69910  8         0x95fcf6b0
0x000000000c967e4d: W 0x00007fff6fd2b2b8  8          0xc967e52
0x000000000c937b32: W 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937b3a: W 0x00007fff6fd2b278  8                0xe
0x000000000c937b41: W 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937b45: W 0x00007fff6fd2b270  8                0x1
0x000000000c937b49: W 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937b4d: R 0x0000000095fcf6bc  2               0x12
0x000000000c937b52: W 0x00007fff6fd2b288  8              0x2c4
0x000000000c937b59: W 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937b60: W 0x00007fff6fd2b2a0  4               0x73
0x000000000c937b6b: W 0x00007fff6fd2b2a8  4                0x1
0x000000000c937b6e: R 0x00007f85ca1db280  8     0x7f85ca1db280
0x000000000c937b77: R 0x000000000d0a40e4  4                0x1
0x000000000c937b84: R 0x00007f85ca1d43c8  8         0x95dc0e20
0x000000000c937b92: R 0x0000000095dc10b0  8                  0
0x000000000c937ba2: R 0x0000000095fcf6c0  4                0x1
0x000000000c937ba9: R 0x0000000095dc10e0  4                  0
0x000000000c937baf: R 0x000000000cfbe644  4            0x1cffe
0x000000000c937bbc: W 0x0000000095dc10b0  8         0x95fcf6b0
0x000000000c937bc5: R 0x0000000095fcf6b0  8                  0
0x000000000c937bc5: W 0x0000000095fcf6b0  8                0x1
0x000000000c937bca: W 0x00007fff6fd2b260  8                  0
0x000000000c937be1: R 0x00007f85ca1d4290  8     0x7f85ca1a9ca0
0x000000000c937bec: R 0x00007f85ca1ab1c0  4                0x3
0x000000000c937bf3: W 0x0000000095dc0faa  2                0x3
0x000000000c937bf9: R 0x00007f85ca1d43e0  8         0x95f68ea8
0x000000000c937c09: R 0x0000000095f69470  2                  0
0x000000000c937c16: W 0x0000000095dc0fac  2                  0
0x000000000c937c1e: R 0x0000000095dc10e0  4                  0
0x000000000c937c1e: W 0x0000000095dc10e0  4                0x2
0x000000000c937c24: W 0x0000000095dc0fa0  8         0x95fcf6b0
0x000000000c937c28: W 0x0000000095dc0fa8  2                0x8
0x000000000c937c2e: R 0x000000006000a9d8  4                0x1
0x000000000c937c3b: R 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937c3f: R 0x00007fff6fd2b2a0  4               0x73
0x000000000c937c42: W 0x0000000095fcf6c8  8         0x95f68ea8
0x000000000c937c46: W 0x0000000095fcf6c4  4               0x73
0x000000000c937c4a: R 0x00007fff6fd2b2a8  4                0x1
0x000000000c937c50: R 0x0000000095fcf6b8  4              0x83e
0x000000000c937c50: W 0x0000000095fcf6b8  4              0x83f
0x000000000c937c5a: W 0x0000000095dc10b0  8                  0
0x000000000c937c65: R 0x00007f85ca1d71d6  1                  0
0x000000000c937c76: R 0x00007fff6fd2b270  8                0x1
0x000000000c937c7a: R 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937c7e: R 0x00007fff6fd2b288  8              0x2c4
0x000000000c937c82: R 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937c86: R 0x00007fff6fd2b278  8                0xe
0x000000000c937c8d: R 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937c8e: R 0x00007fff6fd2b2b8  8          0xc967e52

The usefulness of this is limited in this form. The only thing I could derive is that big numbers in the memory access column (‘0x00007fff6fd2ac60’) are probably PGA related, and the numbers between roughly 0x000000006000000 and 0x0000000095dc0fd0 are probably SGA related. After running the annotate tool, it looks like this:

ksl_get_shared_latch:W:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:W:0x00007fff6fd2b278():8
ksl_get_shared_latch:W:0x00007fff6fd2b298():8
ksl_get_shared_latch:W:0x00007fff6fd2b270():8
ksl_get_shared_latch:W:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x0000000095fcf6bc(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):2
ksl_get_shared_latch:W:0x00007fff6fd2b288():8
ksl_get_shared_latch:W:0x00007fff6fd2b290():8
ksl_get_shared_latch:W:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x00007f85ca1db280(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x000000000d0a40e4():4
ksl_get_shared_latch:R:0x00007f85ca1d43c8(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6c0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:R:0x000000000cfbe644():4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x00007fff6fd2b260():8
ksl_get_shared_latch:R:0x00007f85ca1d4290(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x00007f85ca1ab1c0(pga|Other, pga heap, kgh stack pga|Other, pga heap, free memory pga|Other, pga heap, permanent memory):4
ksl_get_shared_latch:W:0x0000000095dc0faa(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x00007f85ca1d43e0(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095f69470(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:W:0x0000000095dc0fac(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc0fa0(shared pool|permanent memor,duration 1,cls perm):8
ksl_get_shared_latch:W:0x0000000095dc0fa8(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x000000006000a9d8(fixed sga|var:kslf_stats_):4
ksl_get_shared_latch:R:0x00007fff6fd2b298():8
ksl_get_shared_latch:R:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x0000000095fcf6c8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6c4(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x00007f85ca1d71d6(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):1
ksl_get_shared_latch:R:0x00007fff6fd2b270():8
ksl_get_shared_latch:R:0x00007fff6fd2b290():8
ksl_get_shared_latch:R:0x00007fff6fd2b288():8
ksl_get_shared_latch:R:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x00007fff6fd2b278():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b8():8

So, now you can see the reason I picked a seemingly arbitrary range of lines actually was because that range is the memory accesses of the ksl_get_shared_latch function. This annotated version show a shared latch get for the ‘session idle bit’ latch. It’s also visible the function uses PGA memory, some of it annotated, some of it not, and that most of the shared pool access is for the latch (a latch essentially is a memory range with the function of serialising access to a resource), which is in the shared pool because it’s a child latch. It’s also visible memory belonging to X$KSUPR is read and written (X$KSUPR is the table responsible for V$PROCESS, the fields KSLLALAQ and KSLLALOW are not externalised in V$PROCESS).

Why are a lot of the assumed PGA addresses (the ones like 0x00007fff6fd2b2b8) not annotated? Well, PGA memory allocations are very transient of nature. Because a PGA memory snapshot is made at a certain point in time, this snapshot represents the memory layout of that moment, which has a high probability of having memory deallocated and freed to the operating system. A lot of the SGA/shared pool allocations on the other hand have the intention of re-usability, and thus are not freed immediately after usage, which gives the SGA memory snapshot a good chance of capturing a lot of the memory allocations.

Get the pinatrace oracle annotate tool via github: git clone https://github.com/FritsHoogland/pinatrace_annotate_oracle.git

Please mind this tool uses the bash shell, it might not work in other shells like ksh.

How to use the tool?
– Use pin with the pinatrace.so tool, as described above. Move the the pinatrace.out file from $ORACLE_HOME/dbs to the directory with the pinatrace_annotate_oracle.sh script.
Immediately after the trace has been generated (!), execute the following scripts using sqlplus as SYSDBA:
– 0_get_pga_detail.sql (this lists the sessions in the database and requires you to specify the oracle PID of the session)
– 1_generate_memory_ranges.sql
– 2_generate_memory_ranges_xtables.sql
– 3_generate_memory_ranges_pga.sql
This results in the following files: memory_ranges.csv, memory_ranges_pga.csv and memory_ranges_xtables.csv.
Now execute the annotate script:
– ./pinatrace_annotate_oracle.sh pinatrace.out
The script outputs to STDOUT, so if you want to save the annotation, redirect it to a file (> file.txt) or if you want to look and redirect to a file: | tee file.txt.

I hope this tool is useful for your research. If you know a memory area described in the data dictionary that is not included, please drop me a message with the script, then I’ll include it.

This blogpost is an introduction to Intel’s Pin dynamic instrumentation framework. Pin and the pintools were brought to my attention by Mahmoud Hatem in his blogpost Tracing Memory access of an oracle process: Intel PinTools. The Pin framework provides an API that abstracts instruction-set specifics (on the CPU layer). Because this is a dynamic binary instrumentation tool, it requires no recompiling of source code. This means we can use it with programs like the Oracle database executable.
The Pin framework download comes with a set of pre-created tools called ‘Pintools’. Some of these tools are really useful for Oracle investigation and research.

Pin works in a very sophisticated way. The description in the Pin manualis to think of Pin as a JIT (just in time) compiler, where the compiler does not take byte code (as JIT compilation does with Java), but the executable of the process pin is executed against. This means pin inserts itself into the process’ execution. This can be seen when looking at the memory map of such a process:

$ cat /proc/29595/maps | grep -e pin-3.0 -e oracle
00400000-1098a000 r-xp 00000000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8a000-10bac000 r--p 1058a000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10bac000-10e05000 rw-p 105ac000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
7fa01e04c000-7fa01e31d000 r-xp 00000000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e51c000-7fa01e530000 r--p 002d0000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e530000-7fa01e531000 rw-p 002e4000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e533000-7fa01e71a000 r-xp 00000000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91a000-7fa01e91c000 r--p 001e7000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91c000-7fa01e91e000 rw-p 001e9000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e946000-7fa01e9c9000 r-xp 00000000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9ca000-7fa01e9cc000 r--p 00083000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9cc000-7fa01e9ce000 rw-p 00085000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9d6000-7fa01ea04000 r-xp 00000000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea04000-7fa01ea05000 r--p 0002d000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea05000-7fa01ea06000 rw-p 0002e000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea06000-7fa01eac4000 r-xp 00000000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac5000-7fa01eac8000 r--p 000be000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac8000-7fa01eaca000 rw-p 000c1000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eacb000-7fa01edb7000 r-xp 00000000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01edb8000-7fa01ee24000 r--p 002ec000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee24000-7fa01ee25000 rw-p 00358000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee80000-7fa01f385000 r-xp 00000000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f385000-7fa01f38d000 r--p 00504000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f38d000-7fa01f394000 rw-p 0050c000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f54e000-7fa01f6d4000 r-xp 00000000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
7fa01f6d4000-7fa01f8d3000 ---p 00186000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so

Here you see this process an oracle foreground (rows 1-3), after which we see a lot of stuff that pin pushed into the process’ address space.

For this reason, it’s vitally important to use and run pin as the same user as the process you want to run pin against. The way pin works is that, upon execution of pin, the pin executable inserts itself into the process’ address space, gains control and then tries to load necessary libraries. If it can’t find these libraries, it will send a SIGKILL to the process, effectively stopping it!

In order to obtain pin, go to the pin homepage, downloads, linux and select the ‘gcc compiler kit’. This is a zipped tarball. Upload that to a linux server as the database owner, usually ‘oracle’. I created a directory ‘pin’ in which I put the tarball, and extracted it (tar xzf). The next step is to compile the tools that come with pin:

$ cd pin-3.0-76991-gcc-linux/source/tools
$ make

This will output a lot of stuff on your screen and compiles everything in the tools directory.

Now that we have seen an introduction, let’s use a Pin tool to do something useful! One such useful is the tool ‘DebugTrace’. First obtain the process id (pid) from an Oracle server foreground process. Then execute the pin tool against this process:

$ pwd
/home/oracle/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 2407 -t source/tools/DebugTrace/obj-intel64/debugtrace.so

Upon execution, nothing is returned, because the majority of the things are happening in the process pin is run against (2407 in this case). If no parameters are specified, the pintool will create a file in the current working directory of the process it is run against. For an oracle foreground process, the current working directory is $ORACLE_HOME/dbs:

$ ls -l /proc/2407/cwd
lrwxrwxrwx 1 oracle oinstall 0 Nov 17 01:45 /proc/2407/cwd -> /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs

For the sake of testing this, I ran ‘select * from dual’ in the oracle database foreground process.

Let’s look what this produces! This is how my dbs directory looks like:

$ ls -ltr
total 2332
-rw-r--r--. 1 oracle oinstall    2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall      24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall    7680 Nov  5 09:41 orapwtestdb
-rw-r-----. 1 oracle oinstall      57 Nov  5 09:42 inittestdb.ora
-rw-rw----. 1 oracle oinstall    1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--  1 oracle oinstall 2361460 Nov 17 01:49 debugtrace.out

You can see the debugtrace.out file is what is produced by the pin tool, and it look rather big (remember I only ran ‘select * from dual’)!

Let’s look into the file:

$ less debugtrace.out
         3 @@@ return underflow
Return 0x00007f706b664218 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x12c
        14 @@@ return underflow
Return 0x000000000cefb62d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cefb580 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cedcc9d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cedc90a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000ceeb746 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b170d0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       156 Tailcall 0x0000000005b170de /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b170c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffeef95c305, 0x7f706ae51da0, ...)
       158 | Call 0x0000000005b170c1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b1dcb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       172 | Return 0x0000000005b20097 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffeef95c305
       174 Return 0x0000000005b170c7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffeef95c305
       197 @@@ return underflow
Return 0x000000000ceeb7ad /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
       213 Call 0x000000000cb5d8b0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:opikndf2+0x000000000410 -> 0x000000000c940560 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx(0x7ffeef95c040, 0x7f706ae51da1, ...)
       268 | Call 0x000000000c94066e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx+0x00000000010e -> 0x000000000cd585c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64(0x7ffeef95c040, 0x7f706ae51da1, ...)
       278 | | Call 0x000000000cd5863b /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64+0x00000000007b -> 0x000000000b13ef90 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt(0x7ffeef95c040, 0x7f706ae51da1, ...)
       286 | | | Call 0x000000000b13efa3 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt+0x000000000013 -> 0x0000000000ba6860 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:.plt+0x000000001360(0x1, 0x7ffeef95beb0, ...)
       293 | | | | Call 0x00007f706b188dcb /lib64/libc.so.6:__clock_gettime+0x00000000001b -> 0x00007ffeef9ab8a0 clock_gettime(0x1, 0x7ffeef95beb0, ...)
...etc...

I think this is rather exciting! This is a FULL function call trace, indented by call depth!
The first lines in the trace are returns, and these returns produce a ‘return underflow’, which is because these functions have been called before the trace was put on the process.
This trace shows:
Calling: the address and function where the function call is made, the address and function that is called and first two arguments of the function.
Returning: the address and function where the return is executed, and the returncode.
Tailcall: this is a function that is called as the last instruction in the current function. This means it returns as part of the current function, which is shown by not indenting this call further (call, tailcall and return are all indented at the same level).

Please mind currently I am not aware of a way to remove the pin framework and tool from a process address space. Also mind that because of how this works (a lot of intercepting in the execution flow), it will slow down the process significantly. Use this tool and the above described techniques at your own risk.

It’s general knowledge that the Oracle database is ACID compliant, and that SCNs or ‘system change numbers’ are at the heart of this mechanism. This blogpost dives into the details of how the Oracle engine uses these numbers.

Oracle database version 12.1.0.2.161018
Operating system version: OL 7.2, kernel: 4.1.12-61.1.14.el7uek.x86_64 (UEK4)

Redo generation
Whenever DML is executed, redo is generated in the form of ‘change vectors’. These change vectors are copied into the redo buffer as part of the transaction, during the transaction. The function that performs this action is called ‘kcrfw_copy_cv()’. This can be derived by watching the foreground process perform memory copy into the memory area of the redo buffer.

In order to do this, you first need to find the memory area of the redo buffer. This can be done by executing ‘oradebug setmypid’ and ‘oradebug ipc’ as sysdba, and examine the resulting trace file:

 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1  4784130 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008600000 default       readwrite

This shows the redo buffers are allocated between 0x96400000 and 0x9E934000 (actual address and actual address + segment size).

The function Oracle uses to copy memory is called “_intel_fast_memcpy()”. This probably is an optimised version of the standard C memcpy() function. What is important, is that the arguments of memcpy function:

$ man memcpy
NAME        
       memcpy - copy memory area
SYNOPSIS    
       #include <string.h>
       void *memcpy(void *dest, const void *src, size_t n);

So the first argument of memcpy is a pointer to the destination, the second argument is a pointer to the source, and the third argument is the size of the memory area to be copied.

In order to find the function that uses memcpy() to copy into the redo buffers area, we can use the following gdb macro:

break _intel_fast_memcpy if $rdi > 0x96400000 && $rdi < 0x9E934000
commands
silent
printf "_intel_fast_memcpy dest %lx, source %lx, size %d\n", $rdi, $rsi, $rdx
frame 1
c
end

The macro creates a conditional breakpoint on the function _intel_fast_memcpy(). The condition is that the CPU register RDI (first argument, destination) should be higher than 0x96400000 and lower than 0x9E934000, which is the memory area the redo log buffers are mapped to. If the breakpoint and its conditions are triggered, the function name is printed, together with the arguments to memcpy, and ‘frame 1’ to list the function that called _intel_fast_memcpy().

To execute this macro, startup an Oracle foreground session, and attach to the session with gdb, and execute the macro. Next, execute an insert statement. In my case, I created a very simple table called ‘T’ with one field, T, varchar2(1). Please mind that you need to watch out for deferred segment creation, so ideally you first need to create the table, and insert a few rows to take care of that.

This is how the gdb side looks like when executing an insert statement:

_intel_fast_memcpy dest 96817454, source 7ffc9fcc9598, size 36
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817478, source 7ffc9fcc9658, size 32
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817498, source 7faf6ad677f0, size 44
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174c4, source 7ffc9fcc9678, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174d8, source 7ffc9fcc95e4, size 76
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817524, source 7faf6ad67888, size 8
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 9681752c, source 7faf6ad67760, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817540, source 7faf6ad69ff8, size 40
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817568, source 7faf6ad6a070, size 24
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817580, source 7faf6ad69f60, size 49
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968175b4, source 89753568, size 1
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()

The destination of the memcpy() action is the logbuffers, the break conditions made sure of that. The source of all except for the last memcpy() is the PGA (visible by the high number), the last memcpy() copies from the SGA. The memory address points to an SQLA component (cursor) in the SGA.

Okay, at this point we established the redo of a DML operation is put in the redo buffers by a function called kcrfw_copy_cv(). Also please mind that modern Oracle versions do put redo into the redo buffers during doing DML, not after it!

To move further on the topic of transactions, let’s look at when the current SCN is fetched and when kcrfw_copy_cv() is called. To do this, I use the following gdb macro:

break kcrfw_copy_cv
commands
silent
printf "kcrfw_copy_cv\n"
c
end
awatch *0x6001fbb0
commands
c
end

When running this gdb macro against an Oracle session running an insert statement, the following output is generated:

Hardware access (read/write) watchpoint 2: *0x6001fbb0

Old value = 347661
New value = 347666
0x000000000ccbdfb3 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbeb04 in kcsgssn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbdfb3 in kcscur3 ()
kcrfw_copy_cv
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()

Actually, I was surprised by this output. I suspected a call to generate an unique SCN for this change. However, the calls/functions which are visible are:
– kcscur3: a function to read a SCN (the global SCN, but is also used to read other SCNs, like the LWN SCN and the on-disk SCN).
– kcsgrsn: kernel cache service get recent SCN, for reading the global SCN.
– kcsgcsn: kernel cache service get current SCN, for reading the global SCN.
– kcsgssn: kernel cache service get snapshot SCN, for reading the global SCN.
All of these functions do READ a SCN value, but do not modify it. In order to update a SCN, the function kcsgbsn (kernel cache service get batched SCN) should be called. There is a memory write shown (row 1-5), but that really is a read, it is an anomaly of watchpoints with memory changed by other processes.

So, conclusion at this point is that DML (in this case an insert statement) just reads the global SCN, and does not seem to update it.

This got me thinking: if an insert statement just reads the global SCN, instead of generating an unique SCN, it should be possible that the changes (redo vectors actually) of DML executed simultaneously in different sessions on different segments share the same SCN? It sounds counter intuitive to what I have been taught about Oracle.

This can be tested in the following way:
1 – Have two foreground sessions in a different schema execute an insert statement at exactly the same time. This can be done in iTerm2 using the ‘broadcast input’ option, which allows input to be executed in multiple panes at the same time. The command executed in both pane’s is ‘insert into t values (‘a’);’.

2 – Have gdb attached to the two sessions, having an access watchpoint at the global SCN, to be able to identify the SCN that both the session used during executing the insert, and a breakpoint at kcrfw_copy_cv() to understand which SCN it uses. I used the following gdb macro:

break kcrfw_copy_cv
  commands
    silent
    printf "kcrfw_copy_cv\n"
    c
  end
awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    c
  end

3 – To validate what is visible in gdb, dump the current logfile and find the SCN as shown in gdb. The current logfile can be dumped to trace using the following PL/SQL procedure:

----------------------------------------
-- This script will dump the current redolog file.
-- If the log file is big, this might take some time.
--
--
-- Original Author : Riyaj Shamsudeen
-- Modifications by: Frits Hoogland
-- No implied or explicit warranty !
----------------------------------------
set serveroutput on size 1000000
declare
  v_sqltext varchar2(255);
begin
 select 'alter system dump logfile '||chr(39)||member||chr(39) into v_sqltext
  from v$log lg, v$logfile lgfile
  where lg.group# = lgfile.group# and
  lg.sequence# = ( select sequence# from v$log where status='CURRENT' )
  and rownum <2;
  dbms_output.put_line ('Executing :'||v_sqltext);
  execute immediate v_sqltext;
end;
/

In step 2, I got the following gdb output on both the inserts:

(1st session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860
kcrfw_copy_cv

(2nd session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860
kcrfw_copy_cv

This is as expected based on our observation, because the insert statement only reads the global SCN. Now let’s dump the current redologfile as shown in step 3 above, and see if we can find SCN 5c860.
Here are the redo records for both changes:

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.0010 LEN: 0x01a8 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.01b8 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

Watch the SCN from the gdb session and at the line with SCN and the SUBSCN value being the same. Also watch the RBA (redo block address) being different.

Both redo records contain 3 changes:

(1st session)
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0005c801 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:32 AFN:3 DBA:0x000000f9 OBJ:4294967295 SCN:0x0000.0005c800 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600c OBJ:20455 SCN:0x0000.0005c857 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

(2nd session)
CHANGE #1 CON_ID:0 TYP:0 CLS:35 AFN:3 DBA:0x00000110 OBJ:4294967295 SCN:0x0000.0005c7b2 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:36 AFN:3 DBA:0x00000895 OBJ:4294967295 SCN:0x0000.0005c78c SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600e OBJ:20455 SCN:0x0000.0005c858 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

These 3 transactions are consistent with an insert of a single row, this is visible by the opcodes:
5.2=update rollback segment header
5.1=undo block
11.2=insert row piece

This means the DML of the two inserts have exactly the same SCN! A logic thought that you might come up with is: but how is Oracle then able to apply these in the correct order, which is one of the fundamental reasons for having SCNs? For that, it’s important to realise this is just the change to a block, but not a committed transaction yet. In order to get a full understanding of how the SCN mechanism works for at least a single row insert, we need to change the gdb macro from the test above:

awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    if $rip >= 0xccbe340 && $rip <= 0xccbe7b0
      printf "kcsgbsn - SCN: %x (@%x)\n", *0x6001fbb0, $rip
    end
    c
  end
break kcrfw_copy_cv
  commands
    silent
    printf "kcrfw_copy_cv\n"
    c
  end

The function kcsgbsn means: kernel cache service get batched SCN. This function DOES update the global SCN(!). However, this function does not get called during the insert. It does get called when commit is called (or rollback).

So, in order take another step in understanding SCN handling by Oracle, let’s use the updated version of the gdb macro, and insert another row into the table in two sessions, and then also commit:

(1st session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcrfw_copy_cv
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)
kcrfw_copy_cv

(2nd session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcrfw_copy_cv
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)
kcsgbsn - SCN: 7e39c (@ccbe63a)
kcrfw_copy_cv

In both sessions we see the insert (lines 2-4 and 11-13), and then the commit (lines 5-8 and 14-18).
During the insert, we see both sessions using the same SCN (7e39b), which can be verified by dumping the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.0010 LEN: 0x01c0 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.01d0 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

And we see kcsgbsn() being called. As has been documented on multiple places, the current value of the global SCN is read, and then advanced (get and advance). This can be seen in lines 6 and 7; 0x7e39b is read, and then advanced to 0x7e39c. However, to be honest to my surprise, the second session does the same, as seen in lines 15-17, it advances from 0x7e39b to 0x7e39c too. However it rereads the SCN one additional time. The surprise is the sessions seem to have the same SCN for the commit. Validation from the dump of the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.0174 LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.01dc LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:19 AFN:3 DBA:0x00000090 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

Yes, both commits (visible by the opcode (OP) 5.4) have the same SCN. At this point I think it’s fitting to theorise why this is happening. I think this is happening and makes sense because the table is protected for consistency at another level in the database, which is the locking mechanisms (TX locks for rows, TM locks for segments), which guarantees that two transactions at the same time do not compromise data consistency. And because of that, it’s not a problem to batch them with the same SCN.

Conclusion
Changes in the database are synchronised by system change numbers, also known as SCNs. SCNs are the way Oracle serialises changes to data. However, this post shows that the Oracle engine tries to maximise the number of changes inside a SCN, and as such a SCN is not an highly granular serialisation mechanism. This is not a problem, since Oracle’s transactional protection mechanisms (like TX and TM locks) do work alongside SCNs to make sure no conflicting changes can happen at the same time. The changes itself do just read and pick the current global SCN from the instance, a commit of these changes does increase the global SCN, but as little as possible. You might find multiple changes in the same SCN, and you might even find commits with the same SCNs.

Bonus topic: finding nemokcscur3.
In case you wondered what this little macro means, and how got the memory addresses:

awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    c
  end

awatch is a memory access watch function. The commands executed when this watch triggers are:
-silent (do not display output)
-if $rip >= 0xccbdfa0 && $rip <= 0xccbe010 (if the instruction pointer contains a value higher or equal to 0xccbdfa0 and the instruction pointer contains a value lower or equal to 0xccbe010)
-printf "kcscur3 – SCN: %x\n", *0x6001fbb0 (print some stuff and print (in hexadecimal format) where the value the pointer 0x6001fbb0 points to)
The text 'kcscur3' in the printf spoils that these addresses are actually the memory addresses where the function kcscur3 is located. How to find these? The starting address can be found in a variety of ways, for example in gdb: 'info function kcscur3', or in the shell using nm: 'nm oracle | grep kcscur3'. It is a bit harder to find the ending address of a function. However, it's still quite simple doable with nm: 'nm -S oracle | grep kcscur3':

[root@o12102fs bin]# nm -S oracle | grep kcscur3
000000000ccbdfa0 0000000000000070 T kcscur3

The first column is the starting memory address of the kcscur3 function, the second column is the length (in hexadecimal) of the function code. 0xccbdfa0+0x70=0xccbe010.

Update oktober 24, 2016:
Roger noted the following: Minor correction, Oracle doesn’t do DML directly, it first generates change vectors which are then applied to the local database which is effectively running in recovery. This way Oracle uniquely guarantees recovery on the primary or physical standby will always do exactly the same thing as the original DML.

Stefan noted the following: Hey Frits. Great work – as always. However i am not quite sure about your kcrfw_copy_cv() PGA interpretation for private redo strands. AFAIK private redo strands reside in permanent memory allocations (in SGA) and you should see memory copy operations between these memory chunks and the public redo buffer(s) on commit. Your GDB breakpoint should only kick in under these circumstances as anything else is handled in private redo (at least on your mentioned database version) and no memory operations on redo buffer. I also modeled a short test case with 12.1.0.1 on OEL 6.7 to verify my theory/assumption and the result was as expected – not quite sure about your observed and mentioned PGA to redo log buffer copy operations then.

I should add that when I run snapper on the insert in my database, the statistics ‘IMU- failed to get a private strand’ and ‘IMU pool not allocated’ are increased, giving a hint no private strands are used.

During investigating I ran once again into statistics in the Oracle database that still provide a useful details, but the actual naming of the statistic is describing a situation that in reality does not exist anymore. The statistics I am talking about are ‘calls to kcmgcs’, ‘calls to kcmgrs’, ‘calls to kcmgas’ and ‘calls to get snapshot scn: kcmgss’.

Disclaimer: this is research. Any of these techniques potentially can crash your instance or leave your database in a corrupted state. Test the techniques used in this article severely before applying it in an actual situation. Use at your own risk.

Back to the ‘calls to’ statistics. To see what I mean here, you can look up the functions in symbol table in the Oracle executable. There are several ways to do that, one way is using gdb:

(gdb) info function ^kcmg
All functions matching regular expression "^kcmg":

Non-debugging symbols:
0x0000000002d12410  kcmgtsUTC
0x0000000002d12be0  kcmgts
0x0000000002d139a0  kcmgtsRSL
0x000000000cc924b0  kcmglscn
0x000000000cc92560  kcmgtsf

See? There are no functions kcmgcs, kcmgrs, kcmgas and kcmgss. Are these statistics of any use then? I think so, although the only really relevant statistic probably is ‘calls to kcmgas’ to indicate an approximate count of the number of calls for a new SCN (gas probably means ‘get and advance’). The meaning of them is documented (at least up to version 12.1) in the online Oracle documentation, Statistics Description:
– ‘calls to get snapshot scn: kcmgss’: Number of times a snapshot system change number (SCN) was allocated. The SCN is allocated at the start of a transaction.
– ‘calls to kcmgas’: Number of calls to routine kcmgas to get a new SCN
– ‘calls to kcmgcs’: Number of calls to routine kcmgcs to get a current SCN
– ‘calls to kcmgrs’: Number of calls to routine kcsgrs to get a recent SCN
(I think ‘routine kcsgrs’ is a typo or a partial improvement and should be either kcmgrs to stay in line with the other (wrong) descriptions, or it should be kcsgrsn (added ‘n’) to be correct)

Would there be a way to find out how this now works? We can’t work from the function names, since I have just proven the functions do not exist anymore (I have no sufficient old enough database to prove this actually, feel free to test this on your old 7/8i/9i/10g database!). However, we can work this out when starting from the other end as seen from what the functions are supposed to do. The functions are supposed to provide a SCN for the session to work with. Since we know the source of SCN values in the SGA (kcsgscn, the global SCN), we can watch which functions do call this variable for a new SCN value or a reference value!

Obtain the address of kcsgscn:

SQL> select ksmfsnam, ksmfsadr from x$ksmfsv
2  where ksmfsnam = 'kcsgscn_';

KSMFSNAM                                                         KSMFSADR
---------------------------------------------------------------- ----------------
kcsgscn_                                                         000000006001FBB0

Now put a access watchpoint on the kcsgscn address for a session using gdb:

$ gdb -p PID
...
(gdb) awatch *0x6001fbb0
Hardware access (read/write) watchpoint 1: *0x6001fbb0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now let’s test something very simple, like a ‘select * from dual’:

Hardware access (read/write) watchpoint 1: *0x6001fbb0

Old value = 331907
New value = 332367
0x000000000cca0f74 in kcsgssn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0423 in kcscur3 ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0f74 in kcsgssn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0423 in kcscur3 ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()

That’s my session accessing address 0x6001fbb0, alias the global SCN, eight times.
The functions that access 0x6001fbbo in this case are kcsgssn, kcscur3 and kcsgcsn.

When we look at the session statistics of the above executed SQL (please mind there is a difference between the first execution and subsequent executions), and focus on the ‘calls to’ statistics this is what this shows (partial output from Tanel’s snapper):

130, SYS       , STAT, calls to kcmgcs                                           ,             4,
130, SYS       , STAT, calls to get snapshot scn: kcmgss                         ,             2,

If carefully combine the snapper output with the function names the watchpoint gives, it becomes apparent that the amount of the statistic with kcmgcs is equal to the times the kcsgcsn function was called, and the amount of the statistic with kcmgss is equal to the times the kcsgssn function was called.

This is only a very simple example, when you try different SQL’s, especially DML, you will notice a lot of calls to the global SCN are made, because the metadata is protected by SCN’s too. I assume Oracle kept the statistics names the same for the sake of not breaking products and scripts using these statistic names. However in reality the functions using the global SCN are different from how the database truly works.

Recently I was investigating the inner working of Oracle. One of the things that is fundamental to the Oracle database is the SCN (system change number). SCNs are used to synchronise changes in the database. There is one source for SCNs in every instance (kcbgscn; the global or current SCN in the fixed SGA), and there are multiple tasks for which Oracle keeps track of synchronisation using SCNs. A few of these tasks for which Oracle stores and uses SCNs to keep track of progress are on disk SCN and lwn SCN.

This blogpost is about some oddities I found when using gdb (the GNU debugger) to watch memory locations of a running Oracle database. This should not be done on a production instance, and is purely for research purposes. Only use the methods mentioned in this article if you are absolutely sure what you are doing, and/or if you are using an Oracle instance that can be crashed and can be restored.

The situation
I was researching the ‘on disk scn’ in a 11.2.0.2 database. I used the following (simplified) gdb macro:

break kcrf_commit_force
  commands
    print $rip
    c
  end
awatch *0x60022659
  commands
    c
  end

This macro breaks on the function kcrf_commit_force essentially to understand the different phases of a commit, and uses the ‘awatch’ function (access watch; which means this watchpoint fires at both reading and writing) to look at memory location 0x60022659. That memory location is the on disk SCN for this instance.

Upon executing a commit after an insert, I saw the following output:

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$1 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Old value = 898
New value = 899
0x0000000008fb5ad1 in kcscur3 ()

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$2 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()

What this shows, is the following:
– line 1 and 9: the kcrf_commit_force function is called.
– line 11, 15 and 19: the memory location *0x60022659 is read, resulting in the value 899.
– line 3: the memory location *0x60022659 is written, it contained 898, and the new value is 899.

This greatly puzzled me. As far as I understand the way Oracle works, a server process that issues ‘commit’ is supposed to READ the on disk SCN, it does not write the on disk SCN, because the only process that changes the on disk SCN is the logwriter. However, quite clearly gdb tells me the server process is writing it. Further consideration is that IF the server process is writing it, where does this information come from?

After some further investigation, I came to the conclusion that the only way to prove to myself that what I am seeing is correct, is to create a program myself mimicking the behaviour witnessed above to truly understand what is going on, because I don’t believe what I am seeing is correct.

I created two little c programs. One that starts up, creates a shared memory segment, writes a letter into the shared memory segment and then waits for a keystroke, and puts the keystroke into the same location, and one that reads the same shared memory segment, then waits for enter to pressed, and reads the shared memory segment again, and outputs if the letter read from the shared memory is different. I called the first one ‘server’, and the second one ‘client’. This way, I can attach gdb to the client executable, which very much like an Oracle server process, reads a shared memory location which I can change at will.

The code can be found on GitHub: shared memory test source. Please mind this is simple proof of concept code that is not exhaustive in error checking and other sanity checks in any way. The source files can be compiled and linked in the following way:

$ cc server.c -o server
$ cc client.c -o client

The server executable can be run in the following way:

$ ./server
Type a letter. q quits.

This creates the shared memory segment, and puts the ‘a’ character in the variable c (line 42, c = ‘a’). As long as the c variable is not ‘q’, it writes the c variable into shared memory segment using the pointer s (*s = c) , and then waits for user input using the scanf function.

Once the shared memory segment is created and written into, the client program can be started:

$ ./client
client started. ^C to quit.
waiting for enter

The first time enter is pressed it will output ‘keystroke: a’ because the p variable was not initialised, and will be assigned the ‘a’ character (p = *s), the second time enter is pressed, ‘keystroke:’ is not printed, because the p variable and the memory location the pointer *s points to both contain ‘a’. What is vital, is that the client program NEVER writes, only reads the shared memory location, in three occasions: in line 50:

if( p != *s) {

In line 51:

printf("keystroke: %c\n", *s);

And in line 52:

p = *s;

Okay. So at this point we got a program that mimics what I think an Oracle server process is doing too on commit time (reading shared memory), which is the program ‘client’, and we got a program which creates and can manipulate the shared memory location the client is reading from, which is ‘server’. Startup the server program (./server), and startup the client using gdb:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb)

This is different than attaching to a process using gdb -p PID, but attaching to a process using ‘-p’ is the only way that I am aware of to use gdb on Oracle database processes. Now that we started gdb with the client program, we must first run it in order to see the program’s runtime environment:

(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter

Now invoke gdb (using control-C) and look at the memory mappings to see where the shared memory segment is:

(gdb) info proc mappings
process 3069
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]

Line 19 contains the shared memory segment, at memory address 0x7ffff7ff8000. To do the same as watching the on disk SCN location in Oracle, put an access watchpoint at 0x7ffff7ff8000:

(gdb) awatch *0x7ffff7ff8000
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now press enter, and we see the shared memory location being accessed three times:

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

If you take the source code, you’ll see that these are the following lines of code at rows 50-52:

       if( p != *s ) {
          printf("keystroke: %c\n", *s);
          p = *s;

Every line reads the shared memory location via the pointer *s. It’s also clear it’s a read because the value in the memory location (97) is displayed once.

Now go over to the server side, and press ‘b’ and enter. Now the shared memory segment contains ‘b’. Next, go back to the client program, and press enter:

waiting for enter

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

Hey! This shows what looks like a write with the first watchpoint, because the memory location is touched, and it reports the value at the location being changed. However, because we know the source code, we know we only read the shared memory location with the client program, not write to it! So what this shows in this case is that gdb knew what the previous value was of the memory location was, and detected that it was changed.

To see how a true write looks like, you can startup the server executable with gdb too, and set an ‘awatch’ watchpoint on the shared memory location exactly the same way as shown with the client program above. This is how that looks like:

(gdb) c
Continuing.
b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x00000000004006e9 in main ()

So conclusion here is that you can’t see the difference between a write to a memory location or a change of value when reading when using an ‘awatch’ watchpoint.

Now that we understand how ‘awatch’ works, would there be a way to make the distinction between reading and writing? Outside of ‘awatch’ (access watchpoint), there’s the ‘watch’ command, which is a watchpoint on writing to a memory location and ‘rwatch’, which is a watchpoint on reading a memory location. This is how the output of ‘rwatch’ looks like in the client program:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7b23670 in __read_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info proc map
process 3517
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]
(gdb) rwatch *0x7ffff7ff8000
Hardware read watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

At line 1 gdb is started with the ‘client’ program. Once inside gdb, the executable is run (line 12). Once it’s running we interrupt in order to get to gdb (line 16). In gdb, the info proc map command is used (line 20) to look at the memory mappings. From the memory mappings, we see address 0x7ffff7ff8000 is the shared memory segment. A read watch is set and a c (continue) command is added to the read watchpoint (line 46 and 48). Then the execution is continued with the c command (line 53).

Now we are running in the executable again. After continuing, I pressed enter to go through the loop in the client executable, and the watchpoint is triggered three times (lines 56, 60 and 68), as we did see earlier, for source code lines 50, 51 and 52.

Now I changed the letter in the shared memory segment from ‘a’ to ‘b’ with the server executable, and pressed enter at line 60. The interesting things that happens is that the read watchpoint is triggered TWO times. This should be three times (!!).

There is a way to make the now invisible read visible again, using the following command in gdb:

(gdb) watch *0x7ffff7ff8000
Hardware watchpoint 2: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Yes, that’s a WRITE watchpoint. After adding a write watchpoint, this is how the output looks like when I change the letter from ‘b’ to ‘c’:

Hardware watchpoint 2: *0x7ffff7ff8000

Old value = 98
New value = 99
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040071e in main ()
keystroke: c
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040073c in main ()
waiting for enter

There we have the tree reads visible again, however, one is ‘disguised’ as a write. To conclude: it seems gdb is not able to make a distinction between reads and writes for watchpoints if the watchpoint memory location is changed outside of the executable active in the debugger. Gdb seems to determine memory access a write if the value found is different from the last it was read.

Bonus information, not applicable to Oracle:
When you have the source code of an application, you can make debugging even better by adding in debug information for gdb in the executable. This is done by adding the ‘-g’ flag to the cc command:

$ cc -g client.c -o client

With the debug information available, it’s easier to debug:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...done.
(gdb) tbreak 1
Temporary breakpoint 1 at 0x40067c: file client.c, line 1.
(gdb) run
Starting program: /home/oracle/src/shm_test/client

Temporary breakpoint 1, main () at client.c:22
22	    key = 666;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info local
shmid = 0
key = 0
shm = 0x400590 "1\355I\211\321^H\211\342H\203\344\360PTI\307\300P\a@"
s = 0x7fffffffe550 "\001"
p = 0 '\000'
(gdb) awatch s
Hardware access (read/write) watchpoint 2: s
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Old value = 0x7fffffffe550 "\001"
New value = 0x7ffff7ff8000 "a"
main () at client.c:45
45	    printf("client started. ^C to quit.\n");
(gdb) c
Continuing.
client started. ^C to quit.
waiting for enter

Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040070f in main () at client.c:50
50	       if( p != *s ) {
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040071b in main () at client.c:51
51	          printf("keystroke: %c\n", *s);
(gdb) c
Continuing.
keystroke: a
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x0000000000400739 in main () at client.c:52
52	          p = *s;
(gdb) c
Continuing.
waiting for enter

Above you see starting up the executable with gdb at line 1. After gdb is started, I first set a temporal breakpoint on line 1 (of the source code, shown on line 12 above), to have gdb ‘enter’ the executable and stop when I run it. If I wouldn’t do that, the program would ‘wait’ in the getchar() function, which is a function in libc, and the libc library would be my scope, line 19 tells that I have not loaded debug symbols for glibc. After setting the breakpoint, I run the executable, which immediately breaks on the temporal breakpoint (line 14).

Now that the executable is active, I can list the variables used using the info local command (line 20). Also, I can simply set a watchpoint on the variable s (line 26), instead of having to search the memory address myself. When the watchpoint is triggered, it now shows the line number in the source file, and the actual source code on that line (lines 34, 35, 44, 45, etc).

The intention of this blogpost is to show the Oracle wait time granularity and the Oracle database time measurement granularity. One of the reasons for doing this, is the Oracle database switched from using the function gettimeofday() up to version 11.2 to clock_gettime() to measure time.

This switch is understandable, as gettimeofday() is a best guess of the kernel of the wall clock time, while clock_gettime(CLOCK_MONOTONIC,…) is an monotonic increasing timer, which means it is more precise and does not have the option to drift backward, which gettimeofday() can do in certain circumstances, like time adjustments via NTP.

The first thing I wanted to proof, is the switch of the gettimeofday() call to the clock_gettime() call. This turned out not to be as simple as I thought.

Because a lot of applications and executables need timing information, which is traditionally done via the gettimeofday() call, the Linux kernel maintainers understood the importance of making this call as performant as possible. Calling a (hardware) clock means you request information from a resource on your computer system. Requesting something like that requires a program to switch to kernel mode first. In order to lower the resources and time needed to get time information, the Linux kernel includes a “trick” to get that information, which is known as a virtual system call or vsyscall. Essentially this means this information is provided in userspace, so there are lesser resources needed, and there is no need to switch to kernel mode. James Morle has an excellent article describing this, this line is a link to that. By staying in userspace, the gettimeofday() and clock_gettime() calls are “userland” calls, and do not show up when using “strace” to see system calls of a process executing.

However I said it wasn’t as easy as I thought. I was looking into this, and thought making the vsyscalls visible by echoing “0” in /proc/sys/kernel/vsyscall64. However, I am working with kernel version 3.8.13 for doing this part of the research….which does not have /proc/sys/kernel/vsyscall64, which means I can’t turn off the vsyscall optimisation and make both gettimeofday() and clock_gettime() visible as systemcall.

Searching for kernel.vsyscall64 on the internet I found out that with early versions Linux kernel version 3 vsyscall64 has been removed. This means I can’t use a simple switch to flip to make the calls visible. So, instead of going straight to the thing I wanted to research, I got stuck in doing the necessary preparing and essential preliminary investigation for it.

Can I do it in another way? Yes, this can be done using gdb, the GNU debugger. Start up a foreground (Oracle database) session, and fetch the process ID of that session and attach to it with gdb:

gdb -p PID
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...etc...
(gdb)

Now simply break on gettimeofday and clock_gettime, and make gdb notify you it encountered the call. First 11.2.0.4:

(gdb) break gettimeofday
Breakpoint 1 at 0x332fc9c7c0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>printf "gettimeofday\n"
>continue
>end
(gdb) break clock_gettime
Breakpoint 2 at 0x3330803e10
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>printf "clock_gettime\n"
>continue
>end
(gdb) 

You can save this gdb macro to time.gdb by executing “save breakpoints time.gdb”. Now execute “c” (continue) and enter, to make the process you attached to running again. Execute something very simple, like:

SQL> select * from dual;

This results in Oracle 11.2.0.4:

(gdb) c
Continuing.
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
gettimeofday
...etc...

That is expected, we already knew the Oracle database is executing the gettimeofday function a lot. Now let’s do exactly the same, but with Oracle version 12.1.0.2. If you saved the breakpoints and macro’s, you can attach to an Oracle 12.1.0.2 foreground process with gdb and execute ‘source time.gdb’ to set the breakpoints and macro’s. When the ‘select * from dual’ is executed in this version of the database, it looks like this:

(gdb) c
Continuing.
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
gettimeofday
clock_gettime
clock_gettime
clock_gettime
clock_gettime
clock_gettime
...etc...

It is clearly (mostly) executing the clock_gettime() function.

The clock_gettime() function can use a variety of time sources. If you read the manpage of clock_gettime you will see that the first argument is the clock source. You can see the clock sources in the kernel source file Linux/include/uapi/linux/time.h, which shows:

/*
 * The IDs of the various system clocks (for POSIX.1b interval timers):
 */
 #define CLOCK_REALTIME                  0
 #define CLOCK_MONOTONIC                 1
 #define CLOCK_PROCESS_CPUTIME_ID        2
 #define CLOCK_THREAD_CPUTIME_ID         3
 #define CLOCK_MONOTONIC_RAW             4
 ...

The first argument of clock_gettime is the type of clock, so if I remove the macro with clock_gettime, execution stops when clock_gettime is called:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
        silent
        printf "clock_gettime\n"
        c
(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>end
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000332fc9c7c0 <gettimeofday>
	breakpoint already hit 2 times
        silent
        printf "gettimeofday\n"
        c
2       breakpoint     keep y   0x0000003330803e10 <clock_gettime>
	breakpoint already hit 23 times
(gdb) c
Continuing.

Now execute something in the sqlplus session. What will happen in the gdb session is:

Breakpoint 2, 0x0000003330803e10 in clock_gettime () from /lib64/librt.so.1
(gdb)

Now look up the first argument of the call:

(gdb) print $rdi
$1 = 1

So Oracle is using CLOCK_MONOTONIC. Not the point of this article, but this means Oracle database time measurement is granular on the microsecond layer.

Now let’s look how much time the Oracle wait interface takes itself. The Oracle wait interface is using the functions kslwtbctx() (kernel service layer wait begin context) and kslwtectx() (kernel service layer wait end context) to start and stop measuring a wait event. Please mind that instead of looking at the time the wait interface provides, this means looking at the time that is taken executing in the kslwtbctx() and kslwtectx() functions. This can be done using systemtap:

global kslwtbctx_time, kslwtectx_time, kslwtbctx_count=0, kslwtbctx_tot=0, kslwtectx_count=0, kslwtectx_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	if ( pid() == target() ) {
		kslwtbctx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx").return {
	if ( pid() == target() ) {
		printf("kslwtbctx: %12d\n", local_clock_ns()-kslwtbctx_time)
		kslwtbctx_tot+=local_clock_ns()-kslwtbctx_time
		kslwtbctx_count++
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	if ( pid() == target() ) {
		kslwtectx_time=local_clock_ns()
	}
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx").return {
	if ( pid() == target() ) {
		printf("kslwtectx: %12d\n", local_clock_ns()-kslwtectx_time)
		kslwtectx_tot+=local_clock_ns()-kslwtectx_time
		kslwtectx_count++
	}
}
probe end {
	printf("\nkslwtbctx: avg %12d\nkslwtectx: avg %12d\n",kslwtbctx_tot/kslwtbctx_count,kslwtectx_tot/kslwtectx_count)
}

This systemtap script has a ‘begin probe’, which executes once the systemtap script starts running. I simply print ‘Begin.’ with a newline. The idea is that it prompts me once the systemtap script is actually running.

Then there is a (userspace) process based probe for the oracle process. There are two probes for both the kslwtbctx and the kslwtectx function in the oracle process. The first one (.function(“kslwtbctx”)) fires when the function is entered, the second one (.function(“kslwtbctx”).return) fires when the function has ended.

The ‘if ( pid() == target() )’ function filters all the invocations and returns of the probed functions for the PID set by “-x PID” parameter. Otherwise any invocation of the probed function by any process would be picked up.

The entering probe simply records the time in nanoseconds in a variable. The returning probe subtracts the previous recorded time from the current time, which means the time between entering and returning is shown. Also, the returning probe adds the time the function took to another variable, and counts the number of times the return probe has fired.

The end probe shows the total time spend in each of the two functions, divided by the number of times the return probe was fired. This way the average time spend in the two functions is calculated. As you will see, the time spend in the function varies.

When this is executed against an Oracle foreground session, this is how it looks like:

# stap -x 2914 wait_interface.stap
Begin.
kslwtectx:         9806
kslwtbctx:         3182
kslwtectx:         1605
kslwtbctx:         1311
kslwtectx:         4200
kslwtbctx:         1126
kslwtectx:         1014
kslwtbctx:          840
kslwtectx:         4402
kslwtbctx:         2636
kslwtectx:         2023
kslwtbctx:         1586
^C
kslwtbctx: avg         2165
kslwtectx: avg         4305

The time measured is in nanoseconds. The average wait interface overhead is roughly 6 microseconds including systemtap overhead on my system.

The obvious thought you might have, is: “why is this important?”. Well, this is actually important, because the 6us dictates what the wait interface should measure, and what it should not measure. What I mean to say, is that anything that is called inside the Oracle database for which the time spend is in the same ballpark as the wait interface overhead or lower, should not be measured by the wait interface, because the measurement would influence the overall performance in a negative way.

A good example of this are latch gets. The Oracle database does not instrument individual latch gets via the wait interface, but rather exposes waiting for a latch via the wait interface when a process has spun for it (a latch is a spinlock), and decides to go to sleep (on a semaphore) waiting to get woken once the latch becomes available.

However, using systemtap we can actually measure latch gets! Here is a simple script to measure the latch gets for non-shared latches via the kslgetl() function:

global kslgetl_time, kslgetl_count=0, kslgetl_tot=0

probe begin {
	printf("Begin.\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl") {
	if ( pid() == target() )
		kslgetl_time=local_clock_ns()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslgetl").return {
	if ( pid() == target() ) {
		printf("kslgetl: %12d\n", local_clock_ns()-kslgetl_time)
		kslgetl_tot+=local_clock_ns()-kslgetl_time
		kslgetl_count++
	}
}
probe end {
	printf("\nkslgetl: avg %12d\n",kslgetl_tot/kslgetl_count)
}

This systemtap script follows the exact same structure and logic of the previous systemtap script.

This is how this looks like on my system when executed against a database foreground session executing ‘select * from dual’:

# stap -x 2914 kslgetl.stap
Begin.
kslgetl:         3363
kslgetl:          786
kslgetl:          744
kslgetl:          782
kslgetl:          721
kslgetl:          707
kslgetl:         1037
kslgetl:          728
kslgetl:          711
kslgetl:          736
kslgetl:          719
kslgetl:          714
kslgetl:         1671
kslgetl:          929
kslgetl:          968
kslgetl:          919
kslgetl:          883
kslgetl:          869
kslgetl:         3030
kslgetl:          750
^C
kslgetl: avg         1362

As you can see, the measured average time spend inside the kslgetl function is 1.3us on my system, which includes systemtap overhead. Clearly the time for taking a latch is less than the wait interface takes, which means not instrumenting the kslgetl() function in the wait interface is a sensible thing.

This means that with the current state of the wait interface, it does not make sense to measure a lot more very fine grained events, even though the timer can time on microsecond granularity. Please mind I am not saying that it does not make sense to detail the response time, I think with modern computer systems with a lot of memory the Oracle database can run more and more without needing to wait for things like disk IOs. This means modern database servers can spend a lot of time just running on CPU, making it hard to understand in what routines the time is spend. Tuning on CPU execution requires an insight into where time is spend. The only option to understand how CPU time in Oracle is composited, is by using external (operating system based) tools like perf and flame graphs to detail CPU time. It would be great if an option would exist in the database to detail on CPU time.

I guess everybody who is working with Oracle databases and has been involved with Oracle Exadata in any way knows about smartscans. It is the smartscan who makes the magic happen of full segment scans with sometimes enormously reduced scan times. The Oracle database does smartscans which something that is referred to as ‘offloading’. This is all general known information.

But how does that work? I assume more people are like me, and are anxious to understand how that exactly works. But the information on smartscans is extremely scarce. Of course there is the Oracle public material, which looks technical, but is little/nothing more than marketing. On My Oracle Support, I can’t find anything on the inner working. Even in the ‘Expert Oracle Exadata’ book (which I still regard as the best source of Exadata related information) there is no material on the mechanics of smartscans.

I’ve written a couple of articles on smartscans, of which this article already lays some groundwork, it describes the phases I could see with the available information at that time: oss_open, oss_ioctl followed by oss_wait and oss_cread followed by oss_wait. This is actually a summary of a smartscan, but a very brief one. In this article I described that a smartscan can only happen with a full segment scan (kdstf* functions, Oracle’s fast full scan routines) and if it chooses to use direct path (kcbld* functions, direct path loader) access, which is actually mandatory to get smart scans.

The following investigation is done on an Exadata X2-2 Quarter rack, with Image version: 12.1.1.1.0.131219, and database version 12.1.0.1.3.

In order to get more understand on smartscans, we can use Oracle’s new debugging syntax. The part we are going to look at is called ‘KXD’:

SYS@db12c2 AS SYSDBA> oradebug doc component kxd

  KXD			       Exadata specific Kernel modules (kxd)
    KXDAM		       Exadata Disk Auto Manage (kxdam)
    KCFIS		       Exadata Predicate Push (kcfis)
    NSMTIO		       Trace Non Smart I/O (nsmtio)
    KXDBIO		       Exadata Block level Intelligent Operations (kxdbio)
    KXDRS		       Exadata Resilvering Layer (kxdrs)
    KXDOFL		       Exadata Offload (kxdofl)
    KXDMISC		       Exadata Misc (kxdmisc)

In order to get the tracing of the Exadata (database-) kernel modules, along with regular sql tracing with waits (to understand when something is happening, use the following events:

FRITS@db12c2 > alter session set events 'trace[kxd.*]';

Session altered.

FRITS@db12c2 > alter session set events 'sql_trace level 8';

Session altered.

Now execute a SQL that does smartscans. I’ve made sure table ‘T’ is big enough to invoke a direct path full table scan:

FRITS@db12c2 > select count(*) from t;

  COUNT(*)
----------
   1000000

Now take a peek in the trace file! The first part is normal looking:

PARSING IN CURSOR #139755290955216 len=22 dep=0 uid=201 oct=3 lid=201 tim=1593707470857 hv=2763161912 ad='224d632b8' sqlid='cyzznbykb509s'
select count(*) from t
END OF STMT
PARSE #139755290955216:c=2000,e=2417,p=0,cr=1,cu=0,mis=1,r=0,dep=0,og=1,plh=2966233522,tim=1593707470856
EXEC #139755290955216:c=0,e=25,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2966233522,tim=1593707470920
WAIT #139755290955216: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=14 tim=1593707470968
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 215 name|mode=1263468550 2=131242 0=2 obj#=14 tim=1593707471374
WAIT #139755290955216: nam='reliable message' ela= 1035 channel context=10126085744 channel handle=10164799536 broadcast message=10179010104 obj#=14 tim=1593707472530
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 108 name|mode=1263468550 2=131242 0=1 obj#=14 tim=1593707472684
WAIT #139755290955216: nam='enq: KO - fast object checkpoint' ela= 101 name|mode=1263468545 2=131242 0=2 obj#=14 tim=1593707472829

We see the parsing of the simple select statement, and the execution, which yields some waits which are always there (the sqlnet message wait), and then some ‘enq: KO – fast object checkpoint’ waits, indicating a checkpoint, which is a sign of a direct path read.

The next part is interesting, because this is the smartscan-specific tracing:

Caching: Global context initialized 0x7f1b50ca6d20
kcfis_alloc_so 0x24d8994d8
In kcfis initialize: new init: app_state: 0x7f1b50ca67b0 app_type: 1
kcfis rcv update : op: 7 val: 1 so_numses 1 ovhdmem 0 mdmem 0 bufmem 0
kcfis_reinitialize: initializing queues
Set work des: global_ctx: 0x7f1b50ca6d20 app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
Automem enabled: app_state: 0x7f1b50ca67b0, mmwds: 0x22446a1a8
No match found for mmwds. Allocated wds 0x7f1b50ca6768, mmwds 0x22446a1a8
Cache version is 1 start cache version is 1
kcfis rcv update : op: 1 val: 1496 so_numses 1 ovhdmem 0 mdmem 1496 bufmem 0
kcfis rcv update : op: 1 val: 68 so_numses 1 ovhdmem 0 mdmem 1564 bufmem 0
oss_state->oss_context is 0x109d2db0
kcfis rcv update : op: 3 val: 69656 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 0
kcfis_initialize done

What we see here is the kcfis layer initialising memory. It’s interesting to see where the allocations are done. In general, on this system, the allocations in the 0x7f1bxxxxxxxx are in the PGA heap, kcfis_alloc_so/0x24d8994d8 in the SGA, and the mmwds/0x22446a1a8 is in the SGA too.
This information can be obtained by dumping heaps. Dumping the heaps at level 7 will show sga, session, pga, call and uga heaps. Another way to get insight into a memory locations is using Tanel Poder’s fcha script (Find Chunk Address). Please mind that if you are reading the kxd trace file and want to look up the addresses in a second window in a sqlplus / as sysdba session, this will allow you to see the SGA chunks, but probably not the PGA chunks, because these are private to the traced session.

Let’s get on to the next chunk of trace lines:

In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=(nil)
kcfis_translate: source: 1
kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188499 blkcnt: 13 rdba: 55714451 Fno: 13 Bno: 1188499 reqid=1 cver=1 source=1
OSSIPC:SKGXP:[109be210.0]{0}: (25797 <- 13487)SKGXPDOAINVALCON: connection 0x109c8c30 admno 0x5ac93f1d scoono 0x321bdad5 acconn 0x63cd0678 getting closed. inactive: 0
OSSIPC:SKGXP:[109be210.9]{obj}: SKGXPCNH: 0x109c8390 SKGXPCON_OPEN (2) sconno 321bdad4 accono 1206dea3 admno 75f4dfcd ospid 13487 ANT
OSSIPC:SKGXP:[109be210.10]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.11]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.12]{obj}:        SSKGXPT 0x109c83d0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 22774 lerr 0
OSSIPC:SKGXP:[109be210.13]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.14]{obj}:        SSKGXPT 0x109c84a0 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 51931 lerr 0
OSSIPC:SKGXP:[109be210.15]{obj}:   next seqno 32768 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
OSSIPC:SKGXP:[109be210.16]{obj}: SKGXPCNH: 0x109c8c30 SKGXPCON_CLOSED (1) sconno 321bdad5 accono 63cd0678 admno 7b54ac1c ospid 13487 ANT
OSSIPC:SKGXP:[109be210.17]{obj}:   cookie [04030201010100001f2f65030b000000:..........e.....(16)]
OSSIPC:SKGXP:[109be210.18]{obj}:   Remote admin port
OSSIPC:SKGXP:[109be210.19]{obj}:        SSKGXPT 0x109c8c70 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 29832 lerr 0
OSSIPC:SKGXP:[109be210.20]{obj}:   Remote data port
OSSIPC:SKGXP:[109be210.21]{obj}:        SSKGXPT 0x109c8d40 flags 0x2 { WRITE } sockno 12 IP 192.168.12.8 RDS 65116 lerr 0
OSSIPC:SKGXP:[109be210.22]{obj}:   next seqno 32765 last ack 32763 credits 2 total credits 2 ertt 16 resends on con 0
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188499, 13) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723752448, 106496) Mirr_num: 0 reqid=1 cver=1 source=1
kcfis_get_disk_for_translation: Appliance 192.168.12.8/ does not exist
kcfis rcv update : op: 5 val: 8192 so_numses 1 ovhdmem 69656 mdmem 1564 bufmem 8192
kcfis_open_appliance: 
throttle: initialized for appliance 0x7f1b5078b3a8
kcfis_find_appliance_fd: appliance fd not found. appliance 192.168.12.8/ Cached appliance fd count 0
kcfis_open_appliance_fd: 0x7f1b5078b3a8
WAIT #139755290955216: nam='cell smart table scan' ela= 120 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707489569
Predicate device intelligent IO opened. fd 5
kcfis_init_appliance_fd: appliance fd 0x7f1b50da2b88 initialized for appliance 0x7f1b5078b3a8
kcfis_add_disk: Adding disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 under appliance = 0x7f1b5078b3a8
initialize disk for disk o/192.168.12.8/DATA_CD_06_enkcel04
kcfis_initialize_disk_fd: Disk initialized. appliance: 192.168.12.8/ disk: o/192.168.12.8/DATA_CD_06_enkcel04 fd: 4 disknumber: 1 incarnation: 6 prev_disknumber: 0 num_inits: 1 init_cache_ver: 1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188499, numblks = 13, disk_handle = 4, diskoffs = 513723752448, len = 106496, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a000 reqid=1 cver=1 source=1
Default: calc numbufs mem 1048576
Final: calc numbufs mem 1048576 buflen 1048576
Num buffers: 1 buf per appliance: 1 num active appliance: 1
Appliance 0x7f1b5078b3a8 active. Active count 1

There’s a lot to see here. The next step in doing a smartscan is the translation of the data dictionary information on the segment to be smartscanned to cell server and grid disk extents. The data dictionary information is shown in line 3; ‘kcfis_get_new_request’, in line 4; ‘kcfis_get_translation:before’ the session tries to translate the data dictionary information to cell server and grid disk. Because the kcfis context is just initialised, there is no cell related information yet. For that reason, the information is requested from the cell server (the OSSIPC:SKGXP lines). Please mind this request will send all disk related information to the kcfis context of the process. With this information, the process can make the translation, as can be seen in line 20: ‘kcfis_get_translation:after’, it shows where the extent is located in the well known exadata notation: ‘o/cell ip/grid disk name’, together with offset and chunk length. Next the disk needs to be initialised (a disk is a combination of cell server or ‘appliance’ and grid disk), and given a hash value, as is indicated by ‘kcfis_get_disk_for_translation’ in line 21. However, in order for a disk to be initialised, the cell server or appliance must be initialised too. That is what the process is indicating in line 21: Appliance 192.168.12.8/ does not exist. The appliance (cell server) is initialised/opened, which is what is shown in lines 23-29. The wait here is initialising a connection with the cell server, waiting for an acknowledgement. Now the appliance is initialised, the disk is initialised, as indicated in lines 30-32. Next line 33 shows the translation which was started earlier new is finally been done. Line 34-37 show something about the buffering which seems to be arranged per appliance, and is 1MB.

Now that the process has initialised the appliance and the disk, the next translation is done. The translations are done per extent of the segment, and the disk is depended on the placing of the extent. Please mind the maximum size of the allocation is depended on the AU (allocation unit) size of ASM, which is set to 4MB by default with Exadata. If a non-initialised appliance is encountered, it is initialised and opened, and if a non-initialised disk is encountered, this is initialised.

The translation looks like this if everything is initialised:

kcfis_get_new_request: obtained new piece to translate:fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 
kcfis_get_translation:before: kcfis_req: fob: 0x24901a2a0 startblk: 1188513 blkcnt: 15 rdba: 55714465 Fno: 13 Bno: 1188513 reqid=2 cver=1 source=1
kcfis_get_translation:after: Translation (disk,off,len of (0x24901a2a0, 1188513, 15) to (o/192.168.12.8/DATA_CD_06_enkcel04, 513723867136, 122880) Mirr_num: 0 reqid=2 cver=1 source=1
Translated fields: disk = 0x7f1b50c953f0, blkno = 1188513, numblks = 15, disk_handle = 4, diskoffs = 513723867136, len = 122880, path_asmname = o/192.168.12.8/DATA_CD_06_enkcel04, disk_num = 1, req_element = 0x7f1b50d0a220 reqid=2 cver=1 source=1

Another thing which is important to notice is the ‘reqid’, which obviously means ‘request id’. This process is repeated until the complete segment is translated into requests.

Once the requests are translated, the next step in the smartscan is to send (‘push’) the requests to the appliances. This apparently is called a ‘payload map’.

kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160
kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d0a000
Pushing request : disknumber = 1, offset = 513723752448, len = 106496 rdba: 55714451 version 0 reqid=1 cver=1
disk=0x7f1b507908e0 state=1

The first line is the start of the pushing of the payload maps. The next line shows a specific appliance being chosen. What is shown next is a line showing ‘disk’ and the hash value of the disk. At the start of a maps push, the disk lines are followed by two lines saying ‘trans_req_element’ and ‘Pushing request’. These two lines probably are some kind of state object for the request, and the actual pushing of the request. Here we see the request id back which we saw in the translation phase.

This is repeated, until some of the disk lines are starting to get followed immediately by another disk line:

disk=0x7f1b50cfdf80 state=1
trans_req_element = 0x7f1b50d12c40
Pushing request : disknumber = 7, offset = 513817985024, len = 1032192 rdba: 55716738 version 0 reqid=67 cver=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
trans_req_element = 0x7f1b50d2ffa0
Pushing request : disknumber = 9, offset = 513890353152, len = 4161536 rdba: 55729668 version 0 reqid=174 cver=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1

Further down in the trace file, the trans_req_element and Pushing request lines are becoming scarce:

disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
disk=0x7f1b50c953f0 state=1
trans_req_element = 0x7f1b50d31920
Pushing request : disknumber = 1, offset = 498446925824, len = 4161536 rdba: 41208836 version 0 reqid=186 cver=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1

Inspection of these lines show that the process is going through a strict sequence of disks of that appliance, and picks up one request per disk which (obviously) belongs to that disk. If the requests are not evenly divided between the disks, some disks will have all the requests already pushed to that disk, while other disks still need additional requests. In that case, the disk which already have their requests pushed will not get a request, so no trans_req_element/Pushing request combination. The process goes through this until all the requests for that appliance are pushed.

After the push of all the requests for that appliance, the following is happening:

kcfis_create_maps_payload. alloc_len 4088 num maps 55
throttle: mappayload: maps_to_push 7 iosize_being_pushed 150953984
kcfis_metadata_payload_len: app_state 0x7f1b50ca67b0 appliance 0x7f1b5078b3a8 payload_len 5968 payload_hdr_len 96 sessiondata_payload_len 144
metadata_payload_len 1536 fmetadata_payload_len 100 maps_len 3992 exthdr_len 8 planpayload_len 48 oflgrppayload_plen 40
kcfis_create_metadata_payload. appliance 0x7f1b5078b3a8 payload 0x7f1b50da3190 payload_memlen 5968 maps 0x7f1b50d07fe0 mapslen 3992
kcfis_create_metadata_payload: pushing sessiondata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing capability payload: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: dop: 1kcfis_create_metadata_payload: pushing metadata: appliance 0x7f1b5078b3a8
kcfis_create_metadata_payload: pushing fast metadata: appliance 0x7f1b5078b3a8
kcfis_push: pushing metadata to appliance 0x7f1b5078b3a8. metadata 0x7f1b50da3190
kcfis_issue_ioctl: payload_type 1
WAIT #139755290955216: nam='cell smart table scan' ela= 178 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707571673
Ioctl completed. Payload type 1
Ioctl quarantine response 1 for appliance 0x7f1b5078b3a8
appliance 0x7f1b5078b3a8 : cellsrv pid: 13487: predicate: /box/predicate735745

The first line shows 55 maps have been pushed to the appliance. The other lines are various memory locations which are needed for gathering the reads which will be send back by the appliances. Further things which seem important are line 11, which issues a ioctl (IO control) request to the appliance, and waits for acknowledgement. Mind the wait is always ‘cell smart table scan’. Line 15 shows this request gets a predicate, which is ‘/box/predicate735745’.

This is repeated for every appliance.

Then the next thing happens:

kcfis_create_maps_payload. appliance 0x7f1b5078b3a8 num_disks 12
throttle: allowing map push: appliance 0x7f1b5078b3a8, maps_to_push 7
disk=0x7f1b50c953f0 state=1
disk=0x7f1b507908e0 state=1
disk=0x7f1b507903f0 state=1
disk=0x7f1b50d04fc0 state=1
disk=0x7f1b50cfe960 state=1
disk=0x7f1b50cfdf80 state=1
disk=0x7f1b50cfc930 state=1
disk=0x7f1b50cc8118 state=1
disk=0x7f1b50cc7ae0 state=1
disk=0x7f1b50cc5ab0 state=1
disk=0x7f1b50cc3a80 state=1
disk=0x7f1b50cc30a0 state=1
kcfis_create_maps_payload. alloc_len 200 num maps 0

The maps payload push is done again for all the appliances, without any request being pushed. The last line confirms no maps/requests having been pushed: num maps 0. However, there is one line which hasn’t been there before: line 2 ‘throttle’, more specifically important in this line is ‘allowing map push’, the previous throttle during ‘kcfis_create_maps_payload’ had the remark ‘mappayload’.

This means that at this point the physical extents to be scanned on the appliances (cell servers) have been identified, translated to appliance, grid disk, offset and size, and the requests for these extents have been send to the appliances. The last snippet actually means that the appliance is notified to start preparing for sending results back.

After the appliances have been notified, memory is initialised again.

Default: calc numbufs mem 4194304
Final: calc numbufs mem 4194304 buflen 1048576
Alloc buffer: target_freebufs 4 allocated_freebufs 0
Get additional mem: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768
Starting work area: app_state: 0x7f1b50ca67b0 wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Started work area: wds: 0x7f1b50ca6768 mmwds 0x22446a1a8
Get additional mem for pga_aggregate_target: max 4195552 min 1048888, wds 0x7f1b50ca6768 mmwds 0x22446a1a8
cur size 0
Change req: expected size 4196352 cur size 0 max_mem (KB) 4098
Max memory allocation ok: max 4195552, expected 4196352, cur 0
Memlen allowed 4195552 io_buflen 1048576 chunk_len 1048888
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
kcfis_alloc_readmem_chunk: sz=1048888
incr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 4212028
kcfis rcv update : op: 5 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 4244796
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 4212028
Calling oss_cread: appliance 0x7f1b5078b3a8 app_buffer: 0x7f1b50659000 databuf: 0x7f1b50559000 buflen: 1048576 (posted to 192.168.12.8/)
appliance 0x7f1b5078b3a8 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50c94e78 app_buffer: 0x7f1b50549000 databuf: 0x7f1b50449000 buflen: 1048576 (posted to 192.168.12.11/)
appliance 0x7f1b50c94e78 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d059a0 app_buffer: 0x7f1b50439000 databuf: 0x7f1b50339000 buflen: 1048576 (posted to 192.168.12.9/)
appliance 0x7f1b50d059a0 total creads 1 new creads 1 read seqno 0 pending reads 0
Calling oss_cread: appliance 0x7f1b50d02948 app_buffer: 0x7f1b50a7a000 databuf: 0x7f1b5097a000 buflen: 1048576 (posted to 192.168.12.10/)
appliance 0x7f1b50d02948 total creads 1 new creads 1 read seqno 0 pending reads 0
kcfis wait: buf: 0x7f1b50549000 app_state: 0x7f1b50ca67b0 err: (0) Success
kcfis wait: buf: 0x7f1b50659000 app_state: 0x7f1b50ca67b0 err: (0) Success
WAIT #139755290955216: nam='cell smart table scan' ela= 59 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707578411
kcfis_push: num-appliances 4. payload_size 0x7f1b50da3130 ioctl_issued 0x7f1b50da3108 results 0x7f1b50da3068 payload_type 0x7f1b50da3160

First memory areas are initialised (lines 1-24), then we see lines showing ‘Calling oss_cread’. The oss_cread call is the call to the appliances to start sending a resultset back. Please mind that despite the calls addressing specific extents, this is a smartscan, so resultsets are send back instead of Oracle blocks. Also, since this is exadata using the iDB/RDS protocol over infiniband, the appliances can use RDMA to send the results back, which means the cells can fill the memory in the server process’ memory directly.

After oss_cread being called, the ‘kcfis_create_maps_payload’ routine (shown in the snippet above the last snippet) being executed to every appliance apparently to indicate all the disks being enabled, possibly trying to throttle activity, and/or to indicate requests will be called from this session. This seems to be repeated for every roundtrip during the entire smartscan for all the appliances that are still needed.

Whenever a result (result set) is ready to be processed, the following sequence happens:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

This shows the reap of result set returned by an appliance. Line 5-6 show appliance specific information. Line 7 is showing important information; ‘nelem’ shows the amount of extents (called ‘elements’ in this context) for which the result or results are returned. Of course ‘nelem’ means ‘number of elements’. In this case it’s 1 (resultset from a specific element/extent). Line 15 shows the actual extent from which the result set came back, because the reqid is exposed, the reqid was defined during the translation phase. This snippet ends with a WAIT line (again: all the waits are ‘cell smart table scan’). I consider this a cyclic process: first the ‘kcfis_create_maps_payload’, then calling oss_cread for one or multiple cells, then a wait, or the above processing of results for one or multiple appliances, and a wait.

I’ve created this snippet to be as simple as possible, in real life result sets of multiple appliances could be reaped (in my case I had to remove a second result). The processing of the result set is done in a few stages, so a resultset is not processed per appliance, but the processing stages are done for all the result sets of all the appliances.

Also this example shows only one request in the reaped result. There can be multiple requests (reqid’s/extents) returned.

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50439000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50d059a0 read seqno 1 pending reads 1
appliance 0x7f1b50d059a0 total creads 0 re-adjusted: read seqno 1 pending reads 1
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50439000, databuf: 0x7f1b50339000, nelem: 2
Dump of memory from 0x00007F1B5033B000 to 0x00007F1B5033B040
7F1B5033B000 0000A23C 02748C02 5BB31713 0402099A  [<.....t....[....]
7F1B5033B010 00988672 001E0001 0000007E 0000007E  [r.......~...~...]
7F1B5033B020 0000007E 00000000 00000000 00000000  [~...............]
7F1B5033B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50339080
kcfis_validate_translation: request 0x7f1b50d0b980
Req completed is : err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1 block_id=41192450
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b5033b000 afn 9 rdba 41192450 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 41192450 good: 1
Request 0x7f1b50d0b980 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b503390c8
kcfis_validate_translation: request 0x7f1b50d114e0
Req completed is : err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1 block_id=53758466
 flags = 0
kcfis_oss_block_verify: bp: 0x7f1b50342e98 afn 12 rdba 53758466 dlen: 32408 blksz: 8192 nblks: 3
kcfis_oss_block_verify: corrupt checkcb: rdba: 53758466 good: 1
Request 0x7f1b50d114e0 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 465244798976, len = 1032192 data_len = 32408 bufoff = 8192 version = 0 reqid=13 cver=1
memptr (nil) len 0, blockid = 41192450
kcfis_get_next_data: elem = 1, err = 0, disknumber = 6, off = 498161696768, len = 1032192 data_len = 32408 bufoff = 40600 version = 0 reqid=56 cver=1
memptr 0x7f1b5033b000 len 32408, blockid = 53758466
kcfis_get_next_data: dptr: 0x7f1b5033b000 len: 64816 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50439000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50439000
WAIT #139755290955216: nam='cell smart table scan' ela= 10 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707616790

(example with ‘nelem’=2)

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50549000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 1 pending reads 0
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50549000, databuf: 0x7f1b50449000, nelem: 1
Dump of memory from 0x00007F1B5044B000 to 0x00007F1B5044B040
7F1B5044B000 0000A23C 03344401 5BB31709 0402099A  [<....D4....[....]
7F1B5044B010 00986F3F 001E0001 0000000F 0000000F  [?o..............]
7F1B5044B020 0000000F 00000000 00000000 00000000  [................]
7F1B5044B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50449080
kcfis_validate_translation: request 0x7f1b50d0aee0
Req completed is : err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1 block_id=53756929
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5044b000 afn 12 rdba 53756929 dlen: 3992 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 53756929 good: 1
Request 0x7f1b50d0aee0 done
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b5078b3a8 read seqno 1 pending reads 0
appliance 0x7f1b5078b3a8 total creads 0 re-adjusted: read seqno 1 pending reads 0
throttle: received: maps_to_push 7 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 1
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03522293 5BB316ED 0402099A  [<...."R....[....]
7F1B5055B010 009849CB 001E0001 0000000D 0000000D  [.I..............]
7F1B5055B020 0000000D 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d0a000
Req completed is : err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1 block_id=55714451
 flags = 4
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55714451 dlen: 3480 blksz: 8192 nblks: 0
kcfis_oss_block_verify: corrupt checkcb: rdba: 55714451 good: 1
Request 0x7f1b50d0a000 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 513723752448, len = 106496 data_len = 3480 bufoff = 8192 version = 0 reqid=1 cver=1
memptr (nil) len 0, blockid = 55714451
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 3480 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 1, off = 562871410688, len = 122880 data_len = 3992 bufoff = 8192 version = 0 reqid=8 cver=1
memptr (nil) len 0, blockid = 53756929
kcfis_get_next_data: dptr: 0x7f1b5044b000 len: 3992 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50549000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50549000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707579132

(example of result sets returned of two appliances, for which the returned results are processed)

Also, when the extent to be scanned is larger, the result set is processed in multiple steps:

kcfis reaped i/o: app_state: 0x7f1b50ca67b0
kcfis reaped i/o: buf: 0x7f1b50659000 err: (0) Success
Returning non-pt payload
appliance 0x7f1b50c94e78 read seqno 2 pending reads 1
appliance 0x7f1b50c94e78 total creads 0 re-adjusted: read seqno 2 pending reads 1
throttle: received: maps_to_push 0 total 14
kcfis_process_completed_buffer: App Buffer 0x7f1b50659000, databuf: 0x7f1b50559000, nelem: 3
Dump of memory from 0x00007F1B5055B000 to 0x00007F1B5055B040
7F1B5055B000 0000A23C 03523080 5BB3178D 0402099A  [<....0R....[....]
7F1B5055B010 0098FBC9 001E0001 00000080 00000080  [................]
7F1B5055B020 00000080 00000000 00000000 00000000  [................]
7F1B5055B030 00000000 00000000 00000000 00000000  [................]
kcfis_process_completed_buffer: 0 elem: 0x7f1b50559080
kcfis_validate_translation: request 0x7f1b50d145c0
Req completed is : err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1 block_id=55718016
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1191936 nblk: 128 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650649796608 sz: 1048576 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d114e0, preqid=79 reqid=79 cver=1
kcfis_validate_translation: REQ3: splitting from bottom
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a2a0 sblk: 1192192 nblk: 256 aubyteoffs 2097152 disk: o/192.168.12.11/DATA_CD_05_enkcel07 off: 650651893760 sz: 2097152 mnum: 0 res: 1 parent req: 0x7f1b50d145c0, req 0x7f1b50d0b980, preqid=79 reqid=79 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5055b000 afn 13 rdba 55718016 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 55718016 good: 1
Request 0x7f1b50d145c0 done
kcfis_process_completed_buffer: 1 elem: 0x7f1b505590c8
kcfis_validate_translation: request 0x7f1b50d13d40
Req completed is : err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1 block_id=41195392
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a520 sblk: 3446272 nblk: 384 aubyteoffs 0 disk: o/192.168.12.11/DATA_CD_06_enkcel07 off: 533536440320 sz: 3145728 mnum: 0 res: 1 parent req: 0x7f1b50d13d40, req 0x7f1b50d145c0, preqid=75 reqid=75 cver=1
kcfis_oss_block_verify: bp: 0x7f1b50563098 afn 9 rdba 41195392 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 41195392 good: 1
Request 0x7f1b50d13d40 done
kcfis_process_completed_buffer: 2 elem: 0x7f1b50559110
kcfis_validate_translation: request 0x7f1b50d13f60
Req completed is : err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1 block_id=59912064
 flags = 0
kcfis_validate_translation: REQ2: splitting from top
kcfis_do_kf_trans_and_queue_in_push1: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_do_kf_trans_and_queue_in_push4: discard: 0 fob: 0x24901a008 sblk: 1191428 nblk: 380 aubyteoffs 32768 disk: o/192.168.12.11/DATA_CD_07_enkcel07 off: 655070625792 sz: 3112960 mnum: 0 res: 1 parent req: 0x7f1b50d13f60, req 0x7f1b50d13d40, preqid=76 reqid=76 cver=1
kcfis_oss_block_verify: bp: 0x7f1b5056b130 afn 14 rdba 59912064 dlen: 32920 blksz: 8192 nblks: 4
kcfis_oss_block_verify: corrupt checkcb: rdba: 59912064 good: 1
Request 0x7f1b50d13f60 done
Returning non-pt payload
kcfis_get_next_data: elem = 0, err = 0, disknumber = 6, off = 650650845184, len = 1048576 data_len = 32920 bufoff = 8192 version = 0 reqid=79 cver=1
memptr (nil) len 0, blockid = 55718016
kcfis_get_next_data: elem = 1, err = 0, disknumber = 4, off = 533539586048, len = 1048576 data_len = 32920 bufoff = 41112 version = 0 reqid=75 cver=1
memptr 0x7f1b5055b000 len 32920, blockid = 41195392
kcfis_get_next_data: elem = 2, err = 0, disknumber = 5, off = 655073738752, len = 1048576 data_len = 32920 bufoff = 74032 version = 0 reqid=76 cver=1
memptr 0x7f1b5055b000 len 65840, blockid = 59912064
kcfis_get_next_data: dptr: 0x7f1b5055b000 len: 98760 err_code: 0
In kcfis read: app_state: 0x7f1b50ca67b0 being_ret=0x7f1b50659000
Returning non-pt payload
kcfis_read: OSS I/O: freeing buffer=0x7f1b50659000
WAIT #139755290955216: nam='cell smart table scan' ela= 9 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707638819

(example with nelem=3, with bigger extents, for which the result is splitted)

Once the result is reaped for a certain appliance, a new oss_cread call must be done in order for another result to be pushed to the server process. Request (reqid or extents) are not sent back in order.

Once the requests are exhausted for a certain appliance, the appliance is excluded from the ‘kcfis_create_maps_payload’ procedure.

Once all smartscan is complete, the memory areas are cleaned up, and the sessions are closed. Apparently a session or some session state information is kept per grid disk, which are all closed (closing of one appliance is shown in this snippet):

kcfis_read DONE - ret NULL
kcfis_finalize: app_state 0x7f1b50ca67b0 permflags 0x9000 clnflags 0x1f
Appliance 0x7f1b5078b3a8 in-active. Active count 3
Appliance 0x7f1b50c94e78 in-active. Active count 2
Appliance 0x7f1b50d059a0 in-active. Active count 1
Appliance 0x7f1b50d02948 in-active. Active count 0
Caching: session at the end of scan 1
Work area gc: 0x7f1b50ca6768, app_state: 0x7f1b50ca67b0
Freeing read buffer chunk (GC) 0x7f1b50448eb0. count 4
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 3159021
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 3191789
Freeing read buffer chunk (GC) 0x7f1b50338eb0. count 3
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 2106014
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 2138782
Freeing read buffer chunk (GC) 0x7f1b50558eb0. count 2
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 1053007
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 1085775
Freeing read buffer chunk (GC) 0x7f1b50979eb0. count 1
decr_kcfis_mem_wds: wds 0x7f1b50ca6768 mmwds 0x22446a1a8 size 1053007 refcnt 1 memsize 0
kcfis rcv update : op: 6 val: 1053007 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 32768
Freeing memory to top level heap: num_freed 4
Set workarea size: app_state: 0x7f1b50ca67b0 kcfis wds 0x7f1b50ca6768 mmwds: 0x22446a1a8, global_ctx: 0x7f1b50ca6d20, size: 0
Work area cleanup start: global_ctx: 0x7f1b50ca6d20
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 app_state: 0x7f1b50ca67b0 refcnt: 0
Close work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8 refcnt: 0
Closed work area: kcfis wds 0x7f1b50ca6768, mmwds: 0x22446a1a8
Deallocating kcfis wds: 0x7f1b50ca6768
Cleanup work area for app_state: 0x7f1b50ca67b0
kcfis_finalize_cached_sessions: global_ctx 0x7f1b50ca6d20
Caching: in kcfis_finalize_cached_sessions global ctx 0x7f1b50ca6d20 total cached 1 cached in ctx 1
Deallocating session. app state 0x7f1b50ca67b0 num cached sessions 0
In kcfis_deallocate_session: app_state: 0x7f1b50ca67b0
KCFIS: [NSMTIO]:SQL for this (non)Smart I/O session is: 
select count(*) from t
kcfis_cache_appliance_fd: Cached appliance fd 0x7f1b50da2b88  for appliance 0x7f1b5078b3a8
Appliance 0x7f1b5078b3a8 was not active. Active count 0
kcfis_deallocate_session: Freeing disk 0x7f1b50c953f0 name = o/192.168.12.8/DATA_CD_06_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507908e0 name = o/192.168.12.8/DATA_CD_01_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b507903f0 name = o/192.168.12.8/DATA_CD_05_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50d04fc0 name = o/192.168.12.8/DATA_CD_10_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfe960 name = o/192.168.12.8/DATA_CD_03_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfdf80 name = o/192.168.12.8/DATA_CD_09_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cfc930 name = o/192.168.12.8/DATA_CD_00_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc8118 name = o/192.168.12.8/DATA_CD_11_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc7ae0 name = o/192.168.12.8/DATA_CD_08_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc5ab0 name = o/192.168.12.8/DATA_CD_02_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc3a80 name = o/192.168.12.8/DATA_CD_04_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis_deallocate_session: Freeing disk 0x7f1b50cc30a0 name = o/192.168.12.8/DATA_CD_07_enkcel04 state = 0 appliance = 0x7f1b5078b3a8
kcfis rcv update : op: 6 val: 8192 so_numses 1 ovhdmem 139312 mdmem 1564 bufmem 24576

And the appliances are notified the smartscan has ended:

kcfis_close_all_appliance_fds. Cache count 4
kcfis_close_appliance_fd: appliance_fd 0x7f1b50da2b88
WAIT #139755290955216: nam='cell smart table scan' ela= 241 cellhash#=3249924569 p2=0 p3=0 obj#=61458 tim=1593707978094
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5db00
WAIT #139755290955216: nam='cell smart table scan' ela= 237 cellhash#=822451848 p2=0 p3=0 obj#=61458 tim=1593707978356
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5df58
WAIT #139755290955216: nam='cell smart table scan' ela= 280 cellhash#=674246789 p2=0 p3=0 obj#=61458 tim=1593707978668
kcfis_close_appliance_fd: appliance_fd 0x7f1b51f5e3b0
WAIT #139755290955216: nam='cell smart table scan' ela= 242 cellhash#=1034800054 p2=0 p3=0 obj#=61458 tim=1593707978935

Summary.
This blogpost tries to summarise (..) the different steps in an Exadata smartscan. One of the most important things which this shows is that anything which is done, is covered by a single wait event (‘cell smart table scan’). In other words: profiling this wait event tells you little to nothing about what is actually happening, except that a smartscan is being processed. In other words: if you get high ‘cell smart table scan’ waits, either for all the waits or for a few, the only way to pinpoint what it is the wait is showing waiting time is executing this trace again, and see which step it is. Of course you can pinpoint if the waiting is happening for a specific cell/appliance by looking at cellhash# in the wait line.

Disclaimer:
Please mind I tried to use my knowledge on Oracle and Exadata processing together with the information the trace provided to build this description. If you encounter anything which is incorrect, please comment on this post, and I try to get it fixed. No bits where harmed during testing.

%d bloggers like this: