Archive

Tag Archives: debugger

This blogpost is an introduction to Intel’s Pin dynamic instrumentation framework. Pin and the pintools were brought to my attention by Mahmoud Hatem in his blogpost Tracing Memory access of an oracle process: Intel PinTools. The Pin framework provides an API that abstracts instruction-set specifics (on the CPU layer). Because this is a dynamic binary instrumentation tool, it requires no recompiling of source code. This means we can use it with programs like the Oracle database executable.
The Pin framework download comes with a set of pre-created tools called ‘Pintools’. Some of these tools are really useful for Oracle investigation and research.

Pin works in a very sophisticated way. The description in the Pin manualis to think of Pin as a JIT (just in time) compiler, where the compiler does not take byte code (as JIT compilation does with Java), but the executable of the process pin is executed against. This means pin inserts itself into the process’ execution. This can be seen when looking at the memory map of such a process:

$ cat /proc/29595/maps | grep -e pin-3.0 -e oracle
00400000-1098a000 r-xp 00000000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8a000-10bac000 r--p 1058a000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10bac000-10e05000 rw-p 105ac000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
7fa01e04c000-7fa01e31d000 r-xp 00000000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e51c000-7fa01e530000 r--p 002d0000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e530000-7fa01e531000 rw-p 002e4000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e533000-7fa01e71a000 r-xp 00000000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91a000-7fa01e91c000 r--p 001e7000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91c000-7fa01e91e000 rw-p 001e9000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e946000-7fa01e9c9000 r-xp 00000000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9ca000-7fa01e9cc000 r--p 00083000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9cc000-7fa01e9ce000 rw-p 00085000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9d6000-7fa01ea04000 r-xp 00000000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea04000-7fa01ea05000 r--p 0002d000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea05000-7fa01ea06000 rw-p 0002e000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea06000-7fa01eac4000 r-xp 00000000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac5000-7fa01eac8000 r--p 000be000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac8000-7fa01eaca000 rw-p 000c1000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eacb000-7fa01edb7000 r-xp 00000000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01edb8000-7fa01ee24000 r--p 002ec000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee24000-7fa01ee25000 rw-p 00358000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee80000-7fa01f385000 r-xp 00000000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f385000-7fa01f38d000 r--p 00504000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f38d000-7fa01f394000 rw-p 0050c000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f54e000-7fa01f6d4000 r-xp 00000000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
7fa01f6d4000-7fa01f8d3000 ---p 00186000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so

Here you see this process an oracle foreground (rows 1-3), after which we see a lot of stuff that pin pushed into the process’ address space.

For this reason, it’s vitally important to use and run pin as the same user as the process you want to run pin against. The way pin works is that, upon execution of pin, the pin executable inserts itself into the process’ address space, gains control and then tries to load necessary libraries. If it can’t find these libraries, it will send a SIGKILL to the process, effectively stopping it!

In order to obtain pin, go to the pin homepage, downloads, linux and select the ‘gcc compiler kit’. This is a zipped tarball. Upload that to a linux server as the database owner, usually ‘oracle’. I created a directory ‘pin’ in which I put the tarball, and extracted it (tar xzf). The next step is to compile the tools that come with pin:

$ cd pin-3.0-76991-gcc-linux/source/tools
$ make

This will output a lot of stuff on your screen and compiles everything in the tools directory.

Now that we have seen an introduction, let’s use a Pin tool to do something useful! One such useful is the tool ‘DebugTrace’. First obtain the process id (pid) from an Oracle server foreground process. Then execute the pin tool against this process:

$ pwd
/home/oracle/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 2407 -t source/tools/DebugTrace/obj-intel64/debugtrace.so

Upon execution, nothing is returned, because the majority of the things are happening in the process pin is run against (2407 in this case). If no parameters are specified, the pintool will create a file in the current working directory of the process it is run against. For an oracle foreground process, the current working directory is $ORACLE_HOME/dbs:

$ ls -l /proc/2407/cwd
lrwxrwxrwx 1 oracle oinstall 0 Nov 17 01:45 /proc/2407/cwd -> /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs

For the sake of testing this, I ran ‘select * from dual’ in the oracle database foreground process.

Let’s look what this produces! This is how my dbs directory looks like:

$ ls -ltr
total 2332
-rw-r--r--. 1 oracle oinstall    2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall      24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall    7680 Nov  5 09:41 orapwtestdb
-rw-r-----. 1 oracle oinstall      57 Nov  5 09:42 inittestdb.ora
-rw-rw----. 1 oracle oinstall    1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--  1 oracle oinstall 2361460 Nov 17 01:49 debugtrace.out

You can see the debugtrace.out file is what is produced by the pin tool, and it look rather big (remember I only ran ‘select * from dual’)!

Let’s look into the file:

$ less debugtrace.out
         3 @@@ return underflow
Return 0x00007f706b664218 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x12c
        14 @@@ return underflow
Return 0x000000000cefb62d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cefb580 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cedcc9d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cedc90a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000ceeb746 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b170d0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       156 Tailcall 0x0000000005b170de /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b170c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffeef95c305, 0x7f706ae51da0, ...)
       158 | Call 0x0000000005b170c1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b1dcb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       172 | Return 0x0000000005b20097 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffeef95c305
       174 Return 0x0000000005b170c7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffeef95c305
       197 @@@ return underflow
Return 0x000000000ceeb7ad /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
       213 Call 0x000000000cb5d8b0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:opikndf2+0x000000000410 -> 0x000000000c940560 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx(0x7ffeef95c040, 0x7f706ae51da1, ...)
       268 | Call 0x000000000c94066e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx+0x00000000010e -> 0x000000000cd585c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64(0x7ffeef95c040, 0x7f706ae51da1, ...)
       278 | | Call 0x000000000cd5863b /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64+0x00000000007b -> 0x000000000b13ef90 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt(0x7ffeef95c040, 0x7f706ae51da1, ...)
       286 | | | Call 0x000000000b13efa3 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt+0x000000000013 -> 0x0000000000ba6860 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:.plt+0x000000001360(0x1, 0x7ffeef95beb0, ...)
       293 | | | | Call 0x00007f706b188dcb /lib64/libc.so.6:__clock_gettime+0x00000000001b -> 0x00007ffeef9ab8a0 clock_gettime(0x1, 0x7ffeef95beb0, ...)
...etc...

I think this is rather exciting! This is a FULL function call trace, indented by call depth!
The first lines in the trace are returns, and these returns produce a ‘return underflow’, which is because these functions have been called before the trace was put on the process.
This trace shows:
Calling: the address and function where the function call is made, the address and function that is called and first two arguments of the function.
Returning: the address and function where the return is executed, and the returncode.
Tailcall: this is a function that is called as the last instruction in the current function. This means it returns as part of the current function, which is shown by not indenting this call further (call, tailcall and return are all indented at the same level).

Please mind currently I am not aware of a way to remove the pin framework and tool from a process address space. Also mind that because of how this works (a lot of intercepting in the execution flow), it will slow down the process significantly. Use this tool and the above described techniques at your own risk.

(warning: this is a rather detailed technical post on the internal working of the Oracle database’s commit interactions between the committing foreground processes and the log writer)

After the Trivadis Performance days I was chatting to Jonathan Lewis. I presented my Profiling the log writer and database writer presentation, in which I state the foreground (user/server) process looks at the commit SCN in order to determine if its logbuffer contents are written to disk by the logwriter(s). Jonathan suggested looking deeper into this matter, because looking at the commit SCN might not the way it truly works.

The reasoning is the foreground process flushes its log information from its private strand into the public logbuffer, and as such only needs to keep track of the log blocks it allocated in the public logbuffer to see if these are written. Different processes can allocate different blocks in the public log buffer, which potentially do not have to be written all in SCN order. In other words: it could be hard to figure out on write time what the exact on-disk SCN of the redo is.

In order to verify this, let’s first recap what I tell about how a foreground process behaves on commit:

This is a schematic drawing of a foreground process committing.
There are two timelines, one of the foreground process, and one of the logwriter.
The foreground process commits. The commit copies the changes (redo vectors) from the process’ private strand into the public logbuffer (not shown, all indicated with “commit;”), after which it signals (“posts”) the logwriter via the semctl() system call.

When idle, the logwriter sleeps in a systemcall called semtimedop() for 3 seconds after which it performs some “household” tasks (measure time and resource usage for example) then calls semtimedop() again. When signalled, or when it finds contents in the public log buffer, the log writer writes the public log buffer contents to the online redo log file(s) via AIO if available. This is indicated by the io_submit() and io_getevents() calls.

In the meantime, after executing semctl(), the foreground process enters the function kcrf_commit_force(), or the kcrf_commit_force_int() function with Oracle 12. Inside this function, it calls the function kcscur3() three times. If the log writer was able to write the public log buffer contents before kcscur3() has been executed 3 times, the foreground will not enter the wait event ‘log file sync’, and not sleep on semtimedop() in order to wait for being signalled by the log writer, which is why I put it in a gray bar (indicating “optional”). Also, in that case, because the foreground process is not awaiting to be signalled by the logwriter, the log writer does not execute semctl(). The log writer knows what process(es) it needs to signal via a list that administers posters and waiters, which is called the “post-wait queue”, which is not externalised. If the log writer did not write the public log buffer contents fast enough, the foreground process registers it started waiting (in the aforementioned post-wait queue), and starts sleeping in semtimedop() for 100ms, after which it executes kcscur3() two times, do some household actions like measuring time and resource usage, and then calls semtimedop() again. This repeats until the process is receiving a message from the log writer via semctl().

Starting from Oracle version 11.2.0.3, the underscore parameter “_use_adaptive_log_file_sync” is set to TRUE, which enables a feature called “adaptive log file sync”. This settings means the database can and will (!) switch between post-wait mode, which has just been described, and a mode called ‘polling’. Polling means the processes that requested the log writer to write will not wait until they are posted by the log writer, but look at the log writer write progress, and continue after the log writer has written their specific log blocks from the public log buffer.

This is how that schematically looks like:

As you can see, it is quite alike the post wait mode, only the sleeping is done using nanosleep() instead of semtimedop() and there is no semctl() on the log writer timeline. The foreground process issues the same kcscur3() functions, but because these measure SCN progress, the foreground can determine if its public log buffer contents have been written to disk, and stop waiting and continue processing.

I made the conclusion that the kcscur3() function is used to determine the commit SCN based on profiling the function call sequence and logic reasoning. Encouraged by Jonathan’s reasoning, let’s try to see if we can dig deeper.

The first thing we need to do, is see if we can obtain more information on the kcscur3() function. One way of doing that, is investigating the function’s arguments. This can be done without source code or debug symbols, because the arguments of a function are passed via CPU registers on X86_64 Linux: first to fourth arguments are in registers RDI, RSI, RDX, RCX.

A way to do this, is insert data into a table, then attach to this process using gdb, and execute the following gdb macro:

break kcscur3
commands
silent
printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
continue
end

After continuing the foreground via gdb, gdb shows kcscur3 is executed 6 times:

kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Interesting! But what are these numbers? Values? Addresses?
In order to understand if these arguments mean anything, let’s first get the addresses for the general shared memory area’s. This can be done using ipc command with oradebug:

SYS@v12102 AS SYSDBA> oradebug setmypid
Statement processed.
SYS@v12102 AS SYSDBA> oradebug ipc
IPC information written to the trace file
SYS@v12102 AS SYSDBA> @lt

@lt is a script to look at the current active trace file. The relevant contents of this trace file are the shared memory area’s:

Handle:            0x113c2940 `/u01/app/oracle/product/12.1.0.2/dbhome_1v12102'
 Dump of unix-generic realm handle `/u01/app/oracle/product/12.1.0.2/dbhome_1v12102', flags = 00000000
  key 604726764 actual_key 604726764 num_areas 4 num_subareas 4
  primary shmid: 112590854 primary sanum 3 version 3
  deferred alloc: FALSE (0) def_post_create: FALSE (0) exp_memlock: 2050M
 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002ca788 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2 112492547 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cb000 0000000000400000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000077000000 Minimum Subarea size 01000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0 112525316 0x00000061000000 0x00000061000000 0x00000061000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000077000000 0000000077000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008d35000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1 112558085 0x000000d8000000 0x000000d8000000 0x000000d8000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008d35000 0000000008e00000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000003000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3 112590854 0x000000e1000000 0x000000e1000000 0x000000e1000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000003000 0000000000003000 default       readwrite

We see the 4 shared area’s and their memory address:
– Fixed size, start address 0x60000000, size 0x2cb000
– Variable size, start address 0x61000000, size 0x77000000
– Redo buffers, start address 0xd8000000, size 0x8d35000
– Skgm overhead, start 0x1e000000, size 0x3000

If we combine this information with the kcscur3() arguments, we see that the first argument points to the ‘fixed size’ area, in other words: the fixed SGA. The fixed SGA variables metadata is listed in x$ksmfsv. Please mind this view does not list all the contents of the fixed SGA, for example the latches are in the fixed SGA too.

We got 3 addresses which are inside the fixed SGA on calling commit: 0x6001fbb0, 0x60027c68 and 0x60027c98. Let’s see if we can find them in x$ksmfsv:

SYS@v12102 AS SYSDBA> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv
2  where to_number('6001fbb0','XXXXXXXX')
3  between to_number(ksmfsadr,'XXXXXXXXXXXXXXXX') and to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')+ksmfssiz-1;

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcsgscn_                                                         000000006001FBB0         48

SYS@v12102 AS SYSDBA> c/6001fbb0/60027c68/
  2* where to_number('60027c68','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

SYS@v12102 AS SYSDBA> c/60027c68/60027c98/
  2* where to_number('60027c98','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

The first fixed SGA address, 0x6001fbb0, points to a variable called kcsgscn_. Some sources on the internet report this is the current SCN. (Kernel Cache System Global SCN?)
Let’s test this! First I set a watchpoint on 0x6001fbb0 in gdb, then query v$database.current_scn:

(gdb) awatch *0x6001fbb0
SYS@v12102 AS SYSDBA> select current_scn from v$database;

This triggers the watchpoint!

Hardware access (read/write) watchpoint 1: *0x6001fbb0

Old value = 11278094
New value = 11278163
0x000000000cc77983 in kcscur3 ()

Ah! So when querying the current_scn, it uses the kcscur3() function too, and apparently, kcscur3() can change a value too (we see an old value, and the new value). I press ‘c’ and enter to let the debugger continue the program it debugs. In fact, I have to do this multiple times in the functions kcscur3(), kcsgssn(), kcsgcsn() and kcsgbsn. The watchpoint shows memory address 0x6001fbb0 is left with a number exactly one higher than is shown in the query ‘select current_scn from v$database’.

The second and third fixed SGA addresses, 0x60027c68 and 0x60027c98, both point to a variable called kcrfsg_. Actually, they do not point to the starting address of the variable, but rather to “somewhere” in this variable. First let’s see what value is stored at these addresses in the variable kcrfsg_ using gdb:

(gdb) x/dw 0x60027c68
0x60027c68:	11278915
(gdb) x/dw 0x60027c98
0x60027c98:	11278917

These are large numbers, which are highly likely to be SCNs of some sort.

The variable kcrfsg_, which is quite probably a c “struct” (a variable composed of multiple variables, alike records in a table) is linked with the x$ view x$kcrfws:

SYS@v12102 AS SYSDBA> select addr from x$kcrfws;

ADDR
----------------
0000000060027C38

The address reported is 8 bits into kcrfsg_. The x$kcrfws view is only used by the v$ view v$xstream_capture, and because of that it is reported to have something to do with replication. That is incorrect. My current assumption is x$kcrfws means Kernel Cache Redo Write Status.

In order to figure out which field in x$kcrfws is linked to which memory address (0x60027c68 and 0x60027c98), I use gdb once again, and use a watchpoint on a memory address. Oradebug also provides this functionality, but it doesn’t work in my setup (mprotect error 22). I attach gdb to a SYSDBA session, and execute:

(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 1: *0x60027c68

Then query the fields in x$kcrfws one by one. It turns out, memory address 0x60027c68 is accessed for the fields lwn_scn_bas and lwn_scn_wrp, and memory address 0x60027c98 is accessed for the fields on_disk_scn_bas and on_disk_scn_wrp.

So, what do we know now? It becomes apparent Oracle uses the kcscur3() function for reading SCN values. The function seems to be dynamic and can be used for multiple locations holding different types of SCNs. We witnessed reading the instance current SCN, the on disk SCN and the LWN SCN.

The on disk SCN (x$kcrfws.on_disk_scn_(bas|wrp)) seems to be a registration of up to which SCN is written by the logwriter, and the LWN SCN (x$kcrfws.lwn_scn_(bas|wrp)) is a registration of up to which SCN is in the current log write number (LWN). The log write number (LWN) seems to be a number appointed to groups of redo blocks for writing them in batch.

This information is needed to make more sense of how my foreground works. In order to make the tracing of the foreground more meaningful, we need to add a break on semctl() to understand when all redo vectors are copied into the public log buffer, and the foreground actually starts waiting on the log writer, and peeks at its progress. It is also handy to add breaks to semtimedop() and nanosleep(), so we know what logwrite mode is in use:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000000cc77970 <kcscur3>
	breakpoint already hit 6 times
        silent
        printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
        continue
(gdb) break semctl
Breakpoint 2 at 0x3bfdaeb030
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>printf "semctl\n"
>continue
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3bfdaeb060
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>silent
>printf "semtimedop\n"
>continue
>end
(gdb) break nanosleep
Breakpoint 4 at 0x3bfde0ef90
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>silent
>printf "nanosleep\n"
>continue
>end
(gdb) disable
(gdb)

I disabled all breakpoints (disable command), and do the insert. After the insert, the enable command enables all breakpoints, and watch what the function call sequence:

(gdb) c
Continuing.
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
semctl
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

So, after a foreground process has written its change vectors into the public log buffer and requests the log writer to write using semctl(), it doesn’t poll a single location for a SCN, but requests the on disk SCN and the current SCN! It also seemed my log writer was speedy enough to write, because there is no nanosleep() nor semtimedop() call.

It seems we need to manipulate the logwriter too, in order to get the call sequences I showed in my slides. In order to do that, I open another terminal session, and attached a debugger session to the lgwr too. At this point I should point out I made sure my Oracle (12.1.0.2) database was using the single lgwr process and not the log writer workers by setting the “_use_single_log_writer” undocumented parameter to true. (do NOT do this on a production database; both setting an undocumented parameter without the blessing of oracle support, nor attach gdb to a log writer process!!)

I added a break to the io_getevents_0_4 function, which is the function for reaping asynchronous submitted IO, and then disable the breakpoint:

break io_getevents_0_4
disable
continue

I also disabled the breakpoints in the other debugger session attached to the foreground session, and insert another record. After the insertion, enable the breakpoints in both gdb sessions, and enter commit. Now the log writer debugger session will break on io_getevents_0_4, and the foreground debugger session should show a repeating pattern:

(gdb) enable
(gdb) c
Continuing.
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
semctl
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
...

The nanosleep and kcscur3 calls for address 60027c98 and 60027c68 continue to be executed. After continuing the log writer debugger session, the foreground debugger session shows:

kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Bingo! We see the nanosleep() call, so we are in polling mode, not in post-wait mode. And because we artificially made the log writer stop from progressing , the foreground process is reading the on-disk SCN and LWN SCN, then calls nano sleep, then scans the on-disk and LWN SCNs again, etcetera, until the on-disk SCN gets higher than the foreground process commit SCN. Interestingly, it seems that once kcscur3() on address 0x60027c98 was able to identify the log writer progressed beyond it’s commit SCN, the next kcscur3() call does not read address 0x60027c68, but instead reads address 0x6001fbb0, alias the current SCN.

Okay, so this points to the on-disk SCN being responsible for commit progress. If this is actually true, we should now take a look at the log writer and see if the log writer indeed changes the on-disk SCN at memory location 0x60027c98 after the write. For this I can use the memory watch facility of the debugger again:

(gdb) awatch *0x60027c98
Hardware access (read/write) watchpoint 2: *0x60027c98
(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 3: *0x60027c68
(gdb) c
Continuing.

This will quite probably immediately break on access of these memory areas. The log writer is looking at these memory area’s a lot. Here’s a snippet of the logwriter approximately during the commit of the foreground:

(gdb) c
Continuing.
Hardware access (read/write) watchpoint 3: *0x60027c68

Old value = 11174890
New value = 11174892
0x000000000cc77ac4 in kcsnew3 ()
(gdb) c
Continuing.

Breakpoint 1, io_getevents_0_4 (ctx=0x7f165de70000, min_nr=2, nr=128, events=0x7ffc696611b8, timeout=0x7ffc696621b8)
    at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x000000000cc77983 in kcscur3 ()
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x0000000002d70bb9 in kcsadj3 ()
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Old value = 11174890
New value = 11174892
0x0000000002d70c13 in kcsadj3 ()

First we see a function called kcsnew3() change the value at 0x60027c68. Then we see the breakpoint at io_getevents() indicating it looks for submitted IOs to return, in other words: the log writer was asked to write something or found something to write. After the write we see the log writer reading the memory location 0x60027c98 using the kcscur3() function, just as our foreground process does. Then we find a function called kcsadj3() which first reads the on-disk SCN location 0x60027c98, and then changing it. This indicates the log writer is increasing the value of the on-disk SCN in 0x60027c98 after it has written using the function kcsadj3(), a safe bet is Kernel Cache Service Adjust, which the foreground in polling mode is reading to understand if the log writer has written the contents the process put in the public log buffer.

Of course this still isn’t the ultimate proof, I break at a few, arbitrary functions, there can be all kinds of other things going on. This means that still the foreground process can just keep the list of log buffer blocks and really use that, and only use the SCN values for verification, because I am looking at only a small subset of the functions it is executing.

However, there is a way to test this!
=> Another word of caution, the next manipulation can render your database into a smoking pile of bits and bytes! <=
The log writer mode needs to be polling. Also, for simplicity, in Oracle 12 “_use_single_log_writer” must be set to true, to be sure the log writer is performing all the tasks, and not log writer slaves.

In order to proof that a foreground session is truly looking at 0x60027c98 for the on-disk SCN to understand if the log writer has written its log buffer contents, we let the log writer stop before it has adjusted the on-disk SCN. That can be accomplished by attaching to the log writer with gdb, and break on io_getevents_0_4:

break io_getevents_0_4
disable
continue

Now the breakpoint is set, disabled, and execution is resumed.

The next thing to do, is go to the sqlplus foreground session and insert a row of data. Once the insert is done, go to the debugger session, stop execution, enter “enable”, and continue. The debugger resumed execution again, but now will break execution once it encounter io_getevents. Now go back to the foreground sqlplus session and enter commit.

This will trigger the log writer to write, and break execution because of the breakpoint. Because the log writer is stopped the foreground session will appear to be “hanging”. In fact what the foreground session is doing is sleeping in nanosleep(), then executing kcscur3() to look at the on-disk and LWN SCNs, then sleeping in nanosleep() again, etc.

This next part is experimental!
Because the log writer is stopped at the io_getevents() call, it didn’t execute kcsadj3() yet to update the on-disk SCN in 0x60027c98. What we will do, is read the current value in 0x60027c98, and increase the value using gdb. If the foreground is truly only waiting for the value at 0x60027c98, increasing the value at this memory address should give back the prompt of the sqlplus session, because it thinks the commit fully succeeded because the log writer performed all its functions.
First query the on-disk SCN:

(gdb) x/dw 0x60027c98
0x60027c98:	11175042

So, the on-disk SCN value is 11175042. Mind you the sqlplus session appears hanging, but really is “polling” the on-disk SCN, and the log writer has stopped executing, because of the breakpoint.

Let’s see if increasing the on-disk SCN makes the sqlplus session think the commit succeeded. Please mind I randomly increased the value from above by changing the 7 to 8 on the 5th position.

(gdb) set {int}0x60027c98 = 11185042

Voila! the foreground process continues and returns the prompt, indicating it is thinking the log writer has finished!

This means that Oracle uses the log writer on-disk SCN for processes to determine if their log buffer contents have been written to disk.

When the log writer is in post-wait mode, this mechanism is in place too, but the foreground process, once waiting, needs to wait for the log writer to post it via semctl(). There is one exception to this, which is true for both the polling and the post-wait mechanisms: right after a foreground process signalled the log writer to write, it will call kcscur3() looking for the on-disk SCN and if it finds the on-disk SCN beyond its commit SCN, there will not be a wait triggered. In all other cases the process will register a wait “log file sync”.

Conclusion
The kcscur3() function is a function to read and update SCNs. There are multiple SCNs, like current SCN, on-disk SCN and LWN SCN, which all can be read with kcscur3(). Also kcscur3() is not exclusive to commit, when querying the current_scn field in v$database, this function is used too.

When a process commits, right after it signalled the log writer, it checks for the log writer progress via the on-disk SCN. If the log writer was fast enough to have the contents flushed to disk and the on-disk SCN updated, the process will not wait for ‘log file sync’ at all.

Both log writer modi, post-wait and polling, look for the log writer write progress via the on-disk SCN. Both modi will show the wait ‘log file sync’ if the above “fast log writer exception” didn’t happen.

When the log writer is in post-wait mode, a process without the “fast log writer exception” will register itself in the post-wait queue and sleep on a semaphore waiting to be posted by the log writer, and register it is waiting for event ‘log file sync’. Still, it does keep track the on-disk SCN (twice) with a check of the LWN SCN in between, via calls to kcscur3(). After being signalled by the log writer, it again verifies the on-disk SCN and looks for the current SCN, after which it continues.

When the log writer is in polling mode, a process without the “fast log writer exception” will register a waiting for event ‘log file sync’. Now it sleeps in nanosleep() and only executes two checks: the on-disk SCN and the LWN SCN, via calls to kcscur3(). Once the log writer progressed the on-disk SCN beyond the process’ commit SCN, it will continue.

Oracle DBAs who are so old that they remember the days before Oracle 11.2 probably remember the tuning efforts for latches. I can still recall the latch number for cache buffers chains from the top of my head: number 98. In the older days this was another number, 157.

But it seems latches have become less of a problem in the modern days of Oracle 11.2 and higher. Still, when I generate heavy concurrency I can see some latch waits. (I am talking about you and SLOB mister Closson).

I decided to look into latches on Oracle 12.1.0.2 instance on Oracle Linux 7. This might also be a good time to go through how you think they work for yourself, it might be different than you think or have been taught.

In order to understand how latching works, I searched for Oracle related traces. I could find event 10005, but it did not return anything latching related. My current understanding is that event 10005 is Oracle KST tracing, for which the results are put in X$TRACE.

Luckily, I could get a great headstart by studying the work of Andrey Nikolaev. However, that work seems to be strictly Solaris based.

I decided to take a look on how this works on Linux. In order to do this, I did setup a system for the specific purpose of this test. Disclaimer: The techniques below are for educational purposes only, and should never be done on a real database!

The work of Andrey shows kslgetl() as the overal latching function, which does:
kslgetl
– sskgslgf (immediate latch get)
– kslges (wait latch get)
— kskthbwt
— kslwlmod (setup wait list)
— sskgslgf (immediate latch get)
— skgpwwait (sleep latch get)
— semop

In order to do predictable latch gets and generate latch misses, in a very predictable way, I used the method that Andrey shows on his website (which he attributed to Tanel), which is using oradebug call to call the latch get function. In order to get latch waits, you need at least two processes doing something with a latch, one holding the latch, and another one requesting it. In order to facilitate this, I setup two sqlplus / as sysdba sessions.

Taking a latch manually can be done using the kslgetl or the ksl_get_shared_latch functions. Freeing a latch is done using the kslfre function. The kslfre function just takes the latch address as argument. The kslgetl and ksl_get_shared_latch functions take the following arguments:
1-latch address
2-immediate get (0 means yes, 1 means no)
3-where (X$KSLLW.INDX)
4-why (X$KSLWSC.INDX)
5-mode (8=shared,16=exclusive; only for ksl_get_shared_latch function)

Immediate mode get for a non shared latch.
An immediate mode get just tries to fetch a latch once. In order to manually do an immediate latch get, I done:

Session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 1

This means session 1 has gotten latch 0x60023a80. I gotten the latch in willing to wait mode, but that does not really matter here. The session did get the latch.

If you want to check this, simply use V$LATCHHOLDER view to verify (in another session):

SQL> select * from v$latchholder;

        PID        SID LADDR            NAME                                                                   GETS     CON_ID
---------- ---------- ---------------- ---------------------------------------------------------------- ---------- ----------
        38        134 0000000060023A80 cache table scan latch                                                  709          0

Please mind that with a latch manually gotten, you need to free the latch before you try to do anything else with your session, otherwise you encounter a (non critical) ORA-600. Freeing a latch is done using oradebug call kslfre and one argument: the latch address.

Session 2:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 0

The ‘Function returned 0’ means the immediate latch get failed. As you can see this was an immediate get because the second argument is 0.

In order to understand which functions are involved, I first used the perf record linux utility. However, because the immediate get does not spin, and a latch get by all means has speed/low overhead as one of the principal design criterions, I could not see the function.

This meant I needed to go to one of the tools I have used extensively in the past: gdb (the GNU debugger). You need to attach to the Oracle database server shared process locally on the database server. Here is what I did:

# gdb -p 4600
...
(gdb) set pagination off
(gdb) rbreak ^ksl.*
...
Breakpoint 262 at 0x8ea7b0
<function, no debug info> ksl_event_stats_rollup;
(gdb) commands 1-262
type commands for breakpoint(s) 1-262, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

The first gdb function turns off having to press enter for every full screen of output of gdb, the second function breaks on all functions in the oracle executable that start with ‘ksl’. The commands command creates commands that are executed in gdb if breakpoints 1-262 are encountered, which is ‘c’: continue.

Now, with the debugger set, I executed the kslgetl function again:

Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 247, 0x000000000c8e10a0 in kslws_check_waitstack ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 242, 0x000000000c8ddcb0 in kslgetl ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()

It is important to understand Oracle does a lot of other stuff outside the latch get via kslgetl. Most of the stuff above are functions which start with kslwt, which is the Oracle wait interface. A couple of times a shared latch is taken (as can be seen by the function ksl_get_shared_latch), and freed (kslfre). The important part here is: kslgetl is executed once, and did not go into any other function to try to get the latch.

Getting a latch in willing to wait mode for a non shared latch.
Now let’s do something a bit more exciting: getting a taken latch in willing to wait mode. The first session can do exactly the same, just take the latch. The second session needs to be changed a little bit to indicate it is willing to wait:

SQL> oradebug cell kslgetl 0x60023a80 1 0 2442

This will call additional functions. In order to understand what these functions are, I used perf record, perf report and perf script.

I was able to create a smaller, more specific gdb script to see what is going on:

break kslgetl
  commands
    silent
    printf "kslgetl laddr:%x, willing:%d, where:%d, why:%d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslges
  commands
    silent
    printf "kslges %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslwlmod
  commands
    silent
    printf "kslwlmod %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break skgpwwait
  commands
    silent
    printf "skgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break sskgpwwait
  commands
    silent
    printf "sskgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break semop
  commands
    silent
    printf "semop %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end

This is how the gdb output looks like when the latch get in willing to wait mode is executed:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
kslwlmod 13311368, -1780327896, 1610758784, 1
skgpwwait 13311608, -1767360, -1780326976, 0
sskgpwwait 13311608, -1767360, -1780326976, 0
semop 360451, 13310840, 1, -1

Interestingly, if the latch is not taken, this is how the latch get sequence looks like:

kslgetl laddr:60023a80, willing:1, where:0, why:2442

In other words, for getting a non shared latch in willing to wait mode:

1-the function kslgetl is called, which tries to fetch the latch.
If the latch can be taken, the function returns, if not:
2-the function kslges (kernel service latch get spinning) is called, which supposedly also tries to take the same latch.
If the latch still can not be taken, the next function is:
3-the function kslwlmod (kernel service latch waiting list modify) is entered.
In this function the process registers itself as waiting in the post/wait list.
4-the function skgpwwait (system kernel generic post/wait wait) is entered.
This function sets up the waiting for the process so it can be posted.
5-the function sskgpwwait (system system kernel generic post/wait wait)
My current understanding is the ‘ss’ function contain the platform specific code for database functions.
6-the (operating system) function semop (semaphore operation) is called.
This will make the process sleep waiting on a semaphore (operating system signalling mechanism). This way, the process will not be runnable on the CPU unless the semaphore is posed.

The information that is missing here, is the spinning. The earlier work of Andrey Nikolaev showed that in the Solaris port of the database, a distinct function (sskgslgf [immediate] and sskgslspin [spin]) was used to get the latch, which made it easy to count.

Some searching around revealed that a CPU register reveals this information. Add this to the above gdb script:

break *0xc29b51
  commands
    silent
    printf " kslges loop: %d\n", $ecx
    c
  end

And try to get a non shared taken latch in willing to wait mode:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
 kslges loop: 19999
 kslges loop: 19998
...
 kslges loop: 1
 kslges loop: 0
kslwlmod 1208568840, -1780327896, 1610758784, 1
skgpwwait 1208569080, -1281169344, -1780326976, 0
sskgpwwait 1208569080, -1281169344, -1780326976, 0
semop 360451, 1208568312, 1, -1

So…this reveals that getting a non shared latch in willing to wait mode will spin 10*_spin_count. In other words: not _spin_count, which is 2000 by default. What is even more interesting, is for the described type of latch, there is no (short) timed sleep done; the kslges function spins 10*_spin_count times for the latch, then puts itself on the waiting list, and goes to sleep on a semaphore. The way the latching mechanism works has been described (by Oracle AFAIK) as that a process will spin for _spin_count times trying to get a latch, then goes to sleep for some time, and then spins trying to get the latch, goes to sleep, etc.

I strace’d the process holding the latch to see if it is doing a semctl systemcall to signal the waiting process if the latch is freed, and indeed that is what is happening. This is different from the “old days” where processes spinning on latches (cache buffers chains latches most of the time) were really hammering the system and would eat up a lot of the CPU slices trying to get a latch.

Part of the classic latching problem, specifically the cache buffers chains latching problem, quite probably is mitigated by having shared latches, which were introduced in Oracle in different versions for different ports.

Watch out for a second blogpost where I do the same investigation for shared latches.