Archive

Tag Archives: C

This blog is about the oracle database wait event ‘reliable message’. It should be noted that normally, this should not be a prominent wait, and if it does so, the most logical cause would be something that is not working as it should, either by oversubscribing something or simply because of a bug.

The reliable message note on My Oracle Support provides a good description:
https://support.oracle.com/epmos/faces/DocContentDisplay?_afrLoop=34595019638128&id=69088.1&_afrWindowMode=0&_adf.ctrl-state=15ief27plw_97

It does tell something about what a reliable message is:
– A message send using the KSR intra-instance broadcast service.
– The message publisher waits on this wait event until all subscribers have consumed the message.
– The publisher waits for up to one second and then retests if all subscribers have consumed the message.

So the first thing to establish is that a reliable message means that the publisher of the message does wait for all subscribers of the channel it published its message on to have finished consuming the message. That is what ‘reliable’ means with reliable message: all subscribers are guaranteed to have consumed the message. Not all messages in the Oracle executable are ‘reliable’; if you take the log writer notification during redo generation in kcrfw_redo_gen_ext>kcrf_commit_force_int for example, it will only signal the log writer if it isn’t signalled before in the first place, and if it does, it’s just a nudge for the log writer to start its task, there is nothing stateful about the message and in the message itself, like an SCN to write up to.

The second thing to think about is that the KSR/reliable message mechanism to my knowledge itself does not suffer from issues or widespread bugs. This means that, for what I’ve seen (for what that’s worth :-D), the KSR code for reliable messaging is unlikely to be an issue itself if you see this event. It is probably the function of the message that the KSR code is trying to publish on the channel to its subscribers that is causing the reliable message wait.

For this to dive deeper into, I first need to restate what I’ve said at the beginning: in order to sensibly investigate long times spend in a wait event, any wait event really, the very first thing you need to look for is if the system the database is running on is functioning in a normal way. In other words: if the issue you see being represented by a wait event is not an issue outside of the database, for which the database has made your work easier because it has timed the manifestation of the issue outside of the database for you, instead of being an issue inside the database.

For the reliable message to be investigated, you need to understand the logic of reliable messaging in the oracle database. Let me take a truncate table as an example; a ‘truncate table ‘ requires the current blocks to be written to disk. To do that, it sends a reliable message to the checkpointer via the ‘kcbo’ channel, for which the description is ‘obj broadcast channel’ channel.

If you freeze the checkpointer process, it will not be able to do anything, and thus it cannot consume the message that is send by the process that executes a truncate. One way of freezing a process is to attach to it with a debugger, like gdb. Of course this only ever should be done in a test environment and not in production, because freezing (background) processes can have all kinds of bad side effects, and it will almost certainly influence anything else that is done.

If you look at the truncating process, you will see the wait event ‘reliable message’. Take the p1 value from the waiting process and execute the following SQL:

-- replace <p1> with the p1 value of the reliable message wait event
select des.id_ksrcdes, des.name_ksrcdes
from x$ksrcctx ctx, x$ksrcdes des 
where to_number(ctx.addr,'XXXXXXXXXXXXXXXX') = <p1> 
and ctx.name_ksrcctx = des.indx;

This shows the channel and the description of the channel the reliable message is waiting for:

ID_KSRCDES                        NAME_KSRCDES
--------------------------------- ---------------------------------------------
kcbo                              obj broadcast channel

To understand more you can use the x$ksrchdl view and link these with the publishers and subscribers of the channel context:

-- inspired by and largely taken from Tanel Poder's channels2.sql script (https://github.com/tanelpoder/tpt-oracle/blob/master/channel2.sql)
-- replace <p1> with the p1 value of the reliable message wait event
select case when bitand(c.flags_ksrchdl,1)=1 then 'PUB ' end || 
       case when bitand(c.flags_ksrchdl,2)=2 then 'SUB ' end || 
       case when bitand(c.flags_ksrchdl,16)=16 then 'INA ' end flags,
       s.sid,
       p.program,
       cd.name_ksrcdes channel_name, 
       cd.id_ksrcdes, 
       c.ctxp_ksrchdl
from   x$ksrchdl c , 
       v$process p, 
       v$session s, 
       x$ksrcctx ctx, 
       x$ksrcdes cd
where 1=1
and to_number(c.ctxp_ksrchdl,'XXXXXXXXXXXXXXXX') = <p1>
and s.paddr(+)=c.owner_ksrchdl 
and p.addr(+)=c.owner_ksrchdl
and c.ctxp_ksrchdl=ctx.addr
and cd.indx=ctx.name_ksrcctx
;

In my case, this shows the following information:

FLAGS          SID PROGRAM                                                      CHANNEL_NAME                             ID_KSRCDES CTXP_KSRCHDL
------------ ----- ------------------------------------------------------------ ---------------------------------------- ---------- ----------------
SUB            258 oracle@oracle-database.local (CKPT)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            141 oracle@oracle-database.local (M001)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            142 oracle@oracle-database.local (W005)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            145 oracle@oracle-database.local (W007)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            396 oracle@oracle-database.local (TNS V1-V3)                     obj broadcast channel                    kcbo       000000007AAB3020

The important bit here is that this channel; “kcbo” has 4 publishers (“my” session, and the background processes M001, W005 and W007) and one subscriber: CKPT, the checkpoint processes. In this case, the information about the process we are waiting for could be guessed, not only because I caused the waiting situation myself, but because the “final_blocking_session” field in v$session would point you to it. But in the case of multiple subscribers, you can use this query to find all the processes that are currently participating in the channel.

A subscribing session essentially does 3 things of importance to the reliable message wait:
1. It finds a published message in the channel it is subscribed to, and starts acting on the message.
2. It performs the task or tasks that should be done as a subscriber to that channel.
3. It ‘consumes’ the message, leading to end the reliable message wait event for the publisher.

In the case of a truncate, the foreground process executing the truncate finds the handle, adds itself as publisher to the handle, adds the message and then posts the subscriber(s) using ‘semop’ to activate them. Then it goes into a function called ‘ksrpubwait_ctx’, and waits for the subscribers to do their work in the ‘reliable message’ wait event. The wait is implemented as a semaphore wait (semtimedop) with a 1 second timeout, as the MOS note indicates.

The subscriber, which is the checkpointer process, will be woken from its semtimedop “sleep” if idle, and goes off performing its work cycle. One of the calls in the checkpointer work cycle is ksb_act_run>ksb_act_run_int>kcbo_subch, which performs the task of inspecting the kcbo “KSR channel” to see if work is published.

If so, it performs the task of setting the RO enqueues and add a checkpoint to the checkpoint queue of type 10: object reuse checkpoint and post the logwriter, after which the checkpointer is done, which means it will post the publisher.

Once the publisher is posted, which is the foreground, it will stop waiting for the ‘reliable message’ wait event, and continue on working. Depending on the speed of all the processes and lots of other things, the next thing that happens might be that the foreground now will try to uniquely obtain the RO enqueues and will not be able to do that, or the background processes might have finished their work and released the RO enqueues. But these dealings are not the point of this blogpost.

What I do hope you understand when you’ve reached this part, is that the ‘reliable message’ event is part of the database infrastructure that serves different functions, and that you should look into the specific function for which the KSR infrastructure is used instead of the KSR infrastructure itself.

Advertisement

Warning! This is a post about Oracle database internals for internals lovers and researchers. For normal, functional administration, this post serves no function. The post shows a little tool I created which consists of a small database I compiled with Oracle database C function names and a script to query it. The reason that keeping such a database makes sense in the first place, is because the Oracle C functions for the Oracle database are setup in an hierarchy based on the function name. This means you can deduct what part of the execution you are in by looking at the function name; for example ‘kslgetl’ means kernel service lock layer, get latch.

To use this, clone git repository at https://gitlab.com/FritsHoogland/ora_functions.git

Use the ‘sf.sh’ (search function) script to query the functions. The script uses sqlite3, use the sqlite rpm package to add this (on linux), or use the packaging method of your platform.

This is how to use sf.sh :

$ ./sf.sh kglGetMutex
kglGetMutex -- kernel generic lock management
---

This shows the sf.sh script could find the first 3 letters (underlined), which probably mean kernel generic lock management. It couldn’t find ‘GetMutex’ however that is so self-explanatory that it doesn’t need annotation.

This is how a full function annotation looks like:

$ ./sf.sh qercoFetch
qercoFetch -- query execute rowsource count fetch
----------

Here the full function is underlined, which means the entire function is found.

If you found explanations for Oracle database code locations, or found an error in the annotation (I made assumptions here and there), please send them to frits.hoogland@gmail.com, or react to this post, so I can add or update it.

This blogpost is an introduction to Intel’s Pin dynamic instrumentation framework. Pin and the pintools were brought to my attention by Mahmoud Hatem in his blogpost Tracing Memory access of an oracle process: Intel PinTools. The Pin framework provides an API that abstracts instruction-set specifics (on the CPU layer). Because this is a dynamic binary instrumentation tool, it requires no recompiling of source code. This means we can use it with programs like the Oracle database executable.
The Pin framework download comes with a set of pre-created tools called ‘Pintools’. Some of these tools are really useful for Oracle investigation and research.

Pin works in a very sophisticated way. The description in the Pin manualis to think of Pin as a JIT (just in time) compiler, where the compiler does not take byte code (as JIT compilation does with Java), but the executable of the process pin is executed against. This means pin inserts itself into the process’ execution. This can be seen when looking at the memory map of such a process:

$ cat /proc/29595/maps | grep -e pin-3.0 -e oracle
00400000-1098a000 r-xp 00000000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8a000-10bac000 r--p 1058a000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10bac000-10e05000 rw-p 105ac000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
7fa01e04c000-7fa01e31d000 r-xp 00000000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e51c000-7fa01e530000 r--p 002d0000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e530000-7fa01e531000 rw-p 002e4000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e533000-7fa01e71a000 r-xp 00000000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91a000-7fa01e91c000 r--p 001e7000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91c000-7fa01e91e000 rw-p 001e9000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e946000-7fa01e9c9000 r-xp 00000000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9ca000-7fa01e9cc000 r--p 00083000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9cc000-7fa01e9ce000 rw-p 00085000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9d6000-7fa01ea04000 r-xp 00000000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea04000-7fa01ea05000 r--p 0002d000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea05000-7fa01ea06000 rw-p 0002e000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea06000-7fa01eac4000 r-xp 00000000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac5000-7fa01eac8000 r--p 000be000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac8000-7fa01eaca000 rw-p 000c1000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eacb000-7fa01edb7000 r-xp 00000000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01edb8000-7fa01ee24000 r--p 002ec000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee24000-7fa01ee25000 rw-p 00358000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee80000-7fa01f385000 r-xp 00000000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f385000-7fa01f38d000 r--p 00504000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f38d000-7fa01f394000 rw-p 0050c000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f54e000-7fa01f6d4000 r-xp 00000000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
7fa01f6d4000-7fa01f8d3000 ---p 00186000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so

Here you see this process an oracle foreground (rows 1-3), after which we see a lot of stuff that pin pushed into the process’ address space.

For this reason, it’s vitally important to use and run pin as the same user as the process you want to run pin against. The way pin works is that, upon execution of pin, the pin executable inserts itself into the process’ address space, gains control and then tries to load necessary libraries. If it can’t find these libraries, it will send a SIGKILL to the process, effectively stopping it!

In order to obtain pin, go to the pin homepage, downloads, linux and select the ‘gcc compiler kit’. This is a zipped tarball. Upload that to a linux server as the database owner, usually ‘oracle’. I created a directory ‘pin’ in which I put the tarball, and extracted it (tar xzf). The next step is to compile the tools that come with pin:

$ cd pin-3.0-76991-gcc-linux/source/tools
$ make

This will output a lot of stuff on your screen and compiles everything in the tools directory.

Now that we have seen an introduction, let’s use a Pin tool to do something useful! One such useful is the tool ‘DebugTrace’. First obtain the process id (pid) from an Oracle server foreground process. Then execute the pin tool against this process:

$ pwd
/home/oracle/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 2407 -t source/tools/DebugTrace/obj-intel64/debugtrace.so

Upon execution, nothing is returned, because the majority of the things are happening in the process pin is run against (2407 in this case). If no parameters are specified, the pintool will create a file in the current working directory of the process it is run against. For an oracle foreground process, the current working directory is $ORACLE_HOME/dbs:

$ ls -l /proc/2407/cwd
lrwxrwxrwx 1 oracle oinstall 0 Nov 17 01:45 /proc/2407/cwd -> /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs

For the sake of testing this, I ran ‘select * from dual’ in the oracle database foreground process.

Let’s look what this produces! This is how my dbs directory looks like:

$ ls -ltr
total 2332
-rw-r--r--. 1 oracle oinstall    2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall      24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall    7680 Nov  5 09:41 orapwtestdb
-rw-r-----. 1 oracle oinstall      57 Nov  5 09:42 inittestdb.ora
-rw-rw----. 1 oracle oinstall    1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--  1 oracle oinstall 2361460 Nov 17 01:49 debugtrace.out

You can see the debugtrace.out file is what is produced by the pin tool, and it look rather big (remember I only ran ‘select * from dual’)!

Let’s look into the file:

$ less debugtrace.out
         3 @@@ return underflow
Return 0x00007f706b664218 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x12c
        14 @@@ return underflow
Return 0x000000000cefb62d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cefb580 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cedcc9d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cedc90a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000ceeb746 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b170d0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       156 Tailcall 0x0000000005b170de /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b170c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffeef95c305, 0x7f706ae51da0, ...)
       158 | Call 0x0000000005b170c1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b1dcb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       172 | Return 0x0000000005b20097 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffeef95c305
       174 Return 0x0000000005b170c7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffeef95c305
       197 @@@ return underflow
Return 0x000000000ceeb7ad /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
       213 Call 0x000000000cb5d8b0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:opikndf2+0x000000000410 -> 0x000000000c940560 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx(0x7ffeef95c040, 0x7f706ae51da1, ...)
       268 | Call 0x000000000c94066e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx+0x00000000010e -> 0x000000000cd585c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64(0x7ffeef95c040, 0x7f706ae51da1, ...)
       278 | | Call 0x000000000cd5863b /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64+0x00000000007b -> 0x000000000b13ef90 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt(0x7ffeef95c040, 0x7f706ae51da1, ...)
       286 | | | Call 0x000000000b13efa3 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt+0x000000000013 -> 0x0000000000ba6860 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:.plt+0x000000001360(0x1, 0x7ffeef95beb0, ...)
       293 | | | | Call 0x00007f706b188dcb /lib64/libc.so.6:__clock_gettime+0x00000000001b -> 0x00007ffeef9ab8a0 clock_gettime(0x1, 0x7ffeef95beb0, ...)
...etc...

I think this is rather exciting! This is a FULL function call trace, indented by call depth!
The first lines in the trace are returns, and these returns produce a ‘return underflow’, which is because these functions have been called before the trace was put on the process.
This trace shows:
Calling: the address and function where the function call is made, the address and function that is called and first two arguments of the function.
Returning: the address and function where the return is executed, and the returncode.
Tailcall: this is a function that is called as the last instruction in the current function. This means it returns as part of the current function, which is shown by not indenting this call further (call, tailcall and return are all indented at the same level).

Please mind currently I am not aware of a way to remove the pin framework and tool from a process address space. Also mind that because of how this works (a lot of intercepting in the execution flow), it will slow down the process significantly. Use this tool and the above described techniques at your own risk.

Recently I was investigating the inner working of Oracle. One of the things that is fundamental to the Oracle database is the SCN (system change number). SCNs are used to synchronise changes in the database. There is one source for SCNs in every instance (kcbgscn; the global or current SCN in the fixed SGA), and there are multiple tasks for which Oracle keeps track of synchronisation using SCNs. A few of these tasks for which Oracle stores and uses SCNs to keep track of progress are on disk SCN and lwn SCN.

This blogpost is about some oddities I found when using gdb (the GNU debugger) to watch memory locations of a running Oracle database. This should not be done on a production instance, and is purely for research purposes. Only use the methods mentioned in this article if you are absolutely sure what you are doing, and/or if you are using an Oracle instance that can be crashed and can be restored.

The situation
I was researching the ‘on disk scn’ in a 11.2.0.2 database. I used the following (simplified) gdb macro:

break kcrf_commit_force
  commands
    print $rip
    c
  end
awatch *0x60022659
  commands
    c
  end

This macro breaks on the function kcrf_commit_force essentially to understand the different phases of a commit, and uses the ‘awatch’ function (access watch; which means this watchpoint fires at both reading and writing) to look at memory location 0x60022659. That memory location is the on disk SCN for this instance.

Upon executing a commit after an insert, I saw the following output:

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$1 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Old value = 898
New value = 899
0x0000000008fb5ad1 in kcscur3 ()

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$2 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()

What this shows, is the following:
– line 1 and 9: the kcrf_commit_force function is called.
– line 11, 15 and 19: the memory location *0x60022659 is read, resulting in the value 899.
– line 3: the memory location *0x60022659 is written, it contained 898, and the new value is 899.

This greatly puzzled me. As far as I understand the way Oracle works, a server process that issues ‘commit’ is supposed to READ the on disk SCN, it does not write the on disk SCN, because the only process that changes the on disk SCN is the logwriter. However, quite clearly gdb tells me the server process is writing it. Further consideration is that IF the server process is writing it, where does this information come from?

After some further investigation, I came to the conclusion that the only way to prove to myself that what I am seeing is correct, is to create a program myself mimicking the behaviour witnessed above to truly understand what is going on, because I don’t believe what I am seeing is correct.

I created two little c programs. One that starts up, creates a shared memory segment, writes a letter into the shared memory segment and then waits for a keystroke, and puts the keystroke into the same location, and one that reads the same shared memory segment, then waits for enter to pressed, and reads the shared memory segment again, and outputs if the letter read from the shared memory is different. I called the first one ‘server’, and the second one ‘client’. This way, I can attach gdb to the client executable, which very much like an Oracle server process, reads a shared memory location which I can change at will.

The code can be found on GitHub: shared memory test source. Please mind this is simple proof of concept code that is not exhaustive in error checking and other sanity checks in any way. The source files can be compiled and linked in the following way:

$ cc server.c -o server
$ cc client.c -o client

The server executable can be run in the following way:

$ ./server
Type a letter. q quits.

This creates the shared memory segment, and puts the ‘a’ character in the variable c (line 42, c = ‘a’). As long as the c variable is not ‘q’, it writes the c variable into shared memory segment using the pointer s (*s = c) , and then waits for user input using the scanf function.

Once the shared memory segment is created and written into, the client program can be started:

$ ./client
client started. ^C to quit.
waiting for enter

The first time enter is pressed it will output ‘keystroke: a’ because the p variable was not initialised, and will be assigned the ‘a’ character (p = *s), the second time enter is pressed, ‘keystroke:’ is not printed, because the p variable and the memory location the pointer *s points to both contain ‘a’. What is vital, is that the client program NEVER writes, only reads the shared memory location, in three occasions: in line 50:

if( p != *s) {

In line 51:

printf("keystroke: %c\n", *s);

And in line 52:

p = *s;

Okay. So at this point we got a program that mimics what I think an Oracle server process is doing too on commit time (reading shared memory), which is the program ‘client’, and we got a program which creates and can manipulate the shared memory location the client is reading from, which is ‘server’. Startup the server program (./server), and startup the client using gdb:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb)

This is different than attaching to a process using gdb -p PID, but attaching to a process using ‘-p’ is the only way that I am aware of to use gdb on Oracle database processes. Now that we started gdb with the client program, we must first run it in order to see the program’s runtime environment:

(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter

Now invoke gdb (using control-C) and look at the memory mappings to see where the shared memory segment is:

(gdb) info proc mappings
process 3069
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]

Line 19 contains the shared memory segment, at memory address 0x7ffff7ff8000. To do the same as watching the on disk SCN location in Oracle, put an access watchpoint at 0x7ffff7ff8000:

(gdb) awatch *0x7ffff7ff8000
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now press enter, and we see the shared memory location being accessed three times:

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

If you take the source code, you’ll see that these are the following lines of code at rows 50-52:

       if( p != *s ) {
          printf("keystroke: %c\n", *s);
          p = *s;

Every line reads the shared memory location via the pointer *s. It’s also clear it’s a read because the value in the memory location (97) is displayed once.

Now go over to the server side, and press ‘b’ and enter. Now the shared memory segment contains ‘b’. Next, go back to the client program, and press enter:

waiting for enter

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

Hey! This shows what looks like a write with the first watchpoint, because the memory location is touched, and it reports the value at the location being changed. However, because we know the source code, we know we only read the shared memory location with the client program, not write to it! So what this shows in this case is that gdb knew what the previous value was of the memory location was, and detected that it was changed.

To see how a true write looks like, you can startup the server executable with gdb too, and set an ‘awatch’ watchpoint on the shared memory location exactly the same way as shown with the client program above. This is how that looks like:

(gdb) c
Continuing.
b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x00000000004006e9 in main ()

So conclusion here is that you can’t see the difference between a write to a memory location or a change of value when reading when using an ‘awatch’ watchpoint.

Now that we understand how ‘awatch’ works, would there be a way to make the distinction between reading and writing? Outside of ‘awatch’ (access watchpoint), there’s the ‘watch’ command, which is a watchpoint on writing to a memory location and ‘rwatch’, which is a watchpoint on reading a memory location. This is how the output of ‘rwatch’ looks like in the client program:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7b23670 in __read_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info proc map
process 3517
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]
(gdb) rwatch *0x7ffff7ff8000
Hardware read watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

At line 1 gdb is started with the ‘client’ program. Once inside gdb, the executable is run (line 12). Once it’s running we interrupt in order to get to gdb (line 16). In gdb, the info proc map command is used (line 20) to look at the memory mappings. From the memory mappings, we see address 0x7ffff7ff8000 is the shared memory segment. A read watch is set and a c (continue) command is added to the read watchpoint (line 46 and 48). Then the execution is continued with the c command (line 53).

Now we are running in the executable again. After continuing, I pressed enter to go through the loop in the client executable, and the watchpoint is triggered three times (lines 56, 60 and 68), as we did see earlier, for source code lines 50, 51 and 52.

Now I changed the letter in the shared memory segment from ‘a’ to ‘b’ with the server executable, and pressed enter at line 60. The interesting things that happens is that the read watchpoint is triggered TWO times. This should be three times (!!).

There is a way to make the now invisible read visible again, using the following command in gdb:

(gdb) watch *0x7ffff7ff8000
Hardware watchpoint 2: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Yes, that’s a WRITE watchpoint. After adding a write watchpoint, this is how the output looks like when I change the letter from ‘b’ to ‘c’:

Hardware watchpoint 2: *0x7ffff7ff8000

Old value = 98
New value = 99
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040071e in main ()
keystroke: c
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040073c in main ()
waiting for enter

There we have the tree reads visible again, however, one is ‘disguised’ as a write. To conclude: it seems gdb is not able to make a distinction between reads and writes for watchpoints if the watchpoint memory location is changed outside of the executable active in the debugger. Gdb seems to determine memory access a write if the value found is different from the last it was read.

Bonus information, not applicable to Oracle:
When you have the source code of an application, you can make debugging even better by adding in debug information for gdb in the executable. This is done by adding the ‘-g’ flag to the cc command:

$ cc -g client.c -o client

With the debug information available, it’s easier to debug:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...done.
(gdb) tbreak 1
Temporary breakpoint 1 at 0x40067c: file client.c, line 1.
(gdb) run
Starting program: /home/oracle/src/shm_test/client

Temporary breakpoint 1, main () at client.c:22
22	    key = 666;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info local
shmid = 0
key = 0
shm = 0x400590 "1\355I\211\321^H\211\342H\203\344\360PTI\307\300P\a@"
s = 0x7fffffffe550 "\001"
p = 0 '\000'
(gdb) awatch s
Hardware access (read/write) watchpoint 2: s
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Old value = 0x7fffffffe550 "\001"
New value = 0x7ffff7ff8000 "a"
main () at client.c:45
45	    printf("client started. ^C to quit.\n");
(gdb) c
Continuing.
client started. ^C to quit.
waiting for enter

Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040070f in main () at client.c:50
50	       if( p != *s ) {
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040071b in main () at client.c:51
51	          printf("keystroke: %c\n", *s);
(gdb) c
Continuing.
keystroke: a
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x0000000000400739 in main () at client.c:52
52	          p = *s;
(gdb) c
Continuing.
waiting for enter

Above you see starting up the executable with gdb at line 1. After gdb is started, I first set a temporal breakpoint on line 1 (of the source code, shown on line 12 above), to have gdb ‘enter’ the executable and stop when I run it. If I wouldn’t do that, the program would ‘wait’ in the getchar() function, which is a function in libc, and the libc library would be my scope, line 19 tells that I have not loaded debug symbols for glibc. After setting the breakpoint, I run the executable, which immediately breaks on the temporal breakpoint (line 14).

Now that the executable is active, I can list the variables used using the info local command (line 20). Also, I can simply set a watchpoint on the variable s (line 26), instead of having to search the memory address myself. When the watchpoint is triggered, it now shows the line number in the source file, and the actual source code on that line (lines 34, 35, 44, 45, etc).

strace is a linux utility to profile system calls. Using strace you can see the system calls that a process executes, in order to investigate the inner working or performance. In my presentation about multiblock reads I put the text ‘strace lies’. This is NOT correct. My current understanding is that strace does show every system call made by an executable. So…why did I make that statement? (editorial note: this article dives into the inner working of Linux AIO)

During the hotsos symposium in Dallas I was chatting with Tanel Poder, and he asked me to look a little bit more into the linux io_getevents() call and strace, because there might be an optimisation which keeps the call from truly issuing a system call, which means strace could be right. We started thinking about it a bit, and came to the conclusion it should be possible for the linux AIO code to cut the corner and peek at the IOs before executing the io_getevents system call (as a spoiler: because the IO context is in userspace).

So, what to do to investigate this? Well, let’s just look at how it works. The Oracle executable executes io_getevents_0_4() in order to do the system call io_getevents(). The function io_getevents_0_4() comes from libaio (the linux asynchronous IO library). After a small search, it appears libaio has a git source code repository, so we can peek into the source code directly from our browser!

If you browse to the source tree, you see the file io_getevents.c. If you click on it, you see the contents of this file, which has the function io_getevents_0_4() in it. This is a very simple function (actual function source code):

int io_getevents_0_4(io_context_t ctx, long min_nr, long nr, struct io_event * events, struct timespec * timeout)
{
	struct aio_ring *ring;
	ring = (struct aio_ring*)ctx;
	if (ring==NULL || ring->magic != AIO_RING_MAGIC)
		goto do_syscall;
	if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0) {
		if (ring->head == ring->tail)
			return 0;
	}
	
do_syscall:	
	return __io_getevents_0_4(ctx, min_nr, nr, events, timeout);
}

If you look at line 7, you see ‘if (timeout!=NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0)’. In other words: if timeout (struct) is set to any value (not NULL), and if the timeout->tv_sec (the seconds portion of the timeout struct) is set to 0 and if the timeout->nsec (the nanoseconds portion of the timeout struct) is set to 0, we enter this function. Once in the function, we look at the struct ring, which is defined as a struct aio_ring from the pointer ctx which is passed to the function io_getevents_0_4(); the first argument. If ring->head is the same as ring-> tail, in other words: if the ring (buffer) is empty, we cut the corner and return 0, without executing the system call. In any other case, the function __io_getevents_0_4() is executed, which executes the system call.

A way to check if this truly is happening, is using the gdb ‘catch syscall’ functionality. In my investigation, I executed ‘break io_getevents_0_4’, which breaks on the userland portion of the io_getevents() function, ‘catch syscall io_getevents’, which breaks when the system call truly is executed, and ‘break io_submit’ to understand which getevents are executed in what number. I setup a testcase with a sqlplus session with the server process throttled to 1 IO per second (see my article on using cgroups to throttle IO), attached to the server process with gdb, and executed the following commands:

break io_submit
commands
silent
printf "io_submit\n"
c
end
break io_getevents_0_4
commands
silent
printf "io_getevents_0_4-libaio\n"
c 
end
catch syscall io_getevents
commands
silent
printf "io_getevents-syscall\n"
c
end

Next I executed a SQL which did a direct path full table scan. This is the result:

io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall
io_submit
io_submit
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents_0_4-libaio
io_getevents-syscall
io_getevents-syscall

If you recall what is in the about multiblock reads presentation: after the io_submit ‘phase’, Oracle executes up to 4 io_getevents() calls non-blocking to look for IO. In this case you see the calls being done in user land, but not making it to the system call, because of the shortcut in the io_getevents_0_4() code. After 4 times, Oracle executes io_getevents() with timeout set to 600 seconds, which makes the call truly execute a system call. Please mind that ‘catch syscall’ triggers twice (as can be seen from the two ‘io_getevents-syscall’ in the above example), but is in reality only 1 system call. This proves the working of the code of the io_getevents_0_4() function we looked into, and the reason why I thought the strace utility lied.

%d bloggers like this: