Archive

Tag Archives: debugging

This blog is about the oracle database wait event ‘reliable message’. It should be noted that normally, this should not be a prominent wait, and if it does so, the most logical cause would be something that is not working as it should, either by oversubscribing something or simply because of a bug.

The reliable message note on My Oracle Support provides a good description:
https://support.oracle.com/epmos/faces/DocContentDisplay?_afrLoop=34595019638128&id=69088.1&_afrWindowMode=0&_adf.ctrl-state=15ief27plw_97

It does tell something about what a reliable message is:
– A message send using the KSR intra-instance broadcast service.
– The message publisher waits on this wait event until all subscribers have consumed the message.
– The publisher waits for up to one second and then retests if all subscribers have consumed the message.

So the first thing to establish is that a reliable message means that the publisher of the message does wait for all subscribers of the channel it published its message on to have finished consuming the message. That is what ‘reliable’ means with reliable message: all subscribers are guaranteed to have consumed the message. Not all messages in the Oracle executable are ‘reliable’; if you take the log writer notification during redo generation in kcrfw_redo_gen_ext>kcrf_commit_force_int for example, it will only signal the log writer if it isn’t signalled before in the first place, and if it does, it’s just a nudge for the log writer to start its task, there is nothing stateful about the message and in the message itself, like an SCN to write up to.

The second thing to think about is that the KSR/reliable message mechanism to my knowledge itself does not suffer from issues or widespread bugs. This means that, for what I’ve seen (for what that’s worth :-D), the KSR code for reliable messaging is unlikely to be an issue itself if you see this event. It is probably the function of the message that the KSR code is trying to publish on the channel to its subscribers that is causing the reliable message wait.

For this to dive deeper into, I first need to restate what I’ve said at the beginning: in order to sensibly investigate long times spend in a wait event, any wait event really, the very first thing you need to look for is if the system the database is running on is functioning in a normal way. In other words: if the issue you see being represented by a wait event is not an issue outside of the database, for which the database has made your work easier because it has timed the manifestation of the issue outside of the database for you, instead of being an issue inside the database.

For the reliable message to be investigated, you need to understand the logic of reliable messaging in the oracle database. Let me take a truncate table as an example; a ‘truncate table ‘ requires the current blocks to be written to disk. To do that, it sends a reliable message to the checkpointer via the ‘kcbo’ channel, for which the description is ‘obj broadcast channel’ channel.

If you freeze the checkpointer process, it will not be able to do anything, and thus it cannot consume the message that is send by the process that executes a truncate. One way of freezing a process is to attach to it with a debugger, like gdb. Of course this only ever should be done in a test environment and not in production, because freezing (background) processes can have all kinds of bad side effects, and it will almost certainly influence anything else that is done.

If you look at the truncating process, you will see the wait event ‘reliable message’. Take the p1 value from the waiting process and execute the following SQL:

-- replace <p1> with the p1 value of the reliable message wait event
select des.id_ksrcdes, des.name_ksrcdes
from x$ksrcctx ctx, x$ksrcdes des 
where to_number(ctx.addr,'XXXXXXXXXXXXXXXX') = <p1> 
and ctx.name_ksrcctx = des.indx;

This shows the channel and the description of the channel the reliable message is waiting for:

ID_KSRCDES                        NAME_KSRCDES
--------------------------------- ---------------------------------------------
kcbo                              obj broadcast channel

To understand more you can use the x$ksrchdl view and link these with the publishers and subscribers of the channel context:

-- inspired by and largely taken from Tanel Poder's channels2.sql script (https://github.com/tanelpoder/tpt-oracle/blob/master/channel2.sql)
-- replace <p1> with the p1 value of the reliable message wait event
select case when bitand(c.flags_ksrchdl,1)=1 then 'PUB ' end || 
       case when bitand(c.flags_ksrchdl,2)=2 then 'SUB ' end || 
       case when bitand(c.flags_ksrchdl,16)=16 then 'INA ' end flags,
       s.sid,
       p.program,
       cd.name_ksrcdes channel_name, 
       cd.id_ksrcdes, 
       c.ctxp_ksrchdl
from   x$ksrchdl c , 
       v$process p, 
       v$session s, 
       x$ksrcctx ctx, 
       x$ksrcdes cd
where 1=1
and to_number(c.ctxp_ksrchdl,'XXXXXXXXXXXXXXXX') = <p1>
and s.paddr(+)=c.owner_ksrchdl 
and p.addr(+)=c.owner_ksrchdl
and c.ctxp_ksrchdl=ctx.addr
and cd.indx=ctx.name_ksrcctx
;

In my case, this shows the following information:

FLAGS          SID PROGRAM                                                      CHANNEL_NAME                             ID_KSRCDES CTXP_KSRCHDL
------------ ----- ------------------------------------------------------------ ---------------------------------------- ---------- ----------------
SUB            258 oracle@oracle-database.local (CKPT)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            141 oracle@oracle-database.local (M001)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            142 oracle@oracle-database.local (W005)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            145 oracle@oracle-database.local (W007)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            396 oracle@oracle-database.local (TNS V1-V3)                     obj broadcast channel                    kcbo       000000007AAB3020

The important bit here is that this channel; “kcbo” has 4 publishers (“my” session, and the background processes M001, W005 and W007) and one subscriber: CKPT, the checkpoint processes. In this case, the information about the process we are waiting for could be guessed, not only because I caused the waiting situation myself, but because the “final_blocking_session” field in v$session would point you to it. But in the case of multiple subscribers, you can use this query to find all the processes that are currently participating in the channel.

A subscribing session essentially does 3 things of importance to the reliable message wait:
1. It finds a published message in the channel it is subscribed to, and starts acting on the message.
2. It performs the task or tasks that should be done as a subscriber to that channel.
3. It ‘consumes’ the message, leading to end the reliable message wait event for the publisher.

In the case of a truncate, the foreground process executing the truncate finds the handle, adds itself as publisher to the handle, adds the message and then posts the subscriber(s) using ‘semop’ to activate them. Then it goes into a function called ‘ksrpubwait_ctx’, and waits for the subscribers to do their work in the ‘reliable message’ wait event. The wait is implemented as a semaphore wait (semtimedop) with a 1 second timeout, as the MOS note indicates.

The subscriber, which is the checkpointer process, will be woken from its semtimedop “sleep” if idle, and goes off performing its work cycle. One of the calls in the checkpointer work cycle is ksb_act_run>ksb_act_run_int>kcbo_subch, which performs the task of inspecting the kcbo “KSR channel” to see if work is published.

If so, it performs the task of setting the RO enqueues and add a checkpoint to the checkpoint queue of type 10: object reuse checkpoint and post the logwriter, after which the checkpointer is done, which means it will post the publisher.

Once the publisher is posted, which is the foreground, it will stop waiting for the ‘reliable message’ wait event, and continue on working. Depending on the speed of all the processes and lots of other things, the next thing that happens might be that the foreground now will try to uniquely obtain the RO enqueues and will not be able to do that, or the background processes might have finished their work and released the RO enqueues. But these dealings are not the point of this blogpost.

What I do hope you understand when you’ve reached this part, is that the ‘reliable message’ event is part of the database infrastructure that serves different functions, and that you should look into the specific function for which the KSR infrastructure is used instead of the KSR infrastructure itself.

Advertisement

This post is the result of a question that I got after presenting a session about Oracle database mutexes organised by ITOUG, as a response to the conference cancellations because of COVID-19. Thank Gianni Ceresa for asking me!

The library cache provides shared cursors and execution plans. Because they are shared, sessions can take advantage of the work of previous sessions of creating these. However, by having these shared, access needs to be regulated not to have sessions overwrite each other’s work. This is done by mutexes.

The question I got was (this is a paraphrased from my memory): ‘when using pluggable databases, could a session in one pluggable database influence performance of a session in another pluggable database’?

The answer I gave was that I didn’t test a multi-tenant scenario, but because the library cache is shared between the pluggable databases, it should be possible for a session in one pluggable database to block another session in another pluggable database.

So let’s test it!

I used an Oracle version 20.2 database, which automatically gives you a pluggable database. In fact, even when you don’t specify you want a multi-tenant database, it will create one. This is as expected and documented.

I created an instance called o202, and two pluggable databases, PDB1 and PDB2. Yes, I am a creative genius.

I logged in to PDB1 and executed a ‘select 8 from dual’.

SQL> show con_id

CON_ID
------------------------------
3

SQL> select 8 from dual;

         8
----------
         8

Now using another session in the root container, I dumped the library cache at level 16 to see how that looks like for the ‘select 8 from dual’ cursor:

SQL> show con_id

CON_ID
------------------------------
1

SQL> alter session set events 'immediate trace name library_cache level 16';

Session altered.

(don’t do this on a live environment!)
These are snippets of certain parts of the dump:

Bucket: #=100089 Mutex=0x71f7f8d8(627065225216, 6, 0, 6)
  LibraryHandle:  Address=0x67bcca90 Hash=372d86f9 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
    ObjectName:  Name=select 8 from dual
      FullHashValue=f18bf11763dd069341c61ce6372d86f9 Namespace=SQL AREA(00) Type=CURSOR(00) ContainerId=1 ContainerUid=1 Identifier=925730553 OwnerIdn=9
...
    Concurrency:  DependencyMutex=0x67bccb40(0, 1, 0, 0) Mutex=0x67bccbe0(146, 22, 0, 6)
...
        Child:  childNum='0'
          LibraryHandle:  Address=0x6e6b3fd8 Hash=0 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
            Name:  Namespace=SQL AREA(00) Type=CURSOR(00) ContainerId=3
...
            Concurrency:  DependencyMutex=0x6e6b4088(0, 0, 0, 0) Mutex=0x67bccbe0(146, 22, 0, 6)

This is bucket 100089, with its mutex. The counter for this mutex is the second argument, which is 6.
The library cache handle on the next line is the parent handle. The sql text and the hash value sit with the parent handle.
I am not sure how to interpret the “ContainerId” here, it says 1 (indicating the root container), while this SQL is only executed in container 3, which is shown above.
After the first ‘…’, the two mutexes in the parent handle can be seen: the dependency mutex, and the parent handle mutex.
After the second ‘…’, child number 0 is visible. Here the container is specified from which this child was actually executed.
After the third ‘…’, the concurrency information for the child is shown. The important bit is the dependency mutex is independent/unique, whilst the regular mutex is the same/shared from the parent.

Now what would happen if I execute the exact same SQL (select 8 from dual) in another container; PDB2 alias container id 4?

SQL> show con_id

CON_ID
------------------------------
4

SQL> select 8 from dual;

         8
----------
         8

And look at bucket 100089 again (using the dump of the library cache at level 16):

Bucket: #=100089 Mutex=0x73f81e58(8589934592, 4, 0, 6)
  LibraryHandle:  Address=0x656e2638 Hash=372d86f9 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
    ObjectName:  Name=select 8 from dual
      FullHashValue=f18bf11763dd069341c61ce6372d86f9 Namespace=SQL AREA(00) Type=CURSOR(00) ContainerId=1 ContainerUid=1 Identifier=925730553 OwnerIdn=9
...
    Concurrency:  DependencyMutex=0x656e26e8(0, 2, 0, 0) Mutex=0x656e2788(2, 31, 0, 6)
...
        Child:  childNum='0'
          LibraryHandle:  Address=0x656e0ed8 Hash=0 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
            Name:  Namespace=SQL AREA(00) Type=CURSOR(00) ContainerId=3
...
            Concurrency:  DependencyMutex=0x656e0f88(0, 0, 0, 0) Mutex=0x656e2788(2, 31, 0, 6)
...
        Child:  childNum='1'
          LibraryHandle:  Address=0x655fba58 Hash=0 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
            Name:  Namespace=SQL AREA(00) Type=CURSOR(00) ContainerId=4
...
            Concurrency:  DependencyMutex=0x655fbb08(0, 0, 0, 0) Mutex=0x656e2788(2, 31, 0, 6)

Executing an identical SQL statement (which is a cursor object in the library cache) in another PDB means the same library cache hash bucket is used, and container id is one of the sharing criteria for a child. Because the statement in both cases has a different container id, the cursors can not be shared. This is very logical if you think about it: both containers essentially are logically isolated databases, and therefore there is no way to tell if the tables, access rights, data, really anything is the same, so the SQL is parsed again for the other container.

This has some implications. Because of identical SQL, it hashes to the same value, which mean both containers need access to the same bucket to find the pointer to the parent handle. Both containers also need access to the same parent and the same heap 0 to obtain the child list to find their child. Both the bucket and the handle have a single mutex that serializes access.

Access to the parent heap 0 is done shared, and remains shared if a compatible child can be found.

However, as soon as a new child needs to be created for the same SQL, it will create the new child handle pinned in exclusive mode and insert it into the hash table/child list in the parent’s heap 0. The new child is pinned in exclusive mode, because it still needs to allocate the child heaps and create the child information. This new child is the newest entry, and therefore is the first one any session would find if it scans for a compatible child.

At this point multi-tenancy makes a difference:
– If a session in the same container starts parsing and follows the procedure of hashing the SQL text, obtaining the bucket, finding the parent handle and then scan the child list for a compatible child, it will find the child in exclusive mode for this container, and waits for the child creation to finish, waiting on the event: ‘cursor: pin S wait on X’.
– If a session in ANOTHER container starts parsing and follows the procedure of hashing the SQL text, obtaining the bucket, finding the parent handle and then scan the child list for a compatible child, it will find the child in exclusive mode, but because it has a different container id, it will skip the child, and continue to scan the child list for compatible children. The scanning means pinning each child that potentially can be compatible, which means has the same container id, in shared mode, and by doing that either finds a compatible child, or if it doesn’t, create one itself.

So what does that mean; the conclusion:
The library cache hash table, the parent cursor and the child list/hash table are shared between all containers. These require unique (library cache hash table/handle) or shared (child list/hash table) access for each database process for usage, but the actual time it is held in these modes is so short that it is negligible.
The main reason for waiting for a cursor/library cache entry is a child pinned in exclusive mode during creation (‘cursor: pin S wait on X’).
When an identical statement is executed in another container, it will skip a child that is created or in the process of being created in another container, even when it’s pinned in exclusive mode.

This means that the multi-tenant option in my opinion does not significantly increase the risk of waiting because of library cache concurrency, specifically because it can skip child cursors pinned exclusively in another container.

Update april 4: I tested having a child from one container being pinned in X mode and running the same SQL in another container in 12.1.0.2.0 (base release of 12.1.0.2) and it works exactly the same as described above. So whilst the library cache hash table and parent handles are shared, the cursor children are specific to a container and do not lock each other out.

Recently, Franck Pachot tweeted the following:

This is a very clever way of using Brendan Gregg’s flame graphs for Oracle database process flame graphs and using my ora_functions project database.

However, the awk script uses a full match using f[$2]:
– f is an array with the function name annotations filled using {f[$1]=$2;next}.
– // means to only work on lines that have in it.
– The lines have the oracle function stuck to , for example: kcbgtcr.
– The first sub() function (substitute); which is sub(“”,”& “) places a space after “” so it looks like: kcbgtcr; “&” means “the first argument of the sub function”.
– FS=” ” means the field separator is a space.
– The next sub() function; which is sub(“”,””f[$2]”&”) substitutes on the pattern, and uses the f array using the second field ($2) of the input, which now is the oracle function thanks to the field separator and the previous sub() function.

But…the function names in the database/csv file are build up on layers. I done this deliberately to avoid having to type all the layer names for a function when entering new ones, but more importantly so I can actually identify the code layers in the oracle executable. Because of the layers, I don’t need the full function descriptions, I can still get an understanding what is going on by looking in what layer a function is called.

To get the layers in the function names, I added a script in my ora_functions repository called ‘annotate_flamegraph.awk’, which takes the function name csv file actually exactly like Franck did, but now builds up the annotation of every function in an existing flamegraph. It simply takes an existing flamegraph SVG file with oracle database functions as an input, and outputs the annotated version on STDOUT, so this is how it’s run to get an annotated SVG:

$ ./annotate_flamegraph.awk dt_11856.txt_1.fold.svg > dt_11856.txt_1.fold.annotated.svg

This is how it looks like (please mind these are snapshots from flamegraph output in a browser):

(click on the graph to see it in the original size)
If you hoover over a row in the flamegraph, it will show the full information at the bottom row, in this case I hoovered over ttcpip.
If you click a row, it will zoom in to the row. The next picture is a good way to show that the function layers and the stack make it obvious what is going on, while most of the functions do not have the full annotation:

(click on the graph to see it in the original size)

Again, thanks to Franck for the inspiration. I hope this will be useful for investigating what oracle is doing!

Recently I was investigating the inner working of Oracle. One of the things that is fundamental to the Oracle database is the SCN (system change number). SCNs are used to synchronise changes in the database. There is one source for SCNs in every instance (kcbgscn; the global or current SCN in the fixed SGA), and there are multiple tasks for which Oracle keeps track of synchronisation using SCNs. A few of these tasks for which Oracle stores and uses SCNs to keep track of progress are on disk SCN and lwn SCN.

This blogpost is about some oddities I found when using gdb (the GNU debugger) to watch memory locations of a running Oracle database. This should not be done on a production instance, and is purely for research purposes. Only use the methods mentioned in this article if you are absolutely sure what you are doing, and/or if you are using an Oracle instance that can be crashed and can be restored.

The situation
I was researching the ‘on disk scn’ in a 11.2.0.2 database. I used the following (simplified) gdb macro:

break kcrf_commit_force
  commands
    print $rip
    c
  end
awatch *0x60022659
  commands
    c
  end

This macro breaks on the function kcrf_commit_force essentially to understand the different phases of a commit, and uses the ‘awatch’ function (access watch; which means this watchpoint fires at both reading and writing) to look at memory location 0x60022659. That memory location is the on disk SCN for this instance.

Upon executing a commit after an insert, I saw the following output:

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$1 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Old value = 898
New value = 899
0x0000000008fb5ad1 in kcscur3 ()

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$2 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()

What this shows, is the following:
– line 1 and 9: the kcrf_commit_force function is called.
– line 11, 15 and 19: the memory location *0x60022659 is read, resulting in the value 899.
– line 3: the memory location *0x60022659 is written, it contained 898, and the new value is 899.

This greatly puzzled me. As far as I understand the way Oracle works, a server process that issues ‘commit’ is supposed to READ the on disk SCN, it does not write the on disk SCN, because the only process that changes the on disk SCN is the logwriter. However, quite clearly gdb tells me the server process is writing it. Further consideration is that IF the server process is writing it, where does this information come from?

After some further investigation, I came to the conclusion that the only way to prove to myself that what I am seeing is correct, is to create a program myself mimicking the behaviour witnessed above to truly understand what is going on, because I don’t believe what I am seeing is correct.

I created two little c programs. One that starts up, creates a shared memory segment, writes a letter into the shared memory segment and then waits for a keystroke, and puts the keystroke into the same location, and one that reads the same shared memory segment, then waits for enter to pressed, and reads the shared memory segment again, and outputs if the letter read from the shared memory is different. I called the first one ‘server’, and the second one ‘client’. This way, I can attach gdb to the client executable, which very much like an Oracle server process, reads a shared memory location which I can change at will.

The code can be found on GitHub: shared memory test source. Please mind this is simple proof of concept code that is not exhaustive in error checking and other sanity checks in any way. The source files can be compiled and linked in the following way:

$ cc server.c -o server
$ cc client.c -o client

The server executable can be run in the following way:

$ ./server
Type a letter. q quits.

This creates the shared memory segment, and puts the ‘a’ character in the variable c (line 42, c = ‘a’). As long as the c variable is not ‘q’, it writes the c variable into shared memory segment using the pointer s (*s = c) , and then waits for user input using the scanf function.

Once the shared memory segment is created and written into, the client program can be started:

$ ./client
client started. ^C to quit.
waiting for enter

The first time enter is pressed it will output ‘keystroke: a’ because the p variable was not initialised, and will be assigned the ‘a’ character (p = *s), the second time enter is pressed, ‘keystroke:’ is not printed, because the p variable and the memory location the pointer *s points to both contain ‘a’. What is vital, is that the client program NEVER writes, only reads the shared memory location, in three occasions: in line 50:

if( p != *s) {

In line 51:

printf("keystroke: %c\n", *s);

And in line 52:

p = *s;

Okay. So at this point we got a program that mimics what I think an Oracle server process is doing too on commit time (reading shared memory), which is the program ‘client’, and we got a program which creates and can manipulate the shared memory location the client is reading from, which is ‘server’. Startup the server program (./server), and startup the client using gdb:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb)

This is different than attaching to a process using gdb -p PID, but attaching to a process using ‘-p’ is the only way that I am aware of to use gdb on Oracle database processes. Now that we started gdb with the client program, we must first run it in order to see the program’s runtime environment:

(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter

Now invoke gdb (using control-C) and look at the memory mappings to see where the shared memory segment is:

(gdb) info proc mappings
process 3069
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]

Line 19 contains the shared memory segment, at memory address 0x7ffff7ff8000. To do the same as watching the on disk SCN location in Oracle, put an access watchpoint at 0x7ffff7ff8000:

(gdb) awatch *0x7ffff7ff8000
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now press enter, and we see the shared memory location being accessed three times:

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

If you take the source code, you’ll see that these are the following lines of code at rows 50-52:

       if( p != *s ) {
          printf("keystroke: %c\n", *s);
          p = *s;

Every line reads the shared memory location via the pointer *s. It’s also clear it’s a read because the value in the memory location (97) is displayed once.

Now go over to the server side, and press ‘b’ and enter. Now the shared memory segment contains ‘b’. Next, go back to the client program, and press enter:

waiting for enter

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

Hey! This shows what looks like a write with the first watchpoint, because the memory location is touched, and it reports the value at the location being changed. However, because we know the source code, we know we only read the shared memory location with the client program, not write to it! So what this shows in this case is that gdb knew what the previous value was of the memory location was, and detected that it was changed.

To see how a true write looks like, you can startup the server executable with gdb too, and set an ‘awatch’ watchpoint on the shared memory location exactly the same way as shown with the client program above. This is how that looks like:

(gdb) c
Continuing.
b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x00000000004006e9 in main ()

So conclusion here is that you can’t see the difference between a write to a memory location or a change of value when reading when using an ‘awatch’ watchpoint.

Now that we understand how ‘awatch’ works, would there be a way to make the distinction between reading and writing? Outside of ‘awatch’ (access watchpoint), there’s the ‘watch’ command, which is a watchpoint on writing to a memory location and ‘rwatch’, which is a watchpoint on reading a memory location. This is how the output of ‘rwatch’ looks like in the client program:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7b23670 in __read_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info proc map
process 3517
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]
(gdb) rwatch *0x7ffff7ff8000
Hardware read watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

At line 1 gdb is started with the ‘client’ program. Once inside gdb, the executable is run (line 12). Once it’s running we interrupt in order to get to gdb (line 16). In gdb, the info proc map command is used (line 20) to look at the memory mappings. From the memory mappings, we see address 0x7ffff7ff8000 is the shared memory segment. A read watch is set and a c (continue) command is added to the read watchpoint (line 46 and 48). Then the execution is continued with the c command (line 53).

Now we are running in the executable again. After continuing, I pressed enter to go through the loop in the client executable, and the watchpoint is triggered three times (lines 56, 60 and 68), as we did see earlier, for source code lines 50, 51 and 52.

Now I changed the letter in the shared memory segment from ‘a’ to ‘b’ with the server executable, and pressed enter at line 60. The interesting things that happens is that the read watchpoint is triggered TWO times. This should be three times (!!).

There is a way to make the now invisible read visible again, using the following command in gdb:

(gdb) watch *0x7ffff7ff8000
Hardware watchpoint 2: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Yes, that’s a WRITE watchpoint. After adding a write watchpoint, this is how the output looks like when I change the letter from ‘b’ to ‘c’:

Hardware watchpoint 2: *0x7ffff7ff8000

Old value = 98
New value = 99
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040071e in main ()
keystroke: c
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040073c in main ()
waiting for enter

There we have the tree reads visible again, however, one is ‘disguised’ as a write. To conclude: it seems gdb is not able to make a distinction between reads and writes for watchpoints if the watchpoint memory location is changed outside of the executable active in the debugger. Gdb seems to determine memory access a write if the value found is different from the last it was read.

Bonus information, not applicable to Oracle:
When you have the source code of an application, you can make debugging even better by adding in debug information for gdb in the executable. This is done by adding the ‘-g’ flag to the cc command:

$ cc -g client.c -o client

With the debug information available, it’s easier to debug:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...done.
(gdb) tbreak 1
Temporary breakpoint 1 at 0x40067c: file client.c, line 1.
(gdb) run
Starting program: /home/oracle/src/shm_test/client

Temporary breakpoint 1, main () at client.c:22
22	    key = 666;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info local
shmid = 0
key = 0
shm = 0x400590 "1\355I\211\321^H\211\342H\203\344\360PTI\307\300P\a@"
s = 0x7fffffffe550 "\001"
p = 0 '\000'
(gdb) awatch s
Hardware access (read/write) watchpoint 2: s
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Old value = 0x7fffffffe550 "\001"
New value = 0x7ffff7ff8000 "a"
main () at client.c:45
45	    printf("client started. ^C to quit.\n");
(gdb) c
Continuing.
client started. ^C to quit.
waiting for enter

Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040070f in main () at client.c:50
50	       if( p != *s ) {
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040071b in main () at client.c:51
51	          printf("keystroke: %c\n", *s);
(gdb) c
Continuing.
keystroke: a
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x0000000000400739 in main () at client.c:52
52	          p = *s;
(gdb) c
Continuing.
waiting for enter

Above you see starting up the executable with gdb at line 1. After gdb is started, I first set a temporal breakpoint on line 1 (of the source code, shown on line 12 above), to have gdb ‘enter’ the executable and stop when I run it. If I wouldn’t do that, the program would ‘wait’ in the getchar() function, which is a function in libc, and the libc library would be my scope, line 19 tells that I have not loaded debug symbols for glibc. After setting the breakpoint, I run the executable, which immediately breaks on the temporal breakpoint (line 14).

Now that the executable is active, I can list the variables used using the info local command (line 20). Also, I can simply set a watchpoint on the variable s (line 26), instead of having to search the memory address myself. When the watchpoint is triggered, it now shows the line number in the source file, and the actual source code on that line (lines 34, 35, 44, 45, etc).

This blogpost is about how to print the system call arguments of a system call which is caught with ‘catch’ or ‘break’ in gdb. The reason for this blogpost is I spend quite some time on searching for this, and working around this, so writing it in a blogpost might help others who spend (some of) their time in the gdb debugger, and encounter the same issue.

When you break on a system call in gdb, it will show you something like this:

Breakpoint 2, semctl () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

All fine, but most of the people who break on something want to know what the arguments of the call are! We are actually breaking on what is called a “syscall wrapper” (explanation in link). This means we do not have the arguments of the system call nicely shown on screen, as we would like (will, I do!).

An example of a system call which does show it’s arguments nice and dandy on screen is “io_submit ()”:

Breakpoint 3, io_submit (ctx=0x7f42badba000, nr=1, iocbs=0x7fff0cc0f30) at io_submit.c:23
23      io_syscall(in, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

This allows me to look into the arguments, like for example:

(gdb) print *iocbs[0]

But now back to the main problem: for some system calls, when I break on them, I end up in the system call wrapper, not showing any arguments. After some time, I found this answer on stackoverflow. Which is exactly what I needed: the system call arguments are passed via registers!!

Let me show you how this works. I was investigating the system call “nanosleep()” recently. In order to understand its arguments, first issue “man nanosleep” to understand the arguments:

NAME
      nanosleep - high-resolution sleep

SYNOPSIS
      #include <time.h>

      int nanosleep(const struct timespec *req, struct timespec *rem);

So, when nanosleep is called, it passes the pointers to two structs of the type “timespec”. The first one is a constant (it determines the specifics of the call), the second one is used if the nanosleep was interrupted with a signal, which means the remaining time is written in it. For the sake of understanding what is happening, I really only want to know what is in the first struct (*req).

So this is how I setup a break on nanosleep() in gdb, and print the argument I want to see:

(gdb) break nanosleep
Breakpoint 2 at 0x32e0e0ef10: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) commands
Type commands for breakpoint(s), 2, one per line.
End with a line saying just "end".
>print (struct timespec) *$rdi
>c
>end
(gdb) c
Continuing.

Breakpoint 2, nanosleep () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
$1 = {tv_sec = 0, tv_nsec = 867779000}

There you go: we got the argument of a system call (that I was interested in) that was “hidden” by the linux system call wrapper.

%d bloggers like this: