Archive

Tag Archives: debug

Warning! This is a post about Oracle database internals for internals lovers and researchers. For normal, functional administration, this post serves no function. The post shows a little tool I created which consists of a small database I compiled with Oracle database C function names and a script to query it. The reason that keeping such a database makes sense in the first place, is because the Oracle C functions for the Oracle database are setup in an hierarchy based on the function name. This means you can deduct what part of the execution you are in by looking at the function name; for example ‘kslgetl’ means kernel service lock layer, get latch.

To use this, clone git repository at https://gitlab.com/FritsHoogland/ora_functions.git

Use the ‘sf.sh’ (search function) script to query the functions. The script uses sqlite3, use the sqlite rpm package to add this (on linux), or use the packaging method of your platform.

This is how to use sf.sh :

$ ./sf.sh kglGetMutex
kglGetMutex -- kernel generic lock management
---

This shows the sf.sh script could find the first 3 letters (underlined), which probably mean kernel generic lock management. It couldn’t find ‘GetMutex’ however that is so self-explanatory that it doesn’t need annotation.

This is how a full function annotation looks like:

$ ./sf.sh qercoFetch
qercoFetch -- query execute rowsource count fetch
----------

Here the full function is underlined, which means the entire function is found.

If you found explanations for Oracle database code locations, or found an error in the annotation (I made assumptions here and there), please send them to frits.hoogland@gmail.com, or react to this post, so I can add or update it.

Advertisements

This post is about the decision the Oracle database engine makes when it is using a full segment scan approach. The choices the engine has is to store the blocks that are physically read in the buffercache, or read the blocks into the process’ PGA. The first choice is what I refer to as a ‘buffered read’, which places the block in the database buffercache so the process itself and other processes can bypass the physical read and use the block from the cache, until the block is evicted from the cache. The second choice is what is commonly referred to as ‘direct path read’, which places the blocks physically read into the process’ PGA, which means the read blocks are stored for only a short duration and is not shared with other processes.

There are some inherent performance aspects different between a buffered and a direct path read. A buffered read can only execute a single physical read request for a single range of blocks, wait for that request to finish, fetch and process the result of the physical read request after which it can execute the next physical read request. So there is maximum of one outstanding IO for multiple (adjacent) Oracle blocks. A direct path read works differently, it submits two physical IO requests, each for a distinct range of Oracle blocks asynchronously, after which it waits one or more IOs to finish. If an IO is returned, it is processed, and an IO for another range of Oracle blocks is submitted to restore the number of IOs in flight to two. If the database engine determines (based upon a non-disclosed mechanism) that enough resources are available it can increase the amount of IO physical IO requests in flight up to 32. Other differences include a maximum for the total size of the IO request, which is 1MB for buffered requests, and 32MB for direct path requests (which is achieved by setting db_file_multiblock_read_count to 4096).

At this point should be clear that there are differences between buffered and direct path reads, and when full segment scans switch from direct path reads to buffered reads it could mean a significant performance difference. On top of this, if your database is using Exadata storage, this decision between buffered reads and direct path reads is even more important. Only once the decision for direct path reads has been made, an Exadata smartscan can be executed. I have actually witnessed cases where a mix of partitioning and HCC lead to the situation that the partitions were so small that a direct path read was not chosen, which meant a smartscan was not considered anymore, meaning that instead of the cells decompressing the compressed blocks all in parallel, the process now had to fetch them and do the decompression on the database layer.

There have been some posts on the circumstances of the decision. However, I have seen none that summarise the differences for the different versions. In order to investigate the differences between the different Oracle versions, I created a git repository at gitlab: https://gitlab.com/FritsHoogland/table_scan_decision. You can easily use the repository by cloning it: git clone https://gitlab.com/FritsHoogland/table_scan_decision.git, which will create a table_scan_decision directory in the current working directory.

Oracle version 11.2.0.2.12
Please mind this version is very old, and SHOULD NOT BE USED ANYMORE because it’s not an actively supported version. However, I do use this version, because this version has different behaviour than the versions that follow.

First determine the small table threshold of the database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

Let’s create tables just below and just over 1531 blocks/small table threshold:

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1531 1531
...
    BLOCKS
----------
      1664

So the small table threshold is 1531, this means that an internal statistic that is used for determining using the direct path mechanism, medium table threshold will be approximately 1531*5=7655. Let’s create tables just below and just over that number of blocks:

TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

For the other versions, trace event ‘nsmtio’ can be used to learn how the decision is made. However, this trace event does not exist in Oracle version 11.2.0.2. The workaround is to just execute a SQL trace and interpret the wait events. For a full table scan, the wait events ‘db file scattered read’ means a buffered read is done, and wait events ‘direct path read’ means a direct path read was done (obviously).

TS@test > alter session set events 'sql_trace level 8';
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1531';
TS@test > select count(*) from table_1531;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that in my case, with Oracle version 11.2.0.2, the switching point is at 5 times _small_table_threshold.

Oracle 11.2.0.3.15
This version too should NOT BE USED ANYMORE because it is not in active support. This too is for reference.
Small table threshold for this database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

With the small table threshold being 1531, the medium table threshold should be approximately 1531*5=7655.

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1440 1440
...
    BLOCKS
----------
      1536
TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans. By doing that I ran into something peculiar with the ‘nsmtio’ event in this version (11.2.0.3 with the latest PSU). This event does exist for this version (which you can validate by running ‘oradebug doc component’), however, it does not yield any output. This means I have to revert to the previous method of running sql_trace at level 8 and interpret the wait events.

TS@test > alter session set events 'trace[nsmtio]:sql_trace level 8'; -- no NSMTIO lines, only sql_trace!
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1440';
TS@test > select count(*) from table_1440;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that with Oracle version 11.2.0.3, the direct path read switching point seems to have moved from 5 times small table threshold to small table threshold itself.

Oracle 11.2.0.4.170718
This version is in active support!
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1538

With the small table threshold being 1538, the medium table threshold should be approximately 1538*5=7690.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1538 1538
...
    BLOCKS
----------
      1664
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7690 7690
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans.

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7693 blocks),
-- main event: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1538';
SQL> select count(*) from table_1538;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14422, objn: 14422
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14423, objn: 14423
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7690';
SQL> select count(*) from table_7690;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14424, objn: 14424
-- main event: direct path read

This shows that with Oracle version 11.2.0.4, the direct path read switching is at small table threshold, which was changed starting from 11.2.0.3.

Oracle version 12.1.0.2.170718
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1440

SQL>

With small table threshold being 1440, the medium table threshold is approximately 1440*5=7200.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7203 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20489, objn: 20489
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20490, objn: 20490
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20491, objn: 20491
-- main events: direct path read

This is in line with the switch in version 11.2.0.3 to small table threshold as the switching point between buffered reads and direct path reads.

Oracle 12.2.0.1.170814
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1444

SQL>

With small table threshold being 1444, the medium table threshold is approximately 1444*5=7220.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1504 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 7048 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 22502, objn: 22502
-- main events: direct path read

Hey! With 12.2.0.1 the direct path read switching point reverted back to pre-11.2.0.3 behaviour of switching on 5 times small table threshold instead of small table threshold itself.

Update!
Re-running my tests shows differences in the outcome between buffered and direct path reads. My current diagnosis is that the scan type determination uses a step based approach:

– The first determination of size is done with ‘NSMTIO: kcbism’ (kcb is medium). If islarge is set to 1, it means the segment is bigger than STT. If islarge is set to 0 it means the segment is smaller than STT, and the segment will be read buffered, and the line ‘qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown in the NSMTIO output.

– The next line is 'NSMTIO: kcbimd' (kcb is medium determination?) It shows the size of the segment (nblks), STT (kcbstt), MTT (kcbpnb) and is_large, which in my tests always is set to 0. Here, there are 4 options that I could find:

1) Segment size between STT and MTT and a buffered read is executed.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, which probably is externalised in the line 'NSMTIO: kcbcmt1'. The outcome can be a buffered read, for which the line 'qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown.

2) Segment size between STT and MTT and the direct path code path is chosen.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, probably externalised in the line 'NSMTIO: kcbcmt1'. If the costing determines it would be beneficial to use a direct path mechanism, it seems it switches to the direct path with cache determination code, which is also used for any table scan that is smaller than VLOT. Because of switching to that code, it will determine if the segment is bigger than VLOT: 'NSMTIO: kcbivlo', which of course in this case isn't true. Then, it will show the line 'NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]'

3) Segment size bigger than MTT but smaller than VLOT.
If the segment is between MTT and VLOT, the Oracle engine does not apply the costing mechanism (which is means the kcbcmt1 line is not shown). It will determine if the segment is bigger than VLOT ('NSMTIO: kcbivlo'), and then show 'NSMTIO: qertbFetch:[MTT VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

4) Segment size bigger than VLOT.
If the segment is bigger than VLOT, the Oracle engine execute the functions kcbimd and kcbivlo, the NSMTIO line for kcbivlo will show is_large 1 to indicate it’s a very large object (VLOT by default is ‘500’, which is 5 times the total number of buffers in the buffer cache. The qertbFetch line will say ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

In the cases where ‘NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]' is shown, which is either a segment between STT and MTT which switched to this code path, or between MTT and VLOT, the code will apply a second determination and potential switching point from buffered to direct path or vice versa, which is shown with the line 'kcbdpc' (kcb direct path check). The outcome can be:

– NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ] to indicate it will use a buffered read.
– NSMTIO: kcbdpc:DirectRead to indicate it will use a direct path read.

I have verified the above 'decision tree' in 11.2.0.2, 11.2.0.3, 11.2.0.4, 12.1.0.2 and 12.2.0.1. It all seems to work this way consistently. I derived this working by looking at the NSMTIO tracing of 12.2, and then gone back in version. You will see that going lower in versions, there is lesser (nsmtio) tracing output; 11.2.0.4 does show way lesser information, for example, it does not show the kcbcmt1 line, and of course 11.2.0.3 and 11.2.0.2 do not show NSMTIO lines altogether. In order to verify the working, I used gdb and quite simply breaked on the kcbism, kcbimd, kcbcmt1, kcbivlo and kcbdpc functions in the versions where this information was missing in the trace.

Still, at the kcbcmt1 point:
– 11.2.0.2 seems to quite consistently take MTT as the direct path switching point.
– 11.2.0.3-12.1.0.2 seem to quite consistently take STT as the direct path switching point.
– 12.2.0.1 varies.

Conclusion.
This article first explained the differences between buffered and direct path reads, and why this is important, and that it is even more important with Exadata for smartscans.

The next part shows how to measure the switching point. The most important message from this blog article is that starting from 11.2.0.3 up to 12.1.0.2 the direct path read switching point is small table threshold, and with Oracle database version 12.2.0.1, the direct path switching point is changed back to pre-11.2.0.3 behaviour which means 5 times the small table threshold of the instance.
The next part shows measurements of the switching point. The addition shows that between STT and MTT there is a cost based decision to go direct path or buffered path. Once the direct path is chosen, it still can go buffered if the majority of the blocks are in the cache.

If you look closely at the output of the nsmtio lines for version 11.2.0.3-12.1.0.1 for tables that had a size between small table threshold and medium table threshold, it seemed a bit weird, because the nsmtio trace said ‘[MTT < OBJECT_SIZE < VLOT]', which to me means that Oracle detected the object size to be between medium table threshold and very large object threshold, which was not true. I can't tell, but it might be a bug that is solved for measuring the wrong size.
The text description in the NSMTIO qertbFetch line is bogus, it simply is a code path; ‘[- STT < OBJECT_SIZE < MTT]' means it's a buffered read, and could be chosen when < STT or in between STT and MTT, '[MTT < OBJECT_SIZE < VLOT]' means it's a direct path read, and could be chosen when in between STT and MTT or MTT and VLOT.

I added the scripts and examples of the tracing events so you can measure this yourself in your environment.

In my previous post, I introduced Intel Pin. If you are new to pin, please follow this link to my previous post on how to set it up and how to run it.

One of the things you can do with Pin, is profile memory access. Profiling memory access using the pin tool ‘pinatrace’ is done in the following way:

$ cd ~/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 12284 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

The pid is a pid of an oracle database foreground process. Now execute something in the session you attached pin to and you find the ‘pinatrace’ output in $ORACLE_HOME/dbs:

$ ls -l $ORACLE_HOME/dbs
total 94064
-rw-rw----. 1 oracle oinstall     1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--. 1 oracle oinstall     2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall       57 Nov  5 09:42 inittestdb.ora
-rw-r-----. 1 oracle oinstall       24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall     7680 Nov  5 09:41 orapwtestdb
-rw-r--r--  1 oracle oinstall 10552584 Nov 17 06:36 pinatrace.out

Please mind memory access generates A LOT of information! The above 11MB is what a ‘select * from dual’ generates (!)

This is how the file looks like:

$ head pinatrace.out
#
# Memory Access Trace Generated By Pin
#
0x00007f85c63fe218: R 0x00007fff6fd2c4c8  8          0xcefb615
0x000000000cefb61e: W 0x00007fff6fd2c4f8  8              0x12c
0x000000000cefb621: R 0x00007fff6fd2c4d0  8     0x7f85c5bebd96
0x000000000cefb625: R 0x00007fff6fd2c4d8  8     0x7f85c5bebd96
0x000000000cefb62c: R 0x00007fff6fd2c4e0  8     0x7fff6fd2c570
0x000000000cefb62d: R 0x00007fff6fd2c4e8  8          0xcefb54e

The first field is the function location, the second field is R or W (reading or writing obviously), the third field is the memory location read or written the fourth field is the amount of bits read and the fifth field is prefetched memory.

The function that is used can be looked up using the addr2line linux utility:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x000000000cefb61e
sntpread at ??:?

I looked up the second address from the pinatrace.out file above, and that address belongs to the function sntpread. There is no additional information available for this function (‘at ??:?’). If the address is not available in the oracle executable, a ‘??’ is displayed:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x00007f85c63fe218
?? ??:0

The pinatrace.out file is usable if you know the exact instruction pointer address or the memory location. However, that usage is fairly limited. An example of that is Mahmoud Hatem’s blog on tracing access to a memory location. Wouldn’t it be nice if we can change the functions addresses to function names, and the memory addresses to named memory locations whenever possible?

That’s where I created the pinatrace annotate oracle tool for. This is a little scriptset that contains scripts to generate memory information from the instance, after which the instruction pointer addresses and memory locations of a pinatrace.out file generated by pinatrace are translated to function names and memory area names. Let’s have a look what that means. This is a snippet of a pinatrace.out file:

0x000000000c967e46: R 0x0000000095f69910  8         0x95fcf6b0
0x000000000c967e4d: W 0x00007fff6fd2b2b8  8          0xc967e52
0x000000000c937b32: W 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937b3a: W 0x00007fff6fd2b278  8                0xe
0x000000000c937b41: W 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937b45: W 0x00007fff6fd2b270  8                0x1
0x000000000c937b49: W 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937b4d: R 0x0000000095fcf6bc  2               0x12
0x000000000c937b52: W 0x00007fff6fd2b288  8              0x2c4
0x000000000c937b59: W 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937b60: W 0x00007fff6fd2b2a0  4               0x73
0x000000000c937b6b: W 0x00007fff6fd2b2a8  4                0x1
0x000000000c937b6e: R 0x00007f85ca1db280  8     0x7f85ca1db280
0x000000000c937b77: R 0x000000000d0a40e4  4                0x1
0x000000000c937b84: R 0x00007f85ca1d43c8  8         0x95dc0e20
0x000000000c937b92: R 0x0000000095dc10b0  8                  0
0x000000000c937ba2: R 0x0000000095fcf6c0  4                0x1
0x000000000c937ba9: R 0x0000000095dc10e0  4                  0
0x000000000c937baf: R 0x000000000cfbe644  4            0x1cffe
0x000000000c937bbc: W 0x0000000095dc10b0  8         0x95fcf6b0
0x000000000c937bc5: R 0x0000000095fcf6b0  8                  0
0x000000000c937bc5: W 0x0000000095fcf6b0  8                0x1
0x000000000c937bca: W 0x00007fff6fd2b260  8                  0
0x000000000c937be1: R 0x00007f85ca1d4290  8     0x7f85ca1a9ca0
0x000000000c937bec: R 0x00007f85ca1ab1c0  4                0x3
0x000000000c937bf3: W 0x0000000095dc0faa  2                0x3
0x000000000c937bf9: R 0x00007f85ca1d43e0  8         0x95f68ea8
0x000000000c937c09: R 0x0000000095f69470  2                  0
0x000000000c937c16: W 0x0000000095dc0fac  2                  0
0x000000000c937c1e: R 0x0000000095dc10e0  4                  0
0x000000000c937c1e: W 0x0000000095dc10e0  4                0x2
0x000000000c937c24: W 0x0000000095dc0fa0  8         0x95fcf6b0
0x000000000c937c28: W 0x0000000095dc0fa8  2                0x8
0x000000000c937c2e: R 0x000000006000a9d8  4                0x1
0x000000000c937c3b: R 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937c3f: R 0x00007fff6fd2b2a0  4               0x73
0x000000000c937c42: W 0x0000000095fcf6c8  8         0x95f68ea8
0x000000000c937c46: W 0x0000000095fcf6c4  4               0x73
0x000000000c937c4a: R 0x00007fff6fd2b2a8  4                0x1
0x000000000c937c50: R 0x0000000095fcf6b8  4              0x83e
0x000000000c937c50: W 0x0000000095fcf6b8  4              0x83f
0x000000000c937c5a: W 0x0000000095dc10b0  8                  0
0x000000000c937c65: R 0x00007f85ca1d71d6  1                  0
0x000000000c937c76: R 0x00007fff6fd2b270  8                0x1
0x000000000c937c7a: R 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937c7e: R 0x00007fff6fd2b288  8              0x2c4
0x000000000c937c82: R 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937c86: R 0x00007fff6fd2b278  8                0xe
0x000000000c937c8d: R 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937c8e: R 0x00007fff6fd2b2b8  8          0xc967e52

The usefulness of this is limited in this form. The only thing I could derive is that big numbers in the memory access column (‘0x00007fff6fd2ac60’) are probably PGA related, and the numbers between roughly 0x000000006000000 and 0x0000000095dc0fd0 are probably SGA related. After running the annotate tool, it looks like this:

ksl_get_shared_latch:W:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:W:0x00007fff6fd2b278():8
ksl_get_shared_latch:W:0x00007fff6fd2b298():8
ksl_get_shared_latch:W:0x00007fff6fd2b270():8
ksl_get_shared_latch:W:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x0000000095fcf6bc(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):2
ksl_get_shared_latch:W:0x00007fff6fd2b288():8
ksl_get_shared_latch:W:0x00007fff6fd2b290():8
ksl_get_shared_latch:W:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x00007f85ca1db280(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x000000000d0a40e4():4
ksl_get_shared_latch:R:0x00007f85ca1d43c8(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6c0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:R:0x000000000cfbe644():4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x00007fff6fd2b260():8
ksl_get_shared_latch:R:0x00007f85ca1d4290(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x00007f85ca1ab1c0(pga|Other, pga heap, kgh stack pga|Other, pga heap, free memory pga|Other, pga heap, permanent memory):4
ksl_get_shared_latch:W:0x0000000095dc0faa(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x00007f85ca1d43e0(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095f69470(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:W:0x0000000095dc0fac(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc0fa0(shared pool|permanent memor,duration 1,cls perm):8
ksl_get_shared_latch:W:0x0000000095dc0fa8(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x000000006000a9d8(fixed sga|var:kslf_stats_):4
ksl_get_shared_latch:R:0x00007fff6fd2b298():8
ksl_get_shared_latch:R:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x0000000095fcf6c8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6c4(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x00007f85ca1d71d6(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):1
ksl_get_shared_latch:R:0x00007fff6fd2b270():8
ksl_get_shared_latch:R:0x00007fff6fd2b290():8
ksl_get_shared_latch:R:0x00007fff6fd2b288():8
ksl_get_shared_latch:R:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x00007fff6fd2b278():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b8():8

So, now you can see the reason I picked a seemingly arbitrary range of lines actually was because that range is the memory accesses of the ksl_get_shared_latch function. This annotated version show a shared latch get for the ‘session idle bit’ latch. It’s also visible the function uses PGA memory, some of it annotated, some of it not, and that most of the shared pool access is for the latch (a latch essentially is a memory range with the function of serialising access to a resource), which is in the shared pool because it’s a child latch. It’s also visible memory belonging to X$KSUPR is read and written (X$KSUPR is the table responsible for V$PROCESS, the fields KSLLALAQ and KSLLALOW are not externalised in V$PROCESS).

Why are a lot of the assumed PGA addresses (the ones like 0x00007fff6fd2b2b8) not annotated? Well, PGA memory allocations are very transient of nature. Because a PGA memory snapshot is made at a certain point in time, this snapshot represents the memory layout of that moment, which has a high probability of having memory deallocated and freed to the operating system. A lot of the SGA/shared pool allocations on the other hand have the intention of re-usability, and thus are not freed immediately after usage, which gives the SGA memory snapshot a good chance of capturing a lot of the memory allocations.

Get the pinatrace oracle annotate tool via github: git clone https://github.com/FritsHoogland/pinatrace_annotate_oracle.git

Please mind this tool uses the bash shell, it might not work in other shells like ksh.

How to use the tool?
– Use pin with the pinatrace.so tool, as described above. Move the the pinatrace.out file from $ORACLE_HOME/dbs to the directory with the pinatrace_annotate_oracle.sh script.
Immediately after the trace has been generated (!), execute the following scripts using sqlplus as SYSDBA:
– 0_get_pga_detail.sql (this lists the sessions in the database and requires you to specify the oracle PID of the session)
– 1_generate_memory_ranges.sql
– 2_generate_memory_ranges_xtables.sql
– 3_generate_memory_ranges_pga.sql
This results in the following files: memory_ranges.csv, memory_ranges_pga.csv and memory_ranges_xtables.csv.
Now execute the annotate script:
– ./pinatrace_annotate_oracle.sh pinatrace.out
The script outputs to STDOUT, so if you want to save the annotation, redirect it to a file (> file.txt) or if you want to look and redirect to a file: | tee file.txt.

I hope this tool is useful for your research. If you know a memory area described in the data dictionary that is not included, please drop me a message with the script, then I’ll include it.

This blogpost is an introduction to Intel’s Pin dynamic instrumentation framework. Pin and the pintools were brought to my attention by Mahmoud Hatem in his blogpost Tracing Memory access of an oracle process: Intel PinTools. The Pin framework provides an API that abstracts instruction-set specifics (on the CPU layer). Because this is a dynamic binary instrumentation tool, it requires no recompiling of source code. This means we can use it with programs like the Oracle database executable.
The Pin framework download comes with a set of pre-created tools called ‘Pintools’. Some of these tools are really useful for Oracle investigation and research.

Pin works in a very sophisticated way. The description in the Pin manualis to think of Pin as a JIT (just in time) compiler, where the compiler does not take byte code (as JIT compilation does with Java), but the executable of the process pin is executed against. This means pin inserts itself into the process’ execution. This can be seen when looking at the memory map of such a process:

$ cat /proc/29595/maps | grep -e pin-3.0 -e oracle
00400000-1098a000 r-xp 00000000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8a000-10bac000 r--p 1058a000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10bac000-10e05000 rw-p 105ac000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
7fa01e04c000-7fa01e31d000 r-xp 00000000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e51c000-7fa01e530000 r--p 002d0000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e530000-7fa01e531000 rw-p 002e4000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e533000-7fa01e71a000 r-xp 00000000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91a000-7fa01e91c000 r--p 001e7000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91c000-7fa01e91e000 rw-p 001e9000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e946000-7fa01e9c9000 r-xp 00000000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9ca000-7fa01e9cc000 r--p 00083000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9cc000-7fa01e9ce000 rw-p 00085000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9d6000-7fa01ea04000 r-xp 00000000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea04000-7fa01ea05000 r--p 0002d000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea05000-7fa01ea06000 rw-p 0002e000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea06000-7fa01eac4000 r-xp 00000000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac5000-7fa01eac8000 r--p 000be000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac8000-7fa01eaca000 rw-p 000c1000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eacb000-7fa01edb7000 r-xp 00000000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01edb8000-7fa01ee24000 r--p 002ec000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee24000-7fa01ee25000 rw-p 00358000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee80000-7fa01f385000 r-xp 00000000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f385000-7fa01f38d000 r--p 00504000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f38d000-7fa01f394000 rw-p 0050c000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f54e000-7fa01f6d4000 r-xp 00000000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
7fa01f6d4000-7fa01f8d3000 ---p 00186000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so

Here you see this process an oracle foreground (rows 1-3), after which we see a lot of stuff that pin pushed into the process’ address space.

For this reason, it’s vitally important to use and run pin as the same user as the process you want to run pin against. The way pin works is that, upon execution of pin, the pin executable inserts itself into the process’ address space, gains control and then tries to load necessary libraries. If it can’t find these libraries, it will send a SIGKILL to the process, effectively stopping it!

In order to obtain pin, go to the pin homepage, downloads, linux and select the ‘gcc compiler kit’. This is a zipped tarball. Upload that to a linux server as the database owner, usually ‘oracle’. I created a directory ‘pin’ in which I put the tarball, and extracted it (tar xzf). The next step is to compile the tools that come with pin:

$ cd pin-3.0-76991-gcc-linux/source/tools
$ make

This will output a lot of stuff on your screen and compiles everything in the tools directory.

Now that we have seen an introduction, let’s use a Pin tool to do something useful! One such useful is the tool ‘DebugTrace’. First obtain the process id (pid) from an Oracle server foreground process. Then execute the pin tool against this process:

$ pwd
/home/oracle/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 2407 -t source/tools/DebugTrace/obj-intel64/debugtrace.so

Upon execution, nothing is returned, because the majority of the things are happening in the process pin is run against (2407 in this case). If no parameters are specified, the pintool will create a file in the current working directory of the process it is run against. For an oracle foreground process, the current working directory is $ORACLE_HOME/dbs:

$ ls -l /proc/2407/cwd
lrwxrwxrwx 1 oracle oinstall 0 Nov 17 01:45 /proc/2407/cwd -> /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs

For the sake of testing this, I ran ‘select * from dual’ in the oracle database foreground process.

Let’s look what this produces! This is how my dbs directory looks like:

$ ls -ltr
total 2332
-rw-r--r--. 1 oracle oinstall    2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall      24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall    7680 Nov  5 09:41 orapwtestdb
-rw-r-----. 1 oracle oinstall      57 Nov  5 09:42 inittestdb.ora
-rw-rw----. 1 oracle oinstall    1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--  1 oracle oinstall 2361460 Nov 17 01:49 debugtrace.out

You can see the debugtrace.out file is what is produced by the pin tool, and it look rather big (remember I only ran ‘select * from dual’)!

Let’s look into the file:

$ less debugtrace.out
         3 @@@ return underflow
Return 0x00007f706b664218 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x12c
        14 @@@ return underflow
Return 0x000000000cefb62d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cefb580 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cedcc9d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cedc90a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000ceeb746 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b170d0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       156 Tailcall 0x0000000005b170de /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b170c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffeef95c305, 0x7f706ae51da0, ...)
       158 | Call 0x0000000005b170c1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b1dcb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       172 | Return 0x0000000005b20097 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffeef95c305
       174 Return 0x0000000005b170c7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffeef95c305
       197 @@@ return underflow
Return 0x000000000ceeb7ad /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
       213 Call 0x000000000cb5d8b0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:opikndf2+0x000000000410 -> 0x000000000c940560 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx(0x7ffeef95c040, 0x7f706ae51da1, ...)
       268 | Call 0x000000000c94066e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx+0x00000000010e -> 0x000000000cd585c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64(0x7ffeef95c040, 0x7f706ae51da1, ...)
       278 | | Call 0x000000000cd5863b /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64+0x00000000007b -> 0x000000000b13ef90 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt(0x7ffeef95c040, 0x7f706ae51da1, ...)
       286 | | | Call 0x000000000b13efa3 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt+0x000000000013 -> 0x0000000000ba6860 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:.plt+0x000000001360(0x1, 0x7ffeef95beb0, ...)
       293 | | | | Call 0x00007f706b188dcb /lib64/libc.so.6:__clock_gettime+0x00000000001b -> 0x00007ffeef9ab8a0 clock_gettime(0x1, 0x7ffeef95beb0, ...)
...etc...

I think this is rather exciting! This is a FULL function call trace, indented by call depth!
The first lines in the trace are returns, and these returns produce a ‘return underflow’, which is because these functions have been called before the trace was put on the process.
This trace shows:
Calling: the address and function where the function call is made, the address and function that is called and first two arguments of the function.
Returning: the address and function where the return is executed, and the returncode.
Tailcall: this is a function that is called as the last instruction in the current function. This means it returns as part of the current function, which is shown by not indenting this call further (call, tailcall and return are all indented at the same level).

Please mind currently I am not aware of a way to remove the pin framework and tool from a process address space. Also mind that because of how this works (a lot of intercepting in the execution flow), it will slow down the process significantly. Use this tool and the above described techniques at your own risk.

It’s general knowledge that the Oracle database is ACID compliant, and that SCNs or ‘system change numbers’ are at the heart of this mechanism. This blogpost dives into the details of how the Oracle engine uses these numbers.

Oracle database version 12.1.0.2.161018
Operating system version: OL 7.2, kernel: 4.1.12-61.1.14.el7uek.x86_64 (UEK4)

Redo generation
Whenever DML is executed, redo is generated in the form of ‘change vectors’. These change vectors are copied into the redo buffer as part of the transaction, during the transaction. The function that performs this action is called ‘kcrfw_copy_cv()’. This can be derived by watching the foreground process perform memory copy into the memory area of the redo buffer.

In order to do this, you first need to find the memory area of the redo buffer. This can be done by executing ‘oradebug setmypid’ and ‘oradebug ipc’ as sysdba, and examine the resulting trace file:

 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1  4784130 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008600000 default       readwrite

This shows the redo buffers are allocated between 0x96400000 and 0x9E934000 (actual address and actual address + segment size).

The function Oracle uses to copy memory is called “_intel_fast_memcpy()”. This probably is an optimised version of the standard C memcpy() function. What is important, is that the arguments of memcpy function:

$ man memcpy
NAME        
       memcpy - copy memory area
SYNOPSIS    
       #include <string.h>
       void *memcpy(void *dest, const void *src, size_t n);

So the first argument of memcpy is a pointer to the destination, the second argument is a pointer to the source, and the third argument is the size of the memory area to be copied.

In order to find the function that uses memcpy() to copy into the redo buffers area, we can use the following gdb macro:

break _intel_fast_memcpy if $rdi > 0x96400000 && $rdi < 0x9E934000
commands
silent
printf "_intel_fast_memcpy dest %lx, source %lx, size %d\n", $rdi, $rsi, $rdx
frame 1
c
end

The macro creates a conditional breakpoint on the function _intel_fast_memcpy(). The condition is that the CPU register RDI (first argument, destination) should be higher than 0x96400000 and lower than 0x9E934000, which is the memory area the redo log buffers are mapped to. If the breakpoint and its conditions are triggered, the function name is printed, together with the arguments to memcpy, and ‘frame 1’ to list the function that called _intel_fast_memcpy().

To execute this macro, startup an Oracle foreground session, and attach to the session with gdb, and execute the macro. Next, execute an insert statement. In my case, I created a very simple table called ‘T’ with one field, T, varchar2(1). Please mind that you need to watch out for deferred segment creation, so ideally you first need to create the table, and insert a few rows to take care of that.

This is how the gdb side looks like when executing an insert statement:

_intel_fast_memcpy dest 96817454, source 7ffc9fcc9598, size 36
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817478, source 7ffc9fcc9658, size 32
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817498, source 7faf6ad677f0, size 44
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174c4, source 7ffc9fcc9678, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174d8, source 7ffc9fcc95e4, size 76
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817524, source 7faf6ad67888, size 8
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 9681752c, source 7faf6ad67760, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817540, source 7faf6ad69ff8, size 40
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817568, source 7faf6ad6a070, size 24
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817580, source 7faf6ad69f60, size 49
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968175b4, source 89753568, size 1
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()

The destination of the memcpy() action is the logbuffers, the break conditions made sure of that. The source of all except for the last memcpy() is the PGA (visible by the high number), the last memcpy() copies from the SGA. The memory address points to an SQLA component (cursor) in the SGA.

Okay, at this point we established the redo of a DML operation is put in the redo buffers by a function called kcrfw_copy_cv(). Also please mind that modern Oracle versions do put redo into the redo buffers during doing DML, not after it!

To move further on the topic of transactions, let’s look at when the current SCN is fetched and when kcrfw_copy_cv() is called. To do this, I use the following gdb macro:

break kcrfw_copy_cv
commands
silent
printf "kcrfw_copy_cv\n"
c
end
awatch *0x6001fbb0
commands
c
end

When running this gdb macro against an Oracle session running an insert statement, the following output is generated:

Hardware access (read/write) watchpoint 2: *0x6001fbb0

Old value = 347661
New value = 347666
0x000000000ccbdfb3 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbeb04 in kcsgssn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbdfb3 in kcscur3 ()
kcrfw_copy_cv
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()

Actually, I was surprised by this output. I suspected a call to generate an unique SCN for this change. However, the calls/functions which are visible are:
– kcscur3: a function to read a SCN (the global SCN, but is also used to read other SCNs, like the LWN SCN and the on-disk SCN).
– kcsgrsn: kernel cache service get recent SCN, for reading the global SCN.
– kcsgcsn: kernel cache service get current SCN, for reading the global SCN.
– kcsgssn: kernel cache service get snapshot SCN, for reading the global SCN.
All of these functions do READ a SCN value, but do not modify it. In order to update a SCN, the function kcsgbsn (kernel cache service get batched SCN) should be called. There is a memory write shown (row 1-5), but that really is a read, it is an anomaly of watchpoints with memory changed by other processes.

So, conclusion at this point is that DML (in this case an insert statement) just reads the global SCN, and does not seem to update it.

This got me thinking: if an insert statement just reads the global SCN, instead of generating an unique SCN, it should be possible that the changes (redo vectors actually) of DML executed simultaneously in different sessions on different segments share the same SCN? It sounds counter intuitive to what I have been taught about Oracle.

This can be tested in the following way:
1 – Have two foreground sessions in a different schema execute an insert statement at exactly the same time. This can be done in iTerm2 using the ‘broadcast input’ option, which allows input to be executed in multiple panes at the same time. The command executed in both pane’s is ‘insert into t values (‘a’);’.

2 – Have gdb attached to the two sessions, having an access watchpoint at the global SCN, to be able to identify the SCN that both the session used during executing the insert, and a breakpoint at kcrfw_copy_cv() to understand which SCN it uses. I used the following gdb macro:

break kcrfw_copy_cv
  commands
    silent
    printf "kcrfw_copy_cv\n"
    c
  end
awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    c
  end

3 – To validate what is visible in gdb, dump the current logfile and find the SCN as shown in gdb. The current logfile can be dumped to trace using the following PL/SQL procedure:

----------------------------------------
-- This script will dump the current redolog file.
-- If the log file is big, this might take some time.
--
--
-- Original Author : Riyaj Shamsudeen
-- Modifications by: Frits Hoogland
-- No implied or explicit warranty !
----------------------------------------
set serveroutput on size 1000000
declare
  v_sqltext varchar2(255);
begin
 select 'alter system dump logfile '||chr(39)||member||chr(39) into v_sqltext
  from v$log lg, v$logfile lgfile
  where lg.group# = lgfile.group# and
  lg.sequence# = ( select sequence# from v$log where status='CURRENT' )
  and rownum <2;
  dbms_output.put_line ('Executing :'||v_sqltext);
  execute immediate v_sqltext;
end;
/

In step 2, I got the following gdb output on both the inserts:

(1st session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860
kcrfw_copy_cv

(2nd session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860
kcrfw_copy_cv

This is as expected based on our observation, because the insert statement only reads the global SCN. Now let’s dump the current redologfile as shown in step 3 above, and see if we can find SCN 5c860.
Here are the redo records for both changes:

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.0010 LEN: 0x01a8 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.01b8 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

Watch the SCN from the gdb session and at the line with SCN and the SUBSCN value being the same. Also watch the RBA (redo block address) being different.

Both redo records contain 3 changes:

(1st session)
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0005c801 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:32 AFN:3 DBA:0x000000f9 OBJ:4294967295 SCN:0x0000.0005c800 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600c OBJ:20455 SCN:0x0000.0005c857 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

(2nd session)
CHANGE #1 CON_ID:0 TYP:0 CLS:35 AFN:3 DBA:0x00000110 OBJ:4294967295 SCN:0x0000.0005c7b2 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:36 AFN:3 DBA:0x00000895 OBJ:4294967295 SCN:0x0000.0005c78c SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600e OBJ:20455 SCN:0x0000.0005c858 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

These 3 transactions are consistent with an insert of a single row, this is visible by the opcodes:
5.2=update rollback segment header
5.1=undo block
11.2=insert row piece

This means the DML of the two inserts have exactly the same SCN! A logic thought that you might come up with is: but how is Oracle then able to apply these in the correct order, which is one of the fundamental reasons for having SCNs? For that, it’s important to realise this is just the change to a block, but not a committed transaction yet. In order to get a full understanding of how the SCN mechanism works for at least a single row insert, we need to change the gdb macro from the test above:

awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    if $rip >= 0xccbe340 && $rip <= 0xccbe7b0
      printf "kcsgbsn - SCN: %x (@%x)\n", *0x6001fbb0, $rip
    end
    c
  end
break kcrfw_copy_cv
  commands
    silent
    printf "kcrfw_copy_cv\n"
    c
  end

The function kcsgbsn means: kernel cache service get batched SCN. This function DOES update the global SCN(!). However, this function does not get called during the insert. It does get called when commit is called (or rollback).

So, in order take another step in understanding SCN handling by Oracle, let’s use the updated version of the gdb macro, and insert another row into the table in two sessions, and then also commit:

(1st session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcrfw_copy_cv
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)
kcrfw_copy_cv

(2nd session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcrfw_copy_cv
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)
kcsgbsn - SCN: 7e39c (@ccbe63a)
kcrfw_copy_cv

In both sessions we see the insert (lines 2-4 and 11-13), and then the commit (lines 5-8 and 14-18).
During the insert, we see both sessions using the same SCN (7e39b), which can be verified by dumping the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.0010 LEN: 0x01c0 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.01d0 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

And we see kcsgbsn() being called. As has been documented on multiple places, the current value of the global SCN is read, and then advanced (get and advance). This can be seen in lines 6 and 7; 0x7e39b is read, and then advanced to 0x7e39c. However, to be honest to my surprise, the second session does the same, as seen in lines 15-17, it advances from 0x7e39b to 0x7e39c too. However it rereads the SCN one additional time. The surprise is the sessions seem to have the same SCN for the commit. Validation from the dump of the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.0174 LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.01dc LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:19 AFN:3 DBA:0x00000090 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

Yes, both commits (visible by the opcode (OP) 5.4) have the same SCN. At this point I think it’s fitting to theorise why this is happening. I think this is happening and makes sense because the table is protected for consistency at another level in the database, which is the locking mechanisms (TX locks for rows, TM locks for segments), which guarantees that two transactions at the same time do not compromise data consistency. And because of that, it’s not a problem to batch them with the same SCN.

Conclusion
Changes in the database are synchronised by system change numbers, also known as SCNs. SCNs are the way Oracle serialises changes to data. However, this post shows that the Oracle engine tries to maximise the number of changes inside a SCN, and as such a SCN is not an highly granular serialisation mechanism. This is not a problem, since Oracle’s transactional protection mechanisms (like TX and TM locks) do work alongside SCNs to make sure no conflicting changes can happen at the same time. The changes itself do just read and pick the current global SCN from the instance, a commit of these changes does increase the global SCN, but as little as possible. You might find multiple changes in the same SCN, and you might even find commits with the same SCNs.

Bonus topic: finding nemokcscur3.
In case you wondered what this little macro means, and how got the memory addresses:

awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    c
  end

awatch is a memory access watch function. The commands executed when this watch triggers are:
-silent (do not display output)
-if $rip >= 0xccbdfa0 && $rip <= 0xccbe010 (if the instruction pointer contains a value higher or equal to 0xccbdfa0 and the instruction pointer contains a value lower or equal to 0xccbe010)
-printf "kcscur3 – SCN: %x\n", *0x6001fbb0 (print some stuff and print (in hexadecimal format) where the value the pointer 0x6001fbb0 points to)
The text 'kcscur3' in the printf spoils that these addresses are actually the memory addresses where the function kcscur3 is located. How to find these? The starting address can be found in a variety of ways, for example in gdb: 'info function kcscur3', or in the shell using nm: 'nm oracle | grep kcscur3'. It is a bit harder to find the ending address of a function. However, it's still quite simple doable with nm: 'nm -S oracle | grep kcscur3':

[root@o12102fs bin]# nm -S oracle | grep kcscur3
000000000ccbdfa0 0000000000000070 T kcscur3

The first column is the starting memory address of the kcscur3 function, the second column is the length (in hexadecimal) of the function code. 0xccbdfa0+0x70=0xccbe010.

Update oktober 24, 2016:
Roger noted the following: Minor correction, Oracle doesn’t do DML directly, it first generates change vectors which are then applied to the local database which is effectively running in recovery. This way Oracle uniquely guarantees recovery on the primary or physical standby will always do exactly the same thing as the original DML.

Stefan noted the following: Hey Frits. Great work – as always. However i am not quite sure about your kcrfw_copy_cv() PGA interpretation for private redo strands. AFAIK private redo strands reside in permanent memory allocations (in SGA) and you should see memory copy operations between these memory chunks and the public redo buffer(s) on commit. Your GDB breakpoint should only kick in under these circumstances as anything else is handled in private redo (at least on your mentioned database version) and no memory operations on redo buffer. I also modeled a short test case with 12.1.0.1 on OEL 6.7 to verify my theory/assumption and the result was as expected – not quite sure about your observed and mentioned PGA to redo log buffer copy operations then.

I should add that when I run snapper on the insert in my database, the statistics ‘IMU- failed to get a private strand’ and ‘IMU pool not allocated’ are increased, giving a hint no private strands are used.

During investigating I ran once again into statistics in the Oracle database that still provide a useful details, but the actual naming of the statistic is describing a situation that in reality does not exist anymore. The statistics I am talking about are ‘calls to kcmgcs’, ‘calls to kcmgrs’, ‘calls to kcmgas’ and ‘calls to get snapshot scn: kcmgss’.

Disclaimer: this is research. Any of these techniques potentially can crash your instance or leave your database in a corrupted state. Test the techniques used in this article severely before applying it in an actual situation. Use at your own risk.

Back to the ‘calls to’ statistics. To see what I mean here, you can look up the functions in symbol table in the Oracle executable. There are several ways to do that, one way is using gdb:

(gdb) info function ^kcmg
All functions matching regular expression "^kcmg":

Non-debugging symbols:
0x0000000002d12410  kcmgtsUTC
0x0000000002d12be0  kcmgts
0x0000000002d139a0  kcmgtsRSL
0x000000000cc924b0  kcmglscn
0x000000000cc92560  kcmgtsf

See? There are no functions kcmgcs, kcmgrs, kcmgas and kcmgss. Are these statistics of any use then? I think so, although the only really relevant statistic probably is ‘calls to kcmgas’ to indicate an approximate count of the number of calls for a new SCN (gas probably means ‘get and advance’). The meaning of them is documented (at least up to version 12.1) in the online Oracle documentation, Statistics Description:
– ‘calls to get snapshot scn: kcmgss’: Number of times a snapshot system change number (SCN) was allocated. The SCN is allocated at the start of a transaction.
– ‘calls to kcmgas’: Number of calls to routine kcmgas to get a new SCN
– ‘calls to kcmgcs’: Number of calls to routine kcmgcs to get a current SCN
– ‘calls to kcmgrs’: Number of calls to routine kcsgrs to get a recent SCN
(I think ‘routine kcsgrs’ is a typo or a partial improvement and should be either kcmgrs to stay in line with the other (wrong) descriptions, or it should be kcsgrsn (added ‘n’) to be correct)

Would there be a way to find out how this now works? We can’t work from the function names, since I have just proven the functions do not exist anymore (I have no sufficient old enough database to prove this actually, feel free to test this on your old 7/8i/9i/10g database!). However, we can work this out when starting from the other end as seen from what the functions are supposed to do. The functions are supposed to provide a SCN for the session to work with. Since we know the source of SCN values in the SGA (kcsgscn, the global SCN), we can watch which functions do call this variable for a new SCN value or a reference value!

Obtain the address of kcsgscn:

SQL> select ksmfsnam, ksmfsadr from x$ksmfsv
2  where ksmfsnam = 'kcsgscn_';

KSMFSNAM                                                         KSMFSADR
---------------------------------------------------------------- ----------------
kcsgscn_                                                         000000006001FBB0

Now put a access watchpoint on the kcsgscn address for a session using gdb:

$ gdb -p PID
...
(gdb) awatch *0x6001fbb0
Hardware access (read/write) watchpoint 1: *0x6001fbb0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now let’s test something very simple, like a ‘select * from dual’:

Hardware access (read/write) watchpoint 1: *0x6001fbb0

Old value = 331907
New value = 332367
0x000000000cca0f74 in kcsgssn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0423 in kcscur3 ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0f74 in kcsgssn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0423 in kcscur3 ()
Hardware access (read/write) watchpoint 1: *0x6001fbb0

Value = 332367
0x000000000cca0c97 in kcsgcsn ()

That’s my session accessing address 0x6001fbb0, alias the global SCN, eight times.
The functions that access 0x6001fbbo in this case are kcsgssn, kcscur3 and kcsgcsn.

When we look at the session statistics of the above executed SQL (please mind there is a difference between the first execution and subsequent executions), and focus on the ‘calls to’ statistics this is what this shows (partial output from Tanel’s snapper):

130, SYS       , STAT, calls to kcmgcs                                           ,             4,
130, SYS       , STAT, calls to get snapshot scn: kcmgss                         ,             2,

If carefully combine the snapper output with the function names the watchpoint gives, it becomes apparent that the amount of the statistic with kcmgcs is equal to the times the kcsgcsn function was called, and the amount of the statistic with kcmgss is equal to the times the kcsgssn function was called.

This is only a very simple example, when you try different SQL’s, especially DML, you will notice a lot of calls to the global SCN are made, because the metadata is protected by SCN’s too. I assume Oracle kept the statistics names the same for the sake of not breaking products and scripts using these statistic names. However in reality the functions using the global SCN are different from how the database truly works.

Recently I was investigating the inner working of Oracle. One of the things that is fundamental to the Oracle database is the SCN (system change number). SCNs are used to synchronise changes in the database. There is one source for SCNs in every instance (kcbgscn; the global or current SCN in the fixed SGA), and there are multiple tasks for which Oracle keeps track of synchronisation using SCNs. A few of these tasks for which Oracle stores and uses SCNs to keep track of progress are on disk SCN and lwn SCN.

This blogpost is about some oddities I found when using gdb (the GNU debugger) to watch memory locations of a running Oracle database. This should not be done on a production instance, and is purely for research purposes. Only use the methods mentioned in this article if you are absolutely sure what you are doing, and/or if you are using an Oracle instance that can be crashed and can be restored.

The situation
I was researching the ‘on disk scn’ in a 11.2.0.2 database. I used the following (simplified) gdb macro:

break kcrf_commit_force
  commands
    print $rip
    c
  end
awatch *0x60022659
  commands
    c
  end

This macro breaks on the function kcrf_commit_force essentially to understand the different phases of a commit, and uses the ‘awatch’ function (access watch; which means this watchpoint fires at both reading and writing) to look at memory location 0x60022659. That memory location is the on disk SCN for this instance.

Upon executing a commit after an insert, I saw the following output:

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$1 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Old value = 898
New value = 899
0x0000000008fb5ad1 in kcscur3 ()

Breakpoint 1, 0x0000000008fb3a72 in kcrf_commit_force ()
$2 = (void (*)()) 0x8fb3a72 <kcrf_commit_force+4>
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x60022659

Value = 899
0x0000000008fb5ad1 in kcscur3 ()

What this shows, is the following:
– line 1 and 9: the kcrf_commit_force function is called.
– line 11, 15 and 19: the memory location *0x60022659 is read, resulting in the value 899.
– line 3: the memory location *0x60022659 is written, it contained 898, and the new value is 899.

This greatly puzzled me. As far as I understand the way Oracle works, a server process that issues ‘commit’ is supposed to READ the on disk SCN, it does not write the on disk SCN, because the only process that changes the on disk SCN is the logwriter. However, quite clearly gdb tells me the server process is writing it. Further consideration is that IF the server process is writing it, where does this information come from?

After some further investigation, I came to the conclusion that the only way to prove to myself that what I am seeing is correct, is to create a program myself mimicking the behaviour witnessed above to truly understand what is going on, because I don’t believe what I am seeing is correct.

I created two little c programs. One that starts up, creates a shared memory segment, writes a letter into the shared memory segment and then waits for a keystroke, and puts the keystroke into the same location, and one that reads the same shared memory segment, then waits for enter to pressed, and reads the shared memory segment again, and outputs if the letter read from the shared memory is different. I called the first one ‘server’, and the second one ‘client’. This way, I can attach gdb to the client executable, which very much like an Oracle server process, reads a shared memory location which I can change at will.

The code can be found on GitHub: shared memory test source. Please mind this is simple proof of concept code that is not exhaustive in error checking and other sanity checks in any way. The source files can be compiled and linked in the following way:

$ cc server.c -o server
$ cc client.c -o client

The server executable can be run in the following way:

$ ./server
Type a letter. q quits.

This creates the shared memory segment, and puts the ‘a’ character in the variable c (line 42, c = ‘a’). As long as the c variable is not ‘q’, it writes the c variable into shared memory segment using the pointer s (*s = c) , and then waits for user input using the scanf function.

Once the shared memory segment is created and written into, the client program can be started:

$ ./client
client started. ^C to quit.
waiting for enter

The first time enter is pressed it will output ‘keystroke: a’ because the p variable was not initialised, and will be assigned the ‘a’ character (p = *s), the second time enter is pressed, ‘keystroke:’ is not printed, because the p variable and the memory location the pointer *s points to both contain ‘a’. What is vital, is that the client program NEVER writes, only reads the shared memory location, in three occasions: in line 50:

if( p != *s) {

In line 51:

printf("keystroke: %c\n", *s);

And in line 52:

p = *s;

Okay. So at this point we got a program that mimics what I think an Oracle server process is doing too on commit time (reading shared memory), which is the program ‘client’, and we got a program which creates and can manipulate the shared memory location the client is reading from, which is ‘server’. Startup the server program (./server), and startup the client using gdb:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb)

This is different than attaching to a process using gdb -p PID, but attaching to a process using ‘-p’ is the only way that I am aware of to use gdb on Oracle database processes. Now that we started gdb with the client program, we must first run it in order to see the program’s runtime environment:

(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter

Now invoke gdb (using control-C) and look at the memory mappings to see where the shared memory segment is:

(gdb) info proc mappings
process 3069
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]

Line 19 contains the shared memory segment, at memory address 0x7ffff7ff8000. To do the same as watching the on disk SCN location in Oracle, put an access watchpoint at 0x7ffff7ff8000:

(gdb) awatch *0x7ffff7ff8000
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now press enter, and we see the shared memory location being accessed three times:

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

If you take the source code, you’ll see that these are the following lines of code at rows 50-52:

       if( p != *s ) {
          printf("keystroke: %c\n", *s);
          p = *s;

Every line reads the shared memory location via the pointer *s. It’s also clear it’s a read because the value in the memory location (97) is displayed once.

Now go over to the server side, and press ‘b’ and enter. Now the shared memory segment contains ‘b’. Next, go back to the client program, and press enter:

waiting for enter

Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x0000000000400712 in main ()
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

Hey! This shows what looks like a write with the first watchpoint, because the memory location is touched, and it reports the value at the location being changed. However, because we know the source code, we know we only read the shared memory location with the client program, not write to it! So what this shows in this case is that gdb knew what the previous value was of the memory location was, and detected that it was changed.

To see how a true write looks like, you can startup the server executable with gdb too, and set an ‘awatch’ watchpoint on the shared memory location exactly the same way as shown with the client program above. This is how that looks like:

(gdb) c
Continuing.
b
Hardware access (read/write) watchpoint 1: *0x7ffff7ff8000

Old value = 97
New value = 98
0x00000000004006e9 in main ()

So conclusion here is that you can’t see the difference between a write to a memory location or a change of value when reading when using an ‘awatch’ watchpoint.

Now that we understand how ‘awatch’ works, would there be a way to make the distinction between reading and writing? Outside of ‘awatch’ (access watchpoint), there’s the ‘watch’ command, which is a watchpoint on writing to a memory location and ‘rwatch’, which is a watchpoint on reading a memory location. This is how the output of ‘rwatch’ looks like in the client program:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...(no debugging symbols found)...done.
(gdb) run
Starting program: /home/oracle/src/shm_test/client
client started. ^C to quit.
waiting for enter
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7b23670 in __read_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info proc map
process 3517
cmdline = '/home/oracle/src/shm_test/client'
cwd = '/home/oracle/src/shm_test'
exe = '/home/oracle/src/shm_test/client'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000          0                              /home/oracle/src/shm_test/client
            0x600000           0x601000     0x1000          0                              /home/oracle/src/shm_test/client
      0x7ffff7a48000     0x7ffff7bd2000   0x18a000          0                     /lib64/libc-2.12.so
      0x7ffff7bd2000     0x7ffff7dd2000   0x200000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd2000     0x7ffff7dd6000     0x4000   0x18a000                     /lib64/libc-2.12.so
      0x7ffff7dd6000     0x7ffff7dd8000     0x2000   0x18e000                     /lib64/libc-2.12.so
      0x7ffff7dd8000     0x7ffff7ddc000     0x4000          0
      0x7ffff7ddc000     0x7ffff7dfc000    0x20000          0                     /lib64/ld-2.12.so
      0x7ffff7ff0000     0x7ffff7ff3000     0x3000          0
      0x7ffff7ff6000     0x7ffff7ff8000     0x2000          0
      0x7ffff7ff8000     0x7ffff7ff9000     0x1000          0                       /SYSV0000029a (deleted)
      0x7ffff7ff9000     0x7ffff7ffa000     0x1000          0
      0x7ffff7ffa000     0x7ffff7ffb000     0x1000          0                           [vdso]
      0x7ffff7ffb000     0x7ffff7ffd000     0x2000    0x1f000                     /lib64/ld-2.12.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x21000                     /lib64/ld-2.12.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000          0
      0x7ffffffea000     0x7ffffffff000    0x15000          0                           [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]
(gdb) rwatch *0x7ffff7ff8000
Hardware read watchpoint 1: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040071e in main ()
keystroke: a
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 97
0x000000000040073c in main ()
waiting for enter

Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040071e in main ()
keystroke: b
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 98
0x000000000040073c in main ()
waiting for enter

At line 1 gdb is started with the ‘client’ program. Once inside gdb, the executable is run (line 12). Once it’s running we interrupt in order to get to gdb (line 16). In gdb, the info proc map command is used (line 20) to look at the memory mappings. From the memory mappings, we see address 0x7ffff7ff8000 is the shared memory segment. A read watch is set and a c (continue) command is added to the read watchpoint (line 46 and 48). Then the execution is continued with the c command (line 53).

Now we are running in the executable again. After continuing, I pressed enter to go through the loop in the client executable, and the watchpoint is triggered three times (lines 56, 60 and 68), as we did see earlier, for source code lines 50, 51 and 52.

Now I changed the letter in the shared memory segment from ‘a’ to ‘b’ with the server executable, and pressed enter at line 60. The interesting things that happens is that the read watchpoint is triggered TWO times. This should be three times (!!).

There is a way to make the now invisible read visible again, using the following command in gdb:

(gdb) watch *0x7ffff7ff8000
Hardware watchpoint 2: *0x7ffff7ff8000
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Yes, that’s a WRITE watchpoint. After adding a write watchpoint, this is how the output looks like when I change the letter from ‘b’ to ‘c’:

Hardware watchpoint 2: *0x7ffff7ff8000

Old value = 98
New value = 99
0x0000000000400712 in main ()
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040071e in main ()
keystroke: c
Hardware read watchpoint 1: *0x7ffff7ff8000

Value = 99
0x000000000040073c in main ()
waiting for enter

There we have the tree reads visible again, however, one is ‘disguised’ as a write. To conclude: it seems gdb is not able to make a distinction between reads and writes for watchpoints if the watchpoint memory location is changed outside of the executable active in the debugger. Gdb seems to determine memory access a write if the value found is different from the last it was read.

Bonus information, not applicable to Oracle:
When you have the source code of an application, you can make debugging even better by adding in debug information for gdb in the executable. This is done by adding the ‘-g’ flag to the cc command:

$ cc -g client.c -o client

With the debug information available, it’s easier to debug:

$ gdb client
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/oracle/src/shm_test/client...done.
(gdb) tbreak 1
Temporary breakpoint 1 at 0x40067c: file client.c, line 1.
(gdb) run
Starting program: /home/oracle/src/shm_test/client

Temporary breakpoint 1, main () at client.c:22
22	    key = 666;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) info local
shmid = 0
key = 0
shm = 0x400590 "1\355I\211\321^H\211\342H\203\344\360PTI\307\300P\a@"
s = 0x7fffffffe550 "\001"
p = 0 '\000'
(gdb) awatch s
Hardware access (read/write) watchpoint 2: s
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Old value = 0x7fffffffe550 "\001"
New value = 0x7ffff7ff8000 "a"
main () at client.c:45
45	    printf("client started. ^C to quit.\n");
(gdb) c
Continuing.
client started. ^C to quit.
waiting for enter

Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040070f in main () at client.c:50
50	       if( p != *s ) {
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x000000000040071b in main () at client.c:51
51	          printf("keystroke: %c\n", *s);
(gdb) c
Continuing.
keystroke: a
Hardware access (read/write) watchpoint 2: s

Value = 0x7ffff7ff8000 "a"
0x0000000000400739 in main () at client.c:52
52	          p = *s;
(gdb) c
Continuing.
waiting for enter

Above you see starting up the executable with gdb at line 1. After gdb is started, I first set a temporal breakpoint on line 1 (of the source code, shown on line 12 above), to have gdb ‘enter’ the executable and stop when I run it. If I wouldn’t do that, the program would ‘wait’ in the getchar() function, which is a function in libc, and the libc library would be my scope, line 19 tells that I have not loaded debug symbols for glibc. After setting the breakpoint, I run the executable, which immediately breaks on the temporal breakpoint (line 14).

Now that the executable is active, I can list the variables used using the info local command (line 20). Also, I can simply set a watchpoint on the variable s (line 26), instead of having to search the memory address myself. When the watchpoint is triggered, it now shows the line number in the source file, and the actual source code on that line (lines 34, 35, 44, 45, etc).

%d bloggers like this: