I was investigating gathering performance data on (oracle) linux servers recently and came across Performance Co-Pilot (PCP). I have come across this product regularly in the past, but it seemed somewhat abstract to me, and I never ran into any actual usage. And we got sar for linux performance data and for the Oracle database we got oswatcher (and it’s exadata cousin exawatcher) and TFA right? How wrong I was.

First let me explain a few things.

There is no magic involved in performance measurement and tuning. What this aims to say is that most performance gathering frameworks use the same performance data sources as regular tools use. The are tools that can go beyond the regular data sources, like systemtap, which actually can be used as a source for PCP (but is beyond the scope of this article) and perf to name a few. This means that whatever tool you use, like: mpstat, iostat, top or sar, they all display the same cpu information. It might disguised in a different way, but really it’s all the same information.

This means that’s it’s not about the information itself, but about how you can interact with historical data. The majority of the performance tools simply do not have a facility to store measurements. That’s perfectly fine, and that wasn’t ever a design criterion when they were written. This is true for tools like mpstat, iostat, top to name a few. However in a lot of cases having volatile data is a real pain. Of course you can save the output to a file, but this very quickly becomes a challenge once you have a few files of sar, iostat, etc.

This is where sar comes in. Sar is installed on linux using the sysstat rpm package, and actually does store historical data! Sar stores it’s data in /var/log/sa, and stores sar archives historically based on the HISTORY setting in /etc/sysconfig/sysstat. In the past I seen this been set to 8, a check on a current oracle linux 6 machine shows HISTORY been set to 28. Also in the past the maximal retention time was 31, because the day of the month number would overwrite the same day of the previous month. This issue too seems to be solved, the comments in the /etc/sysconfig/sysstat file says that if the value is set higher than 28, files are kept in month-based directories.

The historical data sar stores are measurements with a ten minute interval. The reason for this is how sa1 (a helper utility for storing sar data) is invoked. This is done via a cron setting in /etc/cron.d/sysstat:

# Run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib64/sa/sa1 1 1
# 0 * * * * root /usr/lib64/sa/sa1 600 6 &
# Generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib64/sa/sa2 -A

If you speak cron, you see that the second row instructs cron to run sa1 taking one sample every ten minutes (which is put in /var/log/saNN, and on line 5 to run sa2, which creates a (text)report called /var/log/sarNN, and does maintenance based on the HISTORY setting.

This is brilliant when coming from having no data at all. In fact, I have used this in the past and am using this with success to look into measurements looking for anomalies. However, it’s not very granular, it’s per ten minutes. Also, getting data is by extracting data from the /var/log/sa/saNN file (sar -f /var/log/sa/saNN -s hh:mm:ss -e hh:mm:ss – (switches for data sources, like cpu% (u), swapping in/out (W), etc)). More frequently I just ask for the /var/log/sa/sarNN (text) report, which contains all sar measurements.

So far we are all happy. Performance data can be queried ad-hoc, and there’s a history with per 10 minute samples. But how about when you do test things on your system? In most of the cases I am involved with benchmarking and proof of concepts, we run things and actually want data which is more granular than 10 minutes. In fact, a single test might not even take 10 minutes and not be visible in the sar history at all! There is a solution for that too: setup sar, iostat, mpstat, etc. each with output into a file in the background, and then run whatever you want to run, and kill the performance collectors afterwards. Sounds great, but if you’ve ever tried that, it becomes a bit complex quite soon.

This is where performance co-pilot (PCP) comes in. PCP is an open-source tool that is available in the ‘latest’ repository of oracle linux 6 and 7, which means it can simply be installed using yum. Once installed, you can extract the very same data earlier mentioned tools get from a linux system using the pcp tools:

CPU information:

[opc@deploy1 ~]$ pmrep :sar-u
               %user       %nice        %sys     %iowait      %steal       %idle
13:00:41         N/A         N/A         N/A         N/A         N/A         N/A
13:00:42        0.47        0.00        0.11        0.00        0.00       99.41
13:00:43        0.96        0.00        0.12        0.00        0.00       98.89
13:00:44        0.21        0.00        0.06        0.00        0.00       99.72
13:00:45        1.15        0.00        0.17        0.00        0.00       98.65
13:00:46        0.03        0.00        0.01        0.00        0.00       99.96
13:00:47        1.22        0.00        0.14        0.00        0.00       98.58
13:00:48        0.58        0.00        0.18        0.00        0.00       99.23
13:00:49        0.98        0.00        0.12        0.00        0.00       98.85
13:00:50        0.18        0.00        0.04        0.00        0.00       99.77
13:00:51        1.15        0.00        0.18        0.00        0.00       98.65
13:00:52        0.07        0.00        0.01        0.00        0.00       99.91
13:00:53        1.19        0.00        0.15        0.00        0.00       98.63

Memory information:

[opc@deploy1 ~]$ pmrep :mem
              free(mb)   bufhead(mb)  pagecache(mb  kernelstack(      slab(mb)  pagetables(m    mapped(mb)      anon(mb)       shm(mb)   swapuse(mb)  hugefree(mb)  hugeused(mb)
13:08:33           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A
13:08:34        237199           320         17056            16          1113            23            62          1048            27             0             0             0
13:08:35        237258           320         17056            16          1113            24            62           991            27             0             0             0
13:08:36        237200           320         17056            16          1113            24            62          1046            27             0             0             0
13:08:37        237255           320         17056            16          1113            24            62           991            27             0             0             0
13:08:38        237212           320         17056            16          1113            24            62          1034            27             0             0             0

Network:

[opc@deploy1 ~]$ pmrep :nw-pkt
          pck/in  pck/in  pck/in  err/in  err/in  err/in  drp/in  drp/in  drp/in  pck/out  pck/out  pck/out  err/out  err/out  err/out  drp/out  drp/out  drp/out
            eth0    eth1      lo    eth0    eth1      lo    eth0    eth1      lo     eth0     eth1       lo     eth0     eth1       lo     eth0     eth1       lo
13:08:58     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A      N/A      N/A      N/A      N/A      N/A      N/A      N/A      N/A      N/A
13:08:59  10.972   0.000  29.924   0.000   0.000   0.000   0.000   0.000   0.000   16.957    0.000   29.924    0.000    0.000    0.000    0.000    0.000    0.000
13:09:00   0.996   0.000  103.63   0.000   0.000   0.000   0.000   0.000   0.000    0.996    0.000  103.627    0.000    0.000    0.000    0.000    0.000    0.000
13:09:01   0.996   0.000  29.892   0.000   0.000   0.000   0.000   0.000   0.000    0.996    0.000   29.892    0.000    0.000    0.000    0.000    0.000    0.000
13:09:02   0.996   0.000  94.653   0.000   0.000   0.000   0.000   0.000   0.000    0.996    0.000   94.653    0.000    0.000    0.000    0.000    0.000    0.000

Disk:

[opc@deploy1 ~]$ pmrep :disk-iops
          riops  wiops  iops
            sda    sda   sda
13:09:23    N/A    N/A   N/A
13:09:24      0      0     0
13:09:25      0      0     0
13:09:26      0      0     0
13:09:27      0      3     3
13:09:28      0      0     0

As you can see, all these statistics where provided using a utility called ‘pmrep’ (reporter utility). They were fetched on runtime. I can go back in time when I use the archive that the pcp logger creates and use -S (starttime):

[opc@deploy1 ~]$ pmrep -a /var/log/pcp/pmlogger/deploy1/20170113.00.10 -S @-1min :sar-u
               %user       %nice        %sys     %iowait      %steal       %idle
13:22:36         N/A         N/A         N/A         N/A         N/A         N/A
13:22:37        0.65        0.00        0.11        0.00        0.00       99.22
13:22:38        0.65        0.00        0.11        0.00        0.00       99.22
13:22:39        0.65        0.00        0.11        0.00        0.00       99.22
13:22:40        0.65        0.00        0.11        0.00        0.00       99.22
13:22:41        0.65        0.00        0.11        0.00        0.00       99.22
13:22:42        0.65        0.00        0.11        0.00        0.00       99.22
13:22:43        0.65        0.00        0.11        0.00        0.00       99.22
13:22:44        0.65        0.00        0.11        0.00        0.00       99.22
13:22:45        0.65        0.00        0.11        0.00        0.00       99.22
13:22:46        0.65        0.00        0.11        0.00        0.00       99.22
13:22:47        0.65        0.00        0.11        0.00        0.00       99.22
13:22:48        0.65        0.00        0.11        0.00        0.00       99.22

The archive is stored in /var/log/pcp/pmlogger/hostname/YYYYMMDD.00.10 for the current day, the older archives are also stored as YYYYMMDD, but without ‘.00.10’ added to it. Above the syntax for using an archive and specifying a relative time is shown (-S for start, then a number and a time specification like sec, min, hour). Of course you can specify absolute times too, like ‘-S @13:22 -T @13:25’. -T is the until time.

Did you notice the measurements read from the archive above are per second?? The metrics are actually measured per minute (this specified in /var/lib/pcp/config/pmlogger/config.default), the pmrep tool extrapolates the values per minute. I am not really happy about that, but you can use ‘-u’ to disable per second extrapolation.

When doing investigation, you often want all data from a certain point in time because quite often you don’t know upfront what exact you are going to encounter. This is also true for troubleshooting, proof of concept testing, performance analysis, benchmarking, etc. You want to be able to do multidimensional analysis after the investigated event, probably even offline. This is a point where PCP truly shines. It easy to create an archive of only the relevant time using the ‘pmlogextract’ utility:

[opc@deploy1 ~]$ pmlogextract -S 12:30 -T 12:40 /var/log/pcp/pmlogger/deploy1/20170113.00.10 this

The above pmlogextract command reads the (current, there’s .00.10 attached) archive /var/log/pcp/pmlogger/deploy1/20170113.00.10 and creates the archive ‘this’ with all measurements between 12:30 and 12:40. A PCP archive consists of three files:

[opc@deploy1 ~]$ ls -ls this*
236 -rw-rw-r--. 1 opc opc 238212 Jan 13 14:05 this.0
  4 -rw-rw-r--. 1 opc opc    192 Jan 13 14:05 this.index
 20 -rw-rw-r--. 1 opc opc  18456 Jan 13 14:05 this.meta

This archive is endian-independent, and can be transferred to your laptop for analysis (provided you got the PCP executables on it of course). Please mind an archive contains a subset of all available performance metrics, in order to save space. My default installation has 1620 metrics available, there are 280 stored in the archive.

But it doesn’t stop here. PCP not only installs a logger daemon (pmlogger) and a couple of agents to extract data (pmdaroot, pmdaproc, pmdaxfs, pmdalinux), but also the ‘pmcd’ daemon (pcp collector deamon) listening on port 44321. This means you can fetch metrics over the network:

[opc@deploy1 ~]$ pmrep -h oracleserver.local :mem
              free(mb)   bufhead(mb)  pagecache(mb  kernelstack(      slab(mb)  pagetables(m    mapped(mb)      anon(mb)       shm(mb)   swapuse(mb)  hugefree(mb)  hugeused(mb)
15:25:18           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A           N/A
15:25:19        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:20        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:21        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:22        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:23        507244           165          5436            19           499           165          1267           817          3015             0             0             0
15:25:24        507244           165          5436            19           499           165          1267           817          3015             0             0             0

This is running my memory overview report fetching current data from the server ‘oracleserver.local’! Sadly, currently is fetching historical data using pmrep via the ‘pmcd’ daemon not supported.

The last point I wanted to focus the attention on is that gathering PCP collected performance data on a central host is built-in in PCP using the logger functionality. Provided the collection host can access the remote host on tcp/44321, it’s really simple (example for above mentioned oracleserver.local host). This is all that needs to be done on the collection host. Nothing needs to be done on the host that needs to provide the data:

[root@deploy1 ~]# echo 'oracleserver.local 	n   n	PCP_LOG_DIR/pmlogger/oracleserver.local	-r -T24h10m -c config.oracleserver.local' >> /etc/pcp/pmlogger/control
[root@deploy1 ~]# service pmlogger restart

The installation of PCP via the oracle linux 6 or 7 repository is really simple (provided you enabled the ‘latest’ repository):

[root@deploy1 ~]# yum install pcp pcp-system-tools
[root@deploy1 ~]# chkconfig pmcd on
[root@deploy1 ~]# chkconfig pmlogger on
[root@deploy1 ~]# service pmcd start
[root@deploy1 ~]# service pmlogger start

ps. I gotten a message ‘Error: Failed to create avahi client: Daemon not running’ when executing ‘service pmcd start’, but it does not seem to impact using PCP.

A final remark: the pmrep reports (:sar-u, :mem, :nw-pkt, disk-iops) mentioned are NOT reports that are in the default installation of PCP. Create a file in the current working directory of the process executing pmrep named pmrep.conf, and use the following gist to fill the file with my report definitions. Actually, it’s really easy to create a report, and there plenty of examples with the ones I created. If the pmrep.conf file is not in the current working directory, it needs to be specific using -c /path/to/pmrep.conf.

In my previous post, I introduced Intel Pin. If you are new to pin, please follow this link to my previous post on how to set it up and how to run it.

One of the things you can do with Pin, is profile memory access. Profiling memory access using the pin tool ‘pinatrace’ is done in the following way:

$ cd ~/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 12284 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

The pid is a pid of an oracle database foreground process. Now execute something in the session you attached pin to and you find the ‘pinatrace’ output in $ORACLE_HOME/dbs:

$ ls -l $ORACLE_HOME/dbs
total 94064
-rw-rw----. 1 oracle oinstall     1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--. 1 oracle oinstall     2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall       57 Nov  5 09:42 inittestdb.ora
-rw-r-----. 1 oracle oinstall       24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall     7680 Nov  5 09:41 orapwtestdb
-rw-r--r--  1 oracle oinstall 10552584 Nov 17 06:36 pinatrace.out

Please mind memory access generates A LOT of information! The above 11MB is what a ‘select * from dual’ generates (!)

This is how the file looks like:

$ head pinatrace.out
#
# Memory Access Trace Generated By Pin
#
0x00007f85c63fe218: R 0x00007fff6fd2c4c8  8          0xcefb615
0x000000000cefb61e: W 0x00007fff6fd2c4f8  8              0x12c
0x000000000cefb621: R 0x00007fff6fd2c4d0  8     0x7f85c5bebd96
0x000000000cefb625: R 0x00007fff6fd2c4d8  8     0x7f85c5bebd96
0x000000000cefb62c: R 0x00007fff6fd2c4e0  8     0x7fff6fd2c570
0x000000000cefb62d: R 0x00007fff6fd2c4e8  8          0xcefb54e

The first field is the function location, the second field is R or W (reading or writing obviously), the third field is the memory location read or written the fourth field is the amount of bits read and the fifth field is prefetched memory.

The function that is used can be looked up using the addr2line linux utility:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x000000000cefb61e
sntpread at ??:?

I looked up the second address from the pinatrace.out file above, and that address belongs to the function sntpread. There is no additional information available for this function (‘at ??:?’). If the address is not available in the oracle executable, a ‘??’ is displayed:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x00007f85c63fe218
?? ??:0

The pinatrace.out file is usable if you know the exact instruction pointer address or the memory location. However, that usage is fairly limited. An example of that is Mahmoud Hatem’s blog on tracing access to a memory location. Wouldn’t it be nice if we can change the functions addresses to function names, and the memory addresses to named memory locations whenever possible?

That’s where I created the pinatrace annotate oracle tool for. This is a little scriptset that contains scripts to generate memory information from the instance, after which the instruction pointer addresses and memory locations of a pinatrace.out file generated by pinatrace are translated to function names and memory area names. Let’s have a look what that means. This is a snippet of a pinatrace.out file:

0x000000000c967e46: R 0x0000000095f69910  8         0x95fcf6b0
0x000000000c967e4d: W 0x00007fff6fd2b2b8  8          0xc967e52
0x000000000c937b32: W 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937b3a: W 0x00007fff6fd2b278  8                0xe
0x000000000c937b41: W 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937b45: W 0x00007fff6fd2b270  8                0x1
0x000000000c937b49: W 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937b4d: R 0x0000000095fcf6bc  2               0x12
0x000000000c937b52: W 0x00007fff6fd2b288  8              0x2c4
0x000000000c937b59: W 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937b60: W 0x00007fff6fd2b2a0  4               0x73
0x000000000c937b6b: W 0x00007fff6fd2b2a8  4                0x1
0x000000000c937b6e: R 0x00007f85ca1db280  8     0x7f85ca1db280
0x000000000c937b77: R 0x000000000d0a40e4  4                0x1
0x000000000c937b84: R 0x00007f85ca1d43c8  8         0x95dc0e20
0x000000000c937b92: R 0x0000000095dc10b0  8                  0
0x000000000c937ba2: R 0x0000000095fcf6c0  4                0x1
0x000000000c937ba9: R 0x0000000095dc10e0  4                  0
0x000000000c937baf: R 0x000000000cfbe644  4            0x1cffe
0x000000000c937bbc: W 0x0000000095dc10b0  8         0x95fcf6b0
0x000000000c937bc5: R 0x0000000095fcf6b0  8                  0
0x000000000c937bc5: W 0x0000000095fcf6b0  8                0x1
0x000000000c937bca: W 0x00007fff6fd2b260  8                  0
0x000000000c937be1: R 0x00007f85ca1d4290  8     0x7f85ca1a9ca0
0x000000000c937bec: R 0x00007f85ca1ab1c0  4                0x3
0x000000000c937bf3: W 0x0000000095dc0faa  2                0x3
0x000000000c937bf9: R 0x00007f85ca1d43e0  8         0x95f68ea8
0x000000000c937c09: R 0x0000000095f69470  2                  0
0x000000000c937c16: W 0x0000000095dc0fac  2                  0
0x000000000c937c1e: R 0x0000000095dc10e0  4                  0
0x000000000c937c1e: W 0x0000000095dc10e0  4                0x2
0x000000000c937c24: W 0x0000000095dc0fa0  8         0x95fcf6b0
0x000000000c937c28: W 0x0000000095dc0fa8  2                0x8
0x000000000c937c2e: R 0x000000006000a9d8  4                0x1
0x000000000c937c3b: R 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937c3f: R 0x00007fff6fd2b2a0  4               0x73
0x000000000c937c42: W 0x0000000095fcf6c8  8         0x95f68ea8
0x000000000c937c46: W 0x0000000095fcf6c4  4               0x73
0x000000000c937c4a: R 0x00007fff6fd2b2a8  4                0x1
0x000000000c937c50: R 0x0000000095fcf6b8  4              0x83e
0x000000000c937c50: W 0x0000000095fcf6b8  4              0x83f
0x000000000c937c5a: W 0x0000000095dc10b0  8                  0
0x000000000c937c65: R 0x00007f85ca1d71d6  1                  0
0x000000000c937c76: R 0x00007fff6fd2b270  8                0x1
0x000000000c937c7a: R 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937c7e: R 0x00007fff6fd2b288  8              0x2c4
0x000000000c937c82: R 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937c86: R 0x00007fff6fd2b278  8                0xe
0x000000000c937c8d: R 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937c8e: R 0x00007fff6fd2b2b8  8          0xc967e52

The usefulness of this is limited in this form. The only thing I could derive is that big numbers in the memory access column (‘0x00007fff6fd2ac60’) are probably PGA related, and the numbers between roughly 0x000000006000000 and 0x0000000095dc0fd0 are probably SGA related. After running the annotate tool, it looks like this:

ksl_get_shared_latch:W:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:W:0x00007fff6fd2b278():8
ksl_get_shared_latch:W:0x00007fff6fd2b298():8
ksl_get_shared_latch:W:0x00007fff6fd2b270():8
ksl_get_shared_latch:W:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x0000000095fcf6bc(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):2
ksl_get_shared_latch:W:0x00007fff6fd2b288():8
ksl_get_shared_latch:W:0x00007fff6fd2b290():8
ksl_get_shared_latch:W:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x00007f85ca1db280(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x000000000d0a40e4():4
ksl_get_shared_latch:R:0x00007f85ca1d43c8(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6c0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:R:0x000000000cfbe644():4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x00007fff6fd2b260():8
ksl_get_shared_latch:R:0x00007f85ca1d4290(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x00007f85ca1ab1c0(pga|Other, pga heap, kgh stack pga|Other, pga heap, free memory pga|Other, pga heap, permanent memory):4
ksl_get_shared_latch:W:0x0000000095dc0faa(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x00007f85ca1d43e0(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095f69470(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:W:0x0000000095dc0fac(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc0fa0(shared pool|permanent memor,duration 1,cls perm):8
ksl_get_shared_latch:W:0x0000000095dc0fa8(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x000000006000a9d8(fixed sga|var:kslf_stats_):4
ksl_get_shared_latch:R:0x00007fff6fd2b298():8
ksl_get_shared_latch:R:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x0000000095fcf6c8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6c4(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x00007f85ca1d71d6(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):1
ksl_get_shared_latch:R:0x00007fff6fd2b270():8
ksl_get_shared_latch:R:0x00007fff6fd2b290():8
ksl_get_shared_latch:R:0x00007fff6fd2b288():8
ksl_get_shared_latch:R:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x00007fff6fd2b278():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b8():8

So, now you can see the reason I picked a seemingly arbitrary range of lines actually was because that range is the memory accesses of the ksl_get_shared_latch function. This annotated version show a shared latch get for the ‘session idle bit’ latch. It’s also visible the function uses PGA memory, some of it annotated, some of it not, and that most of the shared pool access is for the latch (a latch essentially is a memory range with the function of serialising access to a resource), which is in the shared pool because it’s a child latch. It’s also visible memory belonging to X$KSUPR is read and written (X$KSUPR is the table responsible for V$PROCESS, the fields KSLLALAQ and KSLLALOW are not externalised in V$PROCESS).

Why are a lot of the assumed PGA addresses (the ones like 0x00007fff6fd2b2b8) not annotated? Well, PGA memory allocations are very transient of nature. Because a PGA memory snapshot is made at a certain point in time, this snapshot represents the memory layout of that moment, which has a high probability of having memory deallocated and freed to the operating system. A lot of the SGA/shared pool allocations on the other hand have the intention of re-usability, and thus are not freed immediately after usage, which gives the SGA memory snapshot a good chance of capturing a lot of the memory allocations.

Get the pinatrace oracle annotate tool via github: git clone https://github.com/FritsHoogland/pinatrace_annotate_oracle.git

Please mind this tool uses the bash shell, it might not work in other shells like ksh.

How to use the tool?
– Use pin with the pinatrace.so tool, as described above. Move the the pinatrace.out file from $ORACLE_HOME/dbs to the directory with the pinatrace_annotate_oracle.sh script.
Immediately after the trace has been generated (!), execute the following scripts using sqlplus as SYSDBA:
– 0_get_pga_detail.sql (this lists the sessions in the database and requires you to specify the oracle PID of the session)
– 1_generate_memory_ranges.sql
– 2_generate_memory_ranges_xtables.sql
– 3_generate_memory_ranges_pga.sql
This results in the following files: memory_ranges.csv, memory_ranges_pga.csv and memory_ranges_xtables.csv.
Now execute the annotate script:
– ./pinatrace_annotate_oracle.sh pinatrace.out
The script outputs to STDOUT, so if you want to save the annotation, redirect it to a file (> file.txt) or if you want to look and redirect to a file: | tee file.txt.

I hope this tool is useful for your research. If you know a memory area described in the data dictionary that is not included, please drop me a message with the script, then I’ll include it.

This blogpost is an introduction to Intel’s Pin dynamic instrumentation framework. Pin and the pintools were brought to my attention by Mahmoud Hatem in his blogpost Tracing Memory access of an oracle process: Intel PinTools. The Pin framework provides an API that abstracts instruction-set specifics (on the CPU layer). Because this is a dynamic binary instrumentation tool, it requires no recompiling of source code. This means we can use it with programs like the Oracle database executable.
The Pin framework download comes with a set of pre-created tools called ‘Pintools’. Some of these tools are really useful for Oracle investigation and research.

Pin works in a very sophisticated way. The description in the Pin manualis to think of Pin as a JIT (just in time) compiler, where the compiler does not take byte code (as JIT compilation does with Java), but the executable of the process pin is executed against. This means pin inserts itself into the process’ execution. This can be seen when looking at the memory map of such a process:

$ cat /proc/29595/maps | grep -e pin-3.0 -e oracle
00400000-1098a000 r-xp 00000000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8a000-10bac000 r--p 1058a000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10bac000-10e05000 rw-p 105ac000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
7fa01e04c000-7fa01e31d000 r-xp 00000000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e51c000-7fa01e530000 r--p 002d0000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e530000-7fa01e531000 rw-p 002e4000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e533000-7fa01e71a000 r-xp 00000000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91a000-7fa01e91c000 r--p 001e7000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91c000-7fa01e91e000 rw-p 001e9000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e946000-7fa01e9c9000 r-xp 00000000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9ca000-7fa01e9cc000 r--p 00083000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9cc000-7fa01e9ce000 rw-p 00085000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9d6000-7fa01ea04000 r-xp 00000000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea04000-7fa01ea05000 r--p 0002d000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea05000-7fa01ea06000 rw-p 0002e000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea06000-7fa01eac4000 r-xp 00000000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac5000-7fa01eac8000 r--p 000be000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac8000-7fa01eaca000 rw-p 000c1000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eacb000-7fa01edb7000 r-xp 00000000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01edb8000-7fa01ee24000 r--p 002ec000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee24000-7fa01ee25000 rw-p 00358000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee80000-7fa01f385000 r-xp 00000000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f385000-7fa01f38d000 r--p 00504000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f38d000-7fa01f394000 rw-p 0050c000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f54e000-7fa01f6d4000 r-xp 00000000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
7fa01f6d4000-7fa01f8d3000 ---p 00186000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so

Here you see this process an oracle foreground (rows 1-3), after which we see a lot of stuff that pin pushed into the process’ address space.

For this reason, it’s vitally important to use and run pin as the same user as the process you want to run pin against. The way pin works is that, upon execution of pin, the pin executable inserts itself into the process’ address space, gains control and then tries to load necessary libraries. If it can’t find these libraries, it will send a SIGKILL to the process, effectively stopping it!

In order to obtain pin, go to the pin homepage, downloads, linux and select the ‘gcc compiler kit’. This is a zipped tarball. Upload that to a linux server as the database owner, usually ‘oracle’. I created a directory ‘pin’ in which I put the tarball, and extracted it (tar xzf). The next step is to compile the tools that come with pin:

$ cd pin-3.0-76991-gcc-linux/source/tools
$ make

This will output a lot of stuff on your screen and compiles everything in the tools directory.

Now that we have seen an introduction, let’s use a Pin tool to do something useful! One such useful is the tool ‘DebugTrace’. First obtain the process id (pid) from an Oracle server foreground process. Then execute the pin tool against this process:

$ pwd
/home/oracle/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 2407 -t source/tools/DebugTrace/obj-intel64/debugtrace.so

Upon execution, nothing is returned, because the majority of the things are happening in the process pin is run against (2407 in this case). If no parameters are specified, the pintool will create a file in the current working directory of the process it is run against. For an oracle foreground process, the current working directory is $ORACLE_HOME/dbs:

$ ls -l /proc/2407/cwd
lrwxrwxrwx 1 oracle oinstall 0 Nov 17 01:45 /proc/2407/cwd -> /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs

For the sake of testing this, I ran ‘select * from dual’ in the oracle database foreground process.

Let’s look what this produces! This is how my dbs directory looks like:

$ ls -ltr
total 2332
-rw-r--r--. 1 oracle oinstall    2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall      24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall    7680 Nov  5 09:41 orapwtestdb
-rw-r-----. 1 oracle oinstall      57 Nov  5 09:42 inittestdb.ora
-rw-rw----. 1 oracle oinstall    1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--  1 oracle oinstall 2361460 Nov 17 01:49 debugtrace.out

You can see the debugtrace.out file is what is produced by the pin tool, and it look rather big (remember I only ran ‘select * from dual’)!

Let’s look into the file:

$ less debugtrace.out
         3 @@@ return underflow
Return 0x00007f706b664218 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x12c
        14 @@@ return underflow
Return 0x000000000cefb62d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cefb580 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cedcc9d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cedc90a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000ceeb746 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b170d0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       156 Tailcall 0x0000000005b170de /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b170c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffeef95c305, 0x7f706ae51da0, ...)
       158 | Call 0x0000000005b170c1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b1dcb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       172 | Return 0x0000000005b20097 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffeef95c305
       174 Return 0x0000000005b170c7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffeef95c305
       197 @@@ return underflow
Return 0x000000000ceeb7ad /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
       213 Call 0x000000000cb5d8b0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:opikndf2+0x000000000410 -> 0x000000000c940560 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx(0x7ffeef95c040, 0x7f706ae51da1, ...)
       268 | Call 0x000000000c94066e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx+0x00000000010e -> 0x000000000cd585c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64(0x7ffeef95c040, 0x7f706ae51da1, ...)
       278 | | Call 0x000000000cd5863b /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64+0x00000000007b -> 0x000000000b13ef90 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt(0x7ffeef95c040, 0x7f706ae51da1, ...)
       286 | | | Call 0x000000000b13efa3 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt+0x000000000013 -> 0x0000000000ba6860 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:.plt+0x000000001360(0x1, 0x7ffeef95beb0, ...)
       293 | | | | Call 0x00007f706b188dcb /lib64/libc.so.6:__clock_gettime+0x00000000001b -> 0x00007ffeef9ab8a0 clock_gettime(0x1, 0x7ffeef95beb0, ...)
...etc...

I think this is rather exciting! This is a FULL function call trace, indented by call depth!
The first lines in the trace are returns, and these returns produce a ‘return underflow’, which is because these functions have been called before the trace was put on the process.
This trace shows:
Calling: the address and function where the function call is made, the address and function that is called and first two arguments of the function.
Returning: the address and function where the return is executed, and the returncode.
Tailcall: this is a function that is called as the last instruction in the current function. This means it returns as part of the current function, which is shown by not indenting this call further (call, tailcall and return are all indented at the same level).

Please mind currently I am not aware of a way to remove the pin framework and tool from a process address space. Also mind that because of how this works (a lot of intercepting in the execution flow), it will slow down the process significantly. Use this tool and the above described techniques at your own risk.

This blogpost is about the Oracle redo log structures and redo efficiency in modern Oracle databases. Actually, a lot of subtle things changed surrounding redo (starting from Oracle 10 actually) which have gone fairly unnoticed. One thing the changes have gone unnoticed for is the Oracle documentation, the description of redo in it is an accurate description for Oracle 9, not how it is working in Oracle 10 or today in Oracle 12.1.0.2.

My test environment is a virtual machine with Oracle Linux 7.2 and Oracle 12.1.0.2.161018, and a “bare metal” server running Oracle Linux 6.7 and Oracle 12.1.0.2.160419. Versions are important, as things can change between versions.

Multi threaded redo and dynamic strands
One of the prominent changes to the redo mechanism is what is referred to as ‘multi threaded redo’ or ‘multiple log buffers’. The way the redo structures are described most of the time is the redo log buffer being in the SGA as a single buffer to which change vectors are written (in a circular fashion), protected by a latch (redo allocation latch). That is not an accurate description since Oracle 9.2.

Starting from Oracle 9.2, multiple log buffers can be and most probably are created. At least in Oracle 12.1 you get 2 redo log buffers. Each of these buffers is protected by a redo allocation latch. The benefit of multiple log buffers is that copying redo into the log buffer is not strictly serialised. However, this has some consequences too. One of the consequences is redo potentially could be not written in strict SCN order by the log writer or its worker processes.

You can see the log buffers in the view x$kcrfstrand:

SYS@testdb AS SYSDBA> select strand_size_kcrfa from x$kcrfstrand where ptr_kcrf_pvt_strand = hextoraw(0) and pnext_buf_kcrfa_cln != hextoraw(0);

STRAND_SIZE_KCRFA
-----------------
         67108864
         67108864

Both values added together exactly matches the size of the log buffer:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: log_buffer
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%log_buffer%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden:
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('','Y',3,2))

NAME                         VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
---------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
log_buffer                   134217728                                                              FALSE    FALSE      TRUE

The parameters that handle dynamic redo strands are “hidden”:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: log_parallelism
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%log_parallelism%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden: Y
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('Y','Y',3,2))

NAME                            VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_log_parallelism_dynamic        TRUE                                                                   TRUE     FALSE      FALSE
_log_parallelism_max            2                                                                      TRUE     FALSE      FALSE

The mechanisms can further be investigated using fixed SGA variables:

SYS@testdb AS SYSDBA> oradebug setmypid
Statement processed.
SYS@testdb AS SYSDBA> oradebug dumpvar sga kcrf_max_strands
uword kcrf_max_strands_ [0600283C8, 0600283CC) = 00000002
SYS@testdb AS SYSDBA> oradebug dumpvar sga kcrf_actv_strands
uword kcrf_actv_strands_ [0600283E0, 0600283E4) = 00000001

Note: the output of oradebug is hexadecimal (!). Of course here it doesn’t matter since the values are smaller than 10.
This shows the maximum number of public redo strands is 2 (kcrf_max_strands), as we have seen with x$kcrfstrand, and the number of active strands is 1 (kcrf_actv_strands), which makes sense in my test database, since I am the sole user of the database.
The maximum number of public redo strands seems to be at least 2, some limited research shows the value is set by the formula CPU_COUNT/16 for systems with higher CPU counts. Some sources report that the value for _log_parallelism_max will be reduced to CPU_COUNT if it is set higher manually.

When I count the number of redo allocation latches, I see something odd:

SYS@testdb AS SYSDBA> select count(*) from v$latch_children where name = 'redo allocation';

  COUNT(*)
----------
	20

A redo allocation latch protects a log strand. I just showed my system has actually two public log strands. Why do I have 20 redo allocation latches if my system has a fixed number of public redo buffers? This finding leads to private redo strands.

Private strands
Outside of multi threaded public redo buffers, Oracle introduced another optimisation for redo, which are private strands. Private strands optimise redo generation by letting a session create the change vectors for a transaction directly in a private redo buffer in the SGA, instead of generating the change vectors in the PGA, and later copying these to the public redo strand. Some restrictions apply on the use of it, although I am not aware of a concrete list of restrictions and limitations (the closest thing to such a list would be the table x$ktiff shown further below with in-memory undo, which is directly connected to private strands). Whenever a private strand can not be used, Oracle reverts to the ‘classic’ mechanism of redo generation using the public redo log buffer. Each private strand is also protected by a redo allocation latch, just like the public strands.

The private strands SGA buffers are allocated at startup time:

SYS@testdb AS SYSDBA> select pool, name, bytes from v$sgastat where name like 'private strands';

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  private strands               2451456

It seems the number of private strands allocated from SGA (and as a consequence the number of redo allocation latches on top of the public strand number) is depended on the transactions database parameter, and takes 10% (_log_private_parallelism_mul) of the transactions multiplied by 129KB plus 4KB overhead:

SYS@testdb AS SYSDBA> select trunc(value * KSPPSTVL / 100) * (129+4) * 1024 "private strands"
                      from (select value from v$parameter where name = 'transactions') a,
                           (select val.KSPPSTVL
                            from sys.x$ksppi nam, sys.x$ksppsv val
                            where nam.indx = val.indx AND nam.ksppinm = '_log_private_parallelism_mul') b;

private strands
---------------
        2451456

However, that is not all that is in play for private strands. Once the private strands are allocated, and after every logswitch, the number of usable private strands is determined by the size of the current online redologfile minus the size of the log buffer and the 5% multiplier (_log_private_mul):

SYS@testdb AS SYSDBA> select trunc(((select bytes from v$log where status = 'CURRENT') - (select to_number(value) from v$parameter where name = 'log_buffer'))*
                             (select to_number(val.KSPPSTVL)
                              from sys.x$ksppi nam, sys.x$ksppsv val
                              where nam.indx = val.indx AND nam.ksppinm = '_log_private_mul') / 100 / 66560)
                             as "private strands"
                       from dual;

private strands
---------------
            -22

Yes, that is a negative number. Let’s look at the number of private strands in this instance:

SYS@testdb AS SYSDBA> select indx,strand_size_kcrfa from x$kcrfstrand where last_buf_kcrfa = '00';

no rows selected

The instance takes the least of the two calculations above for the number of private strands. That explains my test instance not having any private strands! The size of my log buffer is 128M, the size of my online redologfiles is 100M. This shows that sizing the online redologfiles too small can lead to inefficiencies (!!).

In-Memory Undo
Another mechanism that has been changed in order to optimise generating redo is in-memory undo. Actually in-memory undo is directly tied to private redo strands. The idea of in-memory undo is to store the undo (read consistent related) part of a transaction in an in-memory undo buffer until it is committed or rolled back. The in-memory undo pool can be seen in the shared pool using:

SYS@testdb AS SYSDBA> select pool, name, bytes from v$sgastat where name = 'KTI-UNDO';

POOL         NAME                                                    BYTES
------------ -------------------------------------------------- ----------
shared pool  KTI-UNDO                                              1274976

The in-memory undo pool is externalised through x$ktifp. This view shows the in-memory buffers and their sizes:

SYS@testdb AS SYSDBA> select ktifpno, ktifppsi from x$ktifp;

   KTIFPNO   KTIFPPSI
---------- ----------
         0      65535
...
        17      65535

The number of in-memory undo buffers is determined by the transactions parameter:

SYS@testdb AS SYSDBA> show parameter transactions

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
transactions                         integer     189
transactions_per_rollback_segment    integer     5

And defaults to 10% of the transactions. Every in-memory undo buffer is protected by an in-memory undo latch:

SYS@testdb AS SYSDBA> select count(*) from v$latch_children where name like 'In memory undo latch';
        18

In memory undo can be switched on and off on at instance-wide level or at the session level using the ‘_in_memory_undo’ parameter, and is true/on by default. Switching off un-memory undo switches off private strands too.

In fact some of these optimisations are announced in the alert log file during startup of an instance:

IMODE=BR
ILAT =18

BR means batched redo mode, which means the private strands feature is turned on (I found that with the too small online redologfiles described above, which caused on private private strands to be made available, the database still announces ‘BR’).
ILAT means number of in-memory pools and latches.

Because in-memory undo is directly tied to private redo strands, the usage of private strands and in-memory undo can be monitored using the view x$ktiff:

SYS@testdb AS SYSDBA> select ktiffcat, ktiffflc from x$ktiff;

KTIFFCAT															   KTIFFFLC
-------------------------------------------------------------------------------------------------------------------------------- ----------
Undo pool overflow flushes														  0
Stack cv flushes															  1
Multi-block undo flushes														  0
Max. chgs flushes															  0
NTP flushes																  0
Contention flushes															  0
Redo pool overflow flushes														  0
Logfile space flushes															  0
Multiple persistent buffer flushes													  0
Bind time flushes															  0
Rollback flushes															  2
Commit flushes															       1694
Recursive txn flushes															  0
Redo only CR flushes															  0
Ditributed txn flushes															  0
Set txn use rbs flushes 														  0
Bitmap state change flushes														  0
Presumed commit violation														  0
Securefile direct-write lob update flushes												  0
Unknown 																  0
Unknown 																  0

This view shows the reasons for flushing the in-memory undo buffers.

Throw-away undo
The reason that the database is mentioning ‘IMODE’ is there is actually another mode one besides ‘BR’, ‘TUA’. The ‘TUA’ abbreviation is a bit weird (Throw Undo Away vs. Throw-away undo?), but it means the database is set to ‘throw-away undo mode’. This is accomplished by setting the undo_retention parameter to a value of 3 or lower, and setting the undocumented parameter _imu_pools to a value greater than 3. Once this is done, and the instance is restarted, it will be reported in the alert.log file as follows:

IMODE=TUA
ILAT =189

The purpose of throw-away undo mode is to throw away undo instead of applying (the name already hints that :-)). By enabling throw-away undo, some of the actions normally executed are not done anymore. Such actions: are applying undo (not all undo is skipped), which reduces block changes, read, writes, buffer activity and reduces redo activity. By enabling this mode, and thus not performing some of the actions that are normally done, there are some consequences: it will significantly increase the risk of ORA-1555 snapshot too old errors. Also a lot of features of the database are not compatible with ‘TUA mode’: flashback database, cluster_database=TRUE, supplemental logging to name a few. However, it looks to me as a mode designed for a specific purpose: being able to do massive loads or changes at a faster rate, and with some consequences. I would add that this mode should always be a temporal setting, and once the desired action is completed, the ‘TUA’ mode should be reverted back to normal behaviour by resetting _imu_pools and increasing undo_retention back to your business transaction needs.

Conclusion
The aim for this blogpost is to describe private strands, in-memory undo and throw-away undo. The name of the blogpost is actually a hat-tip to mainly the paper from Stephan Haisley, which has most of what is written here covered. So in that sense, it is a ‘redo’ of his paper, although I tested this all on Oracle version 12.1.0.2.181016 database instance.

Private strands and in-memory undo reduce the work needed when changes are applied to the database. You might want to make sure you are using these features, especially since too small online redologfiles can prevent private strands from being made available. I come across a lot of databases which have inappropriately sized online redologfiles.

Throw-away undo mode, or TUA mode, is a mode which can reduce the time spend on bulk actions. However, anything comes at a price. For TUA mode it means the database looses some of its durability properties. However, if this is used for importing huge amounts of data, or doing mass changes without any session concurrency, this could be a welcome time saver.

References:
Hellodba
Stephan Haisley’s 2008 presentation ‘Redo and Undo Optimisations in Oracle 10g’
Tanel Põder’s presentation ‘Performance and Scalability Improvements in Oracle 10g and 11g’
Jonathan Lewis’ write-ups about private strands and ‘Oracle Core essential internals for DBAs and Developers’.
Twitter: @piontekdd, @martinberx, @westendwookie
Thanks: Mauro Pagano

This article is about the internals of how the Oracle database handles transactions. In this case the communication mechanism of foreground sessions to the logwriter process is examined. The tests in this article have been executed using the following versions:
– Oracle database 12.1.0.2.161018
– Oracle linux 7.2, kernel 4.1.12-61.1.14.el7uek.x86_64 (UEK4)

In my previous article, ‘Transactions and SCNs’, I talked about redo generation, among other things. When no changes have been made to the way a commit works (parameters commit_logging, commit_wait and commit_write), a commit waits for its change vectors to be written to disk by either the logwriter or the log writer worker processes. In order for the logwriter or its worker processes to write these change vectors as quickly as possible, a foreground process needs to signal the logwriter. This is independent of the two modes of the logwriter, post/wait and polling, because these describe how the logwriter (or the helper processes) signal the foreground process after writing.

Whenever there is a need for a foreground process, it can signal the logwriter to write out the change vectors. This can not only happen during commit, but also during doing DML. Most people are aware that the technical implementation to trigger the logwriter is using the ‘semctl()’ call. Typically, an idle logwriter process is waiting for the wait event ‘rdbms ipc message’ on the Oracle level, which is the instrumentation of the process waiting for the system call ‘semtimedop()’. This way, the logwriter process is off cpu until the semaphore is triggered or semtimedop() times out.

By executing semctl() against the semaphore identifier and semaphore number the logwriter process is sleeping on, the logwriter is made runnable and starts working.

In order to prove this, this small gdb script can be used:

break kcrfw_copy_cv
commands
silent
print $rip
c
end
break kcrf_commit_force_int
commands
silent
print $rip
c
end
break semctl
commands
silent
print $rip
c
end

When this macro is run in a gdb session attached to an Oracle foreground process, and the following commands are executed:

insert into t values ('a');
commit;

The following output is generated:

$1 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$2 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$3 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$4 = (void (*)()) 0x7fc43b2189d0 <semctl>
$5 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

Row 1: this is the redo generated by the insert statement copied into the public logbuffer.
Row 2: this is the redo generated by the commit statement copied into the public logbuffer.
Row 3/4: entering the first ‘kcrf_commit_force_int’ phase, where the logwriter is semctl’ed.
Row 5: entering the second ‘kcrf_commit_force_int’ phase, where is waited for the logwriter to finish writing.

Sounds logical, right?

However, when testing this thesis with two sessions doing the insert and commit at exactly the same time (using the broadcast input option in iTerm2), I gotten the following output:

(first session)
$6 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$7 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$8 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$9 = (void (*)()) 0x7fc43b2189d0 <semctl>
$10 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

(second session)
$1 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$2 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$3 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$4 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

I guess it’s easy to spot the call to semctl being missing for the second session. What is happening here?

First thing to do is investigate how semctl is called. The way to do that, is by changing the commands to execute when gdb breaks on semctl and get a backtrace using the ‘bt’ command:

(gdb) info break 3
Num     Type           Disp Enb Address            What
3       breakpoint     keep y   0x00007fc43b2189d0 <semctl>
	breakpoint already hit 2 times
        silent
        print $rip
        c
(gdb) commands 3
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>silent
>print $rip
>bt
>c
>end
(gdb) c
Continuing.

Now execute the insert and the commit again, look at the backtrace:

$11 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$12 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$13 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$14 = (void (*)()) 0x7fc43b2189d0 <semctl>
#0  0x00007fc43b2189d0 in semctl () from /lib64/libc.so.6
#1  0x000000000cdf00a5 in sskgpwpost ()
#2  0x000000000cdedcff in skgpwpost ()
#3  0x000000000c94466f in kslpsprns ()
#4  0x000000000c936553 in kskpthr ()
#5  0x000000000c9442af in ksl_post_proc ()
#6  0x000000000ccc0b28 in ksasnd ()
#7  0x000000000ccc17ff in ksbasend ()
#8  0x000000000ccbc13f in kcrf_commit_force_int ()
#9  0x000000000ccb730a in kcrfw_redo_gen_ext ()
#10 0x000000000c9df1c8 in kcbchg1_main ()
#11 0x000000000c9ddc05 in kcbchg1 ()
#12 0x000000000c99421f in ktucmt ()
#13 0x000000000c975dce in ktcCommitTxn_new ()
#14 0x000000000c97c147 in ktcCommitTxn ()
#15 0x000000000ce504f9 in ktdcmt ()
#16 0x000000000cc44917 in k2lcom ()
#17 0x00000000029cacda in k2send ()
#18 0x000000000c97f87c in xctCommitTxn ()
#19 0x000000000270191f in kksExecuteCommand ()
#20 0x000000000cb6d678 in opiexe ()
#21 0x00000000021cf90e in kpoal8 ()
#22 0x000000000cb60d1d in opiodr ()
#23 0x000000000cd8f4db in ttcpip ()
#24 0x0000000001b96d5c in opitsk ()
#25 0x0000000001b9b721 in opiino ()
#26 0x000000000cb60d1d in opiodr ()
#27 0x0000000001b92eab in opidrv ()
#28 0x000000000269e4d1 in sou2o ()
#29 0x0000000000ba8d9a in opimai_real ()
#30 0x00000000026a8c1c in ssthrdmain ()
#31 0x0000000000ba8cdc in main ()
$15 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

This shows that inside kcrf_commit_force_int, these functions are called which lead up to semctl:

#0  0x00007fc43b2189d0 in semctl () from /lib64/libc.so.6
#1  0x000000000cdf00a5 in sskgpwpost ()
#2  0x000000000cdedcff in skgpwpost ()
#3  0x000000000c94466f in kslpsprns ()
#4  0x000000000c936553 in kskpthr ()
#5  0x000000000c9442af in ksl_post_proc ()
#6  0x000000000ccc0b28 in ksasnd ()
#7  0x000000000ccc17ff in ksbasend ()

This is starting to make sense, ksb is kernel service background processes, ksa is kernel service asynchronous messages. The next thing to do is execute the simultaneous commit again, break on these functions, and see how many of these are called:

(gdb) break sskgpwpost
Breakpoint 4 at 0xcdf0050
(gdb) break skgpwpost
Breakpoint 5 at 0xcdedce0
(gdb) break kslpsprns
Breakpoint 6 at 0xc944310
(gdb) break kskpthr
Breakpoint 7 at 0xc9364d0
(gdb) break ksl_post_proc
Breakpoint 8 at 0xc944270
(gdb) break ksasnd
Breakpoint 9 at 0xccc0a40
(gdb) break ksbasend
Breakpoint 10 at 0xccc1760
(gdb) commands 4-10
Type commands for breakpoint(s) 4-10, one per line.
End with a line saying just "end".
>silent
>print $rip
>c
>end
(gdb) c
Continuing.

This is how the output looks like:

(first session)
$1 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$2 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$3 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$4 = (void (*)()) 0xccc1760 <ksbasend>
$5 = (void (*)()) 0xccc0a40 <ksasnd>
$6 = (void (*)()) 0xc944270 <ksl_post_proc>
$7 = (void (*)()) 0xc9364d0 <kskpthr>
$8 = (void (*)()) 0xc944310 <kslpsprns>
$9 = (void (*)()) 0xcdedce0 <skgpwpost>
$10 = (void (*)()) 0xcdf0050 <sskgpwpost>
$11 = (void (*)()) 0x7f06f6f4a9d0 <semctl>
$12 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
(second session)
$1 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$2 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$3 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$4 = (void (*)()) 0xccc1760 <ksbasend>
$5 = (void (*)()) 0xccc0a40 <ksasnd>
$6 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$7 = (void (*)()) 0xccc1760 <ksbasend>
$8 = (void (*)()) 0xccc0a40 <ksasnd>

What this shows, is that the first session starts the ‘dive’ from the kcrf_commit_force_int function towards the semctl system call, and the second session is starting the same dive, but stops in the ksasnd function. After that the second session starts the second phase of kcrf_commit_force_int, and starts the same sequence of calls for semctl’ing the logwriter, and ends in ksasnd once again. The first session did do the entire dive to the semctl call, and remembers that when kcrf_commit_force_int is called the second time.

So how does this actually work in the functions ksbasend and ksasnd? The next step is to look how the function ksasnd is called. For that, change the breakpoint for the ksasnd function to print the arguments for the functions. Please mind I do not know the number of arguments, I normally print the first four arguments (which are in the cpu registers RDI, RSI, RDX and RCX in that order) as a starting point. The reason for picking the ksasnd function is there is a high probability (although it truly is a guess) that ksasnd means kernel service asynchronous messages send, and thus gives some indication why it did continue calling the functions up to semctl. This is how that is done:

(gdb) info break 9
Num     Type           Disp Enb Address            What
9       breakpoint     keep y   0x000000000ccc0a40 <ksasnd>
	breakpoint already hit 1 time
        silent
        print $rip
        c
(gdb) commands 9
Type commands for breakpoint(s) 9, one per line.
End with a line saying just "end".
>printf "ksasnd %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
>silent
>c
>end

When another insert/commit is executed, this gives the following output:

$29 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$30 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$31 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$32 = (void (*)()) 0xccc1760 <ksbasend>
ksasnd 95dafd00, 2, 85, 18
$33 = (void (*)()) 0xc944270 <ksl_post_proc>
$34 = (void (*)()) 0xc9364d0 <kskpthr>
$35 = (void (*)()) 0xc944310 <kslpsprns>
$36 = (void (*)()) 0xcdedce0 <skgpwpost>
$37 = (void (*)()) 0xcdf0050 <sskgpwpost>
$38 = (void (*)()) 0x7f06f6f4a9d0 <semctl>
$39 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

Since KSA probably is kernel service asynchronous messages, one of the numbers could be related to the messages listed in x$messages. Upon some investigation it turns out 0x85 (decimal 133) is quite likely a x$messages number:

SYS@testdb AS SYSDBA> select indx, description, dest from x$messages where indx = 133;

      INDX DESCRIPTION               DEST
---------- ------------------------- ----------
       133 Redo writer post action   LGWR

That is what the session is doing, posting the logwriter for action! Just to settle what this means: activating the logwriter using the semctl systemcall is not the only thing that happens. Actually, the main thing happening is that the foreground process sends a specific request (message) to the logwriter, which is asking it to take action. The semctl call is just the way to make the process active.

At this point I got stuck on the function calls. However, there are more ways to see what is going on. I started to do some sampling using ‘perf’. Investigating the perf sampling resulted in finding an additional function that is called during insert and commit: ksaamb (which probably means kernel service asynchronous (messaging) allocate message buffer). Also memory profiling resulted in a memory location that is read during dive towards semctl: 0x60027e20.

The address 0x60027e20 lies within the fixed SGA variable kcrfsg_:

SQL> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv
  2 where to_number('60027e20','XXXXXXXX')
  3 between to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')
  4 and to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')+ksmfssiz-1;

KSMFSNAM	     KSMFSADR	      KSMFSSIZ
-------------------- ---------------- --------
kcrfsg_ 	     0000000060027C30	  1608

However, it is not externalised in x$kcrfws; running an access watch in gdb and selecting the entire x$kcrfws table does not trigger the watchpoint.

With these two new found items, let’s continue the investigation:
– revert the breakpoint at ksasnd to just print it was triggered (print $rip).
– add a breakpoint for ksaamb in the same fashion as the other breakpoints.
– add an access watchpoint on address 0x60027e20 that continues when triggered:

(gdb) awatch *0x60027e20
Hardware access (read/write) watchpoint 1: *0x60027e20
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end

When this is run with two sessions at the same time (doing insert into and commit), the pieces fall into place:

(session 1)
$33 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$34 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$35 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$36 = (void (*)()) 0xccc1760 <ksbasend>
$37 = (void (*)()) 0xccc0a40 <ksasnd>
$38 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 1
0x000000000ccc0c50 in ksaamb ()
$39 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$40 = (void (*)()) 0xccc1760 <ksbasend>
$41 = (void (*)()) 0xccc0a40 <ksasnd>
$42 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 1
0x000000000ccc0c50 in ksaamb ()

(session 2)
$9 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$10 = (void (*)()) 0xccba840 <kcrfw_copy_cv>
$11 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>
$12 = (void (*)()) 0xccc1760 <ksbasend>
$13 = (void (*)()) 0xccc0a40 <ksasnd>
$14 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Old value = 1
New value = 0
0x000000000ccc0c50 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0ca1 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0d68 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Old value = 0
New value = 1
0x000000000ccc0d7c in ksaamb ()
$15 = (void (*)()) 0xc944270 <ksl_post_proc>
$16 = (void (*)()) 0xc9364d0 <kskpthr>
$17 = (void (*)()) 0xc944310 <kslpsprns>
$18 = (void (*)()) 0xcdedce0 <skgpwpost>
$19 = (void (*)()) 0xcdf0050 <sskgpwpost>
$20 = (void (*)()) 0x7f7320c8c9d0 <semctl>
$21 = (void (*)()) 0xccbb770 <kcrf_commit_force_int>

In this case, both session act completely the same, until the ksaamb function. In the ksaamb function, both sessions read memory address 0x60027e20. The first session reads the value 1, the second session reads 0 (which has been a 1 previously, which is the reason for printing old/new value).

It seems that 0 indicates there are no messages pending for the logwriter, which is why the second session continues to run in the ksaamb function, and the first session actually cancelling the action to execute semctl. In the second session the address 0x60027e20 is read a few more times, until it actually writes the value 1 into it.

To expand a bit on how this works, you can add latch gets and frees:

(gdb) break kslgetl
Breakpoint 13 at 0xc938af0
(gdb) commands
Type commands for breakpoint(s) 13, one per line.
End with a line saying just "end".
>silent
>printf "kslgetl %x\n", $rdi
>c
>end
(gdb) break kslfre
Breakpoint 14 at 0xc9397a0
(gdb) commands
Type commands for breakpoint(s) 14, one per line.
End with a line saying just "end".
>silent
>printf "kslfre %x\n", $rdi
>c
>end

This will show all the latch gets during both insert into and commit. However the interesting bit is:

(session 1)
$61 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 1
0x000000000ccc0c50 in ksaamb ()

(session 2)
$6 = (void (*)()) 0xccc0bf0 <ksaamb>
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0c50 in ksaamb ()
kslgetl 6000e660
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0ca1 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0d68 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Old value = 0
New value = 1
0x000000000ccc0d7c in ksaamb ()
kslfre 6000e660

The first session reads the value 1 and bails out. The second session reads the value 0 and continues. It then takes a latch at address 0x6000e660, and eventually sets the value to 1, and frees the latch at address 0x6000e660.

Probably most readers of this blog can already guess the latch name, but for completeness sake:

SYS@testdb AS SYSDBA> select addr, name from v$latch where to_number(addr,'XXXXXXXXXXXXXXXX') = to_number('6000e660','XXXXXXXX');

ADDR		 NAME
---------------- ----------------------------------------------------------------
000000006000E660 messages

The messages latch!

Conclusion
Whenever a session needs its redo to be written by the log writer, it makes sure the redo is available for the logwriter using the function kcrfw_copy_cv, and then triggers the logwriter. When profiling system calls, it looks like the trigger simply is waking up the logwriter using the system call semctl. Upon deeper investigation, the way it works is that actually a message is send to the logwriter explicitly asking the logwriter to take action on the Oracle level. The construction of the logwriter sleeping in semtimedop and a process triggering it using semctl is an implementation detail.

A session that needs to trigger the logwriter uses the Oracle ksa (kernel service asynchronous messages) layer of functions. The function ksaamb (kernel service asynchronous messages allocate message buffer) reads a variable (0x60027e20) in the fixed SGA to make sure this work is not already currently being done. 0 means no other process is doing that, a value of 1 means another session is already working on signalling the logwriter. If 1 is encountered the session stops trying to signal the logwriter, and “hopes” the other session triggers the logwriter and that the logwriter will write it’s redo. However, to be absolutely sure, it will try to signal the logwriter once again when the second “phase” of the kcrf_commit_force_int execution (which is waiting for the logwriter to finish writing the session’s redo).

Setting the fixed SGA variable is protected by a messages latch.

The essence of the article is you might not always see a semctl call where you expect it. This is actually an optimisation which makes sense once you understand the implementation details.

Recently I was asked to analyse the security impact of the snmp daemon on a recent Exadata. This system was running Exadata image version 12.1.2.1.3. This blog article gives you an overview of a lot of the things that surround snmp and security.

First of all what packages are installed doing something with snmp? A list can be obtained the following way:

# rpm -qa | grep snmp
net-snmp-utils-5.5-54.0.1.el6_7.1.x86_64
net-snmp-libs-5.5-54.0.1.el6_7.1.x86_64
net-snmp-5.5-54.0.1.el6_7.1.x86_64
sas_snmp-14.02-0103.x86_64

Essentially the usual net-snmp packages and a package called ‘sas_snmp’.

A next important thing is how the firewall is configured. However, the default setting of the firewall on the compute nodes with exadata is the firewall turned off:

# iptables -L -v
Chain INPUT (policy ACCEPT 437M packets, 216G bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 343M packets, 748G bytes)
 pkts bytes target     prot opt in     out     source               destination

So if there is something running that listens to a network port that can benefit ‘attackers’, there is no firewall to stop them.

Next obvious question is what snmp processes are actually running:

# ps -ef |grep snmp
root       7088      1  0 Aug16 ?        00:51:32 /usr/sbin/snmpd -LS0-6d -Lf /dev/null -p /var/run/snmpd.pid
root      33443      1  0 03:14 ?        00:00:49 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf
root      33454  33443  0 03:14 ?        00:00:00 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf

The snmpd process is the net-snmp snmp daemon. However, there are two additional processes running with the name ‘snmp’ in them, one is owned by init, and a processes that this process has spawned. The name ‘lsi_mrdsnmpagent’ probably means LSI MegaRaid SNMP agent. That gives a fair hint this processes is doing something snmp related specifically for the LSI MegaRaid adapter, which is the disk controller.

Are there any open ports related to snmp processes?

# netstat -anp | grep snmp
tcp        0      0 127.0.0.1:199               0.0.0.0:*                   LISTEN      7088/snmpd
udp        0      0 0.0.0.0:161                 0.0.0.0:*                               7088/snmpd
udp        0      0 0.0.0.0:22917               0.0.0.0:*                               7088/snmpd

1. tcp port 199
This is support for the SMUX protocol (RFC 1227) to communicate with SMUX-based subagents. This is a deprecated feature in favour of AgentX. It is considered a bug (https://bugzilla.redhat.com/show_bug.cgi?id=110931) the daemon is still using this port. However the port is opened on localhost (127.0.0.1) and as such not reachable from outside of the machine, which means it is not a direct security problem.

2. udp port 161
This is the default snmpd port. This port is open to the outside world on the compute node, which can be seen from the address 0.0.0.0 in the above ‘source’ column. The port being open can be verified using another machine and the ‘nmap’ tool:

$ sudo nmap -Pn -sU -p 161 311.1.1.1
Password:

Starting Nmap 6.47 ( http://nmap.org ) at 2016-10-26 15:00 CEST
Nmap scan report for 311.1.1.1
Host is up (0.087s latency).
PORT    STATE SERVICE
161/udp open  snmp

The status ‘open’ shows this udp port does respond to requests.

3. udp port 22917 (in this case; this port number is random)
This is a random port that gets set for trapsink directive set in the /etc/snmp/snmpd.conf. A trap sink is the destination for snmp traps that get triggered. Although this udp port is in use, it does not respond to network traffic:

$ sudo nmap -Pn -sU -p 22917 311.1.1.1
Password:

Starting Nmap 6.47 ( http://nmap.org ) at 2016-10-26 15:22 CEST
Nmap scan report for 311.1.1.1
Host is up.
PORT      STATE         SERVICE
22917/udp open|filtered unknown

The status ‘open|filtered’ does mean the udp port does not respond to requests.

Now let’s look how the actual configuration file of the snmp daemon looks like on exadata. The configuration file is /etc/snmp/snmpd.conf:

snmp daemon configuration file:
trapcommunity public
trapsink 127.0.0.1 public
rocommunity public 127.0.0.1
rwcommunity public 127.0.0.1

access  RWGroup         ""      any       noauth    exact all all all
com2sec snmpclient      127.0.0.1               public
group   RWGroup                 v1                              snmpclient

pass .1.3.6.1.4.1.4413.4.1 /usr/bin/ucd5820stat
pass .1.3.6.1.4.1.3582 /usr/sbin/lsi_mrdsnmpmain

syscontact Root <root@localhost> (configure /etc/snmp/snmp.local.conf)
syslocation Unknown (edit /etc/snmp/snmpd.conf)

view    all             included      .1                80

The snmpd.conf file shows:
– trapsink destination (127.0.0.1, localhost) and community string (public).
– the ro and rw communities are set to ‘public 127.0.0.1’
In general it is advised to change the community strings to something unique to avoid being easy guessable. However, in this case there’s also a network description following the ro and rw community, which is: 127.0.0.1. This means snmp access is restricted to localhost.
This can be verified by running snmpwalk from another machine:

$ snmpwalk -v 2c -c public 311.1.1.1
Timeout: No Response from 311.1.1.1

This means there is no way to communicate to the snmp daemon from outside of the machine. We can see from the snmp daemon configuration file that access to the snmp deamon is limited to localhost.

It seems the LSI megaraid snmp agent works together with snmpd:

root      33443      1  0 03:14 ?        00:01:01 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf
root      33454  33443  0 03:14 ?        00:00:00 /usr/sbin/lsi_mrdsnmpagent -c /etc/snmp/snmpd.conf

Obviously it reads the snmpd.conf (-c /etc/snmp/snmpd.conf as seen above), but it has got a configuration file of its own. This configuration file of ‘lsi_mrdsnmpagent’ process can be seen when looking at the files inside the sas_snmp rpm package (rpm -ql sas_snmp), but the main evidence it is using the file can be derived from looking at the open file descriptors of the lsi_mrdsnmpagent process:

# ls -ls /proc/$(pgrep -f lsi_mrdsnmpagent | head -1)/fd
total 0
0 lr-x------. 1 root root 64 Oct 26 03:14 0 -> /dev/null
0 lr-x------. 1 root root 64 Oct 26 03:14 1 -> /etc/lsi_mrdsnmp/sas/sas_TrapDestination.conf
0 l-wx------. 1 root root 64 Oct 26 03:14 2 -> /var/log/cellos/cron_daily_cellos.stderr (deleted)
0 lrwx------. 1 root root 64 Oct 26 03:14 3 -> socket:[2923149143]
0 l-wx------. 1 root root 64 Oct 26 03:14 4 -> /var/log/cellos/cellos.log (deleted)
0 l-wx------. 1 root root 64 Oct 26 03:14 5 -> /var/log/cellos/cellos.trc (deleted)
0 lr-x------. 1 root root 64 Oct 26 03:14 6 -> /etc/snmp/snmpd.conf
0 lr-x------. 1 root root 64 Oct 26 03:14 7 -> /etc/redhat-release
0 lr-x------. 1 root root 64 Oct 26 03:14 8 -> /dev/megaraid_sas_ioctl_node
0 lr-x------. 1 root root 64 Oct 26 03:14 9 -> pipe:[2919419375]

Line 4 shows ‘/etc/lsi_mrdsnmp/sas/sas_TrapDestination.conf’! Let’s look inside that configuration file:

# cat /etc/lsi_mrdsnmp/sas/sas_TrapDestination.conf
#################################################
# Agent Service needs the IP addresses to sent trap
# The trap destination may be specified in this file or
# using snmpd.conf file. Following indicators can be set
# on "TrapDestInd" to instruct the agent to pick the IPs
# as the destination.
# 1 - IPs only from snmpd.conf
# 2 - IPs from this file only
# 3 - IPs from both the files
#################################################
TrapDestInd 3
#############Trap Destination IP##################
# Add port no after IP address with no space after
# colon to send the SNMP trap message to custom port.
# Community is to be mentioned after IP. If no community
# is mentioned, default SNMP community 'public' shall be
# used. 'trapcommunity' token is also used in snmpd.conf.
# Alternatively, you can also use trapsink command
# in snmpd.conf to send the SNMP trap message to
# custom port, else default SNMP trap port '162' shall
# be used.
127.0.0.1	public
# 145.147.201.88:1234	public
# 145.146.180.20:3061	testComm
127.0.0.1:8162 public

It is a configuration file that works alongside the snmpd.conf configuration. What is important to see, is ‘TrapDestInd’, which is set at ‘3’, which means that traps are send to trap destinations set in the snmpd.conf file AND set in the sas_TrapDestionation.conf file. Two traps are defined in the file, 127.0.0.1 with community string public, which means it sends a trap to udp port 161 (at which the snmpd process is listening, as we saw earlier in the open ports list), but the most interesting thing here is there’s also a trap send to 127.0.0.1 at port 8162. That is a port number I do not know from the top of my head!

However, it’s simple to find out. The first thing to check is to see what process is running at port 8162:

# netstat -anp | grep 8162
udp        0      0 :::8162                     :::*                                    15233/java

That’s a java process! Let’s grep the process number to see if the full command line gives more clues what this java process is:

# ps -ef | grep 15233
dbmsvc    15233  15136  0 Aug16 ?        05:32:25 /usr/java/jdk1.7.0_80/bin/java -client -Xms256m -Xmx512m -XX:CompileThreshold=8000 -XX:PermSize=128m -XX:MaxPermSize=256m -Dweblogic.Name=msServer -Djava.security.policy=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3/server/lib/weblogic.policy -XX:-UseLargePages -XX:ParallelGCThreads=8 -Dweblogic.ListenPort=7878 -Djava.security.egd=file:/dev/./urandom -Xverify:none -da -Dplatform.home=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3 -Dwls.home=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3/server -Dweblogic.home=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/wlserver_10.3/server -Dweblogic.management.discover=true -Dwlw.iterativeDev= -Dwlw.testConsole= -Dwlw.logErrorsToConsole= -Dweblogic.ext.dirs=/opt/oracle/dbserver_12.1.2.1.3.151021/dbms/deploy/wls/patch_wls1036/profiles/default/sysext_manifest_classpath weblogic.Server

That’s java running weblogic, with the name ‘msServer’. That is something that is part of the daemons that serve dbmcli (alike the daemons that service cellcli on the cells)!

This actually makes sense. The daemons that manage the database server fetch hardware status information and hardware failures from the BMC using the IPMI device (/dev/ipmi0). However the LSI MegaRaid adapter can not provide its status in that way. So in order for the management daemons to keep track of events on the LSI MegaRaid adapter (hardware issues), a daemon that works together with the snmp daemon is setup, which sends snmp traps if something occurs. The management daemon has setup a port that listens for these traps.

I do not know if the community strings is processed by the management deamon processing the trap. However, the port number on which the daemon is listening for traps is defined in ‘/opt/oracle/dbserver/dbms/deploy/config/cellinit.ora’ with the directive BMC_SNMP_PORT.

Conclusion
As far as I can see, the reason the snmp deamon is running is to be able to run the LSI MegaRaid SNMP agent process, so it can send traps to the compute node’s management daemons. Since most Exadata compute nodes do not have the firewall enabled, udp port 161 is exposed. The settings of the snmp daemon itself limits access to localhost.

It’s general knowledge that the Oracle database is ACID compliant, and that SCNs or ‘system change numbers’ are at the heart of this mechanism. This blogpost dives into the details of how the Oracle engine uses these numbers.

Oracle database version 12.1.0.2.161018
Operating system version: OL 7.2, kernel: 4.1.12-61.1.14.el7uek.x86_64 (UEK4)

Redo generation
Whenever DML is executed, redo is generated in the form of ‘change vectors’. These change vectors are copied into the redo buffer as part of the transaction, during the transaction. The function that performs this action is called ‘kcrfw_copy_cv()’. This can be derived by watching the foreground process perform memory copy into the memory area of the redo buffer.

In order to do this, you first need to find the memory area of the redo buffer. This can be done by executing ‘oradebug setmypid’ and ‘oradebug ipc’ as sysdba, and examine the resulting trace file:

 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1  4784130 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008600000 default       readwrite

This shows the redo buffers are allocated between 0x96400000 and 0x9E934000 (actual address and actual address + segment size).

The function Oracle uses to copy memory is called “_intel_fast_memcpy()”. This probably is an optimised version of the standard C memcpy() function. What is important, is that the arguments of memcpy function:

$ man memcpy
NAME        
       memcpy - copy memory area
SYNOPSIS    
       #include <string.h>
       void *memcpy(void *dest, const void *src, size_t n);

So the first argument of memcpy is a pointer to the destination, the second argument is a pointer to the source, and the third argument is the size of the memory area to be copied.

In order to find the function that uses memcpy() to copy into the redo buffers area, we can use the following gdb macro:

break _intel_fast_memcpy if $rdi > 0x96400000 && $rdi < 0x9E934000
commands
silent
printf "_intel_fast_memcpy dest %lx, source %lx, size %d\n", $rdi, $rsi, $rdx
frame 1
c
end

The macro creates a conditional breakpoint on the function _intel_fast_memcpy(). The condition is that the CPU register RDI (first argument, destination) should be higher than 0x96400000 and lower than 0x9E934000, which is the memory area the redo log buffers are mapped to. If the breakpoint and its conditions are triggered, the function name is printed, together with the arguments to memcpy, and ‘frame 1’ to list the function that called _intel_fast_memcpy().

To execute this macro, startup an Oracle foreground session, and attach to the session with gdb, and execute the macro. Next, execute an insert statement. In my case, I created a very simple table called ‘T’ with one field, T, varchar2(1). Please mind that you need to watch out for deferred segment creation, so ideally you first need to create the table, and insert a few rows to take care of that.

This is how the gdb side looks like when executing an insert statement:

_intel_fast_memcpy dest 96817454, source 7ffc9fcc9598, size 36
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817478, source 7ffc9fcc9658, size 32
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817498, source 7faf6ad677f0, size 44
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174c4, source 7ffc9fcc9678, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174d8, source 7ffc9fcc95e4, size 76
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817524, source 7faf6ad67888, size 8
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 9681752c, source 7faf6ad67760, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817540, source 7faf6ad69ff8, size 40
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817568, source 7faf6ad6a070, size 24
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817580, source 7faf6ad69f60, size 49
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968175b4, source 89753568, size 1
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()

The destination of the memcpy() action is the logbuffers, the break conditions made sure of that. The source of all except for the last memcpy() is the PGA (visible by the high number), the last memcpy() copies from the SGA. The memory address points to an SQLA component (cursor) in the SGA.

Okay, at this point we established the redo of a DML operation is put in the redo buffers by a function called kcrfw_copy_cv(). Also please mind that modern Oracle versions do put redo into the redo buffers during doing DML, not after it!

To move further on the topic of transactions, let’s look at when the current SCN is fetched and when kcrfw_copy_cv() is called. To do this, I use the following gdb macro:

break kcrfw_copy_cv
commands
silent
printf "kcrfw_copy_cv\n"
c
end
awatch *0x6001fbb0
commands
c
end

When running this gdb macro against an Oracle session running an insert statement, the following output is generated:

Hardware access (read/write) watchpoint 2: *0x6001fbb0

Old value = 347661
New value = 347666
0x000000000ccbdfb3 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbeb04 in kcsgssn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbdfb3 in kcscur3 ()
kcrfw_copy_cv
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()

Actually, I was surprised by this output. I suspected a call to generate an unique SCN for this change. However, the calls/functions which are visible are:
– kcscur3: a function to read a SCN (the global SCN, but is also used to read other SCNs, like the LWN SCN and the on-disk SCN).
– kcsgrsn: kernel cache service get recent SCN, for reading the global SCN.
– kcsgcsn: kernel cache service get current SCN, for reading the global SCN.
– kcsgssn: kernel cache service get snapshot SCN, for reading the global SCN.
All of these functions do READ a SCN value, but do not modify it. In order to update a SCN, the function kcsgbsn (kernel cache service get batched SCN) should be called. There is a memory write shown (row 1-5), but that really is a read, it is an anomaly of watchpoints with memory changed by other processes.

So, conclusion at this point is that DML (in this case an insert statement) just reads the global SCN, and does not seem to update it.

This got me thinking: if an insert statement just reads the global SCN, instead of generating an unique SCN, it should be possible that the changes (redo vectors actually) of DML executed simultaneously in different sessions on different segments share the same SCN? It sounds counter intuitive to what I have been taught about Oracle.

This can be tested in the following way:
1 – Have two foreground sessions in a different schema execute an insert statement at exactly the same time. This can be done in iTerm2 using the ‘broadcast input’ option, which allows input to be executed in multiple panes at the same time. The command executed in both pane’s is ‘insert into t values (‘a’);’.

2 – Have gdb attached to the two sessions, having an access watchpoint at the global SCN, to be able to identify the SCN that both the session used during executing the insert, and a breakpoint at kcrfw_copy_cv() to understand which SCN it uses. I used the following gdb macro:

break kcrfw_copy_cv
  commands
    silent
    printf "kcrfw_copy_cv\n"
    c
  end
awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    c
  end

3 – To validate what is visible in gdb, dump the current logfile and find the SCN as shown in gdb. The current logfile can be dumped to trace using the following PL/SQL procedure:

----------------------------------------
-- This script will dump the current redolog file.
-- If the log file is big, this might take some time.
--
--
-- Original Author : Riyaj Shamsudeen
-- Modifications by: Frits Hoogland
-- No implied or explicit warranty !
----------------------------------------
set serveroutput on size 1000000
declare
  v_sqltext varchar2(255);
begin
 select 'alter system dump logfile '||chr(39)||member||chr(39) into v_sqltext
  from v$log lg, v$logfile lgfile
  where lg.group# = lgfile.group# and
  lg.sequence# = ( select sequence# from v$log where status='CURRENT' )
  and rownum <2;
  dbms_output.put_line ('Executing :'||v_sqltext);
  execute immediate v_sqltext;
end;
/

In step 2, I got the following gdb output on both the inserts:

(1st session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860
kcrfw_copy_cv

(2nd session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860
kcrfw_copy_cv

This is as expected based on our observation, because the insert statement only reads the global SCN. Now let’s dump the current redologfile as shown in step 3 above, and see if we can find SCN 5c860.
Here are the redo records for both changes:

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.0010 LEN: 0x01a8 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.01b8 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

Watch the SCN from the gdb session and at the line with SCN and the SUBSCN value being the same. Also watch the RBA (redo block address) being different.

Both redo records contain 3 changes:

(1st session)
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0005c801 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:32 AFN:3 DBA:0x000000f9 OBJ:4294967295 SCN:0x0000.0005c800 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600c OBJ:20455 SCN:0x0000.0005c857 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

(2nd session)
CHANGE #1 CON_ID:0 TYP:0 CLS:35 AFN:3 DBA:0x00000110 OBJ:4294967295 SCN:0x0000.0005c7b2 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:36 AFN:3 DBA:0x00000895 OBJ:4294967295 SCN:0x0000.0005c78c SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600e OBJ:20455 SCN:0x0000.0005c858 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

These 3 transactions are consistent with an insert of a single row, this is visible by the opcodes:
5.2=update rollback segment header
5.1=undo block
11.2=insert row piece

This means the DML of the two inserts have exactly the same SCN! A logic thought that you might come up with is: but how is Oracle then able to apply these in the correct order, which is one of the fundamental reasons for having SCNs? For that, it’s important to realise this is just the change to a block, but not a committed transaction yet. In order to get a full understanding of how the SCN mechanism works for at least a single row insert, we need to change the gdb macro from the test above:

awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    if $rip >= 0xccbe340 && $rip <= 0xccbe7b0
      printf "kcsgbsn - SCN: %x (@%x)\n", *0x6001fbb0, $rip
    end
    c
  end
break kcrfw_copy_cv
  commands
    silent
    printf "kcrfw_copy_cv\n"
    c
  end

The function kcsgbsn means: kernel cache service get batched SCN. This function DOES update the global SCN(!). However, this function does not get called during the insert. It does get called when commit is called (or rollback).

So, in order take another step in understanding SCN handling by Oracle, let’s use the updated version of the gdb macro, and insert another row into the table in two sessions, and then also commit:

(1st session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcrfw_copy_cv
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)
kcrfw_copy_cv

(2nd session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcrfw_copy_cv
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)
kcsgbsn - SCN: 7e39c (@ccbe63a)
kcrfw_copy_cv

In both sessions we see the insert (lines 2-4 and 11-13), and then the commit (lines 5-8 and 14-18).
During the insert, we see both sessions using the same SCN (7e39b), which can be verified by dumping the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.0010 LEN: 0x01c0 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.01d0 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

And we see kcsgbsn() being called. As has been documented on multiple places, the current value of the global SCN is read, and then advanced (get and advance). This can be seen in lines 6 and 7; 0x7e39b is read, and then advanced to 0x7e39c. However, to be honest to my surprise, the second session does the same, as seen in lines 15-17, it advances from 0x7e39b to 0x7e39c too. However it rereads the SCN one additional time. The surprise is the sessions seem to have the same SCN for the commit. Validation from the dump of the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.0174 LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.01dc LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:19 AFN:3 DBA:0x00000090 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

Yes, both commits (visible by the opcode (OP) 5.4) have the same SCN. At this point I think it’s fitting to theorise why this is happening. I think this is happening and makes sense because the table is protected for consistency at another level in the database, which is the locking mechanisms (TX locks for rows, TM locks for segments), which guarantees that two transactions at the same time do not compromise data consistency. And because of that, it’s not a problem to batch them with the same SCN.

Conclusion
Changes in the database are synchronised by system change numbers, also known as SCNs. SCNs are the way Oracle serialises changes to data. However, this post shows that the Oracle engine tries to maximise the number of changes inside a SCN, and as such a SCN is not an highly granular serialisation mechanism. This is not a problem, since Oracle’s transactional protection mechanisms (like TX and TM locks) do work alongside SCNs to make sure no conflicting changes can happen at the same time. The changes itself do just read and pick the current global SCN from the instance, a commit of these changes does increase the global SCN, but as little as possible. You might find multiple changes in the same SCN, and you might even find commits with the same SCNs.

Bonus topic: finding nemokcscur3.
In case you wondered what this little macro means, and how got the memory addresses:

awatch *0x6001fbb0
  commands
    silent
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    end
    c
  end

awatch is a memory access watch function. The commands executed when this watch triggers are:
-silent (do not display output)
-if $rip >= 0xccbdfa0 && $rip <= 0xccbe010 (if the instruction pointer contains a value higher or equal to 0xccbdfa0 and the instruction pointer contains a value lower or equal to 0xccbe010)
-printf "kcscur3 – SCN: %x\n", *0x6001fbb0 (print some stuff and print (in hexadecimal format) where the value the pointer 0x6001fbb0 points to)
The text 'kcscur3' in the printf spoils that these addresses are actually the memory addresses where the function kcscur3 is located. How to find these? The starting address can be found in a variety of ways, for example in gdb: 'info function kcscur3', or in the shell using nm: 'nm oracle | grep kcscur3'. It is a bit harder to find the ending address of a function. However, it's still quite simple doable with nm: 'nm -S oracle | grep kcscur3':

[root@o12102fs bin]# nm -S oracle | grep kcscur3
000000000ccbdfa0 0000000000000070 T kcscur3

The first column is the starting memory address of the kcscur3 function, the second column is the length (in hexadecimal) of the function code. 0xccbdfa0+0x70=0xccbe010.

Update oktober 24, 2016:
Roger noted the following: Minor correction, Oracle doesn’t do DML directly, it first generates change vectors which are then applied to the local database which is effectively running in recovery. This way Oracle uniquely guarantees recovery on the primary or physical standby will always do exactly the same thing as the original DML.

Stefan noted the following: Hey Frits. Great work – as always. However i am not quite sure about your kcrfw_copy_cv() PGA interpretation for private redo strands. AFAIK private redo strands reside in permanent memory allocations (in SGA) and you should see memory copy operations between these memory chunks and the public redo buffer(s) on commit. Your GDB breakpoint should only kick in under these circumstances as anything else is handled in private redo (at least on your mentioned database version) and no memory operations on redo buffer. I also modeled a short test case with 12.1.0.1 on OEL 6.7 to verify my theory/assumption and the result was as expected – not quite sure about your observed and mentioned PGA to redo log buffer copy operations then.

I should add that when I run snapper on the insert in my database, the statistics ‘IMU- failed to get a private strand’ and ‘IMU pool not allocated’ are increased, giving a hint no private strands are used.

%d bloggers like this: