This is a run through of a performance investigation into Oracle tablespace encryption. These are the versions this test was performed on:

$ cat /etc/oracle-release
Oracle Linux Server release 6.8
$ /u01/app/oracle/product/ lspatches
24315824;Database PSU, Oracle JavaVM Component (OCT2016)
24006101;Database Patch Set Update : (24006101)

In this test I created an encrypted tablespace:

SQL> create tablespace is_encrypted datafile size 10m autoextend on next 10m encryption default storage(encrypt);

(this assumes you have setup a master encryption key already)
And I created an encrypted simple heap table with one row:

SQL> create table t_is_encrypted ( f1 varchar2(100) ) tablespace is_encrypted;
SQL> insert into t_is_encrypted values ('AAAAAAAAAA');

Also, in order to do some performance tests, I created a simple heap table with a lot of rows:

SQL> create table t_encrypted_large ( id number, random_string varchar2( 100 ) ) tablespace is_encrypted;
SQL> begin
        for counter in 1..32000000 loop
                insert into t_encrypted_large values ( counter, dbms_random.string('l',50) );
        end loop;

Let me first go back to basics a little and explain how tablespace encryption works: tablespace encryption sets the ‘E’ flag in the cache header of blocks of a segment which contain data (block type# 6, ‘trans data’), which means the block will be UNencrypted in the buffercache, but will be encrypted once the block is persisted to disk. If a block is read from disk into the cache, it means it must be decrypted first. Once a block is in the cache and decrypted, it can be read and used without the overhead of encryption, because it is not encrypted anymore.

A special case arises when direct path reads are used by Oracle. A direct path read means the blocks are read from disk into the process’ PGA, so there is no involvement of the buffer cache. This means that direct path reads ALWAYS must decrypt the data blocks with tablespace encryption, which means it will suffer the overhead of decryption. This inherent overhead applies to any direct path action (which means both direct reads and direct path writes).

Let’s first do a simple count(*) on table t_encrypted_large, and collect a wait profile. I am doing this with Tanel Poder’s snapper. First session, which collects the profile:

SQL> var snapper refcursor
SQL> @snapper all,begin 1 1 793

Now with the begin snapshot ready, head over to the other session, which will scan the t_encrypted_large table:

SQL> select count(*) from t_encrypted_large;


Now end the snapper snapshot and look at where the time was spend:

SQL> @snapper all,end 1 1 793
    SID, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
    793, SYS       , TIME, parse time elapsed                                        ,           140,     5.32us,      .0%, [          ],          ,           ,
    793, SYS       , TIME, PL/SQL execution elapsed time                             ,           113,     4.29us,      .0%, [          ],          ,           ,
    793, SYS       , TIME, DB CPU                                                    ,       4912502,   186.66ms,    18.7%, [@@        ],          ,           ,
    793, SYS       , TIME, sql execute elapsed time                                  ,       4915777,   186.78ms,    18.7%, [##        ],          ,           ,
    793, SYS       , TIME, DB time                                                   ,       4916441,   186.81ms,    18.7%, [##        ],          ,           ,       2.66 % unaccounted time
    793, SYS       , WAIT, Disk file operations I/O                                  ,            69,     2.62us,      .0%, [          ],         2,        .08,     34.5us average wait
    793, SYS       , WAIT, db file sequential read                                   ,           484,    18.39us,      .0%, [          ],         3,        .11,   161.33us average wait
    793, SYS       , WAIT, db file scattered read                                    ,           220,     8.36us,      .0%, [          ],         1,        .04,      220us average wait
    793, SYS       , WAIT, direct path read                                          ,          1641,    62.35us,      .0%, [          ],         4,        .15,   410.25us average wait
    793, SYS       , WAIT, SQL*Net message to client                                 ,             2,      .08us,      .0%, [          ],         3,        .11,      .67us average wait
    793, SYS       , WAIT, SQL*Net message from client                               ,      20701153,   786.56ms,    78.7%, [WWWWWWWW  ],         3,        .11,       6.9s average wait

The time spend executing SQL is 186.78ms, and during that time, 186.66ms was spend on CPU (DB CPU). The waits acknowledge this, the time in IO related waits is measured in microseconds. In case you wondered if this output is sane, this system is using local NVMe disks, and you see the full table scan flipping from buffered multiblock IO (db file scattered read) to direct path multiblock IO (direct path read). The conclusion at this point is that as good as all time was spend running on CPU. Of course this might be different if IO has higher latency (random IO latency of rotating media is between 5-12ms, reading from a SAN cache normally has a latency of around 1ms).

It is not possible to get a breakdown of where (ON) CPU time is spend inside the database. There are session counters, but these do not depict time. However, on the operating system level there is a useful tool: perf. In my experience, perf output is only really useful when the hardware PMU events are available from the operating system. This can be tested by specifying the event to use for perf. If you have the PMU events available, perf stat -e cycles sleep 1 will return something like:

$ perf stat -e cycles sleep 1

 Performance counter stats for 'sleep 1':

           736,481      cycles

       1.001210217 seconds time elapsed

If you don’t have PMU events available (for you are running in a virtual machine for example, however some virtualisation products can give access to the PMU events), it will look like:

$ perf stat -e cycles sleep 1

 Performance counter stats for 'sleep 1':

   <not supported>      cycles

       1.001025828 seconds time elapsed

Back to the original purpose of this blogpost: the performance of tablespace. Let’s get a breakdown of the functions the time is spend on doing the ‘select count(*) from t_encrypted_large’. This is done by first executing perf record in one session with the process ID specified of the server process, which then executes the ‘select count(*) from t_encrypted_large’. Once the SQL is ready, the perf session can be stopped (ctrl-c), and the contents of the gathered perf database can be reported. This is how the recording is done:

$ perf record -g -p 57531

Now the ‘select count(*) from t_encrypted_large’ is executed. Once the result comes back in the sqlplus session, go to the perf session and ctrl-c it:

^C[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.928 MB (5114 samples) ]

Now that the perf data is collected, display a summary of the data:

$ perf report --sort comm --max-stack 2
# To display the header info, please use --header/--header-only options.
# Samples: 5K of event 'cycles'
# Event count (approx.): 15445982213
# Children      Self  Command
# ........  ........  ...............
   100.00%   100.00%  oracle_57531_ao
            |--27.44%-- kdr4chk
            |--15.91%-- kdb4chk1
            |--10.14%-- cipher_loop_p3
            |--8.56%-- msort_with_tmp
            |--5.84%-- kdrrea2
            |--3.98%-- memcpy
            |--3.12%-- kd4_ent_cmp
            |--2.13%-- kdrrrh2
            |--1.99%-- kdstf00000010000100kmP
            |--1.53%-- __intel_ssse3_rep_memcpy
            |--1.42%-- __intel_new_memset
            |--1.40%-- blks_loop_p3
            |--1.14%-- y8_ExpandRijndaelKey
            |--1.06%-- kdbulk
            |--0.88%-- ownGetReg
            |--0.62%-- kghalf
             --12.86%-- [...]

What this shows, is 100% of the samples are from the process ‘oracle_57531_ao’, 27.44% of the samples were spend in a function called ‘kdr4chk’, 15.91% in a function called ‘kdb4chk1’, and so on.

Just by looking at the function names, these do not strike me as functions that do decryption. Function names like ‘cipher_loop_p3’ and ‘y8_ExpandRijndaelKey’ more look like functions doing some of the heavy lifting for decryption. In fact, the function names the most time is spend in (kdr4chk and kdb4chk1) seem to indicate some kind of checking.

A first thought is to look at the database parameters for block checking, and turn them off, and redo the test.

SQL> show parameter db_block_check

------------------------------------ ----------- ------------------------------
db_block_checking		     string	 OFF
db_block_checksum		     string	 FALSE

In fact, I already turned db block checking and db block checksum off. So it’s not the checking governed by these parameters, so what can it be then?

The best way I could come up with is getting a trace of the exact functions the process is executing. However, the number of C functions an executable goes through is typically gigantic. So for that reason, doing an investigation of the C functions, it’s best to use a very small table, which is ‘t_is_encrypted’, which contains only one row, in order to minimise the processing.

However, there is no tooling by default on Linux that gives you the ability to trace the exact C functions a process is executing. perf, which was just used SAMPLES execution. However, the intel pin tools give you the options to create a C function execution overview. Please use the link to see how you can download and link pin tools, and then create a debugtrace of the execution of ‘select * from t_is_encrypted’ for yourself. If you are interested in the findings based on the debugtrace, read on!

First of all, understanding the functions requires some experience. I don’t think there is any workaround for gaining the knowledge, in order to learn how to interprent them you just have to study them and spending time on it to work it out.

The output of debugtrace on ‘select * from t_is_encrypted’ (which fetches the one row in the table from an encrypted tablespace) looks like this:

         3 @@@ return underflow
Return 0x00007f225b355818 /lib64/ returns: 0x151
        14 @@@ return underflow
Return 0x000000000cf1036d /u01/app/oracle/product/ returns: 0
        31 @@@ return underflow
Return 0x000000000cf102c0 /u01/app/oracle/product/ returns: 0
       119 @@@ return underflow
Return 0x000000000cef19dd /u01/app/oracle/product/ returns: 0
       122 @@@ return underflow
Return 0x000000000cef164a /u01/app/oracle/product/ returns: 0
       152 Call 0x000000000cf00486 /u01/app/oracle/product/ -> 0x0000000005b2bdb0 /u01/app/oracle/product/, 0x7f225a965da0, ...)
       156 Tailcall 0x0000000005b2bdbe /u01/app/oracle/product/ -> 0x0000000005b2bda0 /u01/app/oracle/product/, 0x7f225a965da0, ...)
       158 | Call 0x0000000005b2bda1 /u01/app/oracle/product/ -> 0x0000000005b32990 /u01/app/oracle/product/, 0x7f225a965da0, ...)
       172 | Return 0x0000000005b34d77 /u01/app/oracle/product/ returns: 0x7ffe4ed10685
       174 Return 0x0000000005b2bda7 /u01/app/oracle/product/ returns: 0x7ffe4ed10685
       197 @@@ return underflow

That is not very well readable. I created some simple sed filters to make the output more readable:

cat debugtrace.out | sed 's/^\ *\([0-9]*\)\(\ .*\)Call.*-> 0x.*:\(.*\)/\2> \3/' | sed 's/^\ *\([0-9]*\)\(\ .*\)Return.*:\(.*returns: .*\)/\2< \3/' | sed 's/^\ *\([0-9]*\)\(\ .*\)Tailcall.*-> 0x.*:\(.*\)/\2<> \3/' | grep -v underflow

If you use that on the same file, it gets better, with the aim to show the function call flow:

Return 0x00007f225b355818 /lib64/ returns: 0x151
Return 0x000000000cf1036d /u01/app/oracle/product/ returns: 0
Return 0x000000000cf102c0 /u01/app/oracle/product/ returns: 0
Return 0x000000000cef19dd /u01/app/oracle/product/ returns: 0
Return 0x000000000cef164a /u01/app/oracle/product/ returns: 0
 > _intel_fast_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
 <> _intel_fast_memcpy.P(0x7ffe4ed10685, 0x7f225a965da0, ...)
 | > __intel_ssse3_rep_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
 | < __intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffe4ed10685
 < _intel_fast_memcpy.P+0x000000000007 returns: 0x7ffe4ed10685
Return 0x000000000cf004ed /u01/app/oracle/product/ returns: 0
 > kslwtectx(0x7ffe4ed103c0, 0x7f225a965da1, ...)
 | > sltrgftime64(0x7ffe4ed103c0, 0x7f225a965da1, ...)
 | | > sltrgftime64_cgt(0x7ffe4ed103c0, 0x7f225a965da1, ...)
 | | | > clock_gettime@plt(0x1, 0x7ffe4ed10230, ...)
 | | | | > clock_gettime(0x1, 0x7ffe4ed10230, ...)
 | | | | < clock_gettime+0x000000000054 returns: 0
 | | | < clock_gettime+0x00000000006a returns: 0
 | | < sltrgftime64_cgt+0x000000000058 returns: 0x68724a87136
 | < sltrgftime64+0x00000000005c returns: 0x68724a87136
 | > kslwt_end_snapshot(0x2c674c080, 0x2c674c080, ...)
 | < kslwt_end_snapshot+0x000000000235 returns: 0x2c674b128
 | > kslwt_update_stats_int(0x2c674c080, 0x2c674c080, ...)
 | | > kews_update_wait_time(0x6, 0x53696ef, ...)
 | | < kews_update_wait_time+0x00000000002b returns: 0x1
 | < kslwt_update_stats_int+0x000000000314 returns: 0x1b0
 | > kskthewt(0x68724a87136, 0x184, ...)
 | < kskthewt+0x0000000000d7 returns: 0x23
 < kslwtectx+0x0000000003c0 returns: 0xe923

The rows that start with ‘Return’ are functions that were already entered when debugtrace started, and thus debugtrace could not determine the level, and reports an underflow in the output when the process returns from them.
The rows that start with ‘>’ mean the function after it is entered. The first two arguments are shown.
The rows that start with ‘<‘ mean the function after it is returned from. The return code is shown.
The rows that start with ‘<>’ mean a ‘tail call’ was executed, which is function that is executed as part of the return of another function.

To further emphasise this example, the above trace start with the server process waiting on input. On linux, this is done using a read function (the __read_nocancel function comes from the libpthread operating system library), which is a place a server process is expected to wait and is actually in an oracle wait event (SQL*Net message from client). It becomes apparent the process was in a wait event because the code for ending a wait event (kslwtectx) is executed after the __read_nocancel function. A word of warning: in this output, kslwtectx seems to be the beginning (left most) function. In reality, this is not true, there are many more functions, but the process has not returned to them so they are not known at that point of the trace. For the function a waiting database server process is sleeping in (__read_nocancel), this is sntpread>ntpfprd>nsbasic_brc>nsbrecv>nioqrc>opikndf2>opitsk>opiino>opiodr>opidrv>sou2o>optima_real>ssthrdmain>main.

Okay, now let’s look into the relevant output of the ‘select * from t_is_encrypted’. If you wade through the trace, it’s best to first go the the line with for ‘< kcbz_table_scan_read'. The kcbz_table_scan_read function performs a buffered multiblock read. '<' means the return, and you need to go up from there to find the first important bit. Here’s the important bit:

 | | | | | | | | | | | > ksfd_clear_pgaobjsc(0, 0x2b, ...)
 | | | | | | | | | | | < ksfd_clear_pgaobjsc+0x000000000023 returns: 0
 | | | | | | | | | | | > _intel_fast_memcpy(0x1a5ad6000, 0x2b1115000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x1a5ad6000, 0x2b1115000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x1a5ad6000, 0x2b1115000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x1a5ad6000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x1a5ad6000
 | | | | | | | | | | | > _intel_fast_memcpy(0x182808000, 0x2b1117000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x182808000, 0x2b1117000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x182808000, 0x2b1117000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x182808000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x182808000
 | | | | | | | | | | | > _intel_fast_memcpy(0x178ed0000, 0x2b1119000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x178ed0000, 0x2b1119000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x178ed0000, 0x2b1119000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x178ed0000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x178ed0000
 | | | | | | | | | | | > _intel_fast_memcpy(0x12927c000, 0x2b111b000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x12927c000, 0x2b111b000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x12927c000, 0x2b111b000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x12927c000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x12927c000
 | | | | | | | | | | | > _intel_fast_memcpy(0x15411c000, 0x2b111d000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x15411c000, 0x2b111d000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x15411c000, 0x2b111d000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x15411c000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x15411c000
 | | | | | | | | | | | > kcbz_free_table_scan_read_buffer(0x2b1115000, 0xa000, ...)
 | | | | | | | | | | | | > kslgetl(0x60023a80, 0x1, ...)
 | | | | | | | | | | | | < kslgetl+0x000000000144 returns: 0x1
 | | | | | | | | | | | | > kslfre(0x60023a80, 0x8, ...)
 | | | | | | | | | | | | < kslfre+0x000000000125 returns: 0
 | | | | | | | | | | | < kcbz_free_table_scan_read_buffer+0x0000000000f3 returns: 0
 | | | | | | | | | | < kcbz_table_scan_read+0x000000000664 returns: 0x5

Going up from kcbz_table_scan_read return, we see a call to kcbz_free_table_scan_read_buffer, which gets and frees a latch (‘cache table scan latch’). Above that we essentially see the table scan read buffer contents being copied into the buffer cache per block using an optimised version of the memcpy system call. Luckily, debugtrace shows the first two arguments of a call, for memcpy this is a pointer to the destination as first argument, and a pointer to the source as second. If you look closely to the memcpy arguments, you see that the destination is random (“scattered” through the buffer cache, which is where ‘db file scattered read’ gets its name from), and the source is starting at 0x2b1115000 and increases by exactly 0x2000. 0x2000 equals 8192, which is the size of the blocks. The reason for starting here is the blocks are further processed in the buffer cache buffers in order they are copied in kcbz_table_scan_read.

The first buffer copied is 0x1a5ad6000. The next sequence of functions:

 | | | | | | | | | | > kcbzvb(0x1a5ad6000, 0x5, ...)
 | | | | | | | | | | | > kcbhvbo(0x1a5ad6000, 0x2000, ...)
 | | | | | | | | | | | < kcbhvbo+0x0000000000cc returns: 0
 | | | | | | | | | | < kcbzvb+0x000000000084 returns: 0x1
 | | | | | | | | | | > kcbztek_trace_blk(0xd5e15b8, 0, ...)
 | | | | | | | | | | < kcbztek_trace_blk+0x00000000001d returns: 0x1a5fce498
 | | | | | | | | | | > _setjmp@plt(0x7ffe4ed04ed0, 0, ...)
 | | | | | | | | | | <> __sigsetjmp(0x7ffe4ed04ed0, 0, ...)
 | | | | | | | | | | <> __sigjmp_save(0x7ffe4ed04ed0, 0, ...)
 | | | | | | | | | | < __sigjmp_save+0x000000000010 returns: 0
 | | | | | | | | | | > kcbz_encdec_tbsblk(0x5, 0x38b, ...)

Next is a function ‘kcbzvb’ which probably is kernel cache buffers space management (Z) verify buffer. Then kcbztek_trace_blk. My current thinking is this function reads the block to see if the encryption flag is set in the cache header, and it hasn’t been decrypted already. When a block is is not encrypted, or is encrypted and already decrypted, kcbz_encdec_tbsblk is not called; in this case, the block is encrypted and not decrypted yet, so we see the function kcbz_encdec_tbsblk being called. A lot of function calls inside the kcbz_encdec_tbsblk are done which strongly indicate having something to do with keys and encryption, like kcbtse_get_tbskey, kcbtse_get_tbskey_info and a lot of functions starting with kgce (kgce_init, kgce_set_mode, kgce_set_key) most notably the function kgce_dec, which probably is kernel generic encryption – decryption. Here is how that function returns and what it is executes next (still inside the function kcbz_encdec_tbsblk):

 | | | | | | | | | | | < kgce_dec+0x0000000000a6 returns: 0
 | | | | | | | | | | | > _intel_fast_memcpy(0x1a5ad6014, 0x7f225a96c0f0, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x1a5ad6014, 0x7f225a96c0f0, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x1a5ad6014, 0x7f225a96c0f0, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x0000000025e8 returns: 0x1a5ad6014
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x1a5ad6014
 | | | | | | | | | | | > kcbhxoro(0x1a5ad6000, 0x2000, ...)
 | | | | | | | | | | | | > sxorchk(0x1a5ad6000, 0x2000, ...)
 | | | | | | | | | | | | < sxorchk+0x0000000000e6 returns: 0x46dd
 | | | | | | | | | | | < kcbhxoro+0x000000000013 returns: 0x46dd

The kgce_dec functions returns. It is my assumption the decryption of the block is done. Next there is another memcpy function. The arguments of memcpy are *dest, *src. The first address is the same address we saw earlier, but at 0x1a5ad6014 instead of 0x1a5ad6000, so 14 bytes past the start of the block. The source is 0x7f225a96c0f0. In most cases these high numbers are PGA memory addresses. So what I think happens here, is the block was copied into the buffer, then kcbztek_trace_blk found the encryption flag and the block encrypted, so the kcbz_encdec_tbsblk function has been called to decrypt the block, which is done in PGA, and then the decrypted version of the block is copied into the buffer overwriting the encrypted version.

The next function kcbhxoro probably is a function to check the block. This must mean the checking is done on the unencrypted block contents. Please mind we are still in kcbz_encdec_tbsblk!

Continuing the trace we see:

 | | | | | | | | | | | > kcbchk_ctx(0x1a5ad6000, 0, ...)
 | | | | | | | | | | | | > ktbdbchk(0x1a5ad6000, 0, ...)
 | | | | | | | | | | | | | > kdgchk(0x1a5ad6000, 0, ...)
 | | | | | | | | | | | | | | > kd4chk(0x1a5ad6000, 0x7fffffc7, ...)
 | | | | | | | | | | | | | | | > kdb4chk1(0x1a5ad6014, 0, ...)
 | | | | | | | | | | | | | | | | > kd4_alloc_ok(0xc9cee30, 0xc9cee80, ...)
 | | | | | | | | | | | | | | | | | > kcbtmal(0, 0x4, ...)
 | | | | | | | | | | | | | | | | | | > kghalf(0x7f225f080c80, 0x7f225f085980, ...)
 | | | | | | | | | | | | | | | | | | | > _intel_fast_memset(0x7f225ef1ffb0, 0, ...)
 | | | | | | | | | | | | | | | | | | | <> _intel_fast_memset.J(0x7f225ef1ffb0, 0, ...)
 | | | | | | | | | | | | | | | | | | | | > __intel_memset(0x7f225ef1ffb0, 0, ...)
 | | | | | | | | | | | | | | | | | | | | < __intel_memset+0x00000000031e returns: 0x7f225ef1ffb0
 | | | | | | | | | | | | | | | | | | | < _intel_fast_memset.J+0x000000000007 returns: 0x7f225ef1ffb0
 | | | | | | | | | | | | | | | | | | < kghalf+0x00000000061f returns: 0x7f225ef1ffb0

These are the apparent check functions! We also see it takes the buffer address as first argument. This is where perf measured the majority of the time being spend (kdr1chk, not in this snippet, and kdb4chk1)! The last noteworthy things in the trace for handling this block are:

 | | | | | | | | | | | < kcbchk_ctx+0x0000000000ff returns: 0
 | | | | | | | | | | | > kgce_fini(0x7f225f080c80, 0x7ffe4ed03b98, ...)
 | | | | | | | | | | | | > skgce_end_ipp_f(0x7f225f080c80, 0x7ffe4ed03b98, ...)
 | | | | | | | | | | | | < skgce_end_ipp_f+0x000000000014 returns: 0
 | | | | | | | | | | | < kgce_fini+0x000000000037 returns: 0
 | | | | | | | | | | < kcbz_encdec_tbsblk+0x000000001f36 returns: 0

Here we see the return from the kcbchk_ctx call, which were identified above as ‘check functions’, then another call to a function in the encryption family (kgce_fini), which clearly indicates it’s stopping the encryption/decryption functions (“fini”), and then returning from the kcbz_encdec_tbsblk function. There are some functions executed after kcbz_encdec_tbsblk, like kcbz_check_objd_typ. A little further you will see the whole sequence starting again for the next block with the function kcbzvb.

This makes me wonder. We saw the block being read from disk into the table_scan_read_buffer, from the read buffer into the cache, then being decrypted, then being checked. I turned db_block_checking and db_block_checksum off, why would there be a need for checking? But let’s not get carried away, I made some assumptions. I am right in my assumptions?

If you look at the logical flow it’s:
– table scan read buffer -> buffer cache buffer (kcbz_table_scan_read>_intel_fast_memcpy)
– identify if block needs decrypting (kcbz_encdec_tbsblk>kcbztek_trace_blk)
– decrypt buffer cache buffer (kcbz_encdec_tbsblk>kcgce_dec)
– check buffer cache buffer (kcbz_encdec_tbsblk>kcbchk_ctx)

If I can identify the buffer, then stop the execution at kcbztek_trace_blk to look at the buffer to check for it being encrypted, then stop at kcbchk_ctx and look at the same buffer again, and see if it is decrypted, my assumption that the kgce_dec function is doing the decryption and that the checking is based on the unencrypted block contents are correct.

In order to do this, we can use gdb and have the function arguments printed of memcpy, and break at the function kcbztek_trace_blk, print the buffer and then continue and break at kcbchk_ctx and print the buffer again. This is how I created my gdb macro:

break _intel_fast_memcpy
    printf "_intel_fast_memcpy dest: %lx, src: %lx, size: %d\n", $rdi, $rsi, $rdx
break kcbztek_trace_blk
    print $rip
break kcbchk_ctx
    print $rip

Now create a database session, flush the buffer cache (alter system flush buffer_cache) to make sure the small table t_is_encrypted is read from disk again, and thus decrypted, and attach to it with gdb and source the macro:

$ gdb -p 57531 -x tde_verification.gdb
(gdb) set pagi off
(gdb) c

I turned pagination off (set pagi off) because there is a lot of memcpy going on, otherwise gdb will pause execution if the screen is filled, and used the continue (c) command, so we can use the sqlplus session again. Now execute ‘select * from t_is_encrypted’, this is what it looks like in the debugger:

_intel_fast_memcpy dest: 7f85093a7f0a, src: 7ffd589f6690, size: 7
_intel_fast_memcpy dest: 7f85093a7f11, src: 7ffd589f6624, size: 4
_intel_fast_memcpy dest: 7f85093a7f15, src: 7ffd589f6628, size: 4
_intel_fast_memcpy dest: 7f85093a7f19, src: 7ffd589f662c, size: 4
_intel_fast_memcpy dest: 7f85093a7f1d, src: 7ffd589f6630, size: 4
_intel_fast_memcpy dest: 7f85093bac08, src: 7ffd589f3238, size: 336
_intel_fast_memcpy dest: 7ffd589f0e30, src: 2b7f89c20, size: 456
_intel_fast_memcpy dest: 7ffd589f3238, src: 7f85093bac08, size: 336
$1 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>

This is actually reading the segment header, so we continue (c).

(gdb) c
_intel_fast_memcpy dest: 7f8508bf86a8, src: 7f8508bf8060, size: 712
_intel_fast_memcpy dest: 7f8508bf86c0, src: 7f8508bf8078, size: 264
_intel_fast_memcpy dest: 7f85093bac08, src: 7ffd589eb308, size: 336
_intel_fast_memcpy dest: 7ffd589e8f00, src: 2b7f89c20, size: 456
_intel_fast_memcpy dest: 7ffd589eb308, src: 7f85093bac08, size: 336
_intel_fast_memcpy dest: 1a5576000, src: 2b1115000, size: 8192
_intel_fast_memcpy dest: ee6b8000, src: 2b1117000, size: 8192
_intel_fast_memcpy dest: 12eb54000, src: 2b1119000, size: 8192
_intel_fast_memcpy dest: 200c56000, src: 2b111b000, size: 8192
_intel_fast_memcpy dest: 1a570e000, src: 2b111d000, size: 8192
$2 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>

Here we see kcbztek_trace_again, and we see memcpy executed 5 times with a size of 8192. These are the blocks copied from the table scan read buffer! The buffer we are going to look at is 0x1a5576000. In order to examine the block, we are going to print 8192 bytes starting from 0x1a5576000:

(gdb) x/8192bx 0x1a5576000
0x1a5576000:	0x06	0xa2	0x00	0x00	0x8b	0x03	0x00	0x00
0x1a5576008:	0xf4	0x9a	0x4e	0x00	0x00	0x00	0x01	0x16
0x1a5576010:	0x3a	0xf2	0x00	0x00	0x30	0x7f	0xf4	0x80
0x1a5576018:	0x5f	0xf1	0xd5	0xe9	0xd0	0x5a	0xeb	0x59
0x1a5576020:	0x5c	0x7c	0x90	0x87	0x21	0x46	0x44	0xf3
0x1a5576028:	0x66	0xa2	0x6a	0x3d	0xd4	0x97	0x6d	0x1e
0x1a5576030:	0xab	0xf8	0x06	0xd4	0x17	0x5c	0xc9	0x1a
0x1a5576038:	0xe6	0x86	0xf4	0x5a	0x20	0xbe	0x8d	0x22
0x1a5576040:	0x0c	0x2a	0x29	0x2b	0xa1	0x2a	0x3b	0x86
0x1a5576048:	0x4b	0x31	0x13	0xa7	0x4c	0xcd	0x4c	0xa4
0x1a5576050:	0xf1	0x1a	0xa3	0x59	0xf0	0x70	0x31	0xd5
0x1a5577fb8:	0xf8	0x24	0x90	0x74	0xa3	0x23	0x65	0xf6
0x1a5577fc0:	0xee	0x76	0x6c	0xf5	0x8b	0xaf	0x23	0x7e
0x1a5577fc8:	0xc4	0x5b	0xbd	0x5f	0x26	0xb8	0x5e	0x3b
0x1a5577fd0:	0x65	0x96	0x48	0x88	0xb3	0x37	0x10	0x15
0x1a5577fd8:	0x3f	0xd8	0x62	0x54	0x4f	0xa3	0xf3	0x60
0x1a5577fe0:	0xfc	0x2e	0x9d	0xf4	0xc6	0x49	0xbc	0x4f
0x1a5577fe8:	0x85	0x3b	0x8a	0xd1	0xa3	0xc8	0x61	0x72
0x1a5577ff0:	0xfd	0xb8	0xe1	0x2e	0x65	0x92	0x0d	0x70
0x1a5577ff8:	0x50	0xd8	0x34	0x06	0x01	0x06	0xf4	0x9a

I can’t tell from looking at the hexadecimal numbers if this block is encrypted or not. However, this block is supposed to have a single row containing ‘AAAAAAAAAA’, which means there is supposed to be a lot of empty space. I can’t find a repeating value, and the entire block contains values. This might be the result of the block being encrypted. Now, if my assumption is correct, if we continue and then when execution breaks on kcbchk_ctx, the same buffer should look entirely different, because it then has been decrypted by kgce_dec:

(gdb) c
_intel_fast_memcpy dest: 7ffd589eb3f0, src: 2a5e2ced8, size: 240
_intel_fast_memcpy dest: 7ffd589ea860, src: e7ef774, size: 15
_intel_fast_memcpy dest: 7ffd589eaa80, src: 7ffd589eb3c8, size: 16
_intel_fast_memcpy dest: 1a5576014, src: 7f8508ccc908, size: 8168
$3 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
(gdb) x/8192bx 0x1a5576000
0x1a5576000:	0x06	0xa2	0x00	0x00	0x8b	0x03	0x00	0x00
0x1a5576008:	0xf4	0x9a	0x4e	0x00	0x00	0x00	0x01	0x06
0x1a5576010:	0xdd	0x46	0x00	0x00	0x01	0x00	0x00	0x00
0x1a5576018:	0xc3	0x7b	0x01	0x00	0xed	0x9a	0x4e	0x00
0x1a5576020:	0x00	0x00	0x00	0x00	0x02	0x00	0x32	0x00
0x1a5576028:	0x88	0x03	0x00	0x00	0x07	0x00	0x1d	0x00
0x1a5576030:	0xd0	0x01	0x00	0x00	0x87	0x09	0x00	0x00
0x1a5576038:	0x90	0x03	0x27	0x00	0x01	0x20	0x00	0x00
0x1a5576040:	0xf4	0x9a	0x4e	0x00	0x00	0x00	0x00	0x00
0x1a5576048:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576050:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576058:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576060:	0x00	0x00	0x00	0x00	0x00	0x01	0x01	0x00
0x1a5576068:	0xff	0xff	0x14	0x00	0x8a	0x1f	0x76	0x1f
0x1a5576070:	0x76	0x1f	0x00	0x00	0x01	0x00	0x8a	0x1f
0x1a5576078:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576080:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576088:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fa8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fb0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fb8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fc0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fc8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fd0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fd8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fe0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fe8:	0x00	0x00	0x00	0x00	0x00	0x00	0x2c	0x01
0x1a5577ff0:	0x01	0x0a	0x41	0x41	0x41	0x41	0x41	0x41
0x1a5577ff8:	0x41	0x41	0x41	0x41	0x01	0x06	0xf4	0x9a

After continuing, we see a couple of memcpy’s, after which the debugger breaks execution when kcbchk_ctx is called. We execute the examine command again, and now the block looks way different. Way more zero’d locations, the place of the different headers is visible because of the zero’s in between, and starting from 0x1a5576078 it’s all zero until 0x1a5577fe8. We see a repeating value 0x41. Is that my ‘A’?

(gdb) p (char) 0x41
$2 = 65 'A'

Yes! These are the 10 A’s!

So, this means that before we enter the kcbchk_ctx function inside kcbz_encdec_tbsblk the block is actually already decrypted, and kcbchk_ctx performs something that takes a lot of the time, which seems to be checking, based on the function naming. Because the block is already decrypted when we enter kcbchk_ctx, it seems to me that the checking step is technically optional.

Further background analysis
I stated that with tablespace encryption the block is decrypted when read into the cache, that non-data blocks are not encrypted and that once the block is decrypted in the cache, it can be read without any encryption overhead because it is not encrypted (the block being decrypted in the cache is actually documented by Oracle). This actually can be proven with a debugger macro:

break pread64
    print $rip
break kgce_dec
    print $rip
break kcbchk_ctx
    print $rip
break kcbz_encdec_tbsblk
    print $rip
break kcbztek_trace_blk
    print $rip
break kdr4chk
    print $rip
break kcbgtcr
    print $rip

If you flush the buffer cache, and then attach to a foreground session with gdb and source the above macro, you get the following output:

$55 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$56 = (void (*)()) 0x7fd2c72b8100 <pread64>
$57 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$58 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$59 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$60 = (void (*)()) 0x7fd2c72b8100 <pread64>
$61 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$62 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$63 = (void (*)()) 0x3923750 <kgce_dec>
$64 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$65 = (void (*)()) 0xcdd26a0 <kdr4chk>
$66 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$67 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$68 = (void (*)()) 0x3923750 <kgce_dec>
$69 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$70 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$71 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$72 = (void (*)()) 0x3923750 <kgce_dec>
$73 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$74 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$75 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$76 = (void (*)()) 0x3923750 <kgce_dec>
$77 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$78 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$79 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$80 = (void (*)()) 0x3923750 <kgce_dec>
$81 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$82 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$83 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$84 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$85 = (void (*)()) 0xc9d9a20 <kcbgtcr>

Rows 1-4: First we see kcbgtcr (kernel cache buffers get consistent read); as I have made clear in the past, all (normal) reads in the oracle databases start off as a logical read; but since we just flushed the cache, the segment header needs to be read from disk, which explains the pread64. After the single block read, the function kcbztek_trace_blk is executed, but because this is a non-data block, no encryption is applied to this block, so there is no decryption to be done.
Rows 5-11: Then the multiblock read is initiated, which is executed via a pread64. After the multibock read, the blocks are put in the buffer cache. For the first block, we see the kcbztek_trace_blk function being executed, and because this block is encrypted, the function kcbz_encdec_tbsblk is invoked, and then the actual decryption in the function kgce_dec. After performing decryption, the apparent checking function kcbchk_ctx is called, and inside that function kdr4chk is executed (which checks an individual row, we got one row).
Rows 12-15,16-19,20-23,24-27: The other blocks that are read do not contain any data, however since they are data blocks, you see kcbztek_trace_blk which finds the encryption flag, after which kcbz_encdec_tbsblk is called, which calls kgce_dec, and then kcbchk_ctx. kdr4chk is not called, because there are no rows.

However, what is interesting is what will be called when we execute the same full table scan again over the 5 blocks now that they are in the buffer cache:

$86 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$87 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$88 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$89 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$90 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$91 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$92 = (void (*)()) 0xc9d9a20 <kcbgtcr>

Exactly how this is documented it would work! The foreground process now can just use the consistent read mechanism to read the blocks in the cache, there is no encryption codepath called. Please mind that when you change blocks in the cache, the changing does not get any additional overhead too, because it works 100% the same as a non-encrypted block. However, what does get a performance penalty is writing the blocks to disk (!). The database writer needs to encrypt the blocks before it can store them on disk! This means that DML intensive databases that switch to tablespace encryption, might need a bigger buffercache because there is more time and resources needed to write them to disk, which might lead to more dirty buffers in the cache, and thus lesser buffers available for other usage.

Also mind that encryption functions of tablespace encryption get called as part of the IO codepath:

#0  0x00000000071d5320 in kcbz_encdec_tbsblk ()
#1  0x00000000012e8e7c in kcbzibmlt ()
#2  0x00000000012dd9d3 in kcbzib ()
#3  0x000000000c9dbda9 in kcbgtcr ()
#4  0x000000000c9c0ac5 in ktrget2 ()
#5  0x000000000ccbc786 in kdstgr ()
#6  0x000000000cd0a0e9 in qertbFetch ()
#7  0x000000000cb6e42b in opifch2 ()
#8  0x00000000021e49f3 in kpoal8 ()
#9  0x000000000cb75a5d in opiodr ()
#10 0x000000000cda421b in ttcpip ()
#11 0x0000000001baba3c in opitsk ()
#12 0x0000000001bb0401 in opiino ()
#13 0x000000000cb75a5d in opiodr ()
#14 0x0000000001ba7b8b in opidrv ()
#15 0x00000000026b31b1 in sou2o ()
#16 0x0000000000bbda7a in opimai_real ()
#17 0x00000000026bd8fc in ssthrdmain ()
#18 0x0000000000bbd9bc in main ()

1. This is the master encryption routine.
2. This is the kernel cache space management insert block multiblock routine.
3. This is the kernel cache space management insert block routine.
4. This is the consistent read call kcbgtcr.
5. This is the kernel transaction layer buffer get routine.
6. This is the kernel data scan table get rows (full table scan).
7. This is the SQL execution rowsource tablescan fetch routine.
In other words: the decryption is in the IO codepath, or: is executed as part of the IO routines.

Update april 4, 2016:
Further investigation showed the checking (entering the kcbchk_ctx function) CAN be eliminated when the undocumented parameter “_disable_block_checking” is set to FALSE.
There is a bug logged for this issue: 12943305 – decryption performance can be improved, and has a fix. This is reported to be fixed in 12.2, and a patches available for the different PSUs.


Recently I was applying the data dictionary part from an (exadata bundle) patch and ran into the following errors:

ORA-24324: service handle not initialized
ORA-24323: value not allowed
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 1001 (oinstall), current egid = 1002 (dba)

This was very weird, I had just started the instance using sqlplus. This is a database that is normally started by Oracle clusterware, but for the sake of quickly patching the database, I started it manually. Another reason I had done it that way is the (infamous) OJVM patch was part of the patching, which for the version I applied needs the database in upgrade mode, which means I had to turn the cluster_database parameter to false temporarily.

Back to the story: I ran datapatch on my manually (sqlplus) started instance, and gotten the above mentioned error. The quick solution was to stop the instance (killing the pmon process will do that), and start the instance again. A thorough look through the alert.log file and trace files generated during startup did not show any error anymore. This issue showed up after successful startup previously, so I kept a close eye on the instance for some time, but it didn’t appeared again. Problem solved, but what did happen?

When looking at the error messages, there are two lines mentioning ‘egid’ which means effective group id. Also, two values for groups are mentioned: 1001, which is the oinstall group, and 1002, which is the dba group. The last line is even more clear actually, it says the startup group id is oinstall, but the current group id dba.

The line ‘ORA-27140: attach to post/wait facility failed’ is actually the root cause. The line ‘ORA-27302: failure occurred at: skgpwinit6’ tells the exact function, and skgpwinit6 probably can be deciphered as ‘System Kernel Generic PostWait INITialisation’. But what does ‘post/wait’ mean? My Oracle Support has a nice description in note ‘TECH: Unix Semaphores and Shared Memory Explained (Doc ID 15566.1)’. Essentially, post/wait is the Oracle side of using the operating system system V semaphore facility. The issue here is the group id set for the semaphores does not align with the group id for this oracle database server process, and is rejected by the operating system (line ‘ORA-27300: OS system dependent operation:invalid_egid failed with status: 1’); linux error 1 is (errno.h) EPERM 1 /* Operation not permitted */, so declined because of permissions.

The next question obviously is: how did this happen? The instance was started by me, in the same linux session, and then running something else that connects to the instance (datapatch) suddenly errors out.

This has to do with Oracle clusterware, ASM disk devices and (potential) role separation. In order for the Oracle database server processes to be able to access and use the local ASM disk devices, it must have the group id set of the ASM disk devices. When role separation is setup, which means the clusterware uses a different user id than the database software, the disk devices have ownership to the clusterware user id, and the group id will be the group set as OSASM during installation, which tends to be set as ‘asmadmin’ in such cases.

Because the disk devices need to have permissions set to 660 (read and write for both the owner and the group), the group set with the devices can be used to use the devices. With role separation, the user id of the database processes is different from the clusterware processes, so for the database processes to be able to use the disk devices, it needs to have membership of the set OSASM group.

Now, the actual root case of this issue is, for the clusterware to make sure the databases can startup using the disk devices, it will set the OSASM group as the group id for the database oracle executable (in the database home!) whenever clusterware is invoked to startup an instance. Because the oracle executable has SUID and SGID bits set (rwsr-s–x), this could mean the group of newly created processes suddenly changes from the previous group id to the OSASM group if clusterware is used to startup an instance from the same home a database instance has been started up earlier without clusterware.

How is the group set on the oracle executable of the database home?
Some simple testing shows the current primary group of the user performing linking of the oracle executable is used as the group of the oracle executable (# means executed as root, $ means executed as oracle):

# groupadd test_group
# id oracle
uid=54321(oracle) gid=54321(oinstall) groups=54321(oinstall),54322(dba)
# user mod -a -G test_group
# id oracle
uid=54321(oracle) gid=54321(oinstall) groups=54321(oinstall),54322(dba),54323(test_group)
# su - oracle
$ . oraenv
ORACLE_SID = [oracle] ? o12102
The Oracle base has been set to /u01/app/oracle
$ cd $ORACLE_HOME/rdbms/lib
$ make -f ioracle
$ ls -ls /u01/app/oracle/product/
316476 -rwsr-s--x 1 oracle oinstall 324067184 Mar 15 11:27 /u01/app/oracle/product/
$ exit
# usermod -g test_group -G dba,oinstall oracle
# su - oracle
$ id
uid=54321(oracle) gid=54323(test_group) groups=54323(test_group),54321(oinstall),54322(dba)
$ . oraenv
ORACLE_SID = [oracle] ? o12102
The Oracle base has been set to /u01/app/oracle
$ cd $ORACLE_HOME/rdbms/lib
$ make -f ioracle
$ ls -ls /u01/app/oracle/product/
316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 /u01/app/oracle/product/

I first used my current settings, which shows oinstall as primary group, and when I relink the oracle executable the group set with the oracle executable is oinstall. This is shown in lines 5-15. Next, I switch the primary group of the oracle user to test_group and execute linking again. Now the group of the oracle executable is test_group.

If I startup a database with the current settings, it will create semaphores with test_group as the current/effective group:

$ sqlplus / as sysdba
SQL> startup
Database opened.
SQL> exit
$ sysresv -l o12102

IPC Resources for ORACLE_SID "o12102" :
557059  	0x017b8a6c
Oracle Instance alive for sid "o12102"
$ ipcs -si 557059

Semaphore Array semid=557059
uid=54321	 gid=54321	 cuid=54321	 cgid=54323
mode=0640, access_perms=0640
nsems = 104
otime = Wed Mar 15 12:50:51 2017
ctime = Wed Mar 15 12:50:51 2017
semnum     value      ncount     zcount     pid
0          0          0          0          17097
1          4869       0          0          17097
2          10236      0          0          17097
3          32760      0          0          17097

Lines 1-4: startup the instance o12102 using sqlplus.
Lines 5-13: use the sysresv utility to find the semaphore array that the instance o12102 is using. The semaphore array id is 557059.
Lines 14-26: the current group id of the semaphore array of the instance o12102 is 54323 (cgid).

Before we go on and involve clusterware, let’s replay the scenario (changing the group of the oracle executable) manually outside of clusterware to see if we can get the same behaviour:

$ ls -ls oracle
316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 oracle
$ chgrp oinstall oracle
$ chmod 6751 oracle
 ls -ls oracle
316472 -rwsr-s--x 1 oracle oinstall 324067184 Mar 15 12:06 oracle

Please mind the database currently is running (otherwise there would be no semaphore array above).
In order to change the group correctly, not only the group needs to be reset (line 3), but also the SUID and SGID bits must be set again, these are lost when the group is changed. Setting the SUID and SGID bits is done in line 4.

Now try to logon again as sysdba:

$ sqlplus / as sysdba

SQL*Plus: Release Production on Wed Mar 15 13:02:25 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


That actually succeeds! This is because principal access is arranged by membership of the OSDBA group (which I have set to dba, of which the oracle user is a member). However, if I try to shutdown the instance, I get the messages regarding post/wait:

SQL> shutdown immediate
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 54323 (test_group), current
egid = 54321 (oinstall)

Can I solve this issue? Yes, quite simply by changing the group of the oracle executable back to the group the database was startup (plus the SUID and SGID bits, obviously):

$ chgrp test_group oracle
$ chmod 6751 oracle
$ sqlplus / as sysdba
SQL> select * from dual;


Voila! It’s corrected again.

Now let’s keep our database group setting in mind (test_group), and involve clusterware. Because I got the o12102 instance already started, let’s see when the group of the database oracle executable changes, because the OSASM group of clusterware is set to oinstall:

$ srvctl status database -d o12102
Database is running.
$ ls -ls oracle
316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 oracle
$ srvctl stop database -d o12102
$ ls -ls oracle
316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 oracle
$ srvctl start database -d o12102
$ ls -ls oracle
316472 -rwsr-s--x 1 oracle oinstall 324067184 Mar 15 12:06 oracle

So, it is really only when an instance is started using clusterware which changes the group to the OSASM group of the oracle executable in the database home.

The group as set by clusterware can be changed by:
– unlocking the clusterware home ($ORACLE_HOME/crs/install/ -unlock ( for SIHA) as root.
– changing the $ORACLE_HOME/rdbms/lib/config.c entries for .Lasm_string: .string “GROUP HERE” and #define SS_ASM_GRP “GROUP HERE” in the clusterware home, and then relink (make -f $ORACLE_HOME/rdbms/lib/config.o; relink all) as the owner of clusterware.
– lock the clusterware home again ($ORACLE_HOME/crs/install/ -patch ( for SIHA) as root.
==> Please mind that group for the ASM devices needs to be changed accordingly if you change the ASM group.

Recently, I was trying to setup TDE. Doing that I found out the Oracle provided documentation isn’t overly clear, and there is a way to do it in pre-Oracle 12, which is done using ‘alter system’ commands, and a new-ish way to do it in Oracle 12, using ‘administer key management’ commands. I am using version, so decided to use the ‘administer key management’ commands. This blogpost is about an exception which I see is encountered in the Januari 2017 (170117) PSU of the Oracle database, which is NOT happening in Oracle 12.2 (no PSU’s for Oracle 12.2 at the time of writing) and Oracle April 2016 and October 2016 PSU’s.

In order to test the wallet functionality for TDE, I used the following commands:

SQL> select status, wrl_parameter from v$encryption_wallet;


SQL> !mkdir /u01/app/oracle/admin/test/wallet

SQL> administer key management create keystore '/u01/app/oracle/admin/test/wallet' identified by "this_is_the_keystore_password";

keystore altered.

SQL> administer key management set keystore open identified by "this_is_the_keystore_password";

keystore altered.

SQL> administer key management set key identified by "this_is_the_keystore_password" with backup;
administer key management set key identified by "this_is_the_keystore_password" with backup
ERROR at line 1:
ORA-28374: typed master key not found in wallet

SQL> select status, wrl_parameter from v$encryption_wallet;


SQL> administer key management set keystore open identified by "this_is_the_keystore_password";

keystore altered.

SQL> select status, wrl_parameter from v$encryption_wallet;


Line 1-10: The DB_UNIQUE_NAME of the instance is ‘test’, and therefore the default wallet location is /u01/app/oracle/admin/test/wallet (ORACLE_BASE/admin/DB_UNIQUE_NAME/wallet). The wallet directory doesn’t exist by default, so I created it (line 10).
Line 12: Here the keystore/wallet is created with a password.
Line 16: After the wallet is created without auto-login, the wallet must be opened using the ‘set keystore open’ command.
Line 20: After the wallet has been created, it does not contain a master key. This is done using the ‘set key’ command. However, this throws an ORA-28374 error.
Line 26: After an error involving the wallet has occurred, the wallet closes.
Line 35: The wallet can simply be opened using the earlier used ‘set keystore open’ command.
Line 39: This is where the surprise is: after opening, the master key “magically” appeared (visible by the status ‘OPEN’, without a master key this would be ‘OPEN_NO_MASTER_KEY’).

I yet have to start creating encrypted table spaces. There might be more surprises, I can’t tell at this moment because I didn’t try it. However, once I discovered this oddity, I talked to my colleague Matt who gave me his own runbook for enabling TDE, which turned out to be the exact same list of commands as I compiled, however he did not encounter the ORA-28374 which I did. I tested the same sequence of commands on, (October 2016) and (April 2016) and there the ORA-28374 was not raised during execution of the ‘set key’ command.

Reading through My Oracle Support note Master Note For Transparent Data Encryption ( TDE ) (Doc ID 1228046.1), I found the following text:

All the versions after


As of If the key associated with the SYSTEM, SYSAUX or UNDO tablespaces is not present in the wallet you cannot associate a new master key with the database (i.e. you cannot activate that master key for the database) unless you set a hidden parameter :

SQL> administer key management use key ‘AUQukK/ZR0/iv26nuN9vIqcAAAAAAAAAAAAAAAAAAAAAAAAAAAAA’ identified by “welcome1” with backup;
administer key management use key ‘AUQukK/ZR0/iv26nuN9vIqcAAAAAAAAAAAAAAAAAAAAAAAAAAAAA’ identified by “welcome1” with backup
ERROR at line 1:
ORA-28374: typed master key not found in wallet

alter system set “_db_discard_lost_masterkey”=true;

SQL> administer key management use key ‘AUQukK/ZR0/iv26nuN9vIqcAAAAAAAAAAAAAAAAAAAAAAAAAAAAA’ identified by “welcome1” with backup;

The heading and first line read weird, the heading indicates the paragraph is about ‘all the versions after’ (which to me means 12.2), and the first line in the paragraph says ‘as of’, which very clearly says this is about version and higher. However, a little further it shows the exact error (ORA-28374) I encountered, and explains that if a current key is used in the data dictionary (mind data dictionary, not wallet), you must set “_db_discard_lost_masterkey” to true before you can create and use another master key for a wallet if you start over (wipe or move the wallet directory).

This makes sense to me now! I tried dropping and creating new wallets in my current 170117 PSU instance, and only tried creating an encryption wallet in a brand new freshly created instance. So if I would have EXACTLY done the same in the instances with the other PSU’s, which is repeatedly create and drop a wallet for TDE, I would have encountered the same ORA-28374 error. Well…I see this as a safety mechanism, be it not a very obvious one, not exuberant documented, and probably causing more grief than it would save if you run into the need the change the master key.

When sifting through a sql_trace file from Oracle version 12.2, I noticed a new wait event: ‘PGA memory operation’:

WAIT #0x7ff225353470: nam='PGA memory operation' ela= 16 p1=131072 p2=0 p3=0 obj#=484 tim=15648003957

The current documentation has no description for it. Let’s see what V$EVENT_NAME says:

SQL> select event#, name, parameter1, parameter2, parameter3, wait_class 
  2  from v$event_name where name = 'PGA memory operation';

------ ------------------------------------- ---------- ---------- ---------- ---------------
   524 PGA memory operation                                                   Other

Well, that doesn’t help…

Let’s look a bit deeper then, if Oracle provides no clue. Let’s start with the strace and sql_trace combination. For the test, I am doing a direct path full table scan on a table. Such a scan must allocate a buffer for the results (direct path reads do not go into the buffercache, table contents are scanned to the PGA and processed from there).

TS@fv122b2 > alter session set events 'sql_trace level 8';

Session altered.

Now use strace to look at the system calls in another session:

# strace -e write=all -e all -p 9426
Process 9426 attached

Now execute ‘select count(*) from t2’. The output is rather verbose, but the important bits are:

io_submit(140031772176384, 1, {{data:0x7f5ba941ffc0, pread, filedes:257, buf:0x7f5ba91cc000, nbytes:106496, offset:183590912}}) = 1
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000
lseek(7, 0, SEEK_CUR)                   = 164639
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 112) = 112
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 37 38  30 20 70 31 3d 32 30 39   ela= 780 p1=209 |
 | 00040  37 31 35 32 20 70 32 3d  31 31 31 34 31 31 32 20  7152 p2=1114112  |
 | 00050  70 33 3d 30 20 6f 62 6a  23 3d 32 32 38 33 33 20  p3=0 obj#=22833  |
 | 00060  74 69 6d 3d 31 39 35 31  37 30 32 30 35 36 36 0a  tim=19517020566. |
munmap(0x7f5ba8fbd000, 2097152)         = 0
munmap(0x7f5ba91bd000, 2097152)         = 0
mmap(0x7f5ba949d000, 65536, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x2ce000) = 0x7f5ba949d000
lseek(7, 0, SEEK_CUR)                   = 183409
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 100) = 100
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 35 39  32 20 70 31 3d 30 20 70   ela= 592 p1=0 p |
 | 00040  32 3d 30 20 70 33 3d 30  20 6f 62 6a 23 3d 32 32  2=0 p3=0 obj#=22 |
 | 00050  38 33 33 20 74 69 6d 3d  31 39 35 32 30 36 33 33  833 tim=19520633 |
 | 00060  36 37 34 0a                                       674.             |

Okay, we can definitely say the mmap() and munmap() system calls seem to be related, which makes sense if you look a the name of the wait event. Let’s look a bit more specific using a systemtap script:

global wait_event_nr=524
probe begin {

probe process("/u01/app/oracle/product/").function("kskthbwt") {
	if ( pid() == target() && register("rdx") == wait_event_nr )
		printf("kskthbwt - %d\n", register("rdx"))
probe process("/u01/app/oracle/product/").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr )
		printf("kskthewt - %d\n", register("rsi"))
probe syscall.mmap2 {
	if ( pid() == target() )
		printf(" mmap, addr %x, size %d, protection %d, flags %d, fd %i, offset %d ", u64_arg(1), u64_arg(2), int_arg(3), int_arg(4), s32_arg(5), u64_arg(6))
probe syscall.mmap2.return {
	if ( pid() == target() )
		printf("return value: %x\n", $return)
probe syscall.munmap {
	if ( pid() == target() )
		printf(" munmap, addr %x, size %d\n", u64_arg(1), u64_arg(2))

Short description of this systemtap script:
Lines 6-9: This probe is triggered once the function kskthbwt is called. This is one of the functions which are executed when the wait interface is called. The if function on line 7 checks if the process specified with -x with the systemtap executable is the process calling this function, and if the register rdx contains the wait event number. This way all other waits are discarded. If the wait event is equal to wait_event_nr, which is set to the wait event number 524, which is ‘PGA memory operation’, the printf() function prints kskthbwt and the wait event number. This is simply to indicate the wait has started.
Lines 10-13: This probe does exactly the same as the previous probe, except the function is kskthewt, which is one of the functions called when the ending of a wait event is triggered.
Line 14-17: This is a probe that is triggered when the mmap2() system call is called. Linux actually uses the second version of the mmap call. Any call to mmap() is silently executed as mmap2(). Inside the probe, the correct process is selected, and the next line simply prints “mmap” and the arguments of mmap, which I picked from the CPU registers. I do not print a newline.
Line 18-21: This is a return probe of the mmap2() system call. The function of this probe is to pick up the return code of the system call. For mmap2(), the return code is the address of the memory area mapped by the kernel for the mmap2() call.
Line 22-25: This is a probe on munmap() system call, which frees mmap’ed memory to the operating system.
Please mind there are no accolades following the if statements, which means the code executed when the if is true is one line following the if. Systemtap and C are not indention sensitive (like python), I indented for the sake of clarity.

I ran the above systemtap script against my user session and did a ‘select count(*) from t2’ again:

# stap -x 9426 mmap.stp
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 750 return value: 7f5ba91bd000
 mmap, addr 7f5ba91bd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba91bd000
kskthewt - 524
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1262 return value: 7f5ba8fbd000
 mmap, addr 7f5ba8fbd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba8fbd000
kskthewt - 524
kskthbwt - 524
 munmap, addr 7f5ba8fbd000, size 2097152
 munmap, addr 7f5ba91bd000, size 2097152
kskthewt - 524

This makes it quite clear! The event ‘PGA memory operation’ is called when mmap() and munmap() are called. Which are calls to allocate and free memory for a process. The file descriptor (fd) value is set to -1, which means no file is mapped, but anonymous memory.

Another interesting thing is shown: first mmap is called with no address given, which makes the kernel pick a memory location. This memory location is then used for a second mmap call at the same memory address. The obvious question for this is: why mmap two times?

To answer that, we need to look at the flags of the two calls. Here is an example:

mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000

The first mmap call asks the kernel for a chunk of memory. PROT_READ and PROT_WRITE mean the memory should allow reading and writing. MAP_PRIVATE means it’s not public/shared, which is logical for Oracle PGA memory. MAP_ANONYMOUS means the memory allocation is not backed by a file, so just an allocation of contiguous memory. MAP_NORESERVE means no swap space is reserved for the allocation. This means this first mapping is essentially just a reservation of the memory range, no physical memory pages are allocated.

The next mmap call maps inside the memory allocated with the first mmap call. This seems strange at first. If you look closely at the flags, you see that MAP_NORESERVE is swapped for MAP_FIXED. The reason for this strategy to make it easier for the Oracle database to allocate the memory allocations inside a contiguous chunk of (virtual) memory.

The first mmap call allocates a contiguous (virtual) memory area, which is really only a reservation of a memory range. No memory is truly allocated, hence MAP_NORESERVE. However, it does guarantee the memory region to be available. The next mmap allocates a portion of the allocated range. There is no MAP_NORESERVE which means this allocation is catered for for swapping in the case of memory shortage. This mapping does use a specific address, so Oracle can use pointers to refer to the contents, because it is certain of the memory address. Also, the MAP_FIXED flag has a side effect, which is used here: any memory mapping done to the address range is silently unmapped from the first (“throw away”) mapping.

Let’s look a bit deeper into the wait event information. For this I changed the probe for function kskthewt in the systemtap script in the following way:

probe process("/u01/app/oracle/product/").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr ) {
		ksuse = register("r13")-4672
		ksuseopc = user_uint16(ksuse + 2098)
		ksusep1 = user_uint64(ksuse + 2104)
		ksusep2 = user_uint64(ksuse + 2112)
		ksusep3 = user_uint64(ksuse + 2120)
		ksusetim = user_uint32(ksuse + 2128)
		printf("kskthewt - wait event#: %u, wait_time:%u, p1:%lu, p2:%lu, p3:%lu\n", ksuseopc, ksusetim, ksusep1, ksusep2, ksusep3)

When running a ‘select count(*) from t2’ again on a freshly started database with a new process with the changed mmap.stp script, this is how the output looks like:

kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 753 return value: 7f1562330000
 mmap, addr 7f1562330000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562330000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:1114112, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:28, p1:2097152, p2:1114112, p3:0

This looks like the size of memory allocated with the first mmap call for the PGA memory reservation is put in p1, and the size of the allocation of the second “real” memory allocation is put in p2 of the ‘PGA memory operation’ event. One thing that does look weird, is the memory is not unmapped/deallocated (this is a full execution of a SQL, allocated buffers must be deallocated?

Let’s look what happens when I execute the same SQL again:

kskthbwt - 524
 munmap, addr 7f1562130000, size 2097152
 mmap, addr 7f15623b0000, size 589824, protection 0, flags 16434, fd -1, offset 881 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:253, p1:0, p2:0, p3:0
kskthbwt - 524
 mmap, addr 7f15623b0000, size 589824, protection 3, flags 50, fd -1, offset 0 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:35, p1:589824, p2:0, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:0, p3:0

Ah! It looks like some memory housekeeping is not done during the previous execution, but is left for the next execution, the execution starts with munmap(), followed by a mmap() call. The first munmap() call deallocates 2 megabyte memory chunk. The next mmap() call is different from the other mmap() calls we have seen so far; we have seen a “throw away”/reservation mmap() call with the memory address set to 0 to let the operating system pick an address for the requested memory chunk, and a mmap() call to truly allocate the reserved memory for usage, which had a memory address set. The mmap() call following munmap() has a memory address set. However, protection is set to 0; this means PROT_NONE, which means the mapped memory can not be read and written. Also the flags number is different, flags 16434 translates to MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE. As part of releasing PGA memory, it seems some memory is reserved. The wait event parameters are all zero. When p1, p2 and p3 are all zero, it seems to indicate munmap() is called. As we just have seen, memory could be reserved. Also, when p1/2/3 are all zero there is no way to tell how much memory is freed, nor which memory allocation.

The next wait is the timing of a single mmap() call. Actually, the mmap() call allocates the previous mmaped memory, but now with protection set to 3 (PROT_READ|PROT_WRITE), which means the memory is actually usable. The p1 value is the amount of memory mmaped.

The last wait is a familiar one, it is the mmap() call with memory address set to zero, as reservation, and another mmap() call to allocate memory inside the previous “reserved” memory. However, the p1/2/4 values are now NOT set in the same way as we saw earlier: only p1 is non zero, indicating the size of the first mmap() call. Previously, p1 and p2 were set to the sizes of both mmap() calls.

With Oracle version 12.2 there is a new wait event ‘PGA memory operation’. This event indicates memory is allocated or de-allocated. Until now I only saw the system calls mmap() and munmap() inside the ‘PGA memory operation’.

To me, ‘cloud computing’ is renting a compute resource to perform a task. In order to use that compute resource, you need to instruct it to do something, which is typically done via the network. If the task the compute resource needs to fulfil is being an application server or being a client or both in the case of an application server that uses an Oracle database, the network latency between the client of the database and the database server is a critical property.

I think so far everybody is with me. If we zoom in to the network, it becomes more difficult, and *very* easy to make wrong assumptions. Let me explain. A network, but really any connection between processing and a resource, has two DIFFERENT properties that I see getting mixed up consistently. These are:
* Latency: the time it takes for a signal or (network) packet to travel from the client to the server, or the time it takes to travel from the client to the server and back.
* Bandwidth: the amount of data that can be transported from the client to the server in a certain time.

How do you determine the latency of a network? Probably the most people respond with ‘use ping’. This is how that looks like:

[user@oid1 ~]$ ping -c 3 lsh1
PING lsh1 (x.x.x.x) 56(84) bytes of data.
64 bytes from lsh1: icmp_seq=1 ttl=62 time=680 ms
64 bytes from lsh1: icmp_seq=2 ttl=62 time=0.304 ms
64 bytes from lsh1: icmp_seq=3 ttl=62 time=0.286 ms

The question I often ask myself is: what is that we see actually? How does this work?
In order to answer that question, the tcpdump tool can answer that question. Using tcpdump, you can capture the network packets on which the ping utility based the above outcome. The ‘-ttt’ option calculates the time between each arrived packet:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host lsh1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1 > lsh1: ICMP echo request, id 35879, seq 1, length 64
00:00:00.680289 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 1, length 64
00:00:00.319614 IP oid1 > lsh1: ICMP echo request, id 35879, seq 2, length 64
00:00:00.000287 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 2, length 64
00:00:01.000180 IP oid1 > lsh1: ICMP echo request, id 35879, seq 3, length 64
00:00:00.000269 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 3, length 64

So, ping works by sending a packet (ICMP echo request) requesting a reply (ICMP echo reply) from the remote server, and measure the time it takes to get that reply. Great, quite simple, isn’t it? However, the biggest issue I see this is using a protocol that is not used for sending regular data (!). Most application servers I encounter send data using TCP (transmission control protocol), the traffic ping sends are sent using a protocol called ICMP (internet control message protocol). Especially in the cloud, which means (probably) a lot of the infrastructure is shared, ICMP might be given different priority than TCP traffic, which you quite probably are using when the application on your cloud virtual machine is running. For those of you who haven’t looked into the network side of the IT landscape, you can priorise protocols and even specific ports, throttle traffic and you can even terminate it. In fact, a sensible protected (virtual) machine in the cloud will not respond to ICMP echo requests in order to protected it from attacks.

So, what would be a more sensible approach then? A better way would be to use the same protocol and port number that your application is going to use. This can be done using a tool called hping. Using that tool, you can craft your own packet with the protocol and flags you want. In the case of Oracle database traffic that would be the TCP protocol, port 1521 (it can be any port number, 1521 is the default port). This is how you can do that. In order to mimic starting a connection, the S (SYN) flag is set (-S), one packet is send (-c 1) to port 1521 (-p 1521).

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip

What this does is best investigated with tcpdump once again. The server this is executed against can respond in two ways (three actually). When you send this to TCP port 1521 where a listener (or any other daemon that listens on that port) is listening, this is the response:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [S], seq 1436552830, win 512, length 0
00:00:00.001229 IP db01-vip.ncube-lm > oid1.kjtsiteserver: Flags [S.], seq 2397022511, ack 1436552831, win 14600, options [mss 1460], length 0
00:00:00.000023 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [R], seq 1436552831, win 0, length 0

This is a variation of the classic TCP three way handshake:
1. A TCP packet is sent with the SYN flag set to indicate starting a (client to server) connection.
2. A TCP packet is sent back with SYN flag set to indicate starting a (server to client) connection, and the first packet is acknowledged.
3. This is where the variation is, normally an acknowledgement would be sent of the second packet to establish a two way connection, but in order to stop the communication a packet is sent with the RST (reset) flag set.

However, this is if a process is listening on the port. This is how that looks like when there is no process listening on port 1521:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.vsamredirector > db01.ncube-lm: Flags [S], seq 1975471906, win 512, length 0
00:00:00.001118 IP db01.ncube-lm > oid1.vsamredirector: Flags [R.], seq 0, ack 1975471907, win 0, length 0

This means that if a connection is initiated to a port on which no process is listening (port status ‘closed’), there is communication between the client and the server. This is why firewalls were invented!
1. A TCP packet is sent with the SYN flag set to indicate starting a connection.
2. A TCP packet is sent back to with the RST (reset) flag set to indicate no connection is possible.

The third option, when port 1521 is firewalled on the server, simply means only the first packet (from client to server with the SYN flag set) is sent and no response is coming back.

Okay, let’s pick up the performance aspect again. This hping command:

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip
HPING db01-vip (eth0 x.x.x.x): S set, 40 headers + 0 data bytes
len=44 ip=db01-vip ttl=57 DF id=0 sport=1521 flags=SA seq=0 win=14600 rtt=1.2 ms

Says the roundtrip time is 1.2ms. If we look at the network packets and timing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.mmcal > db01-vip.ncube-lm: Flags [S], seq 1289836562, win 512, length 0
00:00:00.001113 IP db01-vip.ncube-lm > oid1.mmcal: Flags [S.], seq 2504750542, ack 1289836563, win 14600, options [mss 1460], length 0
00:00:00.000016 IP oid1.mmcal > db01-vip.ncube-lm: Flags [R], seq 1289836563, win 0, length 0

It becomes apparent that the 1.2ms time hping reports is the time it takes for the remote server to send back the SYN+ACK package in the TCP three way handshake.

So does that mean that if we take a number of measurements (let’s say 100, or 1000) to have a statistically significant number of measurements we can establish my TCP roundtrip time and then know how fast my connection will be (outside of all the other variables inherent to the internet and potential noisy neighbours to name a few)?

Oracle provides a way to generate and measure SQL-Net traffic in My Oracle Support note: Measuring Network Capacity using oratcptest (Doc ID 2064368.1). This note provides a jar file which contains server and client software, and is aimed at dataguard, but is useful to measure SQL-Net network latency. I have looked at the packets oratcptest generates, and they mimic SQL-Net quite well.

Let’s see if we can redo the test above to measure pure network latency. First on the database server side, setup the server:

[user@db01m ~]$ java -jar oratcptest.jar -server db01 -port=1521

And then on the client side run the client using the same oratcptest jar file:

java -jar oratcptest.jar db01 -mode=sync -length=0 -duration=1s -interval=1s -port=1521

The important bits are -mode=sync (client packet must be acknowledged before sending another packet) and -length=0 (network traffic contains no payload). This is the result:

[Requesting a test]
	Message payload        = 0 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(07:34:42) The server is ready.
                        Throughput                 Latency
(07:34:43)          0.017 Mbytes/s                0.670 ms
(07:34:43) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.017 Mbytes/s
	             Avg. latency = 0.670 ms

If you look at the hping roundtrip time (1.2ms) and the oratcptest roundtrip time (0.7ms) clearly this is different! If you just look at the numbers (1.2 versus 0.7) it might seem like the oratcptest time is only measuring client to server traffic instead of the whole roundtrip? For this too it’s good to use tcpdump once again and look what oratcptest actually is doing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.63602 > db01.ncube-lm: Flags [S], seq 2408800085, win 17920, options [mss 8960,sackOK,TS val 3861246405 ecr 0,nop,wscale 7], length 0
00:00:00.001160 IP db01.ncube-lm > oid1.63602: Flags [S.], seq 2178995555, ack 2408800086, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:00:00.000015 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 1, win 140, length 0
00:00:00.023175 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 1:145, ack 1, win 140, length 144
00:00:00.000520 IP db01.ncube-lm > oid1.63602: Flags [.], ack 145, win 123, length 0
00:00:00.000951 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 1:145, ack 145, win 123, length 144
00:00:00.000008 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 145, win 149, length 0
00:00:00.018839 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 145:157, ack 145, win 149, length 12
00:00:00.000563 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 145:149, ack 157, win 123, length 4
00:00:00.000358 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 157:169, ack 149, win 149, length 12
00:00:00.000486 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 149:153, ack 169, win 123, length 4
00:00:00.000100 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 169:181, ack 153, win 149, length 12
00:00:00.000494 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 153:157, ack 181, win 123, length 4
00:00:00.000192 IP oid1.63586 > db01.ncube-lm: Flags [P.], seq 18181:18193, ack 6157, win 149, length 12
00:00:00.000447 IP db01.ncube-lm > oid1.63586: Flags [P.], seq 6157:6161, ack 18193, win 123, length 4
00:00:00.006696 IP oid1.63586 > db01.ncube-lm: Flags [F.], seq 18193, ack 6161, win 149, length 0
00:00:00.000995 IP db01.ncube-lm > oid1.63586: Flags [F.], seq 6161, ack 18194, win 123, length 0
00:00:00.000012 IP oid1.63586 > db01.ncube-lm: Flags [.], ack 6162, win 149, length 0

If you look at rows 4, 5 and 6 you see the typical TCP three-way handshake. What is nice to see, is that the actual response or roundtrip time for the packet from the server on line 5 actually took 1.1ms, which is what we have measured with hping! At lines 7-10 we see there is a packet send from the client to the server which is ACK’ed and a packet send from the server to the client which is ACK’ed. If you add the ‘-A’ flag to tcpdump you can get the values in the packet printed as characters, which shows the client telling the server how it wants to perform the test and the server responding with the requested settings. This is all a preparation for the test.

Starting from line 11, there is a strict repeating sequence of the client sending a packet of length 12, ACK’ing the previous received packet, and then the server responding with a packet of length 4 ACK’ing its previous received packet. This is the actual performance test! This means that the setting ‘-duration=1s -interval=1s’ does not mean it sends one packet, it actually means it’s continuously sending packets for the duration of 1 second. Also another flag is showing: the P or PSH (push) flag. This flag means the kernel/tcpip-stack understands all data to transmit is provided from ‘userland’, and now must be sent immediately, and instructs the receiving side to process it immediately in order to bring it to the receiving userland application as soon as possible too.

Lines 20-22 show how the connection is closed by sending a packet with a FIN flag, which is done for both the client to the server and the server to the client, and because it’s TCP, these need to be ACK’ed, which is why you see a trailing packet without a flag set, only ACK’ing the FIN packet.

The conclusion so far is that for real usable latency calculations you should not use a different protocol (so whilst ICMP (ping) does give an latency indication it should really only be used as an indicator), and that you should measure doing the actual work, not meta-transactions like the TCP three way handshake. Probably because of the PSH flag, the actual minimal latency for SQL-Net traffic is lower than ping and hping showed.

Wait a minute…did you notice the ‘actual minimal latency’? So far we only have been sending empty packets, which means we measured how fast a packet can travel from client to server and back. In reality, you probably want to send actual data back and forth, don’t you? That is something that we actually have not measured yet!

Let’s do actual Oracle transactions. For the sake of testing network latency, we can use Swingbench to execute SQL. This is how that is done:

[user@oid1 bin]$ cd ~/sw/swingbench/bin
[user@oid1 bin]$ ./charbench -c ../configs/stresstest.xml -u soe -p soe -uc 1 -rt 00:01
Author  :	 Dominic Giles
Version :

Results will be written to results.xml.
Hit Return to Terminate Run...

Time		Users	TPM	TPS

8:22:56 AM      1       14450   775

Please mind I am using 1 user (-uc 1) and a testing time of 1 minute (-rt 00:01), which should be longer when you are doing real testing. As a reminder, I am using 1 session because I want to understand the latency, not the bandwidth! In order to understand if the network traffic looks the same as oratcptest.jar, I can use tcpdump once again. Here is a snippet of the traffic:

00:00:00.000106 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5839:5852, ack 5986, win 272, length 13
00:00:00.000491 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 5986:6001, ack 5852, win 330, length 15
00:00:00.000234 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5852:6003, ack 6001, win 272, length 151
00:00:00.000562 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6001:6077, ack 6003, win 330, length 76
00:00:00.000098 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6003:6016, ack 6077, win 272, length 13
00:00:00.000484 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6077:6092, ack 6016, win 330, length 15
00:00:00.000238 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6016:6159, ack 6092, win 272, length 143
00:00:00.000591 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6092:6425, ack 6159, win 330, length 333

The important bit is this shows the same single packet traffic client to server and back as we saw oratcptest generated, however now with varying packet size (which is logical, different SQL statements are sent to the database), the PSH bit is set, which also is the same as oratcptest generated.

Let’s assume this is a real-life workload. In order to measure and calculate differences in performance between different networks, we need the average packet length. This can be done with a tool called tcpstat (this link provides the EL6 version). In my case I have only one application using a database on this server, so I can just filter on port 1521 to measure my SQL-Net traffic:

[user@oid1 ~]$ sudo tcpstat -i eth0 -o "Packet/s=%p\tmin size: %m\tavg size: %a\tmax size: %M\tstddev: %d\n" -f 'port 1521'
Packet/s=2526.40	min size: 53	avg size: 227.76	max size: 1436	stddev: 289.21
Packet/s=2531.40	min size: 53	avg size: 229.79	max size: 1432	stddev: 291.22
Packet/s=2634.20	min size: 53	avg size: 229.59	max size: 1432	stddev: 293.38
Packet/s=2550.00	min size: 53	avg size: 234.11	max size: 1435	stddev: 296.77
Packet/s=2486.80	min size: 53	avg size: 232.24	max size: 1436	stddev: 293.16

In case you wondered why tcpstat reports a minimum length of 53 and tcpdump (a little up in the article) of 13; tcpstat reports full packet length including packet, protocol and frame headers, tcpdump in this case reports the payload length.

Now we can execute oratcptest.jar again, but with a payload size set that matches the average size that we measured, I have taken 250 as payload size:

[user@oid1 ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(09:39:47) The server is ready.
                        Throughput                 Latency
(09:39:48)          0.365 Mbytes/s                0.685 ms
(09:39:48) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.365 Mbytes/s
	             Avg. latency = 0.685 ms

As you can see, there is a real modest increase in average latency going from 0.670ms to 0.685ms.

In order to test the impact of network latency let’s move the oratcptest client to the server, to get the lowest possible latency. Actually, this is very easy, because the oratcptest.jar file contains both the client and the server, so all I need to do is logon to the server where I started the oratcptest.jar file in server mode, and run it in client mode:

[user@db01m ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(14:49:29) The server is ready.
                        Throughput                 Latency
(14:49:30)         12.221 Mbytes/s                0.020 ms
(14:49:30) Test finished.
	       Socket send buffer = 26010 bytes
	          Avg. throughput = 11.970 Mbytes/s
	             Avg. latency = 0.021 ms

Wow! The roundtrip latency dropped from 0.685ms to 0.021ms! Another test using oratcptest.jar using a true local network connection (with Linux being virtualised using Xen/OVM) shows a latency of 0.161ms.

These are the different network latency figures measured with oratcptest using a payload size that equals my average network payload size:
– Local only RTT: 0.021
– Local network RTT: 0.161
– Different networks RTT: 0.685

If I take swingbench and execute the ‘stresstest’ run local, on a machine directly connected via the local network and across different networks (think cloud), and now measure TPS (transactions per second), I get the following figures:
– Local only TPS: 2356
– Local network TPS: 1567
– Different networks TPS: 854

Do these figures make sense?
– Local only: Time not in network transit per second: 1000-(0.021*2356)=950.524; approximate average time spend on query: 950.523/2356=0.40ms
– Local network: 1000-(0.161*1567)=747.713/1567=0.48ms
– Different networks: 1000-(0.685*854)=415.010/854=0.49ms
It seems that this swingbench test spends roughly 0.40-0.50ms on processing, the difference in transactions per second seem to be mainly caused by the difference in network latency.

This second blogpost on Performance Co Pilot or PCP in short is about visualisation. In case you haven’t read the first part, here it is, which describes how it works, why you should use it, and how you can install it.

One way of visualising PCP is using the pmchart utility. The pmchart utility is installed via the pcp-gui package (yum install pcp-gui). The pmchart utility uses X to display a window and draw lines, bar graphs, area graphs, etc. in a sense alike the performance manager on Microsoft Windows. You can select the individual performance statistics PCP measures, which can be viewed and investigated with pminfo, for example kernel.all.cpu.user:

# pminfo -dT kernel.all.cpu.user

    Data Type: 64-bit unsigned int  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: counter  Units: millisec
total user CPU time from /proc/stat for all CPUs, including guest CPU time

Use ‘pminfo’ without any argument to get a list of available statistics.

When I tried the pmchart utility on OL7.2, I needed to install the dejavu-sans fonts in order not to get black squares as letters in the pmchart window (yum install dejavu-sans-fonts). Also, there are pre-created ‘views’ for filesystem, load, memory, etc. (file, open view). This is very nice, it requires you to log on to a host to investigate with X tunnelled to your system (or have VNC setup), start pmchart and select metrics or open a view, and then record them. It doesn’t seem to be possible to use the archives pmlogger creates for historical analysis, nor it doesn’t seem to be able to save the recorded data, only the graphics that are created based on the selected statistics or using a view as a picture.
Note: upon reading the documentation, it turns out it is possible to use archives. This means you can pick up an archive from a machine and load it using pmchart locally.

However, there are different ways of investigating a host via PCP statistics… The pcp-webapi package provides exactly what the name suggests: a deamon that externalises PCP data via a web based (REST) API. Here’s how that installed:
There is a caveat: the pcp-webapi package is in the ‘optional latest’ repository, which might not be enabled. You can use ‘yum repolist all’ to see all configured repositories.

# yum install --enablerepo="ol7_optional_latest" pcp-webapi

After installing, you need to enable automatic startup (using systemctl in OL7) and start it:

# systemctl enable pmwebd
# systemctl start pmwebd

At this point I need to point out that starting pmwebd means a daemon starts at port 44323 for HTTP requests, which does not provide any authentication. If the host you are using is directly connected to the internet, you should protect this port and do not allow access. The best to use this port is using a tunnel.

This, however, just enables the API, it doesn’t provide anything that displays data useful for a person. That is done when you add the pcp-webjs package:

# yum install --enablerepo="ol7_optional_latest" pcp-webjs

This installs 3 webapps: vector, graphite and grafana.

Vector ( is a web based application that is written in javascript that fetches current data from the pcp webapi and creates graphs from it. You can use the vector app that is installed as part of the pcp-webjs package together with PCP, or start it up from somewhere else, as an example, the documentation mentions putting the app in an Amazon S3 bucket. If it’s started from somewhere else, you need to point vector to the PCP webapi of the machine you want to investigate because your browser, which runs vector, needs to have access to port 44323 to fetch data from the webapi. To use vector that is installed with PCP using the pcp-webjs package, point your browser to http://hostname:44323/vector and specify the hostname or ip address at “Hostname”. This is how that looks like:
Vector displays PCP measured statistics about CPU, memory, network, disk and even about containers, with an high resolution (2 seconds by default).

Grafana ( installed with the webjs package is a web based application that takes an approach opposite to vector, it fetches the data from the archives created by pmlogger and displays the statistics stored in these. The default dashboard displays the 1 minute load average, network I/O, disk reads and writes, memory statistics and filesystem fullness. The different colours in the graphs over time indicate different PCP archives being used, because by default PCP archives store data per day. To use grafana, point your browser to http://hostname:44323/grafana. This is how that looks like:

This blog post is about two things: one how you can monitor who is bringing you database up and down (there is a twist at the end!) and two how you can very conveniently do that with aggregated logs in a browser with a tool called ‘Kibana’, which is the K in ELK.

What is the ‘ELK stack’?
The ELK stack gets it’s name from Elasticsearch, Logstash and Kibana.
– Elasticsearch is an open source search engine based on Apache Lucene, which provides a distributed, multitenant-capable full-text search engine with a http web interface and schema-free JSON documents.
– Logstash is a fully configurable open source data processing pipeline that can receive data from a multiple sources simultaneously, transform it and output it based on the output plugin, which is the elastic search plugin in this blogpost but could be anything from STDOUT, an unix pipe, a file, a file in CSV, HTTP, email, IRC, Jira, graphite, kafka, mongodb, nagios, S3, SolR, … really whatever you want.
– Kibana is an open source data visualisation plugin for Elasticsearch.
When looking at Kibana, it quite much looks like the splunk interface.

Installing the ELK stack.
Installing the ELK stack in a basic way is easy. In this blogpost I will install everything on the same host, everything being the ELK stack and an Oracle database installation. In reality you should have a log gatherer on every host (called ‘filebeat’) and a dedicated host which runs the rest of the stack (logstash, elasticsearch and kibana). The below install actions were executed on a Linux 64 bit host running Oracle Linux 6.8.
In order to make the installation really easy, I use the yum repository of the elastic company, this is how to set that up (all done as root, ‘#’ indicates root):

# rpm --import
# vi /etc/yum.repos.d/elastic.repo
name=Elastic repository for 5.x packages

Install elasticsearch:

# yum install java-1.8.0-openjdk
# yum install elasticsearch
# chkconfig --add elasticsearch
# service elasticsearch start

Install logstash:

# yum install logstash

Configure logstash input and output:

# vi /etc/logstash/conf.d/input.conf
input {
  beats {
    port => 5044
# vi /etc/logstash/conf.d/output.conf
output {
  elasticsearch {
    hosts => "localhost:9200"
    manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
    document_type => "%{[@metadata][type]}"

Verify the logstash config files:

# sudo -u logstash /usr/share/logstash/bin/logstash --path.settings /etc/logstash -t
Sending Logstash's logs to /var/log/logstash which is now configured via
Configuration OK

If you see the ‘Configuration OK’ message, it means logstash could interprent the configuration files. It does not mean it will all work as desired, there could be runtime issues.
Now let’s start logstash. Logstash uses upstart (meaning a startup script in /etc/init) instead of the legacy startup mechanism using the chkconfig and service utilities.

# initctl start logstash

The last part of the data pipeline is ‘filebeat’. There are and could be multiple input products, in this blogpost I use ‘filebeat’, which keeps track of logfiles.

# yum install filebeat
# chkconfig --add filebeat

We are going to look into linux and oracle auditing. So we need to keep track of a couple of files:
– /var/log/secure: this is the default linux logfile which contains all kinds of authentication messages, as defined in /etc/rsyslog.conf (authpriv.* /var/log/secure).
– /u01/app/oracle/admin/*/adump/*.aud: this is the default place where the oracle database stores it’s audit files. These audit files provide what is called ‘mandatory auditing’, and includes at least connections to the instance with administrator privilege, database startup and database shutdown. The default is a normal text based logfile, it could be set to XML.
– /var/log/audit/audit.log: this is the logfile of the linux kernel based audit facility. This is actually a lesser known hidden gem in Linux, and provides audit information from the Linux kernel.

These files need to be configured in filebeat, in the file: /etc/filebeat/filebeat.yml. As the extension of the file indicates, this is a file organised in YAML syntax. The best way to configure the file is to move the file, and create your own file with your desired configuration. First of all we add the output, which is logstash in our case. Please mind the default configuration of filebeat is direct output to elasticsearch, which means we don’t have an option to enrich the data!

# mv /etc/filebeat/filebeat.yml /etc/filebeat/filebeat.yml.orig
# vi /etc/filebeat/filebeat.yml
  hosts: ["localhost:5044"]

Please mind the two spaces in front of ‘hosts’, which is mandatory for a YAML document!
Next up we add the files to monitor in the configuration file. The linux based logfiles are easy:

- input_type: log
    - /var/log/secure
  document_type: secure

- input_type: log
    - /var/log/audit/audit.log
  document_type: audit

One thing to notice is that a type is set for each file (which is really just a name for the file filebeat monitors), which makes it able to find data from these specific files later on. Now the Oracle audit file:

- input_type: log
    - /u01/app/oracle/admin/*/adump/*.aud
  document_type: oracle_audit
    pattern: '^[A-Za-z]{3} [A-Za-z]{3} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}'
    negate: true
    match: after

This looks a bit more complicated. The reason for the complication is the multiline specification. An Oracle database audit file contains a timestamp, after which the audit data is written; it looks like this:

Thu Jan 19 13:44:12 2017 +00:00
LENGTH : '198'
ACTION :[49] 'ALTER DATABASE OPEN /* db agent *//* {0:0:476} */'
CLIENT USER:[6] 'oracle'
STATUS:[1] '0'
DBID:[10] '2622783786'

The important things at this time: the ‘pattern’ keyword specifies the timestamp, you can see you can match it with the timestamp, and all the following data needs to be processed together, this is a single record, written over multiple lines. ‘negate: true’ means that anything that does not fit the pattern needs to be added to this piece of data, ‘match: after’ means that this is added after the pattern is matched.

Now that filebeat is setup, we can start the filebeat daemon:

# service filebeat start

The last component is kibana:

# yum install kibana
# chkconfig --add kibana
# service kibana start

Now that we’ve set the entire pipeline up, a next thing to do is to configure logstash to enrich the data. Here’s the how it’s done for the Oracle database audit file:

# vi /etc/logstash/conf.d/oracle-audit.conf
filter {
  if [type] == "oracle_audit" {
    grok {
      match => { "message" => "^%{DAY} %{MONTH:M} %{MONTHDAY:d} %{HOUR:h}:%{MINUTE:m}:%{SECOND:s} %{YEAR:y}" }
      add_tag => [ "grok", "oracle_audit" ]
    grok {
      match => { "message" => "ACTION :\[[0-9]*\] '(?<ora_audit_action>.*)'.*DATABASE USER:\[[0-9]*\] '(?<ora_audit_dbuser>.*)'.*PRIVILEGE :\[[0-9]*\] '(?<ora_audit_priv>.*)'.*CLIENT USER:\[[0-9]*\] '(?<ora_audit_osuser>.*)'.*CLIENT TERMINAL:\[[0-9]*\] '(?<ora_audit_term>.*)'.*STATUS:\[[0-9]*\] '(?<ora_audit_status>.*)'.*DBID:\[[0-9]*\] '(?<ora_audit_dbid>.*)'" }
    grok {
      match => { "source" => [ ".*/[a-zA-Z0-9_#$]*_[a-z0-9]*_(?<ora_audit_derived_pid>[0-9]*)_[0-9]*\.aud" ] }
    mutate {
      add_field => { "ts" => "%{y}-%{M}-%{d} %{h}:%{m}:%{s}" }
    date {
      locale => "en"
      match => [ "ts", "YYYY-MMM-dd HH:mm:ss" ]
    mutate {
      remove_field => [ "ts", "y", "M", "d", "h", "m", "s" ]

It’s beyond the scope of this article to go through every detail, but as you can see we apply a filter. Everything in this filter takes place for the type “oracle_audit”, which is set by filebeat. The next thing we encounter a couple of times is ‘grok’s’. The term grok comes from the Robert Heinlein science-fiction novel ‘Stranger in a Strange land’. Essentially, a grok with logstash means you specify a pattern, for which the actions are applied if the specified pattern matches. The first grok looks for the date pattern for which extra fields are created (M,d,h,m,s, after the colon) in the field ‘message’, and adds a tag (a word in the tags field for the record that is created). The second grok also looks in the ‘message’ field, and specifies text (ACTION for example), some other characters and then (?.*) is visible. This is a custom pattern, for which the field name to be created is in between < and > and is followed by a pattern. This grok line (including all the patterns) creates fields for all the Oracle audit fields in the audit file! The next grok picks up the PID from the filename of the logfile (the filename is in a field ‘source’), and the two mutates create and destroy a new field ts which is used for the date, and date specifies the date/time with the data flowing through logstash is filled with the date and time in the ts field, instead of the time filebeat picked up the data and sent it through logstash. Please mind that if you add (or change) configuration in a logstash configuration file, you need to restart logstash.

We are all set now! Last words on this configuration: kibana and elasticsearch by default do not require authentication. Do not expose the ports of these products to the internet! I am using a tunnel to the kibana website, which runs on port 5601. It’s very easy to ssh into the machine running the ELK stack using ssh user@machine -L 5601:localhost:5601, which creates a port on localhost:5601 on my machine at home (-L = local), for which communication is tunnelled to localhost:5601 on the remote machine (the localhost in the ssh line example is an address on the machine you ssh in to, this could also be another server which is only visible from the machine you ssh into.

First let’s login to the machine, and see what information is revealed with /var/log/secure:
You get this screen when you goto kibana at port 5601, enter: ‘type: secure’ in the search bar to display data of the type secure (which is what is set with document_type: secure in filebeat.yml), and login to the machine where filebeat is watching the /var/log/secure file. As you can see, you get two lines from the ssh deamon, one indicating something’s done with pam (pam_unix), and one line which indicates it authenticated via a public key for user ops from an ip address (which is anonymised) at port 39282 via ssh2.

With a lot of cloud providers you get a user which has public key authentication setup (which you saw above), after which you need to sudo to for example the oracle user. In a lot of companies, you get a personalised account to log on to servers, after which you need to sudo to oracle. In both cases you need to use sudo to become the user that you need to administer, for example oracle. This is what sudo generates in the /var/log/secure file:
The secure log displays sudo was invoked by the user opc, on TTY pts/1 and the command executed via sudo was ‘/bin/su – oracle’.

Now that I have became oracle using sudo, I set the environment of my database using oraenv and started up a database. Now go over to kibana, and issued a search for ‘type: oracle_audit’. This is how that looks like:
Now if you look at what the audit record provides, the only things that provide something useful for the purpose of investigating who did stop or start a database are ACTION and CLIENT TERMINAL (I assume the database is stopped and started by the ‘oracle’ user). Now change the ‘selected fields’ in kibana and add the (dynamically created!) fields: ora_audit_action, ora_audit_term and ora_audit_derived_pid, and remove message. This is how that looks like:
The important thing to look for here is the ora_audit_action ‘startup’, then look at the ora_audit_derived_pid, and two rows down we see terminal ‘pts/1’ was the terminal on which this was entered.

Now that we know the terminal, we can add in searching in the message field for the secure type. Enter ‘type: oracle_audit OR (type: secure AND message: “*pts/1*”)’ in the search bar.
Okay, this works. But it’s far from perfect. In fact, it only works if the username of the session doing the sudo is the only session with that username, otherwise if there is more than one session it can be any of these sessions doing the sudo, since there is nothing more than the username. This also means that if there is a direct logon to the oracle user, there is no way to identify a session with a TTY, and thus database startup and shutdown are completely anonymous, there’s no way to link a specific session to that action outside of probably the oracle user and a TTY which can not be linked to anything like for example an ip address.

Is there a better way? Yes! We can also use the linux, kernel based, auditing service, which is on by default. This service keeps a log file at /var/log/secure/secure.log, and gives way more granular auditing events than the /var/log/secure log. Linux audit generates a lot of diverse types of rows, so it’s actually not easy to grok them, but in order to understand which session executed a startup or shutdown, the only audit row that is important for this specific use case is an audit type called ‘CRED_ACQ’. The grok for this type looks like this:

# vi /etc/logstash/conf.d/linux-audit.conf
filter {
  if [type] == "audit" {
    grok {
        match => { "message" => ""type=%{WORD:audit_type} msg=audit\(%{NUMBER:audit_epoch}:%{NUMBER:audit_counter}\): pid=%{NUMBER:audit_pid} uid=%{NUMBER:audit_uid} auid=%{NUMBER:audit_auid} ses=%{NUMBER:audit_ses} msg='op=%{NOTSPACE:audit_op} ((acct=\"%{GREEDYDATA:audit_acct}\")|(id=%{NUMBER:audit_id})|acct=%{BASE16NUM:audit_acct}) exe=\"%{GREEDYDATA:audit_exe}\" hostname=%{NOTSPACE:audit_hostname} addr=%{NOTSPACE:audit_addr} terminal=%{NOTSPACE:audit_terminal} res=%{NOTSPACE:audit_res}'" }
        add_tag => [ "grok", "audit" ]
    date {
      locale => en
      match => [ "audit_epoch", "UNIX" ]

This grok matches the CREDIT_ACQ audit type which we will use to trace back the session via the audit log. Another nicety of this logstash configuration is the audit records time using an epoch timestamp, which logstash can translate back to a human readable timestamp. Once this is in place, log in again and use sudo to switch to oracle (or log in directly as oracle, it doesn’t really matter that much now!), and search in kibana for: ‘type: oracle_audit OR (type: audit AND audit_type: CRED_ACQ)’. Now get the relevant fields; remove ‘message’, and add: audit_hostname, audit_acct, audit_ses, audit_terminal, ora_audit_term, ora_audit_derived_pid, ora_audit_action. This probably returns a log of rows, now scroll (“back in time”) and search for the startup or shutdown command, and then follow the trail:
Startup points to (oracle server process) PID 17748, which was instantiated by a session using by pts/1 (two rows down), one row further down we see the audit information which shows pts/1, which is connected to audit_ses 4230. The audit_ses number is a number that sticks with a session, regardless of using sudo. If you follow down number 4230, you see multiple rows of audit_ses 4230, some of them with root, which is typical for sudo switching from one user to another. The final row shows the user logging in with it’s ip address. In other words: using the linux kernel audit facility, you can get all available information!

Okay, all happy now? Are you sure? Now for the twist!

Whenever you use RAC, or use ASM, or use both, or you are using the grid infra single instance as a framework to track your your listener(s) and database(s) and start and stop them automatically, you can still stop and start an instance directly using sqlplus, but in most cases you will be using the grid infrastructure crsctl or srvctl commands. When the grid infrastructure crsctl and srvctl commands are used, this is how the Oracle database audit information looks like:
As you can see, because the cluster ware brought the database down, there is no terminal associated with the shutdown. So the above mentioned way of first searching for startup and shutdown in the oracle audit information, finding the associated terminal, and then tracing it through the audit records can NOT be used whenever the Oracle cluster ware is used, because a grid infrastructure deamon is actually stopping and starting the database, and the grid infrastructure does not keep any information (that I am aware of) about which client invoked a grid infrastructure command. I guess a lot of auditors will be very unhappy about this.

Now the good news: you can solve this issue very easy. The downside is it requires additional configuration of the linux auditing. The solution is to put an ‘execution watch’ on srvctl and crsctl; this is how this is done:

# auditctl -w /u01/app/ -p x -k oracrs
# auditctl -w /u01/app/ -p x -k oracrs

In order to validate the working, I started a database using srvctl, and searched for: ‘(type: oracle_audit AND ora_audit_action: STARTUP) OR (type: audit AND message: key=\”oracrs\”)’. This is how that looks like:
As you can see, there’s the Oracle database record indicating the startup of the database, and a little while back in time there’s the linux audit row indicating the execution of the srvctl executable. Once you are at that point, you can using the earlier mentioned way of using the audit_ses number to trace the session execution, including sudo and ip address at logon time.

%d bloggers like this: