A performance deep dive into tablespace encryption

This is a run through of a performance investigation into Oracle tablespace encryption. These are the versions this test was performed on:

$ cat /etc/oracle-release
Oracle Linux Server release 6.8
$ /u01/app/oracle/product/12.1.0.2/dbhome_1/OPatch/opatch lspatches
24315824;Database PSU 12.1.0.2.161018, Oracle JavaVM Component (OCT2016)
24006101;Database Patch Set Update : 12.1.0.2.161018 (24006101)

In this test I created an encrypted tablespace:

SQL> create tablespace is_encrypted datafile size 10m autoextend on next 10m encryption default storage(encrypt);

(this assumes you have setup a master encryption key already)
And I created an encrypted simple heap table with one row:

SQL> create table t_is_encrypted ( f1 varchar2(100) ) tablespace is_encrypted;
SQL> insert into t_is_encrypted values ('AAAAAAAAAA');

Also, in order to do some performance tests, I created a simple heap table with a lot of rows:

SQL> create table t_encrypted_large ( id number, random_string varchar2( 100 ) ) tablespace is_encrypted;
SQL> begin
        for counter in 1..32000000 loop
                insert into t_encrypted_large values ( counter, dbms_random.string('l',50) );
        end loop;
end;
/

Let me first go back to basics a little and explain how tablespace encryption works: tablespace encryption sets the ‘E’ flag in the cache header of blocks of a segment which contain data (block type# 6, ‘trans data’), which means the block will be UNencrypted in the buffercache, but will be encrypted once the block is persisted to disk. If a block is read from disk into the cache, it means it must be decrypted first. Once a block is in the cache and decrypted, it can be read and used without the overhead of encryption, because it is not encrypted anymore.

A special case arises when direct path reads are used by Oracle. A direct path read means the blocks are read from disk into the process’ PGA, so there is no involvement of the buffer cache. This means that direct path reads ALWAYS must decrypt the data blocks with tablespace encryption, which means it will suffer the overhead of decryption. This inherent overhead applies to any direct path action (which means both direct reads and direct path writes).

Let’s first do a simple count(*) on table t_encrypted_large, and collect a wait profile. I am doing this with Tanel Poder’s snapper. First session, which collects the profile:

SQL> var snapper refcursor
SQL> @snapper all,begin 1 1 793

Now with the begin snapshot ready, head over to the other session, which will scan the t_encrypted_large table:

SQL> select count(*) from t_encrypted_large;

  COUNT(*)
----------
  32000000

Now end the snapper snapshot and look at where the time was spend:

SQL> @snapper all,end 1 1 793
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    SID, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    793, SYS       , TIME, parse time elapsed                                        ,           140,     5.32us,      .0%, [          ],          ,           ,
    793, SYS       , TIME, PL/SQL execution elapsed time                             ,           113,     4.29us,      .0%, [          ],          ,           ,
    793, SYS       , TIME, DB CPU                                                    ,       4912502,   186.66ms,    18.7%, [@@        ],          ,           ,
    793, SYS       , TIME, sql execute elapsed time                                  ,       4915777,   186.78ms,    18.7%, [##        ],          ,           ,
    793, SYS       , TIME, DB time                                                   ,       4916441,   186.81ms,    18.7%, [##        ],          ,           ,       2.66 % unaccounted time
    793, SYS       , WAIT, Disk file operations I/O                                  ,            69,     2.62us,      .0%, [          ],         2,        .08,     34.5us average wait
    793, SYS       , WAIT, db file sequential read                                   ,           484,    18.39us,      .0%, [          ],         3,        .11,   161.33us average wait
    793, SYS       , WAIT, db file scattered read                                    ,           220,     8.36us,      .0%, [          ],         1,        .04,      220us average wait
    793, SYS       , WAIT, direct path read                                          ,          1641,    62.35us,      .0%, [          ],         4,        .15,   410.25us average wait
    793, SYS       , WAIT, SQL*Net message to client                                 ,             2,      .08us,      .0%, [          ],         3,        .11,      .67us average wait
    793, SYS       , WAIT, SQL*Net message from client                               ,      20701153,   786.56ms,    78.7%, [WWWWWWWW  ],         3,        .11,       6.9s average wait

The time spend executing SQL is 186.78ms, and during that time, 186.66ms was spend on CPU (DB CPU). The waits acknowledge this, the time in IO related waits is measured in microseconds. In case you wondered if this output is sane, this system is using local NVMe disks, and you see the full table scan flipping from buffered multiblock IO (db file scattered read) to direct path multiblock IO (direct path read). The conclusion at this point is that as good as all time was spend running on CPU. Of course this might be different if IO has higher latency (random IO latency of rotating media is between 5-12ms, reading from a SAN cache normally has a latency of around 1ms).

It is not possible to get a breakdown of where (ON) CPU time is spend inside the database. There are session counters, but these do not depict time. However, on the operating system level there is a useful tool: perf. In my experience, perf output is only really useful when the hardware PMU events are available from the operating system. This can be tested by specifying the event to use for perf. If you have the PMU events available, perf stat -e cycles sleep 1 will return something like:

$ perf stat -e cycles sleep 1

 Performance counter stats for 'sleep 1':

           736,481      cycles

       1.001210217 seconds time elapsed

If you don’t have PMU events available (for you are running in a virtual machine for example, however some virtualisation products can give access to the PMU events), it will look like:

$ perf stat -e cycles sleep 1

 Performance counter stats for 'sleep 1':

   <not supported>      cycles

       1.001025828 seconds time elapsed

Back to the original purpose of this blogpost: the performance of tablespace. Let’s get a breakdown of the functions the time is spend on doing the ‘select count(*) from t_encrypted_large’. This is done by first executing perf record in one session with the process ID specified of the server process, which then executes the ‘select count(*) from t_encrypted_large’. Once the SQL is ready, the perf session can be stopped (ctrl-c), and the contents of the gathered perf database can be reported. This is how the recording is done:

$ perf record -g -p 57531

Now the ‘select count(*) from t_encrypted_large’ is executed. Once the result comes back in the sqlplus session, go to the perf session and ctrl-c it:

^C[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.928 MB perf.data (5114 samples) ]

Now that the perf data is collected, display a summary of the data:

$ perf report --sort comm --max-stack 2
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 5K of event 'cycles'
# Event count (approx.): 15445982213
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_57531_ao
            |--27.44%-- kdr4chk
            |--15.91%-- kdb4chk1
            |--10.14%-- cipher_loop_p3
            |--8.56%-- msort_with_tmp
            |--5.84%-- kdrrea2
            |--3.98%-- memcpy
            |--3.12%-- kd4_ent_cmp
            |--2.13%-- kdrrrh2
            |--1.99%-- kdstf00000010000100kmP
            |--1.53%-- __intel_ssse3_rep_memcpy
            |--1.42%-- __intel_new_memset
            |--1.40%-- blks_loop_p3
            |--1.14%-- y8_ExpandRijndaelKey
            |--1.06%-- kdbulk
            |--0.88%-- ownGetReg
            |--0.62%-- kghalf
             --12.86%-- [...]

What this shows, is 100% of the samples are from the process ‘oracle_57531_ao’, 27.44% of the samples were spend in a function called ‘kdr4chk’, 15.91% in a function called ‘kdb4chk1’, and so on.

Just by looking at the function names, these do not strike me as functions that do decryption. Function names like ‘cipher_loop_p3’ and ‘y8_ExpandRijndaelKey’ more look like functions doing some of the heavy lifting for decryption. In fact, the function names the most time is spend in (kdr4chk and kdb4chk1) seem to indicate some kind of checking.

A first thought is to look at the database parameters for block checking, and turn them off, and redo the test.

SQL> show parameter db_block_check

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_block_checking		     string	 OFF
db_block_checksum		     string	 FALSE

In fact, I already turned db block checking and db block checksum off. So it’s not the checking governed by these parameters, so what can it be then?

The best way I could come up with is getting a trace of the exact functions the process is executing. However, the number of C functions an executable goes through is typically gigantic. So for that reason, doing an investigation of the C functions, it’s best to use a very small table, which is ‘t_is_encrypted’, which contains only one row, in order to minimise the processing.

However, there is no tooling by default on Linux that gives you the ability to trace the exact C functions a process is executing. perf, which was just used SAMPLES execution. However, the intel pin tools give you the options to create a C function execution overview. Please use the link to see how you can download and link pin tools, and then create a debugtrace of the execution of ‘select * from t_is_encrypted’ for yourself. If you are interested in the findings based on the debugtrace, read on!

First of all, understanding the functions requires some experience. I don’t think there is any workaround for gaining the knowledge, in order to learn how to interprent them you just have to study them and spending time on it to work it out.

The output of debugtrace on ‘select * from t_is_encrypted’ (which fetches the one row in the table from an encrypted tablespace) looks like this:

         3 @@@ return underflow
Return 0x00007f225b355818 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x151
        14 @@@ return underflow
Return 0x000000000cf1036d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cf102c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cef19dd /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cef164a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000cf00486 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b2bdb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
       156 Tailcall 0x0000000005b2bdbe /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b2bda0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffe4ed10685, 0x7f225a965da0, ...)
       158 | Call 0x0000000005b2bda1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b32990 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
       172 | Return 0x0000000005b34d77 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffe4ed10685
       174 Return 0x0000000005b2bda7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffe4ed10685
       197 @@@ return underflow

That is not very well readable. I created some simple sed filters to make the output more readable:

cat debugtrace.out | sed 's/^\ *\([0-9]*\)\(\ .*\)Call.*-> 0x.*:\(.*\)/\2> \3/' | sed 's/^\ *\([0-9]*\)\(\ .*\)Return.*:\(.*returns: .*\)/\2< \3/' | sed 's/^\ *\([0-9]*\)\(\ .*\)Tailcall.*-> 0x.*:\(.*\)/\2<> \3/' | grep -v underflow

If you use that on the same file, it gets better, with the aim to show the function call flow:

Return 0x00007f225b355818 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x151
Return 0x000000000cf1036d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
Return 0x000000000cf102c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
Return 0x000000000cef19dd /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
Return 0x000000000cef164a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
 > _intel_fast_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
 <> _intel_fast_memcpy.P(0x7ffe4ed10685, 0x7f225a965da0, ...)
 | > __intel_ssse3_rep_memcpy(0x7ffe4ed10685, 0x7f225a965da0, ...)
 | < __intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffe4ed10685
 < _intel_fast_memcpy.P+0x000000000007 returns: 0x7ffe4ed10685
Return 0x000000000cf004ed /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
 > kslwtectx(0x7ffe4ed103c0, 0x7f225a965da1, ...)
 | > sltrgftime64(0x7ffe4ed103c0, 0x7f225a965da1, ...)
 | | > sltrgftime64_cgt(0x7ffe4ed103c0, 0x7f225a965da1, ...)
 | | | > clock_gettime@plt(0x1, 0x7ffe4ed10230, ...)
 | | | | > clock_gettime(0x1, 0x7ffe4ed10230, ...)
 | | | | < clock_gettime+0x000000000054 returns: 0
 | | | < clock_gettime+0x00000000006a returns: 0
 | | < sltrgftime64_cgt+0x000000000058 returns: 0x68724a87136
 | < sltrgftime64+0x00000000005c returns: 0x68724a87136
 | > kslwt_end_snapshot(0x2c674c080, 0x2c674c080, ...)
 | < kslwt_end_snapshot+0x000000000235 returns: 0x2c674b128
 | > kslwt_update_stats_int(0x2c674c080, 0x2c674c080, ...)
 | | > kews_update_wait_time(0x6, 0x53696ef, ...)
 | | < kews_update_wait_time+0x00000000002b returns: 0x1
 | < kslwt_update_stats_int+0x000000000314 returns: 0x1b0
 | > kskthewt(0x68724a87136, 0x184, ...)
 | < kskthewt+0x0000000000d7 returns: 0x23
 < kslwtectx+0x0000000003c0 returns: 0xe923

The rows that start with ‘Return’ are functions that were already entered when debugtrace started, and thus debugtrace could not determine the level, and reports an underflow in the output when the process returns from them.
The rows that start with ‘>’ mean the function after it is entered. The first two arguments are shown.
The rows that start with ‘<‘ mean the function after it is returned from. The return code is shown.
The rows that start with ‘<>’ mean a ‘tail call’ was executed, which is function that is executed as part of the return of another function.

To further emphasise this example, the above trace start with the server process waiting on input. On linux, this is done using a read function (the __read_nocancel function comes from the libpthread operating system library), which is a place a server process is expected to wait and is actually in an oracle wait event (SQL*Net message from client). It becomes apparent the process was in a wait event because the code for ending a wait event (kslwtectx) is executed after the __read_nocancel function. A word of warning: in this output, kslwtectx seems to be the beginning (left most) function. In reality, this is not true, there are many more functions, but the process has not returned to them so they are not known at that point of the trace. For the function a waiting database server process is sleeping in (__read_nocancel), this is sntpread>ntpfprd>nsbasic_brc>nsbrecv>nioqrc>opikndf2>opitsk>opiino>opiodr>opidrv>sou2o>optima_real>ssthrdmain>main.

Okay, now let’s look into the relevant output of the ‘select * from t_is_encrypted’. If you wade through the trace, it’s best to first go the the line with for ‘< kcbz_table_scan_read'. The kcbz_table_scan_read function performs a buffered multiblock read. '<' means the return, and you need to go up from there to find the first important bit. Here’s the important bit:

 | | | | | | | | | | | > ksfd_clear_pgaobjsc(0, 0x2b, ...)
 | | | | | | | | | | | < ksfd_clear_pgaobjsc+0x000000000023 returns: 0
 | | | | | | | | | | | > _intel_fast_memcpy(0x1a5ad6000, 0x2b1115000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x1a5ad6000, 0x2b1115000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x1a5ad6000, 0x2b1115000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x1a5ad6000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x1a5ad6000
 | | | | | | | | | | | > _intel_fast_memcpy(0x182808000, 0x2b1117000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x182808000, 0x2b1117000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x182808000, 0x2b1117000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x182808000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x182808000
 | | | | | | | | | | | > _intel_fast_memcpy(0x178ed0000, 0x2b1119000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x178ed0000, 0x2b1119000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x178ed0000, 0x2b1119000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x178ed0000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x178ed0000
 | | | | | | | | | | | > _intel_fast_memcpy(0x12927c000, 0x2b111b000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x12927c000, 0x2b111b000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x12927c000, 0x2b111b000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x12927c000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x12927c000
 | | | | | | | | | | | > _intel_fast_memcpy(0x15411c000, 0x2b111d000, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x15411c000, 0x2b111d000, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x15411c000, 0x2b111d000, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x00000000243e returns: 0x15411c000
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x15411c000
 | | | | | | | | | | | > kcbz_free_table_scan_read_buffer(0x2b1115000, 0xa000, ...)
 | | | | | | | | | | | | > kslgetl(0x60023a80, 0x1, ...)
 | | | | | | | | | | | | < kslgetl+0x000000000144 returns: 0x1
 | | | | | | | | | | | | > kslfre(0x60023a80, 0x8, ...)
 | | | | | | | | | | | | < kslfre+0x000000000125 returns: 0
 | | | | | | | | | | | < kcbz_free_table_scan_read_buffer+0x0000000000f3 returns: 0
 | | | | | | | | | | < kcbz_table_scan_read+0x000000000664 returns: 0x5

Going up from kcbz_table_scan_read return, we see a call to kcbz_free_table_scan_read_buffer, which gets and frees a latch (‘cache table scan latch’). Above that we essentially see the table scan read buffer contents being copied into the buffer cache per block using an optimised version of the memcpy system call. Luckily, debugtrace shows the first two arguments of a call, for memcpy this is a pointer to the destination as first argument, and a pointer to the source as second. If you look closely to the memcpy arguments, you see that the destination is random (“scattered” through the buffer cache, which is where ‘db file scattered read’ gets its name from), and the source is starting at 0x2b1115000 and increases by exactly 0x2000. 0x2000 equals 8192, which is the size of the blocks. The reason for starting here is the blocks are further processed in the buffer cache buffers in order they are copied in kcbz_table_scan_read.

The first buffer copied is 0x1a5ad6000. The next sequence of functions:

 | | | | | | | | | | > kcbzvb(0x1a5ad6000, 0x5, ...)
 | | | | | | | | | | | > kcbhvbo(0x1a5ad6000, 0x2000, ...)
 | | | | | | | | | | | < kcbhvbo+0x0000000000cc returns: 0
 | | | | | | | | | | < kcbzvb+0x000000000084 returns: 0x1
 | | | | | | | | | | > kcbztek_trace_blk(0xd5e15b8, 0, ...)
 | | | | | | | | | | < kcbztek_trace_blk+0x00000000001d returns: 0x1a5fce498
 | | | | | | | | | | > _setjmp@plt(0x7ffe4ed04ed0, 0, ...)
 | | | | | | | | | | <> __sigsetjmp(0x7ffe4ed04ed0, 0, ...)
 | | | | | | | | | | <> __sigjmp_save(0x7ffe4ed04ed0, 0, ...)
 | | | | | | | | | | < __sigjmp_save+0x000000000010 returns: 0
 | | | | | | | | | | > kcbz_encdec_tbsblk(0x5, 0x38b, ...)

Next is a function ‘kcbzvb’ which probably is kernel cache buffers space management (Z) verify buffer. Then kcbztek_trace_blk. My current thinking is this function reads the block to see if the encryption flag is set in the cache header, and it hasn’t been decrypted already. When a block is is not encrypted, or is encrypted and already decrypted, kcbz_encdec_tbsblk is not called; in this case, the block is encrypted and not decrypted yet, so we see the function kcbz_encdec_tbsblk being called. A lot of function calls inside the kcbz_encdec_tbsblk are done which strongly indicate having something to do with keys and encryption, like kcbtse_get_tbskey, kcbtse_get_tbskey_info and a lot of functions starting with kgce (kgce_init, kgce_set_mode, kgce_set_key) most notably the function kgce_dec, which probably is kernel generic encryption – decryption. Here is how that function returns and what it is executes next (still inside the function kcbz_encdec_tbsblk):

 | | | | | | | | | | | < kgce_dec+0x0000000000a6 returns: 0
 | | | | | | | | | | | > _intel_fast_memcpy(0x1a5ad6014, 0x7f225a96c0f0, ...)
 | | | | | | | | | | | <> _intel_fast_memcpy.P(0x1a5ad6014, 0x7f225a96c0f0, ...)
 | | | | | | | | | | | | > __intel_ssse3_rep_memcpy(0x1a5ad6014, 0x7f225a96c0f0, ...)
 | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x0000000025e8 returns: 0x1a5ad6014
 | | | | | | | | | | | < _intel_fast_memcpy.P+0x000000000007 returns: 0x1a5ad6014
 | | | | | | | | | | | > kcbhxoro(0x1a5ad6000, 0x2000, ...)
 | | | | | | | | | | | | > sxorchk(0x1a5ad6000, 0x2000, ...)
 | | | | | | | | | | | | < sxorchk+0x0000000000e6 returns: 0x46dd
 | | | | | | | | | | | < kcbhxoro+0x000000000013 returns: 0x46dd

The kgce_dec functions returns. It is my assumption the decryption of the block is done. Next there is another memcpy function. The arguments of memcpy are *dest, *src. The first address is the same address we saw earlier, but at 0x1a5ad6014 instead of 0x1a5ad6000, so 14 bytes past the start of the block. The source is 0x7f225a96c0f0. In most cases these high numbers are PGA memory addresses. So what I think happens here, is the block was copied into the buffer, then kcbztek_trace_blk found the encryption flag and the block encrypted, so the kcbz_encdec_tbsblk function has been called to decrypt the block, which is done in PGA, and then the decrypted version of the block is copied into the buffer overwriting the encrypted version.

The next function kcbhxoro probably is a function to check the block. This must mean the checking is done on the unencrypted block contents. Please mind we are still in kcbz_encdec_tbsblk!

Continuing the trace we see:

 | | | | | | | | | | | > kcbchk_ctx(0x1a5ad6000, 0, ...)
 | | | | | | | | | | | | > ktbdbchk(0x1a5ad6000, 0, ...)
 | | | | | | | | | | | | | > kdgchk(0x1a5ad6000, 0, ...)
 | | | | | | | | | | | | | | > kd4chk(0x1a5ad6000, 0x7fffffc7, ...)
 | | | | | | | | | | | | | | | > kdb4chk1(0x1a5ad6014, 0, ...)
 | | | | | | | | | | | | | | | | > kd4_alloc_ok(0xc9cee30, 0xc9cee80, ...)
 | | | | | | | | | | | | | | | | | > kcbtmal(0, 0x4, ...)
 | | | | | | | | | | | | | | | | | | > kghalf(0x7f225f080c80, 0x7f225f085980, ...)
 | | | | | | | | | | | | | | | | | | | > _intel_fast_memset(0x7f225ef1ffb0, 0, ...)
 | | | | | | | | | | | | | | | | | | | <> _intel_fast_memset.J(0x7f225ef1ffb0, 0, ...)
 | | | | | | | | | | | | | | | | | | | | > __intel_memset(0x7f225ef1ffb0, 0, ...)
 | | | | | | | | | | | | | | | | | | | | < __intel_memset+0x00000000031e returns: 0x7f225ef1ffb0
 | | | | | | | | | | | | | | | | | | | < _intel_fast_memset.J+0x000000000007 returns: 0x7f225ef1ffb0
 | | | | | | | | | | | | | | | | | | < kghalf+0x00000000061f returns: 0x7f225ef1ffb0

These are the apparent check functions! We also see it takes the buffer address as first argument. This is where perf measured the majority of the time being spend (kdr1chk, not in this snippet, and kdb4chk1)! The last noteworthy things in the trace for handling this block are:

 | | | | | | | | | | | < kcbchk_ctx+0x0000000000ff returns: 0
 | | | | | | | | | | | > kgce_fini(0x7f225f080c80, 0x7ffe4ed03b98, ...)
 | | | | | | | | | | | | > skgce_end_ipp_f(0x7f225f080c80, 0x7ffe4ed03b98, ...)
 | | | | | | | | | | | | < skgce_end_ipp_f+0x000000000014 returns: 0
 | | | | | | | | | | | < kgce_fini+0x000000000037 returns: 0
 | | | | | | | | | | < kcbz_encdec_tbsblk+0x000000001f36 returns: 0

Here we see the return from the kcbchk_ctx call, which were identified above as ‘check functions’, then another call to a function in the encryption family (kgce_fini), which clearly indicates it’s stopping the encryption/decryption functions (“fini”), and then returning from the kcbz_encdec_tbsblk function. There are some functions executed after kcbz_encdec_tbsblk, like kcbz_check_objd_typ. A little further you will see the whole sequence starting again for the next block with the function kcbzvb.

This makes me wonder. We saw the block being read from disk into the table_scan_read_buffer, from the read buffer into the cache, then being decrypted, then being checked. I turned db_block_checking and db_block_checksum off, why would there be a need for checking? But let’s not get carried away, I made some assumptions. I am right in my assumptions?

If you look at the logical flow it’s:
– table scan read buffer -> buffer cache buffer (kcbz_table_scan_read>_intel_fast_memcpy)
– identify if block needs decrypting (kcbz_encdec_tbsblk>kcbztek_trace_blk)
– decrypt buffer cache buffer (kcbz_encdec_tbsblk>kcgce_dec)
– check buffer cache buffer (kcbz_encdec_tbsblk>kcbchk_ctx)

If I can identify the buffer, then stop the execution at kcbztek_trace_blk to look at the buffer to check for it being encrypted, then stop at kcbchk_ctx and look at the same buffer again, and see if it is decrypted, my assumption that the kgce_dec function is doing the decryption and that the checking is based on the unencrypted block contents are correct.

In order to do this, we can use gdb and have the function arguments printed of memcpy, and break at the function kcbztek_trace_blk, print the buffer and then continue and break at kcbchk_ctx and print the buffer again. This is how I created my gdb macro:

break _intel_fast_memcpy
  commands
    silent
    printf "_intel_fast_memcpy dest: %lx, src: %lx, size: %d\n", $rdi, $rsi, $rdx
    c
  end
break kcbztek_trace_blk
  commands
    silent
    print $rip
  end
break kcbchk_ctx
  commands
    silent
    print $rip
  end

Now create a database session, flush the buffer cache (alter system flush buffer_cache) to make sure the small table t_is_encrypted is read from disk again, and thus decrypted, and attach to it with gdb and source the macro:

$ gdb -p 57531 -x tde_verification.gdb
...
(gdb) set pagi off
(gdb) c
Continuing.

I turned pagination off (set pagi off) because there is a lot of memcpy going on, otherwise gdb will pause execution if the screen is filled, and used the continue (c) command, so we can use the sqlplus session again. Now execute ‘select * from t_is_encrypted’, this is what it looks like in the debugger:

_intel_fast_memcpy dest: 7f85093a7f0a, src: 7ffd589f6690, size: 7
_intel_fast_memcpy dest: 7f85093a7f11, src: 7ffd589f6624, size: 4
_intel_fast_memcpy dest: 7f85093a7f15, src: 7ffd589f6628, size: 4
_intel_fast_memcpy dest: 7f85093a7f19, src: 7ffd589f662c, size: 4
_intel_fast_memcpy dest: 7f85093a7f1d, src: 7ffd589f6630, size: 4
_intel_fast_memcpy dest: 7f85093bac08, src: 7ffd589f3238, size: 336
_intel_fast_memcpy dest: 7ffd589f0e30, src: 2b7f89c20, size: 456
_intel_fast_memcpy dest: 7ffd589f3238, src: 7f85093bac08, size: 336
$1 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>

This is actually reading the segment header, so we continue (c).

(gdb) c
Continuing.
_intel_fast_memcpy dest: 7f8508bf86a8, src: 7f8508bf8060, size: 712
_intel_fast_memcpy dest: 7f8508bf86c0, src: 7f8508bf8078, size: 264
_intel_fast_memcpy dest: 7f85093bac08, src: 7ffd589eb308, size: 336
_intel_fast_memcpy dest: 7ffd589e8f00, src: 2b7f89c20, size: 456
_intel_fast_memcpy dest: 7ffd589eb308, src: 7f85093bac08, size: 336
_intel_fast_memcpy dest: 1a5576000, src: 2b1115000, size: 8192
_intel_fast_memcpy dest: ee6b8000, src: 2b1117000, size: 8192
_intel_fast_memcpy dest: 12eb54000, src: 2b1119000, size: 8192
_intel_fast_memcpy dest: 200c56000, src: 2b111b000, size: 8192
_intel_fast_memcpy dest: 1a570e000, src: 2b111d000, size: 8192
$2 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>

Here we see kcbztek_trace_again, and we see memcpy executed 5 times with a size of 8192. These are the blocks copied from the table scan read buffer! The buffer we are going to look at is 0x1a5576000. In order to examine the block, we are going to print 8192 bytes starting from 0x1a5576000:

(gdb) x/8192bx 0x1a5576000
0x1a5576000:	0x06	0xa2	0x00	0x00	0x8b	0x03	0x00	0x00
0x1a5576008:	0xf4	0x9a	0x4e	0x00	0x00	0x00	0x01	0x16
0x1a5576010:	0x3a	0xf2	0x00	0x00	0x30	0x7f	0xf4	0x80
0x1a5576018:	0x5f	0xf1	0xd5	0xe9	0xd0	0x5a	0xeb	0x59
0x1a5576020:	0x5c	0x7c	0x90	0x87	0x21	0x46	0x44	0xf3
0x1a5576028:	0x66	0xa2	0x6a	0x3d	0xd4	0x97	0x6d	0x1e
0x1a5576030:	0xab	0xf8	0x06	0xd4	0x17	0x5c	0xc9	0x1a
0x1a5576038:	0xe6	0x86	0xf4	0x5a	0x20	0xbe	0x8d	0x22
0x1a5576040:	0x0c	0x2a	0x29	0x2b	0xa1	0x2a	0x3b	0x86
0x1a5576048:	0x4b	0x31	0x13	0xa7	0x4c	0xcd	0x4c	0xa4
0x1a5576050:	0xf1	0x1a	0xa3	0x59	0xf0	0x70	0x31	0xd5
...snip...
0x1a5577fb8:	0xf8	0x24	0x90	0x74	0xa3	0x23	0x65	0xf6
0x1a5577fc0:	0xee	0x76	0x6c	0xf5	0x8b	0xaf	0x23	0x7e
0x1a5577fc8:	0xc4	0x5b	0xbd	0x5f	0x26	0xb8	0x5e	0x3b
0x1a5577fd0:	0x65	0x96	0x48	0x88	0xb3	0x37	0x10	0x15
0x1a5577fd8:	0x3f	0xd8	0x62	0x54	0x4f	0xa3	0xf3	0x60
0x1a5577fe0:	0xfc	0x2e	0x9d	0xf4	0xc6	0x49	0xbc	0x4f
0x1a5577fe8:	0x85	0x3b	0x8a	0xd1	0xa3	0xc8	0x61	0x72
0x1a5577ff0:	0xfd	0xb8	0xe1	0x2e	0x65	0x92	0x0d	0x70
0x1a5577ff8:	0x50	0xd8	0x34	0x06	0x01	0x06	0xf4	0x9a

I can’t tell from looking at the hexadecimal numbers if this block is encrypted or not. However, this block is supposed to have a single row containing ‘AAAAAAAAAA’, which means there is supposed to be a lot of empty space. I can’t find a repeating value, and the entire block contains values. This might be the result of the block being encrypted. Now, if my assumption is correct, if we continue and then when execution breaks on kcbchk_ctx, the same buffer should look entirely different, because it then has been decrypted by kgce_dec:

(gdb) c
Continuing.
_intel_fast_memcpy dest: 7ffd589eb3f0, src: 2a5e2ced8, size: 240
_intel_fast_memcpy dest: 7ffd589ea860, src: e7ef774, size: 15
_intel_fast_memcpy dest: 7ffd589eaa80, src: 7ffd589eb3c8, size: 16
_intel_fast_memcpy dest: 1a5576014, src: 7f8508ccc908, size: 8168
$3 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
(gdb) x/8192bx 0x1a5576000
0x1a5576000:	0x06	0xa2	0x00	0x00	0x8b	0x03	0x00	0x00
0x1a5576008:	0xf4	0x9a	0x4e	0x00	0x00	0x00	0x01	0x06
0x1a5576010:	0xdd	0x46	0x00	0x00	0x01	0x00	0x00	0x00
0x1a5576018:	0xc3	0x7b	0x01	0x00	0xed	0x9a	0x4e	0x00
0x1a5576020:	0x00	0x00	0x00	0x00	0x02	0x00	0x32	0x00
0x1a5576028:	0x88	0x03	0x00	0x00	0x07	0x00	0x1d	0x00
0x1a5576030:	0xd0	0x01	0x00	0x00	0x87	0x09	0x00	0x00
0x1a5576038:	0x90	0x03	0x27	0x00	0x01	0x20	0x00	0x00
0x1a5576040:	0xf4	0x9a	0x4e	0x00	0x00	0x00	0x00	0x00
0x1a5576048:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576050:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576058:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576060:	0x00	0x00	0x00	0x00	0x00	0x01	0x01	0x00
0x1a5576068:	0xff	0xff	0x14	0x00	0x8a	0x1f	0x76	0x1f
0x1a5576070:	0x76	0x1f	0x00	0x00	0x01	0x00	0x8a	0x1f
0x1a5576078:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576080:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5576088:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
...snip...
0x1a5577fa8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fb0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fb8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fc0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fc8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fd0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fd8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fe0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x1a5577fe8:	0x00	0x00	0x00	0x00	0x00	0x00	0x2c	0x01
0x1a5577ff0:	0x01	0x0a	0x41	0x41	0x41	0x41	0x41	0x41
0x1a5577ff8:	0x41	0x41	0x41	0x41	0x01	0x06	0xf4	0x9a

After continuing, we see a couple of memcpy’s, after which the debugger breaks execution when kcbchk_ctx is called. We execute the examine command again, and now the block looks way different. Way more zero’d locations, the place of the different headers is visible because of the zero’s in between, and starting from 0x1a5576078 it’s all zero until 0x1a5577fe8. We see a repeating value 0x41. Is that my ‘A’?

(gdb) p (char) 0x41
$2 = 65 'A'

Yes! These are the 10 A’s!

So, this means that before we enter the kcbchk_ctx function inside kcbz_encdec_tbsblk the block is actually already decrypted, and kcbchk_ctx performs something that takes a lot of the time, which seems to be checking, based on the function naming. Because the block is already decrypted when we enter kcbchk_ctx, it seems to me that the checking step is technically optional.

Further background analysis
I stated that with tablespace encryption the block is decrypted when read into the cache, that non-data blocks are not encrypted and that once the block is decrypted in the cache, it can be read without any encryption overhead because it is not encrypted (the block being decrypted in the cache is actually documented by Oracle). This actually can be proven with a debugger macro:

break pread64
  commands
    silent
    print $rip
    c
  end
break kgce_dec
  commands
    silent
    print $rip
    c
  end
break kcbchk_ctx
  commands
    silent
    print $rip
    c
  end
break kcbz_encdec_tbsblk
  commands
    silent
    print $rip
    c
  end
break kcbztek_trace_blk
  commands
    silent
    print $rip
    c
  end
break kdr4chk
  commands
    silent
    print $rip
    c
  end
break kcbgtcr
  commands
    silent
    print $rip
    c
  end

If you flush the buffer cache, and then attach to a foreground session with gdb and source the above macro, you get the following output:

$55 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$56 = (void (*)()) 0x7fd2c72b8100 <pread64>
$57 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$58 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$59 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$60 = (void (*)()) 0x7fd2c72b8100 <pread64>
$61 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$62 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$63 = (void (*)()) 0x3923750 <kgce_dec>
$64 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$65 = (void (*)()) 0xcdd26a0 <kdr4chk>
$66 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$67 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$68 = (void (*)()) 0x3923750 <kgce_dec>
$69 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$70 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$71 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$72 = (void (*)()) 0x3923750 <kgce_dec>
$73 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$74 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$75 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$76 = (void (*)()) 0x3923750 <kgce_dec>
$77 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$78 = (void (*)()) 0xca1f7f0 <kcbztek_trace_blk>
$79 = (void (*)()) 0x71d5320 <kcbz_encdec_tbsblk>
$80 = (void (*)()) 0x3923750 <kgce_dec>
$81 = (void (*)()) 0xc9ceed0 <kcbchk_ctx>
$82 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$83 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$84 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$85 = (void (*)()) 0xc9d9a20 <kcbgtcr>

Rows 1-4: First we see kcbgtcr (kernel cache buffers get consistent read); as I have made clear in the past, all (normal) reads in the oracle databases start off as a logical read; but since we just flushed the cache, the segment header needs to be read from disk, which explains the pread64. After the single block read, the function kcbztek_trace_blk is executed, but because this is a non-data block, no encryption is applied to this block, so there is no decryption to be done.
Rows 5-11: Then the multiblock read is initiated, which is executed via a pread64. After the multibock read, the blocks are put in the buffer cache. For the first block, we see the kcbztek_trace_blk function being executed, and because this block is encrypted, the function kcbz_encdec_tbsblk is invoked, and then the actual decryption in the function kgce_dec. After performing decryption, the apparent checking function kcbchk_ctx is called, and inside that function kdr4chk is executed (which checks an individual row, we got one row).
Rows 12-15,16-19,20-23,24-27: The other blocks that are read do not contain any data, however since they are data blocks, you see kcbztek_trace_blk which finds the encryption flag, after which kcbz_encdec_tbsblk is called, which calls kgce_dec, and then kcbchk_ctx. kdr4chk is not called, because there are no rows.

However, what is interesting is what will be called when we execute the same full table scan again over the 5 blocks now that they are in the buffer cache:

$86 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$87 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$88 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$89 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$90 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$91 = (void (*)()) 0xc9d9a20 <kcbgtcr>
$92 = (void (*)()) 0xc9d9a20 <kcbgtcr>

Exactly how this is documented it would work! The foreground process now can just use the consistent read mechanism to read the blocks in the cache, there is no encryption codepath called. Please mind that when you change blocks in the cache, the changing does not get any additional overhead too, because it works 100% the same as a non-encrypted block. However, what does get a performance penalty is writing the blocks to disk (!). The database writer needs to encrypt the blocks before it can store them on disk! This means that DML intensive databases that switch to tablespace encryption, might need a bigger buffercache because there is more time and resources needed to write them to disk, which might lead to more dirty buffers in the cache, and thus lesser buffers available for other usage.

Also mind that encryption functions of tablespace encryption get called as part of the IO codepath:

#0  0x00000000071d5320 in kcbz_encdec_tbsblk ()
#1  0x00000000012e8e7c in kcbzibmlt ()
#2  0x00000000012dd9d3 in kcbzib ()
#3  0x000000000c9dbda9 in kcbgtcr ()
#4  0x000000000c9c0ac5 in ktrget2 ()
#5  0x000000000ccbc786 in kdstgr ()
#6  0x000000000cd0a0e9 in qertbFetch ()
#7  0x000000000cb6e42b in opifch2 ()
#8  0x00000000021e49f3 in kpoal8 ()
#9  0x000000000cb75a5d in opiodr ()
#10 0x000000000cda421b in ttcpip ()
#11 0x0000000001baba3c in opitsk ()
#12 0x0000000001bb0401 in opiino ()
#13 0x000000000cb75a5d in opiodr ()
#14 0x0000000001ba7b8b in opidrv ()
#15 0x00000000026b31b1 in sou2o ()
#16 0x0000000000bbda7a in opimai_real ()
#17 0x00000000026bd8fc in ssthrdmain ()
#18 0x0000000000bbd9bc in main ()

1. This is the master encryption routine.
2. This is the kernel cache space management insert block multiblock routine.
3. This is the kernel cache space management insert block routine.
4. This is the consistent read call kcbgtcr.
5. This is the kernel transaction layer buffer get routine.
6. This is the kernel data scan table get rows (full table scan).
7. This is the SQL execution rowsource tablescan fetch routine.
In other words: the decryption is in the IO codepath, or: is executed as part of the IO routines.

Update april 4, 2016:
Further investigation showed the checking (entering the kcbchk_ctx function) CAN be eliminated when the undocumented parameter “_disable_block_checking” is set to FALSE.
There is a bug logged for this issue: 12943305 – decryption performance can be improved, and has a fix. This is reported to be fixed in 12.2, and a patches available for the different 12.1.0.2 PSUs.

Advertisements
3 comments
  1. DBA RJ said:

    Amazing Frits. Thanks for this another lesson.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: