Archive

Tag Archives: oracle

There are many situations where you want to use a very specific configuration of the Oracle database, for example when a client has an issue and is still on EL5, or gets disk errors on a filesystem that is ext3, or is using ASM and gets weird IO patterns. Other examples are: you want to test the newest PSU to see if responds differently to an issue you are working on, or you want to test a combination of the Oracle database version 11.2.0.3 and grid infrastructure 12.1.0.2.

Of course you can just go and install a virtual machine, install all the different bits and pieces. Doing so manually kills vast amounts of time. By doing that, you will end up with a lot of virtual machines, for which at a certain point in time you have to make a decision to remove some of these.

Also a lot of people use a (virtual) machine with a couple of database versions installed, and test on these. In that case you sometimes have to ignore details like filesystemASM, or specific PSU level, it’s hard to keep that updated, but when a client case is in a lower version, in general you don’t go back in PSU level (although not impossible). One thing I ran into frequently is that it’s easy to get caught in side effects because of changes and settings made for earlier test cases (often underscore parameters).

This blogpost introduces my project ‘vagrant-builder’ which allows you to build a virtual machine with Oracle and optionally clusterware installed in any version you specify. The provisioning will download all software and patches (except for the 12.2.0.1 media, which needs to be provided in the ‘files’ directory) fully automatic for you. These are the options:

Linux version:
Oracle linux version 5, 6 or 7 (limited by boxes build by the box-cutter project).
The Actual versions currently existing are ol5.11, ol6.6/7/8, ol7.0/1/2/3. I am awaiting the boxcutter project to produce ol6.9 and ol7.4.

Filesystems:
Filesystem types for u01 and for oradata (when no ASM is used): xfs, ext4, ext3.

Kernel:
Oracle linux 5: latest redhat kernel, latest UEK2 kernel.
Oracle linux 6: any exadata kernel version (if made available on public-yum), latest redhat/UEK2/UEK3/UEK4 kernel.
Oracle linux 7: latest redhat kernel, latest UEK3 or UEK 4 kernel.

ASM:
No ASM install.
12.2.0.1 no patches, PSU: 170620, 170718, 170814
12.1.0.2 no patches, PSU: 1, 2, 3, 4, 5, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814
11.2.0.4 no patches, PSU: 1, 2, 3, 4, 5, 6, 7, 8, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814

Database:
No database install.
12.2.0.1 no patches, PSU: 170620, 170718, 170814
12.1.0.2 no patches, PSU: 1, 2, 3, 4, 5, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814
11.2.0.4 no patches, PSU: 1, 2, 3, 4, 5, 6, 7, 8, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814
11.2.0.3 PSU 15 only.
11.2.0.2 PSU 12 only.

Database:
By specifying a database name, a database will be created with that name. Of course the dictionary part of the patching will be applied to the database!

How does this work? This works using the combination of the following pieces of software:
– Virtualbox
– Vagrant
– Ansible
Plus the vagrant-builder repository: https://gitlab.com/FritsHoogland/vagrant-builder

If you don’t have Virtualbox, Vagrant or Ansible installed, follow the installation procedure in this blog article; it’s a bit older, so versions of the software components will be higher, you should simply install the latest versions. There is quite an important caveat (sadly): Ansible in principle does not run on windows. You can made it working on windows by using Cygwin, but officially it doesn’t support windows. If you can get the provisioning using Ansible to fully work on windows please share how you did that.

Once you got all the software components installed, another thing you might want to do first is to move your default virtual box directory to a place where you got enough space to hold virtual machines.

Then, clone the vagrant-builder repository into a directory (git clone https://gitlab.com/FritsHoogland/vagrant-builder.git myvm, for example), go into that directory and edit the Vagrantfile to set:
– hostonly_network_ip_address
– mos username & password
– database_name (if you want a database)
– linux (choose one by removing the hash sign in front of it)
– kernel
– asm_version (set a version if you want clusterware “siha” and ASM, if a database_version is set and asm_version is empty, you get a filesystem based database)
– database_version (set a version if you want the database software to be installed)
– vm_cpus (number of CPUs visible/made available to the VM)
– vm_memory (amount of memory made available ot the VM)
– vm_hostname (if you want multiple VMs, you need multiple vm_hostnames set!)
– perl_l4cache_workaround (if you got a newer CPU with a level 4 cache, set this to Y (yes), otherwise set this to N (no))

Save the changes, and startup the virtual machine: ‘vagrant up’. This will pull the operating system image, add a disk for the database, startup linux, setup and configure linux, download the database and grid software version (except for version 12.2.0.1, for which the installation media needs to be staged in the files dictory), install it, download the patches, install these and create a database, without manual intervention.

Advertisements

This post is about the decision the Oracle database engine makes when it is using a full segment scan approach. The choices the engine has is to store the blocks that are physically read in the buffercache, or read the blocks into the process’ PGA. The first choice is what I refer to as a ‘buffered read’, which places the block in the database buffercache so the process itself and other processes can bypass the physical read and use the block from the cache, until the block is evicted from the cache. The second choice is what is commonly referred to as ‘direct path read’, which places the blocks physically read into the process’ PGA, which means the read blocks are stored for only a short duration and is not shared with other processes.

There are some inherent performance aspects different between a buffered and a direct path read. A buffered read can only execute a single physical read request for a single range of blocks, wait for that request to finish, fetch and process the result of the physical read request after which it can execute the next physical read request. So there is maximum of one outstanding IO for multiple (adjacent) Oracle blocks. A direct path read works differently, it submits two physical IO requests, each for a distinct range of Oracle blocks asynchronously, after which it waits one or more IOs to finish. If an IO is returned, it is processed, and an IO for another range of Oracle blocks is submitted to restore the number of IOs in flight to two. If the database engine determines (based upon a non-disclosed mechanism) that enough resources are available it can increase the amount of IO physical IO requests in flight up to 32. Other differences include a maximum for the total size of the IO request, which is 1MB for buffered requests, and 32MB for direct path requests (which is achieved by setting db_file_multiblock_read_count to 4096).

At this point should be clear that there are differences between buffered and direct path reads, and when full segment scans switch from direct path reads to buffered reads it could mean a significant performance difference. On top of this, if your database is using Exadata storage, this decision between buffered reads and direct path reads is even more important. Only once the decision for direct path reads has been made, an Exadata smartscan can be executed. I have actually witnessed cases where a mix of partitioning and HCC lead to the situation that the partitions were so small that a direct path read was not chosen, which meant a smartscan was not considered anymore, meaning that instead of the cells decompressing the compressed blocks all in parallel, the process now had to fetch them and do the decompression on the database layer.

There have been some posts on the circumstances of the decision. However, I have seen none that summarise the differences for the different versions. In order to investigate the differences between the different Oracle versions, I created a git repository at gitlab: https://gitlab.com/FritsHoogland/table_scan_decision. You can easily use the repository by cloning it: git clone https://gitlab.com/FritsHoogland/table_scan_decision.git, which will create a table_scan_decision directory in the current working directory.

Oracle version 11.2.0.2.12
Please mind this version is very old, and SHOULD NOT BE USED ANYMORE because it’s not an actively supported version. However, I do use this version, because this version has different behaviour than the versions that follow.

First determine the small table threshold of the database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

Let’s create tables just below and just over 1531 blocks/small table threshold:

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1531 1531
...
    BLOCKS
----------
      1664

So the small table threshold is 1531, this means that an internal statistic that is used for determining using the direct path mechanism, medium table threshold will be approximately 1531*5=7655. Let’s create tables just below and just over that number of blocks:

TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

For the other versions, trace event ‘nsmtio’ can be used to learn how the decision is made. However, this trace event does not exist in Oracle version 11.2.0.2. The workaround is to just execute a SQL trace and interpret the wait events. For a full table scan, the wait events ‘db file scattered read’ means a buffered read is done, and wait events ‘direct path read’ means a direct path read was done (obviously).

TS@test > alter session set events 'sql_trace level 8';
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1531';
TS@test > select count(*) from table_1531;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that in my case, with Oracle version 11.2.0.2, the switching point is at 5 times _small_table_threshold.

Oracle 11.2.0.3.15
This version too should NOT BE USED ANYMORE because it is not in active support. This too is for reference.
Small table threshold for this database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

With the small table threshold being 1531, the medium table threshold should be approximately 1531*5=7655.

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1440 1440
...
    BLOCKS
----------
      1536
TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans. By doing that I ran into something peculiar with the ‘nsmtio’ event in this version (11.2.0.3 with the latest PSU). This event does exist for this version (which you can validate by running ‘oradebug doc component’), however, it does not yield any output. This means I have to revert to the previous method of running sql_trace at level 8 and interpret the wait events.

TS@test > alter session set events 'trace[nsmtio]:sql_trace level 8'; -- no NSMTIO lines, only sql_trace!
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1440';
TS@test > select count(*) from table_1440;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that with Oracle version 11.2.0.3, the direct path read switching point seems to have moved from 5 times small table threshold to small table threshold itself.

Oracle 11.2.0.4.170718
This version is in active support!
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1538

With the small table threshold being 1538, the medium table threshold should be approximately 1538*5=7690.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1538 1538
...
    BLOCKS
----------
      1664
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7690 7690
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans.

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7693 blocks),
-- main event: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1538';
SQL> select count(*) from table_1538;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14422, objn: 14422
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14423, objn: 14423
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7690';
SQL> select count(*) from table_7690;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14424, objn: 14424
-- main event: direct path read

This shows that with Oracle version 11.2.0.4, the direct path read switching is at small table threshold, which was changed starting from 11.2.0.3.

Oracle version 12.1.0.2.170718
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1440

SQL>

With small table threshold being 1440, the medium table threshold is approximately 1440*5=7200.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7203 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20489, objn: 20489
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20490, objn: 20490
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20491, objn: 20491
-- main events: direct path read

This is in line with the switch in version 11.2.0.3 to small table threshold as the switching point between buffered reads and direct path reads.

Oracle 12.2.0.1.170814
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1444

SQL>

With small table threshold being 1444, the medium table threshold is approximately 1444*5=7220.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1504 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 7048 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 22502, objn: 22502
-- main events: direct path read

Hey! With 12.2.0.1 the direct path read switching point reverted back to pre-11.2.0.3 behaviour of switching on 5 times small table threshold instead of small table threshold itself.

Update!
Re-running my tests shows differences in the outcome between buffered and direct path reads. My current diagnosis is that the scan type determination uses a step based approach:

– The first determination of size is done with ‘NSMTIO: kcbism’ (kcb is medium). If islarge is set to 1, it means the segment is bigger than STT. If islarge is set to 0 it means the segment is smaller than STT, and the segment will be read buffered, and the line ‘qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown in the NSMTIO output.

– The next line is 'NSMTIO: kcbimd' (kcb is medium determination?) It shows the size of the segment (nblks), STT (kcbstt), MTT (kcbpnb) and is_large, which in my tests always is set to 0. Here, there are 4 options that I could find:

1) Segment size between STT and MTT and a buffered read is executed.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, which probably is externalised in the line 'NSMTIO: kcbcmt1'. The outcome can be a buffered read, for which the line 'qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown.

2) Segment size between STT and MTT and the direct path code path is chosen.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, probably externalised in the line 'NSMTIO: kcbcmt1'. If the costing determines it would be beneficial to use a direct path mechanism, it seems it switches to the direct path with cache determination code, which is also used for any table scan that is smaller than VLOT. Because of switching to that code, it will determine if the segment is bigger than VLOT: 'NSMTIO: kcbivlo', which of course in this case isn't true. Then, it will show the line 'NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]'

3) Segment size bigger than MTT but smaller than VLOT.
If the segment is between MTT and VLOT, the Oracle engine does not apply the costing mechanism (which is means the kcbcmt1 line is not shown). It will determine if the segment is bigger than VLOT ('NSMTIO: kcbivlo'), and then show 'NSMTIO: qertbFetch:[MTT VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

4) Segment size bigger than VLOT.
If the segment is bigger than VLOT, the Oracle engine execute the functions kcbimd and kcbivlo, the NSMTIO line for kcbivlo will show is_large 1 to indicate it’s a very large object (VLOT by default is ‘500’, which is 5 times the total number of buffers in the buffer cache. The qertbFetch line will say ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

In the cases where ‘NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]' is shown, which is either a segment between STT and MTT which switched to this code path, or between MTT and VLOT, the code will apply a second determination and potential switching point from buffered to direct path or vice versa, which is shown with the line 'kcbdpc' (kcb direct path check). The outcome can be:

– NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ] to indicate it will use a buffered read.
– NSMTIO: kcbdpc:DirectRead to indicate it will use a direct path read.

I have verified the above 'decision tree' in 11.2.0.2, 11.2.0.3, 11.2.0.4, 12.1.0.2 and 12.2.0.1. It all seems to work this way consistently. I derived this working by looking at the NSMTIO tracing of 12.2, and then gone back in version. You will see that going lower in versions, there is lesser (nsmtio) tracing output; 11.2.0.4 does show way lesser information, for example, it does not show the kcbcmt1 line, and of course 11.2.0.3 and 11.2.0.2 do not show NSMTIO lines altogether. In order to verify the working, I used gdb and quite simply breaked on the kcbism, kcbimd, kcbcmt1, kcbivlo and kcbdpc functions in the versions where this information was missing in the trace.

Still, at the kcbcmt1 point:
– 11.2.0.2 seems to quite consistently take MTT as the direct path switching point.
– 11.2.0.3-12.1.0.2 seem to quite consistently take STT as the direct path switching point.
– 12.2.0.1 varies.

Conclusion.
This article first explained the differences between buffered and direct path reads, and why this is important, and that it is even more important with Exadata for smartscans.

The next part shows how to measure the switching point. The most important message from this blog article is that starting from 11.2.0.3 up to 12.1.0.2 the direct path read switching point is small table threshold, and with Oracle database version 12.2.0.1, the direct path switching point is changed back to pre-11.2.0.3 behaviour which means 5 times the small table threshold of the instance.
The next part shows measurements of the switching point. The addition shows that between STT and MTT there is a cost based decision to go direct path or buffered path. Once the direct path is chosen, it still can go buffered if the majority of the blocks are in the cache.

If you look closely at the output of the nsmtio lines for version 11.2.0.3-12.1.0.1 for tables that had a size between small table threshold and medium table threshold, it seemed a bit weird, because the nsmtio trace said ‘[MTT < OBJECT_SIZE < VLOT]', which to me means that Oracle detected the object size to be between medium table threshold and very large object threshold, which was not true. I can't tell, but it might be a bug that is solved for measuring the wrong size.
The text description in the NSMTIO qertbFetch line is bogus, it simply is a code path; ‘[- STT < OBJECT_SIZE < MTT]' means it's a buffered read, and could be chosen when < STT or in between STT and MTT, '[MTT < OBJECT_SIZE < VLOT]' means it's a direct path read, and could be chosen when in between STT and MTT or MTT and VLOT.

I added the scripts and examples of the tracing events so you can measure this yourself in your environment.

This post shows you how to install Rundeck with the Ansible plugin on Centos 7. The installation is done with nginx as the web server and using SSL with a self signed certificate. Please read the Ansible installation script, and modify anything that should be different for your situation. You will be amazed how well readable the installation script is!

Rundeck is a web based user interface that allows you to run commands against a group of hosts. Rundeck has an ansible plugin. Using that plugin, it could perform the similar functionality as Ansible Tower (commercial product) or Semaphore (open source).

After a fresh installation of Centos 7, do the following as root:

yum -y localinstall https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install ansible git
git clone https://gitlab.com/FritsHoogland/install_rundeck.git

This adds the EPEL (extra packages for Enterprise Linux) repository to Centos, which contains ansible and git. The next yum command installs ansible (the orchestration software) and git (distributed version control). The git clone command pulls the ansible orchestration script from my install_rundeck git repository at gitlab.

It is important to open the install_rundeck/install_rundeck.yml script and modify anything that is different in your situation. The public hostname will most likely be different than the 192.168.66.16 in the script. You might want to change the locality information with the certificate generation (unless you live in Lutjebroek like me :-). If you do a corporate installation, you might want to skip this step altogether and get a certificate pair signed by your company’s CA.

Please mind I ran into issues with ansible inventories:
– The hosts in the inventory need to have ansible run against them to pick up their properties and become visible in rundeck in the nodes tab. For being able to have ansible run against the hosts in the inventory, the host need to exist in the ssh known hosts file of the rundeck user, otherwise ansible can’t run and the host or hosts will not be visible in rundeck. The solution is to log in as the rundeck user and logon to the hosts in the inventory once manually and accept the host or hosts. From a security perspective it’s understandable that you careful need to assess the hosts to trust, but from an automation standpoint this is annoying. Outside of essentially filling out the known hosts file as I described, there are several other workarounds.
– I created an ansible inventory file in the rundeck project directory and entered the hosts in it. Rundeck picked up the hosts (after solving the above point they became visible in the nodes tab), however when executing something using ansible via rundeck it would say ‘[WARNING]: provided hosts list is empty, only localhost is available’. This means ansible was not pointed explicitly to an inventory, so it used the default one. In order to solve this, I symlinked my (rundeck) project inventory to the /etc/ansible/hosts to make it centrally available. Apparently, using a central inventory for ansible using the plugin is by design. I would rather have rundeck generate an inventory per execution, and pointing to it when the plugin executes ansible.

Now install rundeck:

ansible-playbook install_rundeck/install_rundeck.yml

Done!

Actually, this is a follow up post from my performance deep dive into tablespace encryption. After having investigated how tablespace encryption works, this blogpost is looking at the other encryption option, column encryption. A conclusion that can be shared upfront is that despite they basically perform the same function, the implementation and performance consequences are quite different.

Column encryption gives you the ability to choose to encrypt per individual column, that’s kind of obvious. However, having to choose which columns to encrypt is what I see as the biggest downside of this encryption option. In most cases, especially with boxed applications, it is quite hard to try to figure out which columns you exactly want to encrypt in order to protect your sensitive data. Which columns do exactly contain your primary sensitive data, and which columns do contain secondary sensitive data (data derived from sensitive data). Do you, when you have to apply encryption, know what EXACTLY is defined as sensitive data, and what isn’t? I bet there isn’t a clear technical description.

A logical reaction then would be ‘couldn’t I then just encrypt all columns’? Well, that is what tablespace encryption is for, isn’t it? To summarise this: I do think the correct use of column encryption in reality is hard to implement and this very limited in usefulness, in most cases tablespace encryption should be used.

Okay…for this test I created a table with two columns, of which one is encrypted:

SQL> create table column_encryption (id number, a varchar2(10) encrypt);
SQL> insert into column_encryption values (1, 'AAAAAAAAAA');
SQL> commit;

The same table, but without encryption:

SQL> create table no_column_encryption (id number, a varchar2(10) );
SQL> insert into no_column_encryption values (1, 'AAAAAAAAAA');
SQL> commit;

And the same table with a lot of rows:

SQL> create table column_encryption_large (id number, a varchar2(10) encrypt);
SQL> begin
 	     for counter in 1..32000000 loop
 		     insert into column_encryption_large values ( counter, dbms_random.string('l',10) );
 	     end loop;
 end;
/

Let’s follow the path of the previous TDE post, and profile the execution of a SQL on the big table to see the impact of column encryption. The first test is a ‘select count(*) from column_encryption_large’ in one session, and ‘perf record -g -p PID’ in another. If you need more explanation on how to run it, please look at the previous blogpost. This is the output of ‘perf report –sort comm –max-stack 2’:

# perf report --sort comm --max-stack 2
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 1418165467
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_6919_aob
            |--29.21%-- kdstf00000010000100kmP
            |--12.58%-- kdbulk
            |--3.32%-- gup_pte_range
            |--2.58%-- kdst_fetch0
            |--2.54%-- kcbgtcr
            |--2.25%-- __blk_bios_map_sg
            |--2.21%-- kcbhvbo
            |--2.18%-- unlock_page
            |--1.98%-- ktrgcm
            |--1.93%-- do_direct_IO
            |--1.86%-- kcbldrget
            |--1.52%-- kcoapl

This shows IO related functions, both Oracle and operating system level; kdstf is kernel data scan table full for example, gup_pte_range, do_direct_IO, unlock_page and __blk_bios_map_sg are Linux kernel functions. Most notably there are no encryption related functions, which is a big difference with tablespace encryption!
This is actually very logical if you understand the differences between column encryption and tablespace encryption. First let’s look at a block dump from a data block from segment in an encrypted tablespace:

Block dump from cache:
Dump of buffer cache at level 4 for pdb=0 tsn=5 rdba=907
Block dump from disk:
Encrypted block <5, 907> content will not be dumped. Dumping header only.
buffer tsn: 5 rdba: 0x0000038b (1024/907)
scn: 0x0.4e9af4 seq: 0x01 flg: 0x16 tail: 0x9af40601
frmt: 0x02 chkval: 0xf23a type: 0x06=trans data

Yes…you read that right: the block is encrypted, so it will not be dumped. Luckily, you can set the undocumented parameter “_sga_clear_dump” to true to make Oracle dump the block:

SQL> alter session set "_sga_clear_dump"=true;
SQL> alter system dump datafile 5 block 907;

This will make Oracle dump the block. The dump will show the decrypted version of the tablespace level encrypted block:

Block header dump:  0x0000038b
 Object id on Block? Y
 seg/obj: 0x17bc3  csc: 0x00.4e9aed  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x388 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0007.01d.000001d0  0x00000987.0390.27  --U-    1  fsc 0x0000.004e9af4
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x0000038b
data_block_dump,data header at 0x7f140f335374
===============
tsiz: 0x1f98
hsiz: 0x14
pbl: 0x7f140f335374
     76543210
flag=--------
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f8a
avsp=0x1f76
tosp=0x1f76
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1f8a
block_row_dump:
tab 0, row 0, @0x1f8a
tl: 14 fb: --H-FL-- lb: 0x1  cc: 1
col  0: [10]  41 41 41 41 41 41 41 41 41 41
end_of_block_dump

For the count(*), there is no need to read the data, the only thing needed is to read the row directory to fetch the number of rows (row 19). However, to do that, the block must be decrypted.

Now look at a block dump of a column encrypted data block:

Block header dump:  0x0000032b
 Object id on Block? Y
 seg/obj: 0x1821d  csc: 0x00.676d7e  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x328 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.007.000078a9  0x00000117.2246.07  --U-    1  fsc 0x0000.00676d7f
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x0000032b
data_block_dump,data header at 0x7f140f333264
===============
tsiz: 0x1f98
hsiz: 0x14
pbl: 0x7f140f333264
     76543210
flag=--------
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f5d
avsp=0x1f49
tosp=0x1f49
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1f5d
block_row_dump:
tab 0, row 0, @0x1f5d
tl: 59 fb: --H-FL-- lb: 0x1  cc: 2
col  0: [ 2]  c1 02
col  1: [52]
 fd e0 87 66 55 f7 e6 43 de be 31 f6 71 4f 7f 4e f1 75 fb 88 98 9d 13 ed 8e
 cb 69 02 bc 29 51 bd 21 ea 22 04 6b 70 e9 ec 01 9d d6 e4 5a 84 01 1d 90 b0
 e9 01
end_of_block_dump

The block and the row directory can be read normally without any need for decryption. The only thing encrypted is the column (“a”). That perfectly explains the absence of any functions that indicate decryption, because there isn’t any decryption taking place!

Now let’s rewrite the SQL to touch the data, and thus involve decryption: ‘select avg(length(a)) from column_encryption_large’. This way the row needs to be decrypted and read. This is how the output of a perf recording looks like:

# perf report --sort comm --max-stack 2
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 65K of event 'cycles'
# Event count (approx.): 229042607170
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_6919_aob
            |--24.73%-- ztchsh1h
            |--14.91%-- ztchsh1n
            |--6.10%-- y8_ExpandRijndaelKey
            |--5.90%-- ownGetReg
            |--5.50%-- __intel_ssse3_rep_memcpy
            |--4.99%-- ztchsh1f
            |--4.28%-- ztcxi
            |--2.60%-- ipp_is_GenuineIntel
            |--1.88%-- _intel_fast_memcpy
            |--1.74%-- _intel_fast_memcpy.P
            |--1.52%-- kspgip
            |--1.16%-- kgce_init

The functions starting with ‘ztc’ are probably related to security (“zecurity”), and also probably related to decryption. The function name “y8_ExpandRijndaelKey” is clearly related to encryption. When you look up the function address of “ownGetReg”, it’s close to the “y8_ExpandRijndaelKey” function. The last group of functions are memcpy related functions, that seems consistent with decrypting: moving data.

On the performance side, it’s clear that the majority of the time is spend in the functions ztchsh1h and ztchsh1n. In order to understand more about these functions, let’s expand the stack:

# perf report --sort comm
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 65K of event 'cycles'
# Event count (approx.): 229035032972
#
# Children      Self  Command
# ........  ........  ...............
#
   100.00%   100.00%  oracle_6919_aob
            |
            |--25.01%-- ztchsh1h
            |          |
            |          |--99.63%-- ztchsh1n
            |          |          |
            |          |          |--50.85%-- ztchsh1f
            |          |          |          ztchf
            |          |          |          ztcxf
            |          |          |          ztcx
            |          |          |          kztsmohmwl
            |          |          |          kztsmhmwl
            |          |          |          kzekmetc
            |          |          |          kzecsqen
            |          |          |          kzecctex
            |          |          |          evaopn2
            |          |          |          evaopn2
            |          |          |          qesaAggNonDistSS
            |          |          |          kdstf00001010000000km
            |          |          |          kdsttgr
            |          |          |          qertbFetch
            |          |          |          qergsFetch
            |          |          |          opifch2
            |          |          |          kpoal8
------------------------------------------------------
            |--14.90%-- ztchsh1n
            |          |
            |          |--85.25%-- ztchsh1f
            |          |          ztchf
            |          |          ztcxf
            |          |          ztcx
            |          |          kztsmohmwl
            |          |          kztsmhmwl
            |          |          kzekmetc
            |          |          kzecsqen
            |          |          kzecctex
            |          |          evaopn2
            |          |          evaopn2
            |          |          qesaAggNonDistSS
            |          |          kdstf00001010000000km
            |          |          kdsttgr
            |          |          qertbFetch
            |          |          qergsFetch
            |          |          opifch2
            |          |          kpoal8

I fetched the stack of the two functions in which the most time was spend. The most important thing to see is that the encryption now takes place as part of processing the fetched data (qesaAggNonDistSS probably has something to do with aggregating data, evaopn2 probably is a function to evaluate operands) rather than performing the (logical) IO; mind the absence of the kcbgtcr function.

The reason for doing the decryption during operand evaluation rather than during doing the IO is because the data is stored encrypted in the block, and thus also in the buffer cache. So during IO time, there is no need to decrypt anything. This also has another rather important consequence: for every row that has an encrypted column that is processed, decryption needs to take place. There does not seem to be any caching of the decrypted value for column encryption, which is logical from a security point of view, but has a severe performance consequence.

When doing a pin tools debugtrace on the above SQL for the processing of a single row (the table ‘column_encryption’, rather than ‘column_encryption_large’), applying the sed filters, and then grepping for a selective set of functions, this is how the processing looks like:

 | | | | > qergsFetch(0x294512030, 0x7f871c9fa2f0, ...)
 | | | | | > qeaeAvg(0x7f8717ce9968, 0xe, ...)
 | | | | | < qeaeAvg+0x000000000063 returns: 0  | | | | | > qertbFetch(0x294512178, 0x7f871ca08a68, ...)
 | | | | | | | | | | > kcbgtcr(0x7ffe2f9b3ae0, 0, ...)
 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > pread64@plt(0x100, 0x1f428c000, ...)
 | | | | | | | | | | < kcbgtcr+0x000000003221 returns: 0x1f428c014  | | | | | | | | | | | | | | > kcbgtcr(0x7ffe2f9b35d0, 0, ...)
 | | | | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1f428c014  | | | | | | > kdsttgr(0x7f871c9f9918, 0, ...)
 | | | | | | | > kdstf00001010000000km(0x7f871c9f9918, 0, ...)
 | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | > pread64@plt(0x100, 0x2b1115000, ...)
 | | | | | | | | | | | < kcbgtcr+0x000000003221 returns: 0x1e4aa6014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1e4aa6076
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1e4aa6076  | | | | | | | | > qesaAggNonDistSS(0x7ffe2f9b45d0, 0x7fff, ...)
 | | | | | | | | | > evaopn2(0x294511ef0, 0x294512030, ...)
 | | | | | | | | | | > evaopn2(0x294511e68, 0x294512030, ...)
 | | | | | | | | | | | | | | | | | | | > ztchsh1n(0x7ffe2f9b1ef8, 0x11c4e8d0, ...)
 | | | | | | | | | | | | | | | | | | | > ztchsh1f(0x7ffe2f9b1ef8, 0x7ffe2f9b3100, ...)
 --> 168 times in total of ztchsh1n or ztchsh1f
 | | | | | | | | | | < evaopn2+0x0000000002dc returns: 0x7f871c9fa2c0  | | | | | | | | | | > evalen(0x294511ef0, 0x7f871c9fa2c0, ...)
 | | | | | | | | | | < evalen+0x000000000147 returns: 0x2
 | | | | | | | | | < evaopn2+0x0000000002dc returns: 0x7f871c9fa2d0  | | | | | | | | | > qeaeAvg(0x7f8717ce9968, 0xb, ...)
 | | | | | | | | | < qeaeAvg+0x000000000063 returns: 0x7f8717ce99c9
 | | | | | | | | < qesaAggNonDistSS+0x000000000193 returns: 0x7fff  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1dec30014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1dec30072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1dec30072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1deca4014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1deca4072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1deca4072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1e4be0014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1e4be0072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1e4be0072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | | | > kcbgtcr(0x7f871c9f9930, 0, ...)
 | | | | | | | | | | | < kcbgtcr+0x0000000009a1 returns: 0x1dedb2014
 | | | | | | | | | < kdst_fetch0+0x0000000004d0 returns: 0x1dedb2072
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0x1dedb2072  | | | | | | | | > kdst_fetch(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | > kdst_fetch0(0x1, 0x7f871c9f9918, ...)
 | | | | | | | | | < kdst_fetch0+0x0000000011c9 returns: 0
 | | | | | | | | < kdst_fetch+0x000000000048 returns: 0
 | | | | | | | < kdstf00001010000000km+0x00000000035d returns: 0x7fff
 | | | | | | < kdsttgr+0x00000000085f returns: 0x7fff
 | | | | | < qertbFetch+0x000000001301 returns: 0x7fff  | | | | | > qeaeAvg(0x7f8717ce9968, 0x294511f78, ...)
 | | | | | < qeaeAvg+0x000000000063 returns: 0x2  | | | | | | > evaopn2(0x294511f78, 0, ...)
 | | | | | | < evaopn2+0x0000000002dc returns: 0x7f871c9fa2e0
 | | | | < qergsFetch+0x000000000f25 returns: 0

This is how the explain plan of the ‘select avg(length(a)) from column_encryption’ SQL:

----------------------------------------------------------------------------------------
| Id  | Operation	   | Name	       | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |		       |       |       |     3 (100)|	       |
|   1 |  SORT AGGREGATE    |		       |     1 |    53 |	    |	       |
|   2 |   TABLE ACCESS FULL| COLUMN_ENCRYPTION |     1 |    53 |     3	 (0)| 00:00:01 |
----------------------------------------------------------------------------------------

If you look back to the grepped debugtrace, and the execution plan:
Line 1: the sort aggregate rowsource (qergsFetch).
Line 4: the table acces full (qertbFetch).
Line 5: here a logical read (kcbgtcr) is issued, and because the block didn’t exist in the cache, was physically read (line 6: pread64). This is the segment header, the “real” scan of data blocks has not been started yet.
Line 11: this is the ultra fast full table scan (kdstf00001010000000km). My guess is this function is a full table scan function with certain decisions hard coded, instead of choices made on runtime, so the in-CPU execution prediction runs into less branch mispredictions.
Line 12: this is the part of the full table scan for fetching (loading) the data (kdst_fetch). What is special here is that a multiblock read is done, the kcbgtcr function triggers a single physical read for multiple logical blocks, which are later fetched per block (kdst_fetch and kcbgtcr functions starting from line 32, 38, etc).
Line 19: this function executes row based functions and aggregates the results per block/fetch (qesaAggNonDistSS).
Line 20: as part of fetching the row and executing functions, the row value is evaluated first (evaopn2).
Line 21/22: here the column is decrypted (made visible by the ztchsh1n/ztchsh1f functions, not necessarily the decryption functions theirselves).
Line 26/29: here probably the length (evalen) and average (qeaeAvg) row functions are executed.
Line 32: the next block is processed, but no rows are found, and thus no need to execute rowsource (qe*) functions afterwards.

So, what do we know at this point regarding column encryption?
– Columns that are encrypted are stored encrypted in the block in the buffer cache.
– Which means they have to be decrypted every time the column values are read, which is different from tablespace encryption, for which a block is encrypted, and is decrypted whenever a block is read into the buffer cache.
– Functions related to column encryption specifically (different functions than seen with tablespace encryption) take roughly 40% of the time in my case.

Can the time spend on column decryption be optimised?
There are multiple ways you can change the way Oracle applies column encryption. There are four encryption types: 3DES168, AES128, AES192 and AES256. The default is AES192.
Here are query timings of doing a select avg(length(a)) from TABLE on my “large” table with 32 million rows:

3DES168 4:53
AES256 1:09
AES192 1:06
AES128 1:03

A way to optimise column encryption is to specify not to use an extra verification by specifying ‘nomac’ at the encryption definition of the column. This saves space (by default, extra space is used for the message abstract that is used by the verification for every column). These are the timings with the ‘nomac’ addition to disable encryption verification:

3DES168 3:59
AES256 0:26
AES192 0:23
AES128 0:22

This shows a significant reduction of time. However, if no encryption at all is applied to the column, the query timing is 0:03.

Internals background information
The functions ztchsh1n/ztchsh1 are related to verification (again, if you read the tablespace encryption blogpost, where the most time consuming functions were verification too). Once ‘nomac’ is specified with the encryption definition of the column, the ztchsh* function vanish, and the top time consuming functions are y8_ExpandRijndaelKey and ownGetReg, which clearly are directly related to decryption. The performance gain of ‘nomac’ is lesser with 3DES168 encryption.

Conclusion
I think tablespace encryption is the encryption method of choice for a normal implementation. In most cases it will be too much work to exactly figure out which columns to encrypt. If you still consider column encryption, you also should be aware that the column value is stored encrypted in the block and (as a consequence) in the cache. Every use of the encrypted column involves encryption or decryption, for which the overhead is significant, even with ‘nomac’ specified to disable (additional) verification.

Recently, I was trying to setup TDE. Doing that I found out the Oracle provided documentation isn’t overly clear, and there is a way to do it in pre-Oracle 12, which is done using ‘alter system’ commands, and a new-ish way to do it in Oracle 12, using ‘administer key management’ commands. I am using version 12.1.0.2.170117, so decided to use the ‘administer key management’ commands. This blogpost is about an exception which I see is encountered in the Januari 2017 (170117) PSU of the Oracle database, which is NOT happening in Oracle 12.2 (no PSU’s for Oracle 12.2 at the time of writing) and Oracle 12.1.0.2 April 2016 and October 2016 PSU’s.

In order to test the wallet functionality for TDE, I used the following commands:

SQL> select status, wrl_parameter from v$encryption_wallet;

STATUS
------------------------------
WRL_PARAMETER
--------------------------------------------------------------------------------
NOT_AVAILABLE
/u01/app/oracle/admin/test/wallet

SQL> !mkdir /u01/app/oracle/admin/test/wallet

SQL> administer key management create keystore '/u01/app/oracle/admin/test/wallet' identified by "this_is_the_keystore_password";

keystore altered.

SQL> administer key management set keystore open identified by "this_is_the_keystore_password";

keystore altered.

SQL> administer key management set key identified by "this_is_the_keystore_password" with backup;
administer key management set key identified by "this_is_the_keystore_password" with backup
*
ERROR at line 1:
ORA-28374: typed master key not found in wallet

SQL> select status, wrl_parameter from v$encryption_wallet;

STATUS
------------------------------
WRL_PARAMETER
--------------------------------------------------------------------------------
CLOSED
/u01/app/oracle/admin/test/wallet

SQL> administer key management set keystore open identified by "this_is_the_keystore_password";

keystore altered.

SQL> select status, wrl_parameter from v$encryption_wallet;

STATUS
------------------------------
WRL_PARAMETER
--------------------------------------------------------------------------------
OPEN
/u01/app/oracle/admin/test/wallet

Notes:
Line 1-10: The DB_UNIQUE_NAME of the instance is ‘test’, and therefore the default wallet location is /u01/app/oracle/admin/test/wallet (ORACLE_BASE/admin/DB_UNIQUE_NAME/wallet). The wallet directory doesn’t exist by default, so I created it (line 10).
Line 12: Here the keystore/wallet is created with a password.
Line 16: After the wallet is created without auto-login, the wallet must be opened using the ‘set keystore open’ command.
Line 20: After the wallet has been created, it does not contain a master key. This is done using the ‘set key’ command. However, this throws an ORA-28374 error.
Line 26: After an error involving the wallet has occurred, the wallet closes.
Line 35: The wallet can simply be opened using the earlier used ‘set keystore open’ command.
Line 39: This is where the surprise is: after opening, the master key “magically” appeared (visible by the status ‘OPEN’, without a master key this would be ‘OPEN_NO_MASTER_KEY’).

I yet have to start creating encrypted table spaces. There might be more surprises, I can’t tell at this moment because I didn’t try it. However, once I discovered this oddity, I talked to my colleague Matt who gave me his own runbook for enabling TDE, which turned out to be the exact same list of commands as I compiled, however he did not encounter the ORA-28374 which I did. I tested the same sequence of commands on 12.2.0.1, 12.1.0.2.161018 (October 2016) and 12.1.0.2.160419 (April 2016) and there the ORA-28374 was not raised during execution of the ‘set key’ command.

Update!
Reading through My Oracle Support note Master Note For Transparent Data Encryption ( TDE ) (Doc ID 1228046.1), I found the following text:

All the versions after 12.1.0.2

=====================

As of 12.1.0.2 If the key associated with the SYSTEM, SYSAUX or UNDO tablespaces is not present in the wallet you cannot associate a new master key with the database (i.e. you cannot activate that master key for the database) unless you set a hidden parameter :

SQL> administer key management use key ‘AUQukK/ZR0/iv26nuN9vIqcAAAAAAAAAAAAAAAAAAAAAAAAAAAAA’ identified by “welcome1” with backup;
administer key management use key ‘AUQukK/ZR0/iv26nuN9vIqcAAAAAAAAAAAAAAAAAAAAAAAAAAAAA’ identified by “welcome1” with backup
*
ERROR at line 1:
ORA-28374: typed master key not found in wallet

alter system set “_db_discard_lost_masterkey”=true;

SQL> administer key management use key ‘AUQukK/ZR0/iv26nuN9vIqcAAAAAAAAAAAAAAAAAAAAAAAAAAAAA’ identified by “welcome1” with backup;

The heading and first line read weird, the heading indicates the paragraph is about ‘all the versions after 12.1.0.2’ (which to me means 12.2), and the first line in the paragraph says ‘as of 12.1.0.2’, which very clearly says this is about version 12.1.0.2 and higher. However, a little further it shows the exact error (ORA-28374) I encountered, and explains that if a current key is used in the data dictionary (mind data dictionary, not wallet), you must set “_db_discard_lost_masterkey” to true before you can create and use another master key for a wallet if you start over (wipe or move the wallet directory).

This makes sense to me now! I tried dropping and creating new wallets in my current 170117 PSU instance, and only tried creating an encryption wallet in a brand new freshly created instance. So if I would have EXACTLY done the same in the instances with the other PSU’s, which is repeatedly create and drop a wallet for TDE, I would have encountered the same ORA-28374 error. Well…I see this as a safety mechanism, be it not a very obvious one, not exuberant documented, and probably causing more grief than it would save if you run into the need the change the master key.

When sifting through a sql_trace file from Oracle version 12.2, I noticed a new wait event: ‘PGA memory operation’:

WAIT #0x7ff225353470: nam='PGA memory operation' ela= 16 p1=131072 p2=0 p3=0 obj#=484 tim=15648003957

The current documentation has no description for it. Let’s see what V$EVENT_NAME says:

SQL> select event#, name, parameter1, parameter2, parameter3, wait_class 
  2  from v$event_name where name = 'PGA memory operation';

EVENT# NAME                                  PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
------ ------------------------------------- ---------- ---------- ---------- ---------------
   524 PGA memory operation                                                   Other

Well, that doesn’t help…

Let’s look a bit deeper then, if Oracle provides no clue. Let’s start with the strace and sql_trace combination. For the test, I am doing a direct path full table scan on a table. Such a scan must allocate a buffer for the results (direct path reads do not go into the buffercache, table contents are scanned to the PGA and processed from there).

TS@fv122b2 > alter session set events 'sql_trace level 8';

Session altered.

Now use strace to look at the system calls in another session:

# strace -e write=all -e all -p 9426
Process 9426 attached
read(9,

Now execute ‘select count(*) from t2’. The output is rather verbose, but the important bits are:

io_submit(140031772176384, 1, {{data:0x7f5ba941ffc0, pread, filedes:257, buf:0x7f5ba91cc000, nbytes:106496, offset:183590912}}) = 1
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000
lseek(7, 0, SEEK_CUR)                   = 164639
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 112) = 112
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 37 38  30 20 70 31 3d 32 30 39   ela= 780 p1=209 |
 | 00040  37 31 35 32 20 70 32 3d  31 31 31 34 31 31 32 20  7152 p2=1114112  |
 | 00050  70 33 3d 30 20 6f 62 6a  23 3d 32 32 38 33 33 20  p3=0 obj#=22833  |
 | 00060  74 69 6d 3d 31 39 35 31  37 30 32 30 35 36 36 0a  tim=19517020566. |
...
munmap(0x7f5ba8fbd000, 2097152)         = 0
munmap(0x7f5ba91bd000, 2097152)         = 0
mmap(0x7f5ba949d000, 65536, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x2ce000) = 0x7f5ba949d000
lseek(7, 0, SEEK_CUR)                   = 183409
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 100) = 100
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 35 39  32 20 70 31 3d 30 20 70   ela= 592 p1=0 p |
 | 00040  32 3d 30 20 70 33 3d 30  20 6f 62 6a 23 3d 32 32  2=0 p3=0 obj#=22 |
 | 00050  38 33 33 20 74 69 6d 3d  31 39 35 32 30 36 33 33  833 tim=19520633 |
 | 00060  36 37 34 0a                                       674.             |

Okay, we can definitely say the mmap() and munmap() system calls seem to be related, which makes sense if you look a the name of the wait event. Let’s look a bit more specific using a systemtap script:

global wait_event_nr=524
probe begin {
	printf("begin.\n")
}

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthbwt") {
	if ( pid() == target() && register("rdx") == wait_event_nr )
		printf("kskthbwt - %d\n", register("rdx"))
}
probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr )
		printf("kskthewt - %d\n", register("rsi"))
}
probe syscall.mmap2 {
	if ( pid() == target() )
		printf(" mmap, addr %x, size %d, protection %d, flags %d, fd %i, offset %d ", u64_arg(1), u64_arg(2), int_arg(3), int_arg(4), s32_arg(5), u64_arg(6))
}
probe syscall.mmap2.return {
	if ( pid() == target() )
		printf("return value: %x\n", $return)
}
probe syscall.munmap {
	if ( pid() == target() )
		printf(" munmap, addr %x, size %d\n", u64_arg(1), u64_arg(2))
}

Short description of this systemtap script:
Lines 6-9: This probe is triggered once the function kskthbwt is called. This is one of the functions which are executed when the wait interface is called. The if function on line 7 checks if the process specified with -x with the systemtap executable is the process calling this function, and if the register rdx contains the wait event number. This way all other waits are discarded. If the wait event is equal to wait_event_nr, which is set to the wait event number 524, which is ‘PGA memory operation’, the printf() function prints kskthbwt and the wait event number. This is simply to indicate the wait has started.
Lines 10-13: This probe does exactly the same as the previous probe, except the function is kskthewt, which is one of the functions called when the ending of a wait event is triggered.
Line 14-17: This is a probe that is triggered when the mmap2() system call is called. Linux actually uses the second version of the mmap call. Any call to mmap() is silently executed as mmap2(). Inside the probe, the correct process is selected, and the next line simply prints “mmap” and the arguments of mmap, which I picked from the CPU registers. I do not print a newline.
Line 18-21: This is a return probe of the mmap2() system call. The function of this probe is to pick up the return code of the system call. For mmap2(), the return code is the address of the memory area mapped by the kernel for the mmap2() call.
Line 22-25: This is a probe on munmap() system call, which frees mmap’ed memory to the operating system.
Please mind there are no accolades following the if statements, which means the code executed when the if is true is one line following the if. Systemtap and C are not indention sensitive (like python), I indented for the sake of clarity.

I ran the above systemtap script against my user session and did a ‘select count(*) from t2’ again:

# stap -x 9426 mmap.stp
begin.
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 750 return value: 7f5ba91bd000
 mmap, addr 7f5ba91bd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba91bd000
kskthewt - 524
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1262 return value: 7f5ba8fbd000
 mmap, addr 7f5ba8fbd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba8fbd000
kskthewt - 524
kskthbwt - 524
 munmap, addr 7f5ba8fbd000, size 2097152
 munmap, addr 7f5ba91bd000, size 2097152
kskthewt - 524

This makes it quite clear! The event ‘PGA memory operation’ is called when mmap() and munmap() are called. Which are calls to allocate and free memory for a process. The file descriptor (fd) value is set to -1, which means no file is mapped, but anonymous memory.

Another interesting thing is shown: first mmap is called with no address given, which makes the kernel pick a memory location. This memory location is then used for a second mmap call at the same memory address. The obvious question for this is: why mmap two times?

To answer that, we need to look at the flags of the two calls. Here is an example:

mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000

The first mmap call asks the kernel for a chunk of memory. PROT_READ and PROT_WRITE mean the memory should allow reading and writing. MAP_PRIVATE means it’s not public/shared, which is logical for Oracle PGA memory. MAP_ANONYMOUS means the memory allocation is not backed by a file, so just an allocation of contiguous memory. MAP_NORESERVE means no swap space is reserved for the allocation. This means this first mapping is essentially just a reservation of the memory range, no physical memory pages are allocated.

The next mmap call maps inside the memory allocated with the first mmap call. This seems strange at first. If you look closely at the flags, you see that MAP_NORESERVE is swapped for MAP_FIXED. The reason for this strategy to make it easier for the Oracle database to allocate the memory allocations inside a contiguous chunk of (virtual) memory.

The first mmap call allocates a contiguous (virtual) memory area, which is really only a reservation of a memory range. No memory is truly allocated, hence MAP_NORESERVE. However, it does guarantee the memory region to be available. The next mmap allocates a portion of the allocated range. There is no MAP_NORESERVE which means this allocation is catered for for swapping in the case of memory shortage. This mapping does use a specific address, so Oracle can use pointers to refer to the contents, because it is certain of the memory address. Also, the MAP_FIXED flag has a side effect, which is used here: any memory mapping done to the address range is silently unmapped from the first (“throw away”) mapping.

Let’s look a bit deeper into the wait event information. For this I changed the probe for function kskthewt in the systemtap script in the following way:

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr ) {
		ksuse = register("r13")-4672
		ksuseopc = user_uint16(ksuse + 2098)
		ksusep1 = user_uint64(ksuse + 2104)
		ksusep2 = user_uint64(ksuse + 2112)
		ksusep3 = user_uint64(ksuse + 2120)
		ksusetim = user_uint32(ksuse + 2128)
		printf("kskthewt - wait event#: %u, wait_time:%u, p1:%lu, p2:%lu, p3:%lu\n", ksuseopc, ksusetim, ksusep1, ksusep2, ksusep3)
	}
}

When running a ‘select count(*) from t2’ again on a freshly started database with a new process with the changed mmap.stp script, this is how the output looks like:

kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 753 return value: 7f1562330000
 mmap, addr 7f1562330000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562330000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:1114112, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:28, p1:2097152, p2:1114112, p3:0

This looks like the size of memory allocated with the first mmap call for the PGA memory reservation is put in p1, and the size of the allocation of the second “real” memory allocation is put in p2 of the ‘PGA memory operation’ event. One thing that does look weird, is the memory is not unmapped/deallocated (this is a full execution of a SQL, allocated buffers must be deallocated?

Let’s look what happens when I execute the same SQL again:

kskthbwt - 524
 munmap, addr 7f1562130000, size 2097152
 mmap, addr 7f15623b0000, size 589824, protection 0, flags 16434, fd -1, offset 881 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:253, p1:0, p2:0, p3:0
kskthbwt - 524
 mmap, addr 7f15623b0000, size 589824, protection 3, flags 50, fd -1, offset 0 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:35, p1:589824, p2:0, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:0, p3:0

Ah! It looks like some memory housekeeping is not done during the previous execution, but is left for the next execution, the execution starts with munmap(), followed by a mmap() call. The first munmap() call deallocates 2 megabyte memory chunk. The next mmap() call is different from the other mmap() calls we have seen so far; we have seen a “throw away”/reservation mmap() call with the memory address set to 0 to let the operating system pick an address for the requested memory chunk, and a mmap() call to truly allocate the reserved memory for usage, which had a memory address set. The mmap() call following munmap() has a memory address set. However, protection is set to 0; this means PROT_NONE, which means the mapped memory can not be read and written. Also the flags number is different, flags 16434 translates to MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE. As part of releasing PGA memory, it seems some memory is reserved. The wait event parameters are all zero. When p1, p2 and p3 are all zero, it seems to indicate munmap() is called. As we just have seen, memory could be reserved. Also, when p1/2/3 are all zero there is no way to tell how much memory is freed, nor which memory allocation.

The next wait is the timing of a single mmap() call. Actually, the mmap() call allocates the previous mmaped memory, but now with protection set to 3 (PROT_READ|PROT_WRITE), which means the memory is actually usable. The p1 value is the amount of memory mmaped.

The last wait is a familiar one, it is the mmap() call with memory address set to zero, as reservation, and another mmap() call to allocate memory inside the previous “reserved” memory. However, the p1/2/4 values are now NOT set in the same way as we saw earlier: only p1 is non zero, indicating the size of the first mmap() call. Previously, p1 and p2 were set to the sizes of both mmap() calls.

Conclusion:
With Oracle version 12.2 there is a new wait event ‘PGA memory operation’. This event indicates memory is allocated or de-allocated. Until now I only saw the system calls mmap() and munmap() inside the ‘PGA memory operation’.

To me, ‘cloud computing’ is renting a compute resource to perform a task. In order to use that compute resource, you need to instruct it to do something, which is typically done via the network. If the task the compute resource needs to fulfil is being an application server or being a client or both in the case of an application server that uses an Oracle database, the network latency between the client of the database and the database server is a critical property.

I think so far everybody is with me. If we zoom in to the network, it becomes more difficult, and *very* easy to make wrong assumptions. Let me explain. A network, but really any connection between processing and a resource, has two DIFFERENT properties that I see getting mixed up consistently. These are:
* Latency: the time it takes for a signal or (network) packet to travel from the client to the server, or the time it takes to travel from the client to the server and back.
* Bandwidth: the amount of data that can be transported from the client to the server in a certain time.

How do you determine the latency of a network? Probably the most people respond with ‘use ping’. This is how that looks like:

[user@oid1 ~]$ ping -c 3 lsh1
PING lsh1 (x.x.x.x) 56(84) bytes of data.
64 bytes from lsh1: icmp_seq=1 ttl=62 time=680 ms
64 bytes from lsh1: icmp_seq=2 ttl=62 time=0.304 ms
64 bytes from lsh1: icmp_seq=3 ttl=62 time=0.286 ms

The question I often ask myself is: what is that we see actually? How does this work?
In order to answer that question, the tcpdump tool can answer that question. Using tcpdump, you can capture the network packets on which the ping utility based the above outcome. The ‘-ttt’ option calculates the time between each arrived packet:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host lsh1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1 > lsh1: ICMP echo request, id 35879, seq 1, length 64
00:00:00.680289 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 1, length 64
00:00:00.319614 IP oid1 > lsh1: ICMP echo request, id 35879, seq 2, length 64
00:00:00.000287 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 2, length 64
00:00:01.000180 IP oid1 > lsh1: ICMP echo request, id 35879, seq 3, length 64
00:00:00.000269 IP lsh1 > oid1: ICMP echo reply, id 35879, seq 3, length 64

So, ping works by sending a packet (ICMP echo request) requesting a reply (ICMP echo reply) from the remote server, and measure the time it takes to get that reply. Great, quite simple, isn’t it? However, the biggest issue I see this is using a protocol that is not used for sending regular data (!). Most application servers I encounter send data using TCP (transmission control protocol), the traffic ping sends are sent using a protocol called ICMP (internet control message protocol). Especially in the cloud, which means (probably) a lot of the infrastructure is shared, ICMP might be given different priority than TCP traffic, which you quite probably are using when the application on your cloud virtual machine is running. For those of you who haven’t looked into the network side of the IT landscape, you can priorise protocols and even specific ports, throttle traffic and you can even terminate it. In fact, a sensible protected (virtual) machine in the cloud will not respond to ICMP echo requests in order to protected it from attacks.

So, what would be a more sensible approach then? A better way would be to use the same protocol and port number that your application is going to use. This can be done using a tool called hping. Using that tool, you can craft your own packet with the protocol and flags you want. In the case of Oracle database traffic that would be the TCP protocol, port 1521 (it can be any port number, 1521 is the default port). This is how you can do that. In order to mimic starting a connection, the S (SYN) flag is set (-S), one packet is send (-c 1) to port 1521 (-p 1521).

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip

What this does is best investigated with tcpdump once again. The server this is executed against can respond in two ways (three actually). When you send this to TCP port 1521 where a listener (or any other daemon that listens on that port) is listening, this is the response:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [S], seq 1436552830, win 512, length 0
00:00:00.001229 IP db01-vip.ncube-lm > oid1.kjtsiteserver: Flags [S.], seq 2397022511, ack 1436552831, win 14600, options [mss 1460], length 0
00:00:00.000023 IP oid1.kjtsiteserver > db01-vip.ncube-lm: Flags [R], seq 1436552831, win 0, length 0

This is a variation of the classic TCP three way handshake:
1. A TCP packet is sent with the SYN flag set to indicate starting a (client to server) connection.
2. A TCP packet is sent back with SYN flag set to indicate starting a (server to client) connection, and the first packet is acknowledged.
3. This is where the variation is, normally an acknowledgement would be sent of the second packet to establish a two way connection, but in order to stop the communication a packet is sent with the RST (reset) flag set.

However, this is if a process is listening on the port. This is how that looks like when there is no process listening on port 1521:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.vsamredirector > db01.ncube-lm: Flags [S], seq 1975471906, win 512, length 0
00:00:00.001118 IP db01.ncube-lm > oid1.vsamredirector: Flags [R.], seq 0, ack 1975471907, win 0, length 0

This means that if a connection is initiated to a port on which no process is listening (port status ‘closed’), there is communication between the client and the server. This is why firewalls were invented!
1. A TCP packet is sent with the SYN flag set to indicate starting a connection.
2. A TCP packet is sent back to with the RST (reset) flag set to indicate no connection is possible.

The third option, when port 1521 is firewalled on the server, simply means only the first packet (from client to server with the SYN flag set) is sent and no response is coming back.

Okay, let’s pick up the performance aspect again. This hping command:

[user@oid1 ~]$ sudo hping -S -c 1 -p 1521 db01-vip
HPING db01-vip (eth0 x.x.x.x): S set, 40 headers + 0 data bytes
len=44 ip=db01-vip ttl=57 DF id=0 sport=1521 flags=SA seq=0 win=14600 rtt=1.2 ms

Says the roundtrip time is 1.2ms. If we look at the network packets and timing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01-vip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.mmcal > db01-vip.ncube-lm: Flags [S], seq 1289836562, win 512, length 0
00:00:00.001113 IP db01-vip.ncube-lm > oid1.mmcal: Flags [S.], seq 2504750542, ack 1289836563, win 14600, options [mss 1460], length 0
00:00:00.000016 IP oid1.mmcal > db01-vip.ncube-lm: Flags [R], seq 1289836563, win 0, length 0

It becomes apparent that the 1.2ms time hping reports is the time it takes for the remote server to send back the SYN+ACK package in the TCP three way handshake.

So does that mean that if we take a number of measurements (let’s say 100, or 1000) to have a statistically significant number of measurements we can establish my TCP roundtrip time and then know how fast my connection will be (outside of all the other variables inherent to the internet and potential noisy neighbours to name a few)?

Oracle provides a way to generate and measure SQL-Net traffic in My Oracle Support note: Measuring Network Capacity using oratcptest (Doc ID 2064368.1). This note provides a jar file which contains server and client software, and is aimed at dataguard, but is useful to measure SQL-Net network latency. I have looked at the packets oratcptest generates, and they mimic SQL-Net quite well.

Let’s see if we can redo the test above to measure pure network latency. First on the database server side, setup the server:

[user@db01m ~]$ java -jar oratcptest.jar -server db01 -port=1521

And then on the client side run the client using the same oratcptest jar file:

java -jar oratcptest.jar db01 -mode=sync -length=0 -duration=1s -interval=1s -port=1521

The important bits are -mode=sync (client packet must be acknowledged before sending another packet) and -length=0 (network traffic contains no payload). This is the result:

[Requesting a test]
	Message payload        = 0 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(07:34:42) The server is ready.
                        Throughput                 Latency
(07:34:43)          0.017 Mbytes/s                0.670 ms
(07:34:43) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.017 Mbytes/s
	             Avg. latency = 0.670 ms

If you look at the hping roundtrip time (1.2ms) and the oratcptest roundtrip time (0.7ms) clearly this is different! If you just look at the numbers (1.2 versus 0.7) it might seem like the oratcptest time is only measuring client to server traffic instead of the whole roundtrip? For this too it’s good to use tcpdump once again and look what oratcptest actually is doing:

[user@oid1 ~]$ sudo tcpdump -ttt -i any host db01
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
00:00:00.000000 IP oid1.63602 > db01.ncube-lm: Flags [S], seq 2408800085, win 17920, options [mss 8960,sackOK,TS val 3861246405 ecr 0,nop,wscale 7], length 0
00:00:00.001160 IP db01.ncube-lm > oid1.63602: Flags [S.], seq 2178995555, ack 2408800086, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:00:00.000015 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 1, win 140, length 0
00:00:00.023175 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 1:145, ack 1, win 140, length 144
00:00:00.000520 IP db01.ncube-lm > oid1.63602: Flags [.], ack 145, win 123, length 0
00:00:00.000951 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 1:145, ack 145, win 123, length 144
00:00:00.000008 IP oid1.63602 > db01.ncube-lm: Flags [.], ack 145, win 149, length 0
00:00:00.018839 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 145:157, ack 145, win 149, length 12
00:00:00.000563 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 145:149, ack 157, win 123, length 4
00:00:00.000358 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 157:169, ack 149, win 149, length 12
00:00:00.000486 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 149:153, ack 169, win 123, length 4
00:00:00.000100 IP oid1.63602 > db01.ncube-lm: Flags [P.], seq 169:181, ack 153, win 149, length 12
00:00:00.000494 IP db01.ncube-lm > oid1.63602: Flags [P.], seq 153:157, ack 181, win 123, length 4
...
00:00:00.000192 IP oid1.63586 > db01.ncube-lm: Flags [P.], seq 18181:18193, ack 6157, win 149, length 12
00:00:00.000447 IP db01.ncube-lm > oid1.63586: Flags [P.], seq 6157:6161, ack 18193, win 123, length 4
00:00:00.006696 IP oid1.63586 > db01.ncube-lm: Flags [F.], seq 18193, ack 6161, win 149, length 0
00:00:00.000995 IP db01.ncube-lm > oid1.63586: Flags [F.], seq 6161, ack 18194, win 123, length 0
00:00:00.000012 IP oid1.63586 > db01.ncube-lm: Flags [.], ack 6162, win 149, length 0

If you look at rows 4, 5 and 6 you see the typical TCP three-way handshake. What is nice to see, is that the actual response or roundtrip time for the packet from the server on line 5 actually took 1.1ms, which is what we have measured with hping! At lines 7-10 we see there is a packet send from the client to the server which is ACK’ed and a packet send from the server to the client which is ACK’ed. If you add the ‘-A’ flag to tcpdump you can get the values in the packet printed as characters, which shows the client telling the server how it wants to perform the test and the server responding with the requested settings. This is all a preparation for the test.

Starting from line 11, there is a strict repeating sequence of the client sending a packet of length 12, ACK’ing the previous received packet, and then the server responding with a packet of length 4 ACK’ing its previous received packet. This is the actual performance test! This means that the setting ‘-duration=1s -interval=1s’ does not mean it sends one packet, it actually means it’s continuously sending packets for the duration of 1 second. Also another flag is showing: the P or PSH (push) flag. This flag means the kernel/tcpip-stack understands all data to transmit is provided from ‘userland’, and now must be sent immediately, and instructs the receiving side to process it immediately in order to bring it to the receiving userland application as soon as possible too.

Lines 20-22 show how the connection is closed by sending a packet with a FIN flag, which is done for both the client to the server and the server to the client, and because it’s TCP, these need to be ACK’ed, which is why you see a trailing packet without a flag set, only ACK’ing the FIN packet.

The conclusion so far is that for real usable latency calculations you should not use a different protocol (so whilst ICMP (ping) does give an latency indication it should really only be used as an indicator), and that you should measure doing the actual work, not meta-transactions like the TCP three way handshake. Probably because of the PSH flag, the actual minimal latency for SQL-Net traffic is lower than ping and hping showed.

Wait a minute…did you notice the ‘actual minimal latency’? So far we only have been sending empty packets, which means we measured how fast a packet can travel from client to server and back. In reality, you probably want to send actual data back and forth, don’t you? That is something that we actually have not measured yet!

Let’s do actual Oracle transactions. For the sake of testing network latency, we can use Swingbench to execute SQL. This is how that is done:

[user@oid1 bin]$ cd ~/sw/swingbench/bin
[user@oid1 bin]$ ./charbench -c ../configs/stresstest.xml -u soe -p soe -uc 1 -rt 00:01
Author  :	 Dominic Giles
Version :	 2.5.0.971

Results will be written to results.xml.
Hit Return to Terminate Run...

Time		Users	TPM	TPS

8:22:56 AM      1       14450   775

Please mind I am using 1 user (-uc 1) and a testing time of 1 minute (-rt 00:01), which should be longer when you are doing real testing. As a reminder, I am using 1 session because I want to understand the latency, not the bandwidth! In order to understand if the network traffic looks the same as oratcptest.jar, I can use tcpdump once again. Here is a snippet of the traffic:

...
00:00:00.000106 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5839:5852, ack 5986, win 272, length 13
00:00:00.000491 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 5986:6001, ack 5852, win 330, length 15
00:00:00.000234 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 5852:6003, ack 6001, win 272, length 151
00:00:00.000562 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6001:6077, ack 6003, win 330, length 76
00:00:00.000098 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6003:6016, ack 6077, win 272, length 13
00:00:00.000484 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6077:6092, ack 6016, win 330, length 15
00:00:00.000238 IP oid1.50553 > db01-vip.ncube-lm: Flags [P.], seq 6016:6159, ack 6092, win 272, length 143
00:00:00.000591 IP db01-vip.ncube-lm > oid1.50553: Flags [P.], seq 6092:6425, ack 6159, win 330, length 333
...

The important bit is this shows the same single packet traffic client to server and back as we saw oratcptest generated, however now with varying packet size (which is logical, different SQL statements are sent to the database), the PSH bit is set, which also is the same as oratcptest generated.

Let’s assume this is a real-life workload. In order to measure and calculate differences in performance between different networks, we need the average packet length. This can be done with a tool called tcpstat (this link provides the EL6 version). In my case I have only one application using a database on this server, so I can just filter on port 1521 to measure my SQL-Net traffic:

[user@oid1 ~]$ sudo tcpstat -i eth0 -o "Packet/s=%p\tmin size: %m\tavg size: %a\tmax size: %M\tstddev: %d\n" -f 'port 1521'
Packet/s=2526.40	min size: 53	avg size: 227.76	max size: 1436	stddev: 289.21
Packet/s=2531.40	min size: 53	avg size: 229.79	max size: 1432	stddev: 291.22
Packet/s=2634.20	min size: 53	avg size: 229.59	max size: 1432	stddev: 293.38
Packet/s=2550.00	min size: 53	avg size: 234.11	max size: 1435	stddev: 296.77
Packet/s=2486.80	min size: 53	avg size: 232.24	max size: 1436	stddev: 293.16

In case you wondered why tcpstat reports a minimum length of 53 and tcpdump (a little up in the article) of 13; tcpstat reports full packet length including packet, protocol and frame headers, tcpdump in this case reports the payload length.

Now we can execute oratcptest.jar again, but with a payload size set that matches the average size that we measured, I have taken 250 as payload size:

[user@oid1 ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(09:39:47) The server is ready.
                        Throughput                 Latency
(09:39:48)          0.365 Mbytes/s                0.685 ms
(09:39:48) Test finished.
	       Socket send buffer = 11700 bytes
	          Avg. throughput = 0.365 Mbytes/s
	             Avg. latency = 0.685 ms

As you can see, there is a real modest increase in average latency going from 0.670ms to 0.685ms.

In order to test the impact of network latency let’s move the oratcptest client to the server, to get the lowest possible latency. Actually, this is very easy, because the oratcptest.jar file contains both the client and the server, so all I need to do is logon to the server where I started the oratcptest.jar file in server mode, and run it in client mode:

[user@db01m ~]$ java -jar oratcptest.jar db01 -mode=sync -length=250 -duration=1s -interval=1s -port=1521
[Requesting a test]
	Message payload        = 250 bytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = NO
	Statistics interval    = 1 second
	Test duration          = 1 second
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)

(14:49:29) The server is ready.
                        Throughput                 Latency
(14:49:30)         12.221 Mbytes/s                0.020 ms
(14:49:30) Test finished.
	       Socket send buffer = 26010 bytes
	          Avg. throughput = 11.970 Mbytes/s
	             Avg. latency = 0.021 ms

Wow! The roundtrip latency dropped from 0.685ms to 0.021ms! Another test using oratcptest.jar using a true local network connection (with Linux being virtualised using Xen/OVM) shows a latency of 0.161ms.

These are the different network latency figures measured with oratcptest using a payload size that equals my average network payload size:
– Local only RTT: 0.021
– Local network RTT: 0.161
– Different networks RTT: 0.685

If I take swingbench and execute the ‘stresstest’ run local, on a machine directly connected via the local network and across different networks (think cloud), and now measure TPS (transactions per second), I get the following figures:
– Local only TPS: 2356
– Local network TPS: 1567
– Different networks TPS: 854

Do these figures make sense?
– Local only: Time not in network transit per second: 1000-(0.021*2356)=950.524; approximate average time spend on query: 950.523/2356=0.40ms
– Local network: 1000-(0.161*1567)=747.713/1567=0.48ms
– Different networks: 1000-(0.685*854)=415.010/854=0.49ms
It seems that this swingbench test spends roughly 0.40-0.50ms on processing, the difference in transactions per second seem to be mainly caused by the difference in network latency.

%d bloggers like this: