Archive

Exadata

One of the principal important configuration settings for running an Oracle database is making appropriate use of memory. Sizing the memory regions too small leads to increased IO, sizing the memory regions too big leads to inefficient use of memory and an increase in memory latency most notably because of swapping.

On Linux, there is a fair amount of memory information available, however it is not obvious how to use that information, which frequently leads to inefficient use of memory, especially in today’s world of consolidation.

The information about linux server database usage is available in /proc/meminfo, and looks like this:

$ cat /proc/meminfo
MemTotal:        3781616 kB
MemFree:          441436 kB
MemAvailable:    1056584 kB
Buffers:             948 kB
Cached:           625888 kB
SwapCached:            0 kB
Active:           500096 kB
Inactive:         447384 kB
Active(anon):     320860 kB
Inactive(anon):     8964 kB
Active(file):     179236 kB
Inactive(file):   438420 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       1048572 kB
SwapFree:        1048572 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:        320644 kB
Mapped:           127900 kB
Shmem:              9180 kB
Slab:              45244 kB
SReclaimable:      26616 kB
SUnreclaim:        18628 kB
KernelStack:        3312 kB
PageTables:         6720 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1786356 kB
Committed_AS:     767908 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       13448 kB
VmallocChunk:   34359721984 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
CmaTotal:          16384 kB
CmaFree:               4 kB
HugePages_Total:    1126
HugePages_Free:     1126
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       65472 kB
DirectMap2M:     4128768 kB

No matter how experienced you are, it’s not easy to get a good overview just be fetching this information. The point is these figures are not individual memory area’s which you simply can add up to understand to the total memory used by linux. Not even all figures are in kB (kilobyte), the HugePages values are in number of pages.

In fact, there is no absolute truth that I can find that gives a definite overview. Here is a description of what I think are the relevant memory statistics in /proc/meminfo:

MemFree: memory not being used, which should be low after a certain amount of time. Linux strives for using as much memory as much as possible for something useful, most notably as cache. If this number remains high, there is ineffective use of memory.
KernelStack: memory being used by the linux kernel.
Slab: memory being used by the kernel for caching data structures.
SwapCached: memory being used as a cache for memory pages being swapped in and out.
Buffers: memory being used as an IO buffer for disk blocks, not page caching, and should be relatively low.
PageTables: memory used for virtual to physical memory address translation.
Shmem: memory allocated as small pages shared memory.
Cached: memory used for caching pages.
Mapped: memory allocated for mapping a file into an address space.
AnonPages: memory allocated for mapping memory that is not backed by a file (“anonymous”).
Hugepagesize: the size for huge pages blocks. Valid choices with current modern intel Xeon CPUs are 2M or 1G. The Oracle database can only use 2M HugePages on linux.
HugePages_Total: total number of pages explicitly allocated as huge pages memory.
HugePages_Rsvd: total number of pages allocated as huge pages memory, but not yet allocated (and thus reported as free).
HugePages_Free: total number of pages available as huge pages memory, includes HugePages_Rsvd.

Based on information in several blogposts and experimenting with the figures, I came up with this formula to get an overview of used memory. This is not an all-conclusive formula, my tests so far get me within 5% of what Linux is reporting as MemTotal.

Warning: the text: “Another thing is that in most cases when the system has swapped out, ‘Cached’ (minus ‘Shmem’ and ‘Mapped’) gets negative, which I currently can’t explain.” is to true anymore!
By dividing Cached memory into Shmem and Cached+Mapped memory, there is no negative value anymore! I can’t find a way to make a distinction between true ‘Cached’ memory, meaning pages cached without any process attached purely for the sake of reusing them so they do not need to be physically read again, and true Mapped pages, meaning pages that are mapped into a process address space. I know there is a value ‘Mapped’, but I can’t work out reliably how to make the distinction between cache and true mapped. Maybe there even isn’t one.

This is that formula:

MemFree
+ KernelStack
+ Buffers
+ PageTables
+ AnonPages
+ Slab
+ SwapCached
+ Cached – Shmem
+ Shmem
+ HugePages used (HugePages_Total-Hugepages_Free)*Hugepagesize
+ Hugepages rsvd (Hugepages_Rsvd*Hugepagesize)
+ Hugepages free (Hugepages_Free-Hugepages_Rsvd)*Hugepagesize
————————————————————-
= Approximate total memory usage

In order to easily use this, I wrote a shell script to apply this formula to your Linux system, available on gitlab: https://gitlab.com/FritsHoogland/memstat.git. You can use the script to get a (quite wide) overview every 3 seconds by running ./memstat.sh, or you can get an overview of the current situation by running ./memstat.sh –oneshot.

This is how a –oneshot of my test system looks like (which is a quite small VM):

$ ./memstat.sh --oneshot
Free                 773932
Shmem                  2264
Mapped+Cached        359712
Anon                 209364
Pagetables            28544
KernelStack            4256
Buffers                   0
Slab                  63716
SwpCache              21836
HP Used             2023424
HP Rsvd               75776
HP Free              206848
Unknown               11944 (  0%)
Total memory        3781616
---------------------------
Total swp           1048572
Used swp             200852 ( 19%)

There is a lot to say about linux memory management. One important thing to realise is that when a system is running low on memory, it will not show as ‘Free’ declining towards zero. Linux will keep a certain amount of memory for direct use, dictated by ‘vm.min_free_kbytes’ as the absolute minimum.

In general, the ‘Cached’ pages (not Shmem pages at first) will be made available under memory pressure, since the Linux page cache really is only caching for potential performance benefit, there is no process directly attached to ‘Cached’ pages. Please mind my experimentations show there is no reliable way I could make a distinction between true ‘Mapped’ pages, meaning pages which are in use as memory mapped files, and true ‘Cached’ pages, meaning disk pages (blocks) sized 4KB which are kept in memory for the sake of reusing them, not directly related to a process.

Once the the number of page cache pages gets low, and there still is need for available pages, pages from the other categories are starting to get moved to swap. This excludes huge pages, even if they are not used! The way pages are considered is based on an ageing mechanism. This works quite well for light memory pressure for a short amount of time.

In fact, this works so well that the default eagerness of the kernel to swap (vm.swappiness, 60 by default, I have seen 30 as a default value too, 0=not eager to swap, 100=maximal swap eagerness) seems appropriate on most systems, even ones which need strict performance requirements. In fact, when swappiness is set (too) low, the kernel will try to avoid swapping as long as possible, meaning that once there is no way around it, it probably needs to swap multiple pages, leading to noticeable delays, while paging out single pages more in advance will have a hardly noticeable overhead.

However, please mind there is no way around consistent memory pressure! This means if memory in active use exceeds physical available memory, it results in physical memory to be shared at the cost of active memory pages being swapped to disk, for which process have to wait.

To show the impact of memory pressure, and how hard it is to understand that from looking at the memory pages, let me show you an example. I ran ‘memstat.sh’ in one session, and the command ‘memhog’ (part of the numactl rpm package) in another. My virtual machine has 4G of memory, and has an Oracle database running which has the SGA allocated in huge pages.

First I started memstat.sh, then ran ‘memhog 1g’, which allocates 1 gigabyte of memory and then releases it. This is the memstat output:

$ ./memstat.sh
          Free          Shmem  Mapped+Cached           Anon     Pagetables    KernelStack        Buffers           Slab       SwpCache        HP Used        HP Rsvd        HP Free        Unknown   %
         42764         435128         495656         387872          40160           4608             96          38600             24              0              0        2306048          30660   0
         42616         435128         495656         388076          40256           4640             96          38572             24              0              0        2306048          30504   0
         42988         435128         495656         388088          40256           4640             96          38576             24              0              0        2306048          30116   0
         42428         435128         495700         388108          40264           4640             96          38600             24              0              0        2306048          30580   0
        894424         320960          99456          12704          40256           4640              0          35496          42352              0              0        2306048          25280   0
        775188         321468         141804          79160          40260           4640              0          35456          70840              0              0        2306048           6752   0
        698636         324248         201476          95044          40264           4640              0          35400          64744              0              0        2306048          11116   0
        686452         324264         202388         107056          40260           4640              0          35392          66076              0              0        2306048           9040   0
        682452         324408         204496         108504          40264           4640              0          35388          65636              0              0        2306048           9780   0

You can see memstat taking some measurements, then memhog is run which quickly allocates 1g and releases it. This is done between rows 6 and 7. First of all the free memory: once the process has allocated all the memory, it stops running which means the memory is freed. Any private memory allocation mapped into the (now quitted) process address space which has backing by a physical page is returned to the operating system as free because it has effectively become available. So what might seem counter-intuitive, by stopping a process that allocated a lot of non-shared (!) memory, it results in a lot of free memory being available.

As I indicated, ‘Cached’ memory is first to be released to provide memory pages for direct use. Mapped+Cached does contain this together with Mapped memory. The amount of pages used by Mapped and Cached are drastically reduced by swapping. ‘Anon’ pages are significantly reduced too, which means they are swapped to the swap device, and ‘Shmem’ is reduced too, which means swapped to the swap device, but way lesser than ‘Mapped+Cached’ and ‘Anon’. ‘Kernel’ (kernel stack) and ‘Pagetables’ hardly decreased and ‘Slab’ decreased somewhat. ‘Swapcache’ actually grew, which makes sense because that is related to the swapping that took place.

The main thing I wanted to point out is that between the time of no memory pressure (lines 2-6) and past memory pressure (8-16), there is no direct memory statistic showing that a system is doing okay nor having suffered. The only thing that directly indicates memory pressure are active swapping in and swapping out, which can be seen with sar -W; pswpin/s and pgwpout/s, or vmstat si/so columns; which are not shown here.

Even past memory pressure, where prior linux memory management had swapped out a lot of pages to facilitate the 1G being allocated which immediately after been allocated was freed and returned as free memory, the majority of the pages on my system that have been swapped out are still swapped out:

$ ./memstat.sh --oneshot
...
Total memory       3781616
--------------------------
Total swp          1048572
Used swp            407228 ( 38%)

This underlines an important linux memory management principle: only do something if there is an immediate, direct need. My system now has no memory pressure anymore, but still 38% of my swap is allocated. Only if these pages are needed, they are paged back in. This underlines the fact that swapping can not and should not be measured by looking at the used amount of swap, a significant amount of swap being used only indicates that memory pressure has occurred in the past. The only way to detect swapping is taking place is by looking at the actual current amount of pages being swapped in and out.

If you see (very) low amounts of pages being swapped out without pages being swapped in at the same time, it’s the swappiness setting that makes pages being moved that have not been used for some time out to the swap device. This is not a problem. If you see pages being swapped in without pages being swapped out at the same time, it means pages that were swapped out either because of past memory pressure or proactive paging due to swappiness are read back in, which is not a problem too. Again, only if both pages are actively being swapped in and out at the same time or if the rate is very high there is a memory problem. The swapping actually is helping you not fail because of memory not being available at all.

Recently I was applying the data dictionary part from an (exadata bundle) patch and ran into the following errors:

ORA-24324: service handle not initialized
ORA-24323: value not allowed
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 1001 (oinstall), current egid = 1002 (dba)

This was very weird, I had just started the instance using sqlplus. This is a database that is normally started by Oracle clusterware, but for the sake of quickly patching the database, I started it manually. Another reason I had done it that way is the (infamous) OJVM patch was part of the patching, which for the version I applied needs the database in upgrade mode, which means I had to turn the cluster_database parameter to false temporarily.

Back to the story: I ran datapatch on my manually (sqlplus) started instance, and gotten the above mentioned error. The quick solution was to stop the instance (killing the pmon process will do that), and start the instance again. A thorough look through the alert.log file and trace files generated during startup did not show any error anymore. This issue showed up after successful startup previously, so I kept a close eye on the instance for some time, but it didn’t appeared again. Problem solved, but what did happen?

When looking at the error messages, there are two lines mentioning ‘egid’ which means effective group id. Also, two values for groups are mentioned: 1001, which is the oinstall group, and 1002, which is the dba group. The last line is even more clear actually, it says the startup group id is oinstall, but the current group id dba.

The line ‘ORA-27140: attach to post/wait facility failed’ is actually the root cause. The line ‘ORA-27302: failure occurred at: skgpwinit6’ tells the exact function, and skgpwinit6 probably can be deciphered as ‘System Kernel Generic PostWait INITialisation’. But what does ‘post/wait’ mean? My Oracle Support has a nice description in note ‘TECH: Unix Semaphores and Shared Memory Explained (Doc ID 15566.1)’. Essentially, post/wait is the Oracle side of using the operating system system V semaphore facility. The issue here is the group id set for the semaphores does not align with the group id for this oracle database server process, and is rejected by the operating system (line ‘ORA-27300: OS system dependent operation:invalid_egid failed with status: 1’); linux error 1 is (errno.h) EPERM 1 /* Operation not permitted */, so declined because of permissions.

The next question obviously is: how did this happen? The instance was started by me, in the same linux session, and then running something else that connects to the instance (datapatch) suddenly errors out.

This has to do with Oracle clusterware, ASM disk devices and (potential) role separation. In order for the Oracle database server processes to be able to access and use the local ASM disk devices, it must have the group id set of the ASM disk devices. When role separation is setup, which means the clusterware uses a different user id than the database software, the disk devices have ownership to the clusterware user id, and the group id will be the group set as OSASM during installation, which tends to be set as ‘asmadmin’ in such cases.

Because the disk devices need to have permissions set to 660 (read and write for both the owner and the group), the group set with the devices can be used to use the devices. With role separation, the user id of the database processes is different from the clusterware processes, so for the database processes to be able to use the disk devices, it needs to have membership of the set OSASM group.

Now, the actual root case of this issue is, for the clusterware to make sure the databases can startup using the disk devices, it will set the OSASM group as the group id for the database oracle executable (in the database home!) whenever clusterware is invoked to startup an instance. Because the oracle executable has SUID and SGID bits set (rwsr-s–x), this could mean the group of newly created processes suddenly changes from the previous group id to the OSASM group if clusterware is used to startup an instance from the same home a database instance has been started up earlier without clusterware.

How is the group set on the oracle executable of the database home?
Some simple testing shows the current primary group of the user performing linking of the oracle executable is used as the group of the oracle executable (# means executed as root, $ means executed as oracle):

# groupadd test_group
# id oracle
uid=54321(oracle) gid=54321(oinstall) groups=54321(oinstall),54322(dba)
# user mod -a -G test_group
# id oracle
uid=54321(oracle) gid=54321(oinstall) groups=54321(oinstall),54322(dba),54323(test_group)
# su - oracle
$ . oraenv
ORACLE_SID = [oracle] ? o12102
The Oracle base has been set to /u01/app/oracle
$ cd $ORACLE_HOME/rdbms/lib
$ make -f ins_rdbms.mk ioracle
...
$ ls -ls /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
316476 -rwsr-s--x 1 oracle oinstall 324067184 Mar 15 11:27 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
===
$ exit
# usermod -g test_group -G dba,oinstall oracle
# su - oracle
$ id
uid=54321(oracle) gid=54323(test_group) groups=54323(test_group),54321(oinstall),54322(dba)
$ . oraenv
ORACLE_SID = [oracle] ? o12102
The Oracle base has been set to /u01/app/oracle
$ cd $ORACLE_HOME/rdbms/lib
$ make -f ins_rdbms.mk ioracle
...
$ ls -ls /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle

I first used my current settings, which shows oinstall as primary group, and when I relink the oracle executable the group set with the oracle executable is oinstall. This is shown in lines 5-15. Next, I switch the primary group of the oracle user to test_group and execute linking again. Now the group of the oracle executable is test_group.

If I startup a database with the current settings, it will create semaphores with test_group as the current/effective group:

$ sqlplus / as sysdba
SQL> startup
...
Database opened.
SQL> exit
$ sysresv -l o12102

IPC Resources for ORACLE_SID "o12102" :
...
Semaphores:
ID		KEY
557059  	0x017b8a6c
Oracle Instance alive for sid "o12102"
$ ipcs -si 557059

Semaphore Array semid=557059
uid=54321	 gid=54321	 cuid=54321	 cgid=54323
mode=0640, access_perms=0640
nsems = 104
otime = Wed Mar 15 12:50:51 2017
ctime = Wed Mar 15 12:50:51 2017
semnum     value      ncount     zcount     pid
0          0          0          0          17097
1          4869       0          0          17097
2          10236      0          0          17097
3          32760      0          0          17097
...

Lines 1-4: startup the instance o12102 using sqlplus.
Lines 5-13: use the sysresv utility to find the semaphore array that the instance o12102 is using. The semaphore array id is 557059.
Lines 14-26: the current group id of the semaphore array of the instance o12102 is 54323 (cgid).

Before we go on and involve clusterware, let’s replay the scenario (changing the group of the oracle executable) manually outside of clusterware to see if we can get the same behaviour:

$ ls -ls oracle
316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 oracle
$ chgrp oinstall oracle
$ chmod 6751 oracle
 ls -ls oracle
316472 -rwsr-s--x 1 oracle oinstall 324067184 Mar 15 12:06 oracle

Please mind the database currently is running (otherwise there would be no semaphore array above).
In order to change the group correctly, not only the group needs to be reset (line 3), but also the SUID and SGID bits must be set again, these are lost when the group is changed. Setting the SUID and SGID bits is done in line 4.

Now try to logon again as sysdba:

$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Wed Mar 15 13:02:25 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected.

That actually succeeds! This is because principal access is arranged by membership of the OSDBA group (which I have set to dba, of which the oracle user is a member). However, if I try to shutdown the instance, I get the messages regarding post/wait:

SQL> shutdown immediate
ERROR:
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 54323 (test_group), current
egid = 54321 (oinstall)

Can I solve this issue? Yes, quite simply by changing the group of the oracle executable back to the group the database was startup (plus the SUID and SGID bits, obviously):

$ chgrp test_group oracle
$ chmod 6751 oracle
$ sqlplus / as sysdba
...
SQL> select * from dual;

D
-
X

Voila! It’s corrected again.

Now let’s keep our database group setting in mind (test_group), and involve clusterware. Because I got the o12102 instance already started, let’s see when the group of the database oracle executable changes, because the OSASM group of clusterware is set to oinstall:

$ srvctl status database -d o12102
Database is running.
$ ls -ls oracle
316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 oracle
$ srvctl stop database -d o12102
$ ls -ls oracle
316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 oracle
$ srvctl start database -d o12102
$ ls -ls oracle
316472 -rwsr-s--x 1 oracle oinstall 324067184 Mar 15 12:06 oracle

So, it is really only when an instance is started using clusterware which changes the group to the OSASM group of the oracle executable in the database home.

The group as set by clusterware can be changed by:
– unlocking the clusterware home ($ORACLE_HOME/crs/install/rootcrs.sh -unlock (roothas.sh for SIHA) as root.
– changing the $ORACLE_HOME/rdbms/lib/config.c entries for .Lasm_string: .string “GROUP HERE” and #define SS_ASM_GRP “GROUP HERE” in the clusterware home, and then relink (make -f ins_rdbms.mk $ORACLE_HOME/rdbms/lib/config.o; relink all) as the owner of clusterware.
– lock the clusterware home again ($ORACLE_HOME/crs/install/rootcrs.sh -patch (roots.sh for SIHA) as root.
==> Please mind that group for the ASM devices needs to be changed accordingly if you change the ASM group.

Recently, I was trying to setup TDE. Doing that I found out the Oracle provided documentation isn’t overly clear, and there is a way to do it in pre-Oracle 12, which is done using ‘alter system’ commands, and a new-ish way to do it in Oracle 12, using ‘administer key management’ commands. I am using version 12.1.0.2.170117, so decided to use the ‘administer key management’ commands. This blogpost is about an exception which I see is encountered in the Januari 2017 (170117) PSU of the Oracle database, which is NOT happening in Oracle 12.2 (no PSU’s for Oracle 12.2 at the time of writing) and Oracle 12.1.0.2 April 2016 and October 2016 PSU’s.

In order to test the wallet functionality for TDE, I used the following commands:

SQL> select status, wrl_parameter from v$encryption_wallet;

STATUS
------------------------------
WRL_PARAMETER
--------------------------------------------------------------------------------
NOT_AVAILABLE
/u01/app/oracle/admin/test/wallet

SQL> !mkdir /u01/app/oracle/admin/test/wallet

SQL> administer key management create keystore '/u01/app/oracle/admin/test/wallet' identified by "this_is_the_keystore_password";

keystore altered.

SQL> administer key management set keystore open identified by "this_is_the_keystore_password";

keystore altered.

SQL> administer key management set key identified by "this_is_the_keystore_password" with backup;
administer key management set key identified by "this_is_the_keystore_password" with backup
*
ERROR at line 1:
ORA-28374: typed master key not found in wallet

SQL> select status, wrl_parameter from v$encryption_wallet;

STATUS
------------------------------
WRL_PARAMETER
--------------------------------------------------------------------------------
CLOSED
/u01/app/oracle/admin/test/wallet

SQL> administer key management set keystore open identified by "this_is_the_keystore_password";

keystore altered.

SQL> select status, wrl_parameter from v$encryption_wallet;

STATUS
------------------------------
WRL_PARAMETER
--------------------------------------------------------------------------------
OPEN
/u01/app/oracle/admin/test/wallet

Notes:
Line 1-10: The DB_UNIQUE_NAME of the instance is ‘test’, and therefore the default wallet location is /u01/app/oracle/admin/test/wallet (ORACLE_BASE/admin/DB_UNIQUE_NAME/wallet). The wallet directory doesn’t exist by default, so I created it (line 10).
Line 12: Here the keystore/wallet is created with a password.
Line 16: After the wallet is created without auto-login, the wallet must be opened using the ‘set keystore open’ command.
Line 20: After the wallet has been created, it does not contain a master key. This is done using the ‘set key’ command. However, this throws an ORA-28374 error.
Line 26: After an error involving the wallet has occurred, the wallet closes.
Line 35: The wallet can simply be opened using the earlier used ‘set keystore open’ command.
Line 39: This is where the surprise is: after opening, the master key “magically” appeared (visible by the status ‘OPEN’, without a master key this would be ‘OPEN_NO_MASTER_KEY’).

I yet have to start creating encrypted table spaces. There might be more surprises, I can’t tell at this moment because I didn’t try it. However, once I discovered this oddity, I talked to my colleague Matt who gave me his own runbook for enabling TDE, which turned out to be the exact same list of commands as I compiled, however he did not encounter the ORA-28374 which I did. I tested the same sequence of commands on 12.2.0.1, 12.1.0.2.161018 (October 2016) and 12.1.0.2.160419 (April 2016) and there the ORA-28374 was not raised during execution of the ‘set key’ command.

Update!
Reading through My Oracle Support note Master Note For Transparent Data Encryption ( TDE ) (Doc ID 1228046.1), I found the following text:

All the versions after 12.1.0.2

=====================

As of 12.1.0.2 If the key associated with the SYSTEM, SYSAUX or UNDO tablespaces is not present in the wallet you cannot associate a new master key with the database (i.e. you cannot activate that master key for the database) unless you set a hidden parameter :

SQL> administer key management use key ‘AUQukK/ZR0/iv26nuN9vIqcAAAAAAAAAAAAAAAAAAAAAAAAAAAAA’ identified by “welcome1” with backup;
administer key management use key ‘AUQukK/ZR0/iv26nuN9vIqcAAAAAAAAAAAAAAAAAAAAAAAAAAAAA’ identified by “welcome1” with backup
*
ERROR at line 1:
ORA-28374: typed master key not found in wallet

alter system set “_db_discard_lost_masterkey”=true;

SQL> administer key management use key ‘AUQukK/ZR0/iv26nuN9vIqcAAAAAAAAAAAAAAAAAAAAAAAAAAAAA’ identified by “welcome1” with backup;

The heading and first line read weird, the heading indicates the paragraph is about ‘all the versions after 12.1.0.2’ (which to me means 12.2), and the first line in the paragraph says ‘as of 12.1.0.2’, which very clearly says this is about version 12.1.0.2 and higher. However, a little further it shows the exact error (ORA-28374) I encountered, and explains that if a current key is used in the data dictionary (mind data dictionary, not wallet), you must set “_db_discard_lost_masterkey” to true before you can create and use another master key for a wallet if you start over (wipe or move the wallet directory).

This makes sense to me now! I tried dropping and creating new wallets in my current 170117 PSU instance, and only tried creating an encryption wallet in a brand new freshly created instance. So if I would have EXACTLY done the same in the instances with the other PSU’s, which is repeatedly create and drop a wallet for TDE, I would have encountered the same ORA-28374 error. Well…I see this as a safety mechanism, be it not a very obvious one, not exuberant documented, and probably causing more grief than it would save if you run into the need the change the master key.

When sifting through a sql_trace file from Oracle version 12.2, I noticed a new wait event: ‘PGA memory operation’:

WAIT #0x7ff225353470: nam='PGA memory operation' ela= 16 p1=131072 p2=0 p3=0 obj#=484 tim=15648003957

The current documentation has no description for it. Let’s see what V$EVENT_NAME says:

SQL> select event#, name, parameter1, parameter2, parameter3, wait_class 
  2  from v$event_name where name = 'PGA memory operation';

EVENT# NAME                                  PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
------ ------------------------------------- ---------- ---------- ---------- ---------------
   524 PGA memory operation                                                   Other

Well, that doesn’t help…

Let’s look a bit deeper then, if Oracle provides no clue. Let’s start with the strace and sql_trace combination. For the test, I am doing a direct path full table scan on a table. Such a scan must allocate a buffer for the results (direct path reads do not go into the buffercache, table contents are scanned to the PGA and processed from there).

TS@fv122b2 > alter session set events 'sql_trace level 8';

Session altered.

Now use strace to look at the system calls in another session:

# strace -e write=all -e all -p 9426
Process 9426 attached
read(9,

Now execute ‘select count(*) from t2’. The output is rather verbose, but the important bits are:

io_submit(140031772176384, 1, {{data:0x7f5ba941ffc0, pread, filedes:257, buf:0x7f5ba91cc000, nbytes:106496, offset:183590912}}) = 1
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000
lseek(7, 0, SEEK_CUR)                   = 164639
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 112) = 112
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 37 38  30 20 70 31 3d 32 30 39   ela= 780 p1=209 |
 | 00040  37 31 35 32 20 70 32 3d  31 31 31 34 31 31 32 20  7152 p2=1114112  |
 | 00050  70 33 3d 30 20 6f 62 6a  23 3d 32 32 38 33 33 20  p3=0 obj#=22833  |
 | 00060  74 69 6d 3d 31 39 35 31  37 30 32 30 35 36 36 0a  tim=19517020566. |
...
munmap(0x7f5ba8fbd000, 2097152)         = 0
munmap(0x7f5ba91bd000, 2097152)         = 0
mmap(0x7f5ba949d000, 65536, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x2ce000) = 0x7f5ba949d000
lseek(7, 0, SEEK_CUR)                   = 183409
write(7, "WAIT #0x7f5ba9596310: nam='PGA m"..., 100) = 100
 | 00000  57 41 49 54 20 23 30 78  37 66 35 62 61 39 35 39  WAIT #0x7f5ba959 |
 | 00010  36 33 31 30 3a 20 6e 61  6d 3d 27 50 47 41 20 6d  6310: nam='PGA m |
 | 00020  65 6d 6f 72 79 20 6f 70  65 72 61 74 69 6f 6e 27  emory operation' |
 | 00030  20 65 6c 61 3d 20 35 39  32 20 70 31 3d 30 20 70   ela= 592 p1=0 p |
 | 00040  32 3d 30 20 70 33 3d 30  20 6f 62 6a 23 3d 32 32  2=0 p3=0 obj#=22 |
 | 00050  38 33 33 20 74 69 6d 3d  31 39 35 32 30 36 33 33  833 tim=19520633 |
 | 00060  36 37 34 0a                                       674.             |

Okay, we can definitely say the mmap() and munmap() system calls seem to be related, which makes sense if you look a the name of the wait event. Let’s look a bit more specific using a systemtap script:

global wait_event_nr=524
probe begin {
	printf("begin.\n")
}

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthbwt") {
	if ( pid() == target() && register("rdx") == wait_event_nr )
		printf("kskthbwt - %d\n", register("rdx"))
}
probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr )
		printf("kskthewt - %d\n", register("rsi"))
}
probe syscall.mmap2 {
	if ( pid() == target() )
		printf(" mmap, addr %x, size %d, protection %d, flags %d, fd %i, offset %d ", u64_arg(1), u64_arg(2), int_arg(3), int_arg(4), s32_arg(5), u64_arg(6))
}
probe syscall.mmap2.return {
	if ( pid() == target() )
		printf("return value: %x\n", $return)
}
probe syscall.munmap {
	if ( pid() == target() )
		printf(" munmap, addr %x, size %d\n", u64_arg(1), u64_arg(2))
}

Short description of this systemtap script:
Lines 6-9: This probe is triggered once the function kskthbwt is called. This is one of the functions which are executed when the wait interface is called. The if function on line 7 checks if the process specified with -x with the systemtap executable is the process calling this function, and if the register rdx contains the wait event number. This way all other waits are discarded. If the wait event is equal to wait_event_nr, which is set to the wait event number 524, which is ‘PGA memory operation’, the printf() function prints kskthbwt and the wait event number. This is simply to indicate the wait has started.
Lines 10-13: This probe does exactly the same as the previous probe, except the function is kskthewt, which is one of the functions called when the ending of a wait event is triggered.
Line 14-17: This is a probe that is triggered when the mmap2() system call is called. Linux actually uses the second version of the mmap call. Any call to mmap() is silently executed as mmap2(). Inside the probe, the correct process is selected, and the next line simply prints “mmap” and the arguments of mmap, which I picked from the CPU registers. I do not print a newline.
Line 18-21: This is a return probe of the mmap2() system call. The function of this probe is to pick up the return code of the system call. For mmap2(), the return code is the address of the memory area mapped by the kernel for the mmap2() call.
Line 22-25: This is a probe on munmap() system call, which frees mmap’ed memory to the operating system.
Please mind there are no accolades following the if statements, which means the code executed when the if is true is one line following the if. Systemtap and C are not indention sensitive (like python), I indented for the sake of clarity.

I ran the above systemtap script against my user session and did a ‘select count(*) from t2’ again:

# stap -x 9426 mmap.stp
begin.
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 750 return value: 7f5ba91bd000
 mmap, addr 7f5ba91bd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba91bd000
kskthewt - 524
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1262 return value: 7f5ba8fbd000
 mmap, addr 7f5ba8fbd000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f5ba8fbd000
kskthewt - 524
kskthbwt - 524
 munmap, addr 7f5ba8fbd000, size 2097152
 munmap, addr 7f5ba91bd000, size 2097152
kskthewt - 524

This makes it quite clear! The event ‘PGA memory operation’ is called when mmap() and munmap() are called. Which are calls to allocate and free memory for a process. The file descriptor (fd) value is set to -1, which means no file is mapped, but anonymous memory.

Another interesting thing is shown: first mmap is called with no address given, which makes the kernel pick a memory location. This memory location is then used for a second mmap call at the same memory address. The obvious question for this is: why mmap two times?

To answer that, we need to look at the flags of the two calls. Here is an example:

mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x4ee000) = 0x7f5ba8fbd000
mmap(0x7f5ba8fbd000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5ba8fbd000

The first mmap call asks the kernel for a chunk of memory. PROT_READ and PROT_WRITE mean the memory should allow reading and writing. MAP_PRIVATE means it’s not public/shared, which is logical for Oracle PGA memory. MAP_ANONYMOUS means the memory allocation is not backed by a file, so just an allocation of contiguous memory. MAP_NORESERVE means no swap space is reserved for the allocation. This means this first mapping is essentially just a reservation of the memory range, no physical memory pages are allocated.

The next mmap call maps inside the memory allocated with the first mmap call. This seems strange at first. If you look closely at the flags, you see that MAP_NORESERVE is swapped for MAP_FIXED. The reason for this strategy to make it easier for the Oracle database to allocate the memory allocations inside a contiguous chunk of (virtual) memory.

The first mmap call allocates a contiguous (virtual) memory area, which is really only a reservation of a memory range. No memory is truly allocated, hence MAP_NORESERVE. However, it does guarantee the memory region to be available. The next mmap allocates a portion of the allocated range. There is no MAP_NORESERVE which means this allocation is catered for for swapping in the case of memory shortage. This mapping does use a specific address, so Oracle can use pointers to refer to the contents, because it is certain of the memory address. Also, the MAP_FIXED flag has a side effect, which is used here: any memory mapping done to the address range is silently unmapped from the first (“throw away”) mapping.

Let’s look a bit deeper into the wait event information. For this I changed the probe for function kskthewt in the systemtap script in the following way:

probe process("/u01/app/oracle/product/12.2.0.0.2/dbhome_1/bin/oracle").function("kskthewt") {
	if ( pid() == target() && register("rsi") == wait_event_nr ) {
		ksuse = register("r13")-4672
		ksuseopc = user_uint16(ksuse + 2098)
		ksusep1 = user_uint64(ksuse + 2104)
		ksusep2 = user_uint64(ksuse + 2112)
		ksusep3 = user_uint64(ksuse + 2120)
		ksusetim = user_uint32(ksuse + 2128)
		printf("kskthewt - wait event#: %u, wait_time:%u, p1:%lu, p2:%lu, p3:%lu\n", ksuseopc, ksusetim, ksusep1, ksusep2, ksusep3)
	}
}

When running a ‘select count(*) from t2’ again on a freshly started database with a new process with the changed mmap.stp script, this is how the output looks like:

kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 753 return value: 7f1562330000
 mmap, addr 7f1562330000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562330000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:1114112, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:28, p1:2097152, p2:1114112, p3:0

This looks like the size of memory allocated with the first mmap call for the PGA memory reservation is put in p1, and the size of the allocation of the second “real” memory allocation is put in p2 of the ‘PGA memory operation’ event. One thing that does look weird, is the memory is not unmapped/deallocated (this is a full execution of a SQL, allocated buffers must be deallocated?

Let’s look what happens when I execute the same SQL again:

kskthbwt - 524
 munmap, addr 7f1562130000, size 2097152
 mmap, addr 7f15623b0000, size 589824, protection 0, flags 16434, fd -1, offset 881 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:253, p1:0, p2:0, p3:0
kskthbwt - 524
 mmap, addr 7f15623b0000, size 589824, protection 3, flags 50, fd -1, offset 0 return value: 7f15623b0000
kskthewt - wait event#: 524, wait_time:35, p1:589824, p2:0, p3:0
kskthbwt - 524
 mmap, addr 0, size 2097152, protection 3, flags 16418, fd -1, offset 1265 return value: 7f1562130000
 mmap, addr 7f1562130000, size 1114112, protection 3, flags 50, fd -1, offset 0 return value: 7f1562130000
kskthewt - wait event#: 524, wait_time:30, p1:2097152, p2:0, p3:0

Ah! It looks like some memory housekeeping is not done during the previous execution, but is left for the next execution, the execution starts with munmap(), followed by a mmap() call. The first munmap() call deallocates 2 megabyte memory chunk. The next mmap() call is different from the other mmap() calls we have seen so far; we have seen a “throw away”/reservation mmap() call with the memory address set to 0 to let the operating system pick an address for the requested memory chunk, and a mmap() call to truly allocate the reserved memory for usage, which had a memory address set. The mmap() call following munmap() has a memory address set. However, protection is set to 0; this means PROT_NONE, which means the mapped memory can not be read and written. Also the flags number is different, flags 16434 translates to MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE. As part of releasing PGA memory, it seems some memory is reserved. The wait event parameters are all zero. When p1, p2 and p3 are all zero, it seems to indicate munmap() is called. As we just have seen, memory could be reserved. Also, when p1/2/3 are all zero there is no way to tell how much memory is freed, nor which memory allocation.

The next wait is the timing of a single mmap() call. Actually, the mmap() call allocates the previous mmaped memory, but now with protection set to 3 (PROT_READ|PROT_WRITE), which means the memory is actually usable. The p1 value is the amount of memory mmaped.

The last wait is a familiar one, it is the mmap() call with memory address set to zero, as reservation, and another mmap() call to allocate memory inside the previous “reserved” memory. However, the p1/2/4 values are now NOT set in the same way as we saw earlier: only p1 is non zero, indicating the size of the first mmap() call. Previously, p1 and p2 were set to the sizes of both mmap() calls.

Conclusion:
With Oracle version 12.2 there is a new wait event ‘PGA memory operation’. This event indicates memory is allocated or de-allocated. Until now I only saw the system calls mmap() and munmap() inside the ‘PGA memory operation’.

In my previous post, I introduced Intel Pin. If you are new to pin, please follow this link to my previous post on how to set it up and how to run it.

One of the things you can do with Pin, is profile memory access. Profiling memory access using the pin tool ‘pinatrace’ is done in the following way:

$ cd ~/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 12284 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

The pid is a pid of an oracle database foreground process. Now execute something in the session you attached pin to and you find the ‘pinatrace’ output in $ORACLE_HOME/dbs:

$ ls -l $ORACLE_HOME/dbs
total 94064
-rw-rw----. 1 oracle oinstall     1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--. 1 oracle oinstall     2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall       57 Nov  5 09:42 inittestdb.ora
-rw-r-----. 1 oracle oinstall       24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall     7680 Nov  5 09:41 orapwtestdb
-rw-r--r--  1 oracle oinstall 10552584 Nov 17 06:36 pinatrace.out

Please mind memory access generates A LOT of information! The above 11MB is what a ‘select * from dual’ generates (!)

This is how the file looks like:

$ head pinatrace.out
#
# Memory Access Trace Generated By Pin
#
0x00007f85c63fe218: R 0x00007fff6fd2c4c8  8          0xcefb615
0x000000000cefb61e: W 0x00007fff6fd2c4f8  8              0x12c
0x000000000cefb621: R 0x00007fff6fd2c4d0  8     0x7f85c5bebd96
0x000000000cefb625: R 0x00007fff6fd2c4d8  8     0x7f85c5bebd96
0x000000000cefb62c: R 0x00007fff6fd2c4e0  8     0x7fff6fd2c570
0x000000000cefb62d: R 0x00007fff6fd2c4e8  8          0xcefb54e

The first field is the function location, the second field is R or W (reading or writing obviously), the third field is the memory location read or written the fourth field is the amount of bits read and the fifth field is prefetched memory.

The function that is used can be looked up using the addr2line linux utility:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x000000000cefb61e
sntpread at ??:?

I looked up the second address from the pinatrace.out file above, and that address belongs to the function sntpread. There is no additional information available for this function (‘at ??:?’). If the address is not available in the oracle executable, a ‘??’ is displayed:

$ addr2line -p -f -e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle 0x00007f85c63fe218
?? ??:0

The pinatrace.out file is usable if you know the exact instruction pointer address or the memory location. However, that usage is fairly limited. An example of that is Mahmoud Hatem’s blog on tracing access to a memory location. Wouldn’t it be nice if we can change the functions addresses to function names, and the memory addresses to named memory locations whenever possible?

That’s where I created the pinatrace annotate oracle tool for. This is a little scriptset that contains scripts to generate memory information from the instance, after which the instruction pointer addresses and memory locations of a pinatrace.out file generated by pinatrace are translated to function names and memory area names. Let’s have a look what that means. This is a snippet of a pinatrace.out file:

0x000000000c967e46: R 0x0000000095f69910  8         0x95fcf6b0
0x000000000c967e4d: W 0x00007fff6fd2b2b8  8          0xc967e52
0x000000000c937b32: W 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937b3a: W 0x00007fff6fd2b278  8                0xe
0x000000000c937b41: W 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937b45: W 0x00007fff6fd2b270  8                0x1
0x000000000c937b49: W 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937b4d: R 0x0000000095fcf6bc  2               0x12
0x000000000c937b52: W 0x00007fff6fd2b288  8              0x2c4
0x000000000c937b59: W 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937b60: W 0x00007fff6fd2b2a0  4               0x73
0x000000000c937b6b: W 0x00007fff6fd2b2a8  4                0x1
0x000000000c937b6e: R 0x00007f85ca1db280  8     0x7f85ca1db280
0x000000000c937b77: R 0x000000000d0a40e4  4                0x1
0x000000000c937b84: R 0x00007f85ca1d43c8  8         0x95dc0e20
0x000000000c937b92: R 0x0000000095dc10b0  8                  0
0x000000000c937ba2: R 0x0000000095fcf6c0  4                0x1
0x000000000c937ba9: R 0x0000000095dc10e0  4                  0
0x000000000c937baf: R 0x000000000cfbe644  4            0x1cffe
0x000000000c937bbc: W 0x0000000095dc10b0  8         0x95fcf6b0
0x000000000c937bc5: R 0x0000000095fcf6b0  8                  0
0x000000000c937bc5: W 0x0000000095fcf6b0  8                0x1
0x000000000c937bca: W 0x00007fff6fd2b260  8                  0
0x000000000c937be1: R 0x00007f85ca1d4290  8     0x7f85ca1a9ca0
0x000000000c937bec: R 0x00007f85ca1ab1c0  4                0x3
0x000000000c937bf3: W 0x0000000095dc0faa  2                0x3
0x000000000c937bf9: R 0x00007f85ca1d43e0  8         0x95f68ea8
0x000000000c937c09: R 0x0000000095f69470  2                  0
0x000000000c937c16: W 0x0000000095dc0fac  2                  0
0x000000000c937c1e: R 0x0000000095dc10e0  4                  0
0x000000000c937c1e: W 0x0000000095dc10e0  4                0x2
0x000000000c937c24: W 0x0000000095dc0fa0  8         0x95fcf6b0
0x000000000c937c28: W 0x0000000095dc0fa8  2                0x8
0x000000000c937c2e: R 0x000000006000a9d8  4                0x1
0x000000000c937c3b: R 0x00007fff6fd2b298  8         0x95f68ea8
0x000000000c937c3f: R 0x00007fff6fd2b2a0  4               0x73
0x000000000c937c42: W 0x0000000095fcf6c8  8         0x95f68ea8
0x000000000c937c46: W 0x0000000095fcf6c4  4               0x73
0x000000000c937c4a: R 0x00007fff6fd2b2a8  4                0x1
0x000000000c937c50: R 0x0000000095fcf6b8  4              0x83e
0x000000000c937c50: W 0x0000000095fcf6b8  4              0x83f
0x000000000c937c5a: W 0x0000000095dc10b0  8                  0
0x000000000c937c65: R 0x00007f85ca1d71d6  1                  0
0x000000000c937c76: R 0x00007fff6fd2b270  8                0x1
0x000000000c937c7a: R 0x00007fff6fd2b290  8          0xd8f898c
0x000000000c937c7e: R 0x00007fff6fd2b288  8              0x2c4
0x000000000c937c82: R 0x00007fff6fd2b280  8     0x7f85ca1db280
0x000000000c937c86: R 0x00007fff6fd2b278  8                0xe
0x000000000c937c8d: R 0x00007fff6fd2b2b0  8     0x7fff6fd2bdb0
0x000000000c937c8e: R 0x00007fff6fd2b2b8  8          0xc967e52

The usefulness of this is limited in this form. The only thing I could derive is that big numbers in the memory access column (‘0x00007fff6fd2ac60’) are probably PGA related, and the numbers between roughly 0x000000006000000 and 0x0000000095dc0fd0 are probably SGA related. After running the annotate tool, it looks like this:

ksl_get_shared_latch:W:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:W:0x00007fff6fd2b278():8
ksl_get_shared_latch:W:0x00007fff6fd2b298():8
ksl_get_shared_latch:W:0x00007fff6fd2b270():8
ksl_get_shared_latch:W:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x0000000095fcf6bc(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):2
ksl_get_shared_latch:W:0x00007fff6fd2b288():8
ksl_get_shared_latch:W:0x00007fff6fd2b290():8
ksl_get_shared_latch:W:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x00007f85ca1db280(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x000000000d0a40e4():4
ksl_get_shared_latch:R:0x00007f85ca1d43c8(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6c0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:R:0x000000000cfbe644():4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6b0(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x00007fff6fd2b260():8
ksl_get_shared_latch:R:0x00007f85ca1d4290(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x00007f85ca1ab1c0(pga|Other, pga heap, kgh stack pga|Other, pga heap, free memory pga|Other, pga heap, permanent memory):4
ksl_get_shared_latch:W:0x0000000095dc0faa(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x00007f85ca1d43e0(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):8
ksl_get_shared_latch:R:0x0000000095f69470(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:W:0x0000000095dc0fac(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc10e0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALOW):4
ksl_get_shared_latch:W:0x0000000095dc0fa0(shared pool|permanent memor,duration 1,cls perm):8
ksl_get_shared_latch:W:0x0000000095dc0fa8(shared pool|permanent memor,duration 1,cls perm):2
ksl_get_shared_latch:R:0x000000006000a9d8(fixed sga|var:kslf_stats_):4
ksl_get_shared_latch:R:0x00007fff6fd2b298():8
ksl_get_shared_latch:R:0x00007fff6fd2b2a0():4
ksl_get_shared_latch:W:0x0000000095fcf6c8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):8
ksl_get_shared_latch:W:0x0000000095fcf6c4(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:R:0x00007fff6fd2b2a8():4
ksl_get_shared_latch:R:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095fcf6b8(shared pool|permanent memor,duration 1,cls perm shared pool|(child)latch:session idle bit):4
ksl_get_shared_latch:W:0x0000000095dc10b0(shared pool|permanent memor,duration 1,cls perm shared pool|X$KSUPR.KSLLALAQ):8
ksl_get_shared_latch:R:0x00007f85ca1d71d6(pga|Other, pga heap, permanent memory pga|Other, top call heap, free memory):1
ksl_get_shared_latch:R:0x00007fff6fd2b270():8
ksl_get_shared_latch:R:0x00007fff6fd2b290():8
ksl_get_shared_latch:R:0x00007fff6fd2b288():8
ksl_get_shared_latch:R:0x00007fff6fd2b280():8
ksl_get_shared_latch:R:0x00007fff6fd2b278():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b0():8
ksl_get_shared_latch:R:0x00007fff6fd2b2b8():8

So, now you can see the reason I picked a seemingly arbitrary range of lines actually was because that range is the memory accesses of the ksl_get_shared_latch function. This annotated version show a shared latch get for the ‘session idle bit’ latch. It’s also visible the function uses PGA memory, some of it annotated, some of it not, and that most of the shared pool access is for the latch (a latch essentially is a memory range with the function of serialising access to a resource), which is in the shared pool because it’s a child latch. It’s also visible memory belonging to X$KSUPR is read and written (X$KSUPR is the table responsible for V$PROCESS, the fields KSLLALAQ and KSLLALOW are not externalised in V$PROCESS).

Why are a lot of the assumed PGA addresses (the ones like 0x00007fff6fd2b2b8) not annotated? Well, PGA memory allocations are very transient of nature. Because a PGA memory snapshot is made at a certain point in time, this snapshot represents the memory layout of that moment, which has a high probability of having memory deallocated and freed to the operating system. A lot of the SGA/shared pool allocations on the other hand have the intention of re-usability, and thus are not freed immediately after usage, which gives the SGA memory snapshot a good chance of capturing a lot of the memory allocations.

Get the pinatrace oracle annotate tool via github: git clone https://github.com/FritsHoogland/pinatrace_annotate_oracle.git

Please mind this tool uses the bash shell, it might not work in other shells like ksh.

How to use the tool?
– Use pin with the pinatrace.so tool, as described above. Move the the pinatrace.out file from $ORACLE_HOME/dbs to the directory with the pinatrace_annotate_oracle.sh script.
Immediately after the trace has been generated (!), execute the following scripts using sqlplus as SYSDBA:
– 0_get_pga_detail.sql (this lists the sessions in the database and requires you to specify the oracle PID of the session)
– 1_generate_memory_ranges.sql
– 2_generate_memory_ranges_xtables.sql
– 3_generate_memory_ranges_pga.sql
This results in the following files: memory_ranges.csv, memory_ranges_pga.csv and memory_ranges_xtables.csv.
Now execute the annotate script:
– ./pinatrace_annotate_oracle.sh pinatrace.out
The script outputs to STDOUT, so if you want to save the annotation, redirect it to a file (> file.txt) or if you want to look and redirect to a file: | tee file.txt.

I hope this tool is useful for your research. If you know a memory area described in the data dictionary that is not included, please drop me a message with the script, then I’ll include it.

This blogpost is an introduction to Intel’s Pin dynamic instrumentation framework. Pin and the pintools were brought to my attention by Mahmoud Hatem in his blogpost Tracing Memory access of an oracle process: Intel PinTools. The Pin framework provides an API that abstracts instruction-set specifics (on the CPU layer). Because this is a dynamic binary instrumentation tool, it requires no recompiling of source code. This means we can use it with programs like the Oracle database executable.
The Pin framework download comes with a set of pre-created tools called ‘Pintools’. Some of these tools are really useful for Oracle investigation and research.

Pin works in a very sophisticated way. The description in the Pin manualis to think of Pin as a JIT (just in time) compiler, where the compiler does not take byte code (as JIT compilation does with Java), but the executable of the process pin is executed against. This means pin inserts itself into the process’ execution. This can be seen when looking at the memory map of such a process:

$ cat /proc/29595/maps | grep -e pin-3.0 -e oracle
00400000-1098a000 r-xp 00000000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10b8a000-10bac000 r--p 1058a000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
10bac000-10e05000 rw-p 105ac000 fb:02 68469986                           /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle
7fa01e04c000-7fa01e31d000 r-xp 00000000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e51c000-7fa01e530000 r--p 002d0000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e530000-7fa01e531000 rw-p 002e4000 fb:04 67152845                   /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/lib-ext/libpin3dwarf.so
7fa01e533000-7fa01e71a000 r-xp 00000000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91a000-7fa01e91c000 r--p 001e7000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e91c000-7fa01e91e000 rw-p 001e9000 fb:04 954267                     /home/oracle/pin/pin-3.0-76991-gcc-linux/source/tools/SimpleExamples/obj-intel64/calltrace.so
7fa01e946000-7fa01e9c9000 r-xp 00000000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9ca000-7fa01e9cc000 r--p 00083000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9cc000-7fa01e9ce000 rw-p 00085000 fb:04 136702                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libc-dynamic.so
7fa01e9d6000-7fa01ea04000 r-xp 00000000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea04000-7fa01ea05000 r--p 0002d000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea05000-7fa01ea06000 rw-p 0002e000 fb:04 136694                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libm-dynamic.so
7fa01ea06000-7fa01eac4000 r-xp 00000000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac5000-7fa01eac8000 r--p 000be000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eac8000-7fa01eaca000 rw-p 000c1000 fb:04 136696                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/runtime/pincrt/libstlport-dynamic.so
7fa01eacb000-7fa01edb7000 r-xp 00000000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01edb8000-7fa01ee24000 r--p 002ec000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee24000-7fa01ee25000 rw-p 00358000 fb:04 100663633                  /home/oracle/pin/pin-3.0-76991-gcc-linux/extras/xed-intel64/lib/libxed.so
7fa01ee80000-7fa01f385000 r-xp 00000000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f385000-7fa01f38d000 r--p 00504000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f38d000-7fa01f394000 rw-p 0050c000 fb:04 136689                     /home/oracle/pin/pin-3.0-76991-gcc-linux/intel64/bin/pinbin
7fa01f54e000-7fa01f6d4000 r-xp 00000000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so
7fa01f6d4000-7fa01f8d3000 ---p 00186000 fb:02 212053053                  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpkavx12.so

Here you see this process an oracle foreground (rows 1-3), after which we see a lot of stuff that pin pushed into the process’ address space.

For this reason, it’s vitally important to use and run pin as the same user as the process you want to run pin against. The way pin works is that, upon execution of pin, the pin executable inserts itself into the process’ address space, gains control and then tries to load necessary libraries. If it can’t find these libraries, it will send a SIGKILL to the process, effectively stopping it!

In order to obtain pin, go to the pin homepage, downloads, linux and select the ‘gcc compiler kit’. This is a zipped tarball. Upload that to a linux server as the database owner, usually ‘oracle’. I created a directory ‘pin’ in which I put the tarball, and extracted it (tar xzf). The next step is to compile the tools that come with pin:

$ cd pin-3.0-76991-gcc-linux/source/tools
$ make

This will output a lot of stuff on your screen and compiles everything in the tools directory.

Now that we have seen an introduction, let’s use a Pin tool to do something useful! One such useful is the tool ‘DebugTrace’. First obtain the process id (pid) from an Oracle server foreground process. Then execute the pin tool against this process:

$ pwd
/home/oracle/pin/pin-3.0-76991-gcc-linux
$ ./pin -pid 2407 -t source/tools/DebugTrace/obj-intel64/debugtrace.so

Upon execution, nothing is returned, because the majority of the things are happening in the process pin is run against (2407 in this case). If no parameters are specified, the pintool will create a file in the current working directory of the process it is run against. For an oracle foreground process, the current working directory is $ORACLE_HOME/dbs:

$ ls -l /proc/2407/cwd
lrwxrwxrwx 1 oracle oinstall 0 Nov 17 01:45 /proc/2407/cwd -> /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs

For the sake of testing this, I ran ‘select * from dual’ in the oracle database foreground process.

Let’s look what this produces! This is how my dbs directory looks like:

$ ls -ltr
total 2332
-rw-r--r--. 1 oracle oinstall    2992 Feb  3  2012 init.ora
-rw-r-----. 1 oracle oinstall      24 Nov  5 09:32 lkTESTDB
-rw-r-----. 1 oracle oinstall    7680 Nov  5 09:41 orapwtestdb
-rw-r-----. 1 oracle oinstall      57 Nov  5 09:42 inittestdb.ora
-rw-rw----. 1 oracle oinstall    1544 Nov 16 09:40 hc_testdb.dat
-rw-r--r--  1 oracle oinstall 2361460 Nov 17 01:49 debugtrace.out

You can see the debugtrace.out file is what is produced by the pin tool, and it look rather big (remember I only ran ‘select * from dual’)!

Let’s look into the file:

$ less debugtrace.out
         3 @@@ return underflow
Return 0x00007f706b664218 /lib64/libpthread.so.0:__read_nocancel+0x00000000000f returns: 0x12c
        14 @@@ return underflow
Return 0x000000000cefb62d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sntpread+0x00000000003d returns: 0
        31 @@@ return underflow
Return 0x000000000cefb580 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:ntpfprd+0x0000000000c0 returns: 0
       119 @@@ return underflow
Return 0x000000000cedcc9d /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbasic_brc+0x00000000032d returns: 0
       122 @@@ return underflow
Return 0x000000000cedc90a /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nsbrecv+0x00000000005a returns: 0
       152 Call 0x000000000ceeb746 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x000000000276 -> 0x0000000005b170d0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       156 Tailcall 0x0000000005b170de /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy+0x00000000000e -> 0x0000000005b170c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P(0x7ffeef95c305, 0x7f706ae51da0, ...)
       158 | Call 0x0000000005b170c1 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000001 -> 0x0000000005b1dcb0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy(0x7ffeef95c305, 0x7f706ae51da0, ...)
       172 | Return 0x0000000005b20097 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:__intel_ssse3_rep_memcpy+0x0000000023e7 returns: 0x7ffeef95c305
       174 Return 0x0000000005b170c7 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:_intel_fast_memcpy.P+0x000000000007 returns: 0x7ffeef95c305
       197 @@@ return underflow
Return 0x000000000ceeb7ad /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:nioqrc+0x0000000002dd returns: 0
       213 Call 0x000000000cb5d8b0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:opikndf2+0x000000000410 -> 0x000000000c940560 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx(0x7ffeef95c040, 0x7f706ae51da1, ...)
       268 | Call 0x000000000c94066e /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:kslwtectx+0x00000000010e -> 0x000000000cd585c0 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64(0x7ffeef95c040, 0x7f706ae51da1, ...)
       278 | | Call 0x000000000cd5863b /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64+0x00000000007b -> 0x000000000b13ef90 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt(0x7ffeef95c040, 0x7f706ae51da1, ...)
       286 | | | Call 0x000000000b13efa3 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:sltrgftime64_cgt+0x000000000013 -> 0x0000000000ba6860 /u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle:.plt+0x000000001360(0x1, 0x7ffeef95beb0, ...)
       293 | | | | Call 0x00007f706b188dcb /lib64/libc.so.6:__clock_gettime+0x00000000001b -> 0x00007ffeef9ab8a0 clock_gettime(0x1, 0x7ffeef95beb0, ...)
...etc...

I think this is rather exciting! This is a FULL function call trace, indented by call depth!
The first lines in the trace are returns, and these returns produce a ‘return underflow’, which is because these functions have been called before the trace was put on the process.
This trace shows:
Calling: the address and function where the function call is made, the address and function that is called and first two arguments of the function.
Returning: the address and function where the return is executed, and the returncode.
Tailcall: this is a function that is called as the last instruction in the current function. This means it returns as part of the current function, which is shown by not indenting this call further (call, tailcall and return are all indented at the same level).

Please mind currently I am not aware of a way to remove the pin framework and tool from a process address space. Also mind that because of how this works (a lot of intercepting in the execution flow), it will slow down the process significantly. Use this tool and the above described techniques at your own risk.

This blogpost is about the Oracle redo log structures and redo efficiency in modern Oracle databases. Actually, a lot of subtle things changed surrounding redo (starting from Oracle 10 actually) which have gone fairly unnoticed. One thing the changes have gone unnoticed for is the Oracle documentation, the description of redo in it is an accurate description for Oracle 9, not how it is working in Oracle 10 or today in Oracle 12.1.0.2.

My test environment is a virtual machine with Oracle Linux 7.2 and Oracle 12.1.0.2.161018, and a “bare metal” server running Oracle Linux 6.7 and Oracle 12.1.0.2.160419. Versions are important, as things can change between versions.

Multi threaded redo and dynamic strands
One of the prominent changes to the redo mechanism is what is referred to as ‘multi threaded redo’ or ‘multiple log buffers’. The way the redo structures are described most of the time is the redo log buffer being in the SGA as a single buffer to which change vectors are written (in a circular fashion), protected by a latch (redo allocation latch). That is not an accurate description since Oracle 9.2.

Starting from Oracle 9.2, multiple log buffers can be and most probably are created. At least in Oracle 12.1 you get 2 redo log buffers. Each of these buffers is protected by a redo allocation latch. The benefit of multiple log buffers is that copying redo into the log buffer is not strictly serialised. However, this has some consequences too. One of the consequences is redo potentially could be not written in strict SCN order by the log writer or its worker processes.

You can see the log buffers in the view x$kcrfstrand:

SYS@testdb AS SYSDBA> select strand_size_kcrfa from x$kcrfstrand where ptr_kcrf_pvt_strand = hextoraw(0) and pnext_buf_kcrfa_cln != hextoraw(0);

STRAND_SIZE_KCRFA
-----------------
         67108864
         67108864

Both values added together exactly matches the size of the log buffer:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: log_buffer
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%log_buffer%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden:
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('','Y',3,2))

NAME                         VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
---------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
log_buffer                   134217728                                                              FALSE    FALSE      TRUE

The parameters that handle dynamic redo strands are “hidden”:

SYS@testdb AS SYSDBA> @parms
Enter value for parameter: log_parallelism
old  20: where name like nvl('%&parameter%',name)
new  20: where name like nvl('%log_parallelism%',name)
Enter value for isset:
old  21: and upper(isset) like upper(nvl('%&isset%',isset))
new  21: and upper(isset) like upper(nvl('%%',isset))
Enter value for show_hidden: Y
old  22: and flag not in (decode('&show_hidden','Y',3,2))
new  22: and flag not in (decode('Y','Y',3,2))

NAME                            VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_log_parallelism_dynamic        TRUE                                                                   TRUE     FALSE      FALSE
_log_parallelism_max            2                                                                      TRUE     FALSE      FALSE

The mechanisms can further be investigated using fixed SGA variables:

SYS@testdb AS SYSDBA> oradebug setmypid
Statement processed.
SYS@testdb AS SYSDBA> oradebug dumpvar sga kcrf_max_strands
uword kcrf_max_strands_ [0600283C8, 0600283CC) = 00000002
SYS@testdb AS SYSDBA> oradebug dumpvar sga kcrf_actv_strands
uword kcrf_actv_strands_ [0600283E0, 0600283E4) = 00000001

Note: the output of oradebug is hexadecimal (!). Of course here it doesn’t matter since the values are smaller than 10.
This shows the maximum number of public redo strands is 2 (kcrf_max_strands), as we have seen with x$kcrfstrand, and the number of active strands is 1 (kcrf_actv_strands), which makes sense in my test database, since I am the sole user of the database.
The maximum number of public redo strands seems to be at least 2, some limited research shows the value is set by the formula CPU_COUNT/16 for systems with higher CPU counts. Some sources report that the value for _log_parallelism_max will be reduced to CPU_COUNT if it is set higher manually.

When I count the number of redo allocation latches, I see something odd:

SYS@testdb AS SYSDBA> select count(*) from v$latch_children where name = 'redo allocation';

  COUNT(*)
----------
	20

A redo allocation latch protects a log strand. I just showed my system has actually two public log strands. Why do I have 20 redo allocation latches if my system has a fixed number of public redo buffers? This finding leads to private redo strands.

Private strands
Outside of multi threaded public redo buffers, Oracle introduced another optimisation for redo, which are private strands. Private strands optimise redo generation by letting a session create the change vectors for a transaction directly in a private redo buffer in the SGA, instead of generating the change vectors in the PGA, and later copying these to the public redo strand. Some restrictions apply on the use of it, although I am not aware of a concrete list of restrictions and limitations (the closest thing to such a list would be the table x$ktiff shown further below with in-memory undo, which is directly connected to private strands). Whenever a private strand can not be used, Oracle reverts to the ‘classic’ mechanism of redo generation using the public redo log buffer. Each private strand is also protected by a redo allocation latch, just like the public strands.

The private strands SGA buffers are allocated at startup time:

SYS@testdb AS SYSDBA> select pool, name, bytes from v$sgastat where name like 'private strands';

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  private strands               2451456

It seems the number of private strands allocated from SGA (and as a consequence the number of redo allocation latches on top of the public strand number) is depended on the transactions database parameter, and takes 10% (_log_private_parallelism_mul) of the transactions multiplied by 129KB plus 4KB overhead:

SYS@testdb AS SYSDBA> select trunc(value * KSPPSTVL / 100) * (129+4) * 1024 "private strands"
                      from (select value from v$parameter where name = 'transactions') a,
                           (select val.KSPPSTVL
                            from sys.x$ksppi nam, sys.x$ksppsv val
                            where nam.indx = val.indx AND nam.ksppinm = '_log_private_parallelism_mul') b;

private strands
---------------
        2451456

However, that is not all that is in play for private strands. Once the private strands are allocated, and after every logswitch, the number of usable private strands is determined by the size of the current online redologfile minus the size of the log buffer and the 5% multiplier (_log_private_mul):

SYS@testdb AS SYSDBA> select trunc(((select bytes from v$log where status = 'CURRENT') - (select to_number(value) from v$parameter where name = 'log_buffer'))*
                             (select to_number(val.KSPPSTVL)
                              from sys.x$ksppi nam, sys.x$ksppsv val
                              where nam.indx = val.indx AND nam.ksppinm = '_log_private_mul') / 100 / 66560)
                             as "private strands"
                       from dual;

private strands
---------------
            -22

Yes, that is a negative number. Let’s look at the number of private strands in this instance:

SYS@testdb AS SYSDBA> select indx,strand_size_kcrfa from x$kcrfstrand where last_buf_kcrfa = '00';

no rows selected

The instance takes the least of the two calculations above for the number of private strands. That explains my test instance not having any private strands! The size of my log buffer is 128M, the size of my online redologfiles is 100M. This shows that sizing the online redologfiles too small can lead to inefficiencies (!!).

In-Memory Undo
Another mechanism that has been changed in order to optimise generating redo is in-memory undo. Actually in-memory undo is directly tied to private redo strands. The idea of in-memory undo is to store the undo (read consistent related) part of a transaction in an in-memory undo buffer until it is committed or rolled back. The in-memory undo pool can be seen in the shared pool using:

SYS@testdb AS SYSDBA> select pool, name, bytes from v$sgastat where name = 'KTI-UNDO';

POOL         NAME                                                    BYTES
------------ -------------------------------------------------- ----------
shared pool  KTI-UNDO                                              1274976

The in-memory undo pool is externalised through x$ktifp. This view shows the in-memory buffers and their sizes:

SYS@testdb AS SYSDBA> select ktifpno, ktifppsi from x$ktifp;

   KTIFPNO   KTIFPPSI
---------- ----------
         0      65535
...
        17      65535

The number of in-memory undo buffers is determined by the transactions parameter:

SYS@testdb AS SYSDBA> show parameter transactions

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
transactions                         integer     189
transactions_per_rollback_segment    integer     5

And defaults to 10% of the transactions. Every in-memory undo buffer is protected by an in-memory undo latch:

SYS@testdb AS SYSDBA> select count(*) from v$latch_children where name like 'In memory undo latch';
        18

In memory undo can be switched on and off on at instance-wide level or at the session level using the ‘_in_memory_undo’ parameter, and is true/on by default. Switching off un-memory undo switches off private strands too.

In fact some of these optimisations are announced in the alert log file during startup of an instance:

IMODE=BR
ILAT =18

BR means batched redo mode, which means the private strands feature is turned on (I found that with the too small online redologfiles described above, which caused on private private strands to be made available, the database still announces ‘BR’).
ILAT means number of in-memory pools and latches.

Because in-memory undo is directly tied to private redo strands, the usage of private strands and in-memory undo can be monitored using the view x$ktiff:

SYS@testdb AS SYSDBA> select ktiffcat, ktiffflc from x$ktiff;

KTIFFCAT															   KTIFFFLC
-------------------------------------------------------------------------------------------------------------------------------- ----------
Undo pool overflow flushes														  0
Stack cv flushes															  1
Multi-block undo flushes														  0
Max. chgs flushes															  0
NTP flushes																  0
Contention flushes															  0
Redo pool overflow flushes														  0
Logfile space flushes															  0
Multiple persistent buffer flushes													  0
Bind time flushes															  0
Rollback flushes															  2
Commit flushes															       1694
Recursive txn flushes															  0
Redo only CR flushes															  0
Ditributed txn flushes															  0
Set txn use rbs flushes 														  0
Bitmap state change flushes														  0
Presumed commit violation														  0
Securefile direct-write lob update flushes												  0
Unknown 																  0
Unknown 																  0

This view shows the reasons for flushing the in-memory undo buffers.

Throw-away undo
The reason that the database is mentioning ‘IMODE’ is there is actually another mode one besides ‘BR’, ‘TUA’. The ‘TUA’ abbreviation is a bit weird (Throw Undo Away vs. Throw-away undo?), but it means the database is set to ‘throw-away undo mode’. This is accomplished by setting the undo_retention parameter to a value of 3 or lower, and setting the undocumented parameter _imu_pools to a value greater than 3. Once this is done, and the instance is restarted, it will be reported in the alert.log file as follows:

IMODE=TUA
ILAT =189

The purpose of throw-away undo mode is to throw away undo instead of applying (the name already hints that :-)). By enabling throw-away undo, some of the actions normally executed are not done anymore. Such actions: are applying undo (not all undo is skipped), which reduces block changes, read, writes, buffer activity and reduces redo activity. By enabling this mode, and thus not performing some of the actions that are normally done, there are some consequences: it will significantly increase the risk of ORA-1555 snapshot too old errors. Also a lot of features of the database are not compatible with ‘TUA mode’: flashback database, cluster_database=TRUE, supplemental logging to name a few. However, it looks to me as a mode designed for a specific purpose: being able to do massive loads or changes at a faster rate, and with some consequences. I would add that this mode should always be a temporal setting, and once the desired action is completed, the ‘TUA’ mode should be reverted back to normal behaviour by resetting _imu_pools and increasing undo_retention back to your business transaction needs.

Conclusion
The aim for this blogpost is to describe private strands, in-memory undo and throw-away undo. The name of the blogpost is actually a hat-tip to mainly the paper from Stephan Haisley, which has most of what is written here covered. So in that sense, it is a ‘redo’ of his paper, although I tested this all on Oracle version 12.1.0.2.181016 database instance.

Private strands and in-memory undo reduce the work needed when changes are applied to the database. You might want to make sure you are using these features, especially since too small online redologfiles can prevent private strands from being made available. I come across a lot of databases which have inappropriately sized online redologfiles.

Throw-away undo mode, or TUA mode, is a mode which can reduce the time spend on bulk actions. However, anything comes at a price. For TUA mode it means the database looses some of its durability properties. However, if this is used for importing huge amounts of data, or doing mass changes without any session concurrency, this could be a welcome time saver.

References:
Hellodba
Stephan Haisley’s 2008 presentation ‘Redo and Undo Optimisations in Oracle 10g’
Tanel Põder’s presentation ‘Performance and Scalability Improvements in Oracle 10g and 11g’
Jonathan Lewis’ write-ups about private strands and ‘Oracle Core essential internals for DBAs and Developers’.
Twitter: @piontekdd, @martinberx, @westendwookie
Thanks: Mauro Pagano

%d bloggers like this: