Archive

Linux

This blogpost is a follow up of this blogpost, with the exception that the install method in this blogpost is way easier, it uses an Ansible playbook to do most of the installation.

1. Install git and ansible via EPEL:

# yum -y localinstall https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# yum -y install ansible git

2. Clone my ‘prometheus_node_exp_grafana_install’ repository:

# git clone https://gitlab.com/FritsHoogland/prometheus_node_exp_grafana_install.git

3. Run the prometheus.yml playbook to install prometheus, node_exporter and grafana:

# cd prometheus_node_exp_grafana_install
# ansible-playbook prometheus.yml

Now all you need to do is setup grafana (at port 3000) to talk to prometheus to fetch data.
If you are interested, the memory usage dashboard can be imported, dashboard id 2747.

In fact, it seems node_exporter has renamed the operating system statistics, so the old Linux memory dashboard didn’t work anymore. I updated it, and uploaded a new version of the memory usage dashboard.

Advertisements

This blogpost is a look into a bug in the wait interface that has been reported by me to Oracle a few times. I verified all versions from Oracle 11.2 version up to 18.2.0.0.180417 on Linux x86_64, in all these versions this bug is present. The bug is that the wait event ‘db file async I/O submit’ does not time anything when using ASM, only when using a filesystem, where this wait event essentially times the time the system call io_submit takes. All tests are done on Linux x86_64, Oracle Linux 7.4 with database and grid version 18.2.0.0.180417

So what?
You might have not seen this wait event before; that’s perfectly possible, because this wait event is unique to the database writer. So does this wait event matter?

When the Oracle datebase engine is set to using asynchronous I/O, and when it makes sense to use asynchronous I/O (!), the engine will use the combination of io_submit() to issue I/O requests to the operating system, and when needs to, fetch the I/O requests using io_getevents(). In general (so not consistently), the engine does not time io_submit, which is a non-blocking call, it only times when it needs to wait for I/O requests using io_getevents(), which is reported as a wait event in an IO wait event class. A lot of ‘%parallel%’ IO related wait events can time asynchronous IO calls.

So why would the engine then time io_submit() for the database writer?
Well, io_submit() is not a blocking call, UNLESS the device queue to which the requests are submitted is full. This means that the developers of the database writer code decided to implement a wait event for io_submit, which is not the case for any other process.

To understand why this makes sense, a little knowledge about database writer internals is necessary. When blocks are dirtied in the cache and these blocks are checkpointed later on, these must be written to disk. The amount of blocks to be written and therefore the number of writes can get high very quickly. The way this is processed is quite interesting (simplified obviously; and when using a filesystem):

a) the database writer picks up a batch of blocks needing writing, for up to 128 IO requests.
b) that batch is submitted, timed by ‘db file async I/O submit’
c) a blocking io_getevents call is issued, timed by ‘db file parallel write’, to wait for the IOs to finish. The interesting thing specifically for the database writer is that the minimal number of IOs ready to wait for is very low (a few IOs to 25-75% of the IOs if the amount gets bigger). Any finished IO will be picked up here, however it’s perfectly possible IOs are still active after this step. In fact, I think it’s deliberately made that way.
d) if any IO requests are still pending, a nonblocking, non-wait event timed io_getevents call is issued to pick up any finished IOs.
e) if any blocks still need writing for which no IO request have been submitted, go to a).
f) if at this point IO requests are still pending, to to c).

This means that the database writer can submit huge amounts of IO requests, and keep on doing that, much more than any other process, because it doesn’t need to wait for all IOs to finish. So, this means that if there is a process that is likely to run into a blocking io_submit call, it’s the database writer.

When using a database without ASM, the above wait timing is exactly what happens. A function call graph of io_submit for the database writer when the database uses a filesystem looks like this:

 | | | | | > kslwtbctx(0x7ffc55eb3e60, 0x8b4, ...)
 | | | | | | > sltrgftime64(0x6c2f4288, 0x6bbe3ca0, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffc55eb3400, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffc55eb3400, ...)
 | | | | | | | | < clock_gettime+0x00000000005d returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwait_timeout_centi_to_micro(0x7fffffff, 0x19183e92, ...)
 | | | | | |  kskthbwt(0x19c37b0d8, 0xb3, ...)
 | | | | | |  kslwt_start_snapshot(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | < kslwt_start_snapshot+0x0000000000d0 returns: 0x6c2f4ae8
 | | | | |  ksfdgo(0x800, 0, ...)
 | | | | | | > ksfd_skgfqio(0x7fc304483f78, 0x9, ...)
 | | | | | | | > skgfqio(0x7fc3091fddc0, 0x7fc304483f78, ...)
 | | | | | | | | > skgfrvldtrq(0x7fc304483f78, 0x9, ...)
 | | | | | | | |  sltrgftime64(0x2000, 0x7fc3043772b0, ...)
 | | | | | | | | | > clock_gettime@plt(0x1, 0x7ffc55eae3b0, ...)
 | | | | | | | | | | > clock_gettime(0x1, 0x7ffc55eae3b0, ...)
 | | | | | | | | | | < clock_gettime+0x00000000005d returns: 0
 | | | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | | |  skgfr_lio_listio64(0x7fc3091fddc0, 0x1, ...)
 | | | | | | | | | > io_submit@plt(0x7fc302992000, 0x115, ...)
 | | | | | | | | | < io_submit+0x000000000007 returns: 0x115
 | | | | | | | | < skgfr_lio_listio64+0x000000000131 returns: 0
 | | | | | | | < skgfqio+0x00000000035e returns: 0
 | | | | | | < ksfd_skgfqio+0x0000000001f5 returns: 0
 | | | | |  kslwtectx(0x7ffc55eb3e60, 0x7fc304483f78, ...)
 | | | | | | > sltrgftime64(0x7ffc55eb3e60, 0x7fc304483f78, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffc55eb33e0, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffc55eb33e0, ...)
 | | | | | | | | < clock_gettime+0x00000000005d returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwt_end_snapshot(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | | > kslwh_enter_waithist_int(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | |  kslwtrk_enter_wait_int(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | | < kslwtrk_enter_wait_int+0x000000000019 returns: 0x6bcaa180
 | | | | | |  kslwt_update_stats_int(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | | > kews_update_wait_time(0x9, 0x8f54, ...)
 | | | | | | |  ksucpu_wait_update(0x9, 0x8f54, ...)
 | | | | | | | < ksucpu_wait_update+0x000000000036 returns: 0x6bd658b0
 | | | | | |  kskthewt(0x19c38402c, 0xb3, ...)
 | | | | | | < kskthewt+0x0000000005b1 returns: 0x30
 | | | | |  select event#, name from v$event_name where event# = to_number('b3','xx');
    EVENT# NAME
---------- ----------------------------------------------------------------
       179 db file async I/O submit

Now on to the actual purpose of this blog post, the same situation, but now when ASM is used. When ASM is used, there is a significant increase in the call stack. This means more code is executed. This may sound strange at first, but it’s very logical if you give it some thought: when using ASM, the Oracle database is talking to raw devices. This means that any of the functionality a filesystem performs, which is implemented in ASM must in some way be performed. This is done in several additional layers in the database code.

Let’s look at a backtrace of io_submit of the database writer when using a filesystem:

#0  0x00007f22bdb36690 in io_submit () from /lib64/libaio.so.1
#1  0x0000000004832ef0 in skgfr_lio_listio64 ()
#2  0x000000001238b7ce in skgfqio ()
#3  0x0000000011d5c3ad in ksfd_skgfqio ()
#4  0x0000000011d57fce in ksfdgo ()
#5  0x0000000000d9f21c in ksfdaio ()
#6  0x00000000039c4a5e in kcfisio ()
#7  0x0000000001d836ec in kcbbdrv ()
#8  0x000000001222fac5 in ksb_act_run_int ()
#9  0x000000001222e792 in ksb_act_run ()
#10 0x0000000003b8b9ce in ksbabs ()
#11 0x0000000003baa161 in ksbrdp ()
#12 0x0000000003fbaed7 in opirip ()
#13 0x00000000026ecaa0 in opidrv ()
#14 0x00000000032904cf in sou2o ()
#15 0x0000000000d681cd in opimai_real ()
#16 0x000000000329d2a1 in ssthrdmain ()
#17 0x0000000000d680d3 in main ()

If you want to follow the call sequence, a backtrace/stacktrace must be read from the bottom up.
ksb = kernel service background processes
kcf = kernel cache file management
ksfd = kernel service functions disk IO
skgf = o/s dependent kernel generic fiile
I hope you recognise the logical layers that are necessary for doing the I/O.

Now look at a backtrace of io_submit of the database writer when using ASM:

#0  0x00007f22bdb36690 in io_submit () from /lib64/libaio.so.1
#1  0x0000000004832ef0 in skgfr_lio_listio64 ()
#2  0x000000001238b7ce in skgfqio ()
#3  0x0000000011d5c3ad in ksfd_skgfqio ()
#4  0x0000000011d57fce in ksfdgo ()
#5  0x0000000000d9f21c in ksfdaio ()
#6  0x000000000755c1a8 in kfk_ufs_async_io ()
#7  0x0000000001455fb2 in kfk_submit_io ()
#8  0x00000000014551a8 in kfk_io1 ()
#9  0x0000000001450b3e in kfk_transitIO ()
#10 0x000000000143c450 in kfioSubmitIO ()
#11 0x000000000143bbaa in kfioRequestPriv ()
#12 0x000000000143b160 in kfioRequest ()
#13 0x000000000136f6bd in ksfdafRequest ()
#14 0x000000000137311a in ksfdafGo ()
#15 0x0000000011d58179 in ksfdgo ()
#16 0x0000000000d9f269 in ksfdaio ()
#17 0x00000000039c4a5e in kcfisio ()
#18 0x0000000001d836ec in kcbbdrv ()
#19 0x000000001222fac5 in ksb_act_run_int ()
#20 0x000000001222e792 in ksb_act_run ()
#21 0x0000000003b8b9ce in ksbabs ()
#22 0x0000000003baa161 in ksbrdp ()
#23 0x0000000003fbaed7 in opirip ()
#24 0x00000000026ecaa0 in opidrv ()
#25 0x00000000032904cf in sou2o ()
#26 0x0000000000d681cd in opimai_real ()
#27 0x000000000329d2a1 in ssthrdmain ()
#28 0x0000000000d680d3 in main ()

Essentially, a couple of layers are added to facilitate ASM; ksfdaf, kfio, kfk.
So the logical sequence becomes:
ksb = kernel service background processes
kcf = kernel cache file management
ksfd = kernel service functions disk IO
ksfdaf = kernel service functions disk IO ASM files
kfio = kernel automatic storage management translation I/O layer
kfk = kernel automatic storage management KFK

ksfd = kernel service functions disk IO
skgf = o/s dependent kernel generic file

Now to give an overview of the function call sequence, I simply need to cut out a lot of functions because otherwise it would be unreadable.

 | | | | | > ksfdgo(0x806, 0x35b4, ...)
 | | | | | | > ksfdafGo(0x806, 0x35b4, ...)
 | | | | | | | > ksfdafRequest(0x7ffcc7d845a0, 0x10f, ...)
 | | | | | | | | > kfioRequest(0x7ffcc7d845a0, 0x10f, ...)
 | | | | | | | | | > _setjmp@plt(0x7ffcc7d821d8, 0x10f, ...)
 | | | | | | | | |  __sigsetjmp(0x7ffcc7d821d8, 0, ...)
 | | | | | | | | |  __sigjmp_save(0x7ffcc7d821d8, 0, ...)
 | | | | | | | | |  kfioRequestPriv(0x7ffcc7d845a0, 0x10f, ...)
...
 | | | | | | | | | | | | | | | | > ksfdgo(0x188, 0x35c5, ...)
 | | | | | | | | | | | | | | | | | > ksfd_skgfqio(0x7f4232709f78, 0x9, ...)
 | | | | | | | | | | | | | | | | | | > skgfqio(0x7f4237483dc0, 0x7f4232709f78, ...)
 | | | | | | | | | | | | | | | | | | | > skgfrvldtrq(0x7f4232709f78, 0x9, ...)
 | | | | | | | | | | | | | | | | | | |  sltrgftime64(0x2000, 0x7f4230b61c98, ...)
 | | | | | | | | | | | | | | | | | | | | > clock_gettime@plt(0x1, 0x7ffcc7d7bb10, ...)
 | | | | | | | | | | | | | | | | | | | | | > clock_gettime(0x1, 0x7ffcc7d7bb10, ...)
 | | | | | | | | | | | | | | | | | | | | | < clock_gettime+0x000000000059 returns: 0
 | | | | | | | | | | | | | | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | | | | | | | | | | | | | |  skgfr_lio_listio64(0x7f4237483dc0, 0x1, ...)
 | | | | | | | | | | | | | | | | | | | | > io_submit@plt(0x7f4230ad8000, 0x112, ...)
 | | | | | | | | | | | | | | | | | | | | < io_submit+0x000000000007 returns: 0x112
 | | | | | | | | | | | | | | | | | | | < skgfr_lio_listio64+0x000000000131 returns: 0
 | | | | | | | | | | | | | | | | | | < skgfqio+0x00000000035e returns: 0
 | | | | | | | | | | | | | | | | | < ksfd_skgfqio+0x0000000001f5 returns: 0
 | | | | | | | | | | | | | | | | < ksfdgo+0x000000000135 returns: 0
...
 | | | | | | | | | < kfioRequestPriv+0x000000000224 returns: 0
 | | | | | | | | < kfioRequest+0x000000000251 returns: 0
 | | | | | | | < ksfdafRequest+0x0000000003c8 returns: 0
 | | | | | | < ksfdafGo+0x000000000081 returns: 0x1
 | | | | |  kslwtbctx(0x7ffcc7d86f60, 0x7f4232709f38, ...)
 | | | | | | > sltrgftime64(0x6da39e68, 0x6d2f5bc0, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffcc7d86500, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffcc7d86500, ...)
 | | | | | | | | < clock_gettime+0x000000000059 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwait_timeout_centi_to_micro(0x7fffffff, 0x14cb3fcb, ...)
 | | | | | |  kskthbwt(0x2b0f3fe06, 0xb3, ...)
 | | | | | |  kslwt_start_snapshot(0x6da3b118, 0x6da3b118, ...)
 | | | | | | < kslwt_start_snapshot+0x0000000000d0 returns: 0x6da3a6c8
 | | | | |  ksfdgo(0x808, 0, ...)
 | | | | |  kslwtectx(0x7ffcc7d86f60, 0x9, ...)
 | | | | | | > sltrgftime64(0x7ffcc7d86f60, 0x9, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffcc7d864e0, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffcc7d864e0, ...)
 | | | | | | | | < clock_gettime+0x000000000059 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwt_end_snapshot(0x6da3b118, 0x6da3b118, ...)
 | | | | | | | > kslwh_enter_waithist_int(0x6da3b118, 0x6da3b118, ...)
 | | | | | | |  kslwtrk_enter_wait_int(0x6da3b118, 0x6da3b118, ...)
 | | | | | | | < kslwtrk_enter_wait_int+0x000000000019 returns: 0x6dacf1e8
 | | | | | |  kslwt_update_stats_int(0x6da3b118, 0x6da3b118, ...)
 | | | | | | | > kews_update_wait_time(0x9, 0xd02, ...)
 | | | | | | |  ksucpu_wait_update(0x9, 0xd02, ...)
 | | | | | | | < ksucpu_wait_update+0x000000000036 returns: 0x6db40f70
 | | | | | |  kskthewt(0x2b0f40b08, 0xb3, ...)
 | | | | | | < kskthewt+0x0000000005b1 returns: 0x30
 | | | | |  ksfdafCopyWaitCtx(0x7ffcc7d86f60, 0xb3, ...)
 | | | | | | > _intel_fast_memcpy(0x7ffcc7d86f60, 0x7f423270a848, ...)
 | | | | | |  _intel_fast_memcpy.P(0x7ffcc7d86f60, 0x7f423270a848, ...)
 | | | | | |  __intel_ssse3_rep_memcpy(0x7ffcc7d86f60, 0x7f423270a848, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x00000000242e returns: 0x7ffcc7d86f60
 | | | | | < ksfdafCopyWaitCtx+0x000000000038 returns: 0x7ffcc7d86f60
 | | | | < ksfdaio+0x00000000055f returns: 0x7ffcc7d86f60
 | | |  oradebug setorapname dbw0
Oracle pid: 18, Unix process pid: 3617, image: oracle@o182-fs.local (DBW0)
SQL> oradebug event sql_trace wait=true
Statement processed.

Then go to the trace directory, and tail the database writer trace file.
Next, attach to the database writer with gdb, and break on the io_submit call and perform a sleep 1 (sleep for 1 second). This should add 1000000 microseconds to the waiting time, if the wait event includes the function we put the break on.

(gdb) break io_submit
Breakpoint 1 at 0x7f336b986690
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end

Now continue the database writer, and execute a checkpoint (alter system checkpoint), and look at the wait events:

WAIT #0: nam='db file async I/O submit' ela= 2 requests=11 interrupt=0 timeout=0 obj#=-1 tim=15801301770
WAIT #0: nam='db file parallel write' ela= 5077 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=15801306930

Well, it’s clear nothing has timed the one second we added, right? (the time in the wait event is at ‘ela’, which is in microseconds)

For the sake of completeness, and to validate this test method, let’s add the sleep to io_getevents (io_getevents_0_4) to see if ‘db file parallel write’ does show the extra time we added in the system call, because ‘db file parallel write’ is supposed to time io_getevents():

(gdb) dis 1
(gdb) break io_getevents_0_4
Breakpoint 2 at 0x7f336b986650
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end

Continue the database writer again, and execute a checkpoint:

WAIT #0: nam='db file async I/O submit' ela= 1 requests=22 interrupt=0 timeout=0 obj#=-1 tim=15983030322
WAIT #0: nam='db file parallel write' ela= 1003978 requests=2 interrupt=0 timeout=2147483647 obj#=-1 tim=15984034336

Yay! There we got the artificial waiting time!

Based on this, I can only come the conclusion that the wait event ‘db file async I/O submit’ does not perform any actual timing of the io_submit system call when ASM is used with the Oracle database.

This blogpost is about using ansible vault. Vault is a way to encrypt sensitive information in ansible scripts by encrypting it. The motivation for this blogpost is the lack of a description that makes sense to me of what the possibilities are for using vault, and how to use the vault options in playbooks.

The basic way ansible vault works, is that when ansible-playbook reads a yaml file, it encounters $ANSIBLE_VAULT;1.1;AES256 indicating ansible vault is used to encrypt the directly following lines, it will use a password to decrypt it, and then uses the decrypted version in memory only. This way secrets can be hidden from being visible. Obviously, the password will allow decrypting it, and the password must be used in order for ansible-playbook to decrypt it.

The original use of vault is to encrypt an entire yaml file. As of Ansible version 2.3, ansible allows the encryption of single values in a yaml file.

1. File encryption
A very silly example of ansible-vault, which shouldn’t be used (!) is this:

$ echo "---
- hosts: localhost
  vars:
    a: aaa
    b: bbb
  tasks:

  - name: execute a shell command
    shell: >
      ls" > test1.yml

When this runs, it will execute on localhost, set to ansible variables (a and b) and then use the shell module to execute “ls”. Again, this is just an example, and is silly on itself. This is how it’s run:

$ ansible-playbook test1.yml
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

This can be encrypted using ansible vault in the following way:

$ ansible-vault encrypt test1.yml
New Vault password:
Confirm new Vault password:
Encryption successful

Now the yaml file is entirely encrypted:

$ cat test1.yml
$ANSIBLE_VAULT;1.1;AES256
32663536343532643864353135643431373830333236636230613264383531336562323530386632
3933353561316530333538363032633731336135396434630a626631303839646161363432396230
32336130633264343766393562656138363132393835646137633065393739336234336463336138
6564613730333436650a336537326233343566633761383436356136666533326561346530613566
64316336393036346538656338303734623161386435373138656532663737666662633765656438
37303666393939356161393134666130303435333463636630386434626335383164663761383035
35613437356361366665633931346664386535393936396637363561386164613337373562313264
63333530353263363437386536393765393930653234353733313433643637333932353638353339
31623366376566613766366236356163303837666664646465363439313265383562656637623266
6463663065383030626232643365623437666332313631376262

It still can be run, but you need to specify the password. In this case it asks to enter a password:

$ ansible-playbook test1.yml --ask-vault-pass
Vault password:

PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

Of course this is not practical, the most common and probable way you want to use this is to have variables with secret information, like passwords, tokens, private keys, personal data, etc. encrypted, but the playbook still be readable. You should use a version control tool like git, and use functionality like diff!

The first thing is to move the variables out of the playbook into a dedicated section. In this case I chosen to assign them based on host, in a directory called ‘host_vars’, and then for localhost ‘host_vars/localhost’:

$ cat host_vars/localhost/vars.yml
---
a: aaa
b: bbb

And then of course remove them from the playbook itself:

$ cat test2.yml
---
- hosts: localhost
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}"

If this is run, it will pick up the variables (for which I put in a debug task):

$ ansible-playbook test2.yml
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

TASK [debug] ***************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

Okay, now let’s say I need to encrypt the variable ‘a’. I can bluntly encrypt vars.yml, and that will work, however, then you don’t know where the variable comes from. The solution to that is ‘indirect referencing’. This is done by adding a second file to host_vars/localhost called ‘vault.yml’, which will be the encrypted file. In the case of encrypting variable a, do the following:
1. Change the vars.yml file to assign {{ vault_a }} to a:

$ cat host_vars/localhost/vars.yml
---
a: "{{ vault_a }}"
b: bbb

2. Define the value of a to vault_a in vault.yml:

$ cat host_vars/localhost/vault.yml
---
vault_a: aaa

3. Encrypt vault.yml:

$ ansible-vault encrypt host_vars/localhost/vault.yml
New Vault password:
Confirm New Vault password:
Encryption successful

Now it can be used by specifying the password again:

$ ansible-playbook test2.yml --ask-vault-pass
Vault password:
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

TASK [debug] ***************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

This should not be a surprise, as I said, ansible will recognise an encrypted file, and decrypt it automatically. However, if you looked carefully you saw something else that is important to realise: I encrypted the variable a containing ‘aaa’, and it was revealed to me using the debug statement. I think this is reasonable, because it’s been explicitly programmed to do so. You should write your playbooks not to reveal any information that you encrypted previously.

2. Variable encryption
If you have larger sets of variables to be encrypted, the above method makes sense. However, if you want to encrypt only a few variables it’s probably simpler to encrypt only the variable you want to be encrypted. The ansible-vault executable provides a facility for that called ‘encrypt_string’. The way this works is that you simply specify the variable and the variable value with ansible-vault, which then outputs the encrypted value, which you need to put in a playbook yourself. For example your ‘a’ variable with the value ‘aaa’:

$ ansible-vault encrypt_string --name 'a' 'aaa'
New Vault password:
Confirm New Vault password:
a: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          37363334323763633562313934656631396431636139653036333036356636326163623238626132
          6138343534396232643638323830326462383831366165630a393335303939663138613437633835
          65396461383565316135316339383035626262343664643563373631366338393361623230666634
          3637363232313935310a323564663863336565623366633838616337393366313831396637386432
          3035
Encryption successful

Now use this to input to a playbook:

---
- hosts: localhost
  vars:
    a: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          37363334323763633562313934656631396431636139653036333036356636326163623238626132
          6138343534396232643638323830326462383831366165630a393335303939663138613437633835
          65396461383565316135316339383035626262343664643563373631366338393361623230666634
          3637363232313935310a323564663863336565623366633838616337393366313831396637386432
          3035
    b: bbb
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}"

3. Vault passwords
The next thing to look at is how to specify the decryption password. One way or another, you need a password specified to decrypt. The first option is to specify the password at the CLI, which is done in the previous examples by specifying ‘–ask-vault-pass’ as an argument to ansible-playbook. This is a good option for running playbooks ad-hoc.

If you want playbooks to run from a facility like ansible tower, rundeck or semaphore, you need to specify the password in non-interactive, in a file. This is often implemented as a hidden file, like ‘.vault_pass.txt’. Of course this then means the password is readable if you got command line access. The main idea here is this gives you an option to encrypt sensitive data in a playbook and store it in a version control system, and do NOT store the password (file) in version control.

$ echo "mypassword" > .vault_pass.txt
$ ansible-playbook --vault-password-file .vault_pass.txt test2.yml
PLAY [localhost] **************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] ************************************************************************************************************************************************
changed: [localhost]

TASK [debug] ******************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP ********************************************************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

There is a caveat here too, into which I ran because my removable HDD enforces 777 on my files, which means the execute bit is set:

$ ansible-playbook --vault-password-file .ttt test2.yml
 [WARNING]: Error in vault password file loading (default): Problem running vault password script /Volumes/Samsung_T3/ansible/.ttt ([Errno 8] Exec format error). If this
is not a script, remove the executable bit from the file.

ERROR! Problem running vault password script /Volumes/Samsung_T3/ansible/.ttt ([Errno 8] Exec format error). If this is not a script, remove the executable bit from the file.

The message is self-explanatory: if ansible finds the passwordfile specified to be executable, it will execute it. In my case, I created a little shell script to do that:

#!/bin/bash
echo "mypassword"

3.a. Vault-id for passwords
Since ansible version 2.4 the official way to use encryption is by using vault-id. Vault-id is more or less the same as the above usage, however, it adds a new feature that could be important: the option to use multiple vault passwords in the same playbook. The way this works is quite simple: ansible tries all the passwords it has been given until it finds one that works. Please mind vault-id only works with encrypted files, not with encrypted variables! Below is an example of the use of vault-id with multiple passwords:

$ echo "---
- hosts: localhost
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}" > test3.yml
$ echo '---
a: "{{ vault_a }}"
b: "{{ vault_b }}"' > host_vars/localhost/vars.yml
$ echo "password_a" > .a.vault_pass.txt
$ echo "password_b" > .b.vault_pass.txt
$ echo "---
vault_a: aaa" > host_vars/localhost/vault_a.yml
$ echo "---
vault_b: bbb" > host_vars/localhost/vault_b.yml
$ ansible-vault --vault-id a@.a.vault_pass.txt encrypt host_vars/localhost/vault_a.yml
Encryption successful
$ ansible-vault --vault-id b@.b.vault_pass.txt encrypt host_vars/localhost/vault_b.yml
Encryption successful
$ ansible-playbook --vault-id a@.a.vault_pass.txt --vault-id b@.b.vault_pass.txt test3.yml
PLAY [localhost] **************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] ************************************************************************************************************************************************
changed: [localhost]

TASK [debug] ******************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP ********************************************************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

In a blogpost introducing the vagrant builder suite I explained what the suite could do, and the principal use, to automate the installation of the Oracle database software and the creation of a database on a virtual machine using vagrant together with ansible and virtual box.

This blogpost shows how to use that suite for automating the installation of the Oracle database software and the creation of a database on a linux server directly, with only the use of ansible without vagrant and virtualbox.

The suite has been updated with all the PSU’s up to current (180417; april 2018), for 11.2.0.4, 12.1.0.2 and 12.2.0.1, and now includes Oracle 18. Please mind for Oracle version 12.2 and 18 you need to provide the installation media, because these are not downloadable as patch. I really hate that Oracle is not providing the installation media for these downloadable in an automated way. This is a good opportunity for oracle to show it’s listening and provide a solution, especially because it wants to be a player in the developer/devops field.

The regular use if this suite with vagrant/ansible/virtualbox would be to clone the vagrant-builder repository, set the variables in the Vagrantfile, and issue ‘vagrant up’, which will fetch an O/S image from the vagrant cloud, create an extra disk for Oracle, create a VM specification in virtual box, boot up the VM, and then run ansible to do the full configuration of Linux for installing the Oracle database, install and patch the Oracle database software and create a database. Without any human intervention.

Currently, the local/direct install scripts support redhat (actually, a redhat version that manifests itself as ‘Red Hat Enterprise Linux’ or ‘RedHat’ in the ansible fact ‘ansible_distribution’) or oracle linux (‘OracleLinux’), version 7 only. You can get all the facts that ansible gathers on the current host by executing ‘ansible localhost -m setup’.

To use it natively on a linux system, you must first make sure the operating system provides a /u01 directory with enough space for the Oracle software install and the database. A rough indication would be more than 20G. I regularly use 40G, which allows me to use the database and create tablespaces. It doesn’t care if it’s only a directory, or if it’s a mount point. Actually the only thing that local_install.yml is doing with it, is changing the ownership of the /u01 directory to oracle once the user is created.

The first thing to do is install ansible and git. Ansible is using python, and python comes installed with all recent RHEL compatible distributions (as an example, ‘yum’ is using python too). In my opinion, installing additional python packages should not be a problem. If your environment is highly standardised and these kind of installs are not allowed, you should use the scripts in the “regular” ansible way, which I will detail at the end.

1. Install git
Oracle linux 7 has git in the base repository. I assume this means this is the same for RHEL 7. That means that when the server has the base repository or the latest repository or a company repository (for version controlling the packages), it can be simply done using yum install:

# yum install git

2. Install ansible
One way of installing ansible is using easy_install and pip. This will get you a recent ansible version. The simplest way would be using an EPEL repository, however, packages in the linux distribution repositories are not updated very frequently in my experience.

# yum install python-setuptools
# easy_install pip
# pip install ansible

3. Clone the vagrant-builder suite
This CLI example clones the vagrant-builder suite into the builder directory. You can name the directory any way you want. The best way of doing this is using a normal (meaning non-root) user, that has password-less unlimited sudo rights. Cloud environments come with such a user by default, for the oracle cloud this is ‘opc’, for amazon this is ‘ec2-user’ (and for vagrant boxes this is ‘vagrant’). The ansible script uses sudo to execute as root, creates the oracle user and then uses sudo to execute as oracle.

$ git clone https://gitlab.com/FritsHoogland/vagrant-builder.git builder

4. Run local_install.yml
Now the automatic installation components are all setup. The next thing to do is go into the builder/ansible directory, and edit the local_install.yml file:

---
- hosts: localhost
  become: true
  vars:
  - mosuser:
  - mospass:
  - oracle_base: /u01/app/oracle
  - database_name:
  - global_password: oracle
  - db_create_file_dest: /u01/app/oracle/oradata
  - database_version:
  - asm_version: ""
  - stage_directory: /u01/stage
...

Line 5/6: fill out MOS details. This is needed to download patches or installation media in patches.
Line 7: this is the default value for the ORACLE_BASE.
Line 8: you need a database name if you want to create a database. If you set it to empty (“”) no database will be created.
Line 9: this sets this password for all oracle database accounts.
Line 10: this variable sets the place for all the database files, db_create_file_dest.
Line 11: this sets the database software version to install. Ansible will determine what files to download. The Oracle 12.2 installation media must be placed in the builder/ansible/files directory. Look in the Vagrantfile for specifications and versions available.
Line 12: the grid software is currently not installed. The variable needs to be specified in order to have the facts setup. Facts are ansible variables.
Line 13: this is the directory in which all files are staged. It will be created at the beginning of a role, and removed at the end.

After this has been filled out, run it with a user that is allowed to execute sudo without specifying a password:

$ ansible-playbook local_install.yml

Ansible will read the playbook, and see that no inventory (list of hosts) is specified, and the hosts specifier is set to localhost, and then run locally.

Just like with its original use with vagrant, this will download all the necessary software directly from Oracle using the MOS credentials, with the exception of the installation media of Oracle 12.2.0.1 and Oracle 18.0.0.0, because these can not be downloaded via the CLI as far as I know, and therefore have to be provided in the ansible/files directory.

If you want to speed up the build, or do not have a good enough network, you can put the installation media in the files directory anyway, the ansible script will look in the files directory for it. The patches for performing patching (all MOS downloads are patches, of which some are actually the installation media) are always downloaded from MOS.

Using ansible non-local.
If you are not allowed to install additional software on the oracle server, there is another way. However, when you want to use this, it means you are asked to essentially manually install the oracle software. Regardless of what is allowed, you have to make changes to the operating system to facilitate running oracle anyway, and apparently the environment is not that automated, so that claim would be a bit silly.

The non-local way is running ansible in the way it is normally used, which is having a server with ansible installed, from which it uses ssh to run the ansible playbooks on (one or more) remote servers (which then only needs python installed on the remote servers, which EL6 and EL7 have by default). The local_install.yml script can be modified very easily to be used in this way: the hosts specification must be changed to ‘all’ or to a name given to a group of servers in the inventory file. This however is beyond the scope of the article.

However, if you need to perform database software installations and database creations regularly in an environment, it would make sense to use a centralised server to perform these actions, instead of setting it up on every distinct server.

Bonus material.
I added another script in the ansible directory, db_management.yml, which performs database creation and removal (and the install of slob). It follows the same pattern of local_install.yml, fill out the variables, and, unique for this script, uncomment the action you want it to perform, and run it.

Starting from Oracle 12, in a default configured database, there are more log writer processes than the well known ‘LGWR’ process itself, which are the ‘LGnn’ processes:

$ ps -ef | grep test | grep lg
oracle   18048     1  0 12:50 ?        00:00:13 ora_lgwr_test
oracle   18052     1  0 12:50 ?        00:00:06 ora_lg00_test
oracle   18056     1  0 12:50 ?        00:00:00 ora_lg01_test

These are the log writer worker processes, for which the minimal amount is equal to the amount public redo strands. Worker processes are assigned to a group, and the group is assigned to a public redo strand. The amount of worker processes in the group is dependent on the undocumented parameter “_max_log_write_parallelism”, which is one by default.

The actual usage of the worker processes is dependent in the first place on the value of the undocumented parameter “_use_single_log_writer”, for which the default value is ‘ADAPTIVE’, which means it’s switching automatically between ‘single log writer mode’, which is the traditional way of the LGWR process handling everything that the log writer functionality needs to do, and the ‘scalable log writer mode’, which means the log writer functionality is presumably using the log writer worker processes.

Other values for “_use_single_log_writer” are ‘TRUE’ to set ‘single log writer mode’, or ‘FALSE’ to set ‘scalable log writer mode’ fixed.

I assume most readers of this blog will know that the master log writer idle work cycle is sleeping on a semaphore (semtimedop()) under the wait event ‘rdbms ipc message’ for 3 seconds, then performs some “housekeeping”, after which it’ll sleep again repeating the small cycle of sleeping and housekeeping. For the log writer worker processes, this looks different if you look at the wait event information of the log writer worker processes:

135,59779,@1    14346                    DEDICATED oracle@memory-presentation.local (LGWR)	    time:1909.44ms,event:rdbms ipc message,seq#:292
48,34282,@1     14350                    DEDICATED oracle@memory-presentation.local (LG00)	    time:57561.85ms,event:LGWR worker group idle,seq#:150
136,24935,@1    14354                    DEDICATED oracle@memory-presentation.local (LG01)	    time:112785.66ms,event:LGWR worker group idle,seq#:74

The master log writer process (LGWR) has been sleeping for 1.9s when I queried the database, and it will sleep for 3 seconds, and then do some work and sleep again. However, the log writer worker processes have been sleeping for much longer: LG00 for 57.6s and LG01 for 112.8s, and the event is different: ‘LGWR worker group idle’. How is this implemented? Let’s look!

$ strace -p $(pgrep lg01)
strace: Process 14354 attached
semtimedop(360448, [{27, -1, 0}], 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
semtimedop(360448, [{27, -1, 0}], 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

I used strace on the LG01 process, and it’s still doing the same as most idle background processes are doing: sleeping on a semaphore for 3 seconds. But, it does not end its wait like LGWR does, the event the log writer worker processes are waiting in keeps on being timed.

Using a pin tools debugtrace shows the following:

 | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff
 | | > __errno_location(0x38000, 0x7ffce278c328, ...)
 | | | > fthread_self(0x38000, 0x7ffce278c328, ...)
 | | | < fthread_self+0x000000000024 returns: 0
 | | < __errno_location+0x000000000010 returns: 0x7f7e930a26a0
 | < sskgpwwait+0x00000000014e returns: 0
 < skgpwwait+0x0000000000e0 returns: 0
 > ksuSdiInProgress(0x19e80, 0x19e80, ...)
 < ksuSdiInProgress+0x000000000035 returns: 0
 > sltrgftime64(0x19e80, 0x19e80, ...)
 | > clock_gettime@plt(0x1, 0x7ffce278c3a0, ...)
 | | > clock_gettime(0x1, 0x7ffce278c3a0, ...)
 | | < clock_gettime+0x000000000069 returns: 0
 | < clock_gettime+0x00000000003a returns: 0
 < sltrgftime64+0x00000000004c returns: 0x19c253f3ff
 > kslwo_getcbk(0xa2, 0xd80fa62, ...)
 < kslwo_getcbk+0x000000000017 returns: 0
 > kgslwait_last_waitctx_time_waited_usecs(0x7f7e930a29a0, 0x6dfd01c0, ...)
 < kgslwait_last_waitctx_time_waited_usecs+0x000000000045 returns: 0x25e5e80
 > kskiorm(0x6d1854a8, 0, ...)
 < kskiorm+0x00000000001e returns: 0
 > kfias_iswtgon_ksfd(0x6d1854a8, 0, ...)
 < kfias_iswtgon_ksfd+0x00000000002b returns: 0
 > kxdbio_has_work(0x7ffce278c3c4, 0x6003d010, ...)
 < kxdbio_has_work+0x000000000027 returns: 0
 > skgpwwait(0x7ffce278c630, 0x7f7e930a7ca0, ...)
 | > kslwait_conv_wait_time(0x2dc6c0, 0x7f7e930a7ca0, ...)
 | < kslwait_conv_wait_time+0x000000000027 returns: 0x2dc6c0
 | > sskgpwwait(0x7ffce278c630, 0x7f7e930a7ca0, ...)
 | | > semtimedop(0x38000, 0x7ffce278c328, ...)
 | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff

And a full stack trace of a log writer worker look like this:

$ pstack $(pgrep lg01)
#0  0x00007feda8eaebda in semtimedop () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000010f9cca6 in sskgpwwait ()
#2  0x0000000010f9a2e8 in skgpwwait ()
#3  0x0000000010a66995 in ksliwat ()
#4  0x0000000010a65d25 in kslwaitctx ()
#5  0x00000000031fb4d0 in kcrfw_slave_queue_remove ()
#6  0x00000000031fad2a in kcrfw_slave_group_main ()
#7  0x00000000012160fa in ksvrdp_int ()
#8  0x000000000370d99a in opirip ()
#9  0x0000000001eb034a in opidrv ()
#10 0x0000000002afedf1 in sou2o ()
#11 0x0000000000d0547a in opimai_real ()
#12 0x0000000002b09b31 in ssthrdmain ()
#13 0x0000000000d05386 in main ()

If you combine the pstack backtrace and the debugtrace information, you see that the idle cycle does not leave the ‘ksliwat’ function, so the wait event is not finished. Quickly looking at the other functions, it’s easy to spot it reads the system clock (sltrgftime64), updates some information (kgslwait_last_waitctx_time_waited_usecs) and then performs some proactive IO checks (kskiorm, kfias_iswtgon_ksfd, kxdbio_has_work) after which it calls the post/wait based functions to setup the semaphore again.

Conclusion so far is the log writer workers do perform a 3 second sleep just like the master log writer, however the wait event ‘LGWR worker group idle’ is not interrupted like ‘rdbms ipc message’ is for the master log writer. This means the wait time for the event for each worker process indicates the last time the worker process actually performed something. A next logical question then is: but what do the log writer worker processes perform? Do they entirely take over the master log writer functionality, or do they work together with the master log writer?

In order to fully understand the next part, it is very beneficial to read up on how the log writer works in ‘single log writer’ mode, where the master log writer handling the idle and work cycle itself:
https://fritshoogland.wordpress.com/2018/02/20/a-look-into-into-oracle-redo-part-4-the-log-writer-null-write/
https://fritshoogland.wordpress.com/2018/02/27/a-look-into-oracle-redo-part-5-the-log-writer-writing/

If you want to perform this investigation yourself, make sure the database is in ‘scalable log writer’ mode, by setting “_use_single_log_writer” to FALSE. This is exactly what I did in order to make sure a log write is done in ‘scalable log writer’ mode.

Now let’s first apply some logic. Above the idle cycle of a log writer worker process is shown. Based on the ‘log writer null write’ blog post, we know that the log writer does advance the LWN and On-disk SCN every 3 seconds. Clearly, the log writer worker process does not do that. So that must mean the master log writer is still performing that function. It would also make very much sense, because it doesn’t matter for scalability if the master log writer performs the function of advancing the LWN and On-disk SCN or a worker process, nothing is waiting on it. Plus, if the master log writer performs most of its functions just like in ‘single log writer’ mode, the change to scalable mode would mean no change for client processes, any committing process must semop() the log writer to start writing.

Let’s look at the relevant debugtrace output of the master log writer in scalable log writer mode:

 | > kcrfw_redo_write_driver(0, 0, ...)
 | | > kcrfw_handle_member_write_errors(0, 0, ...)
 | | < kcrfw_handle_member_write_errors+0x000000000020 returns: 0x600161a0
 | | > kcmgtsf(0, 0, ...)
 | | | > sltrgatime64(0, 0, ...)
 | | | | > sltrgftime64(0, 0, ...)
 | | | | | > clock_gettime@plt(0x1, 0x7fff1fe13010, ...)
 | | | | | | > clock_gettime(0x1, 0x7fff1fe13010, ...)
 | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | < sltrgftime64+0x00000000004c returns: 0x53747fe42
 | | | < sltrgatime64+0x00000000003e returns: 0x155d4fd
 | | < kcmgtsf+0x00000000032f returns: 0x3a182314
 | | > kcrfw_slave_adaptive_updatemode(0, 0x600161a0, ...)
 | | < kcrfw_slave_adaptive_updatemode+0x000000000080 returns: 0x7efe34d1f760
 | | > kcrfw_defer_write(0, 0x600161a0, ...)
 | | < kcrfw_defer_write+0x000000000038 returns: 0x7efe34d1f760
 | | > kcrfw_slave_queue_find(0, 0x600161a0, ...)
 | | < kcrfw_slave_queue_find+0x0000000000f1 returns: 0
 | | > kcrfw_slave_queue_setpreparing(0, 0x1, ...)
 | | < kcrfw_slave_queue_setpreparing+0x000000000021 returns: 0
 | | > kcrfw_slave_group_switchpic(0, 0x1, ...)
 | | < kcrfw_slave_group_switchpic+0x000000000050 returns: 0x699b4508
 | | > skgstmGetEpochTs(0, 0x1, ...)
 | | | > gettimeofday@plt(0x7fff1fe13070, 0, ...)
 | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | < skgstmGetEpochTs+0x000000000049 returns: 0x20debfd6192e5
 | | > kcsnew3(0x600113b8, 0x7fff1fe13228, ...)
 | | | > kcsnew8(0x600113b8, 0x7fff1fe13070, ...)
 | | | | > kslgetl(0x60049800, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > kslfre(0x60049800, 0x1, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | < kcsnew8+0x000000000117 returns: 0
 | | | > ub8_to_kscn_impl(0x66c3c7, 0x7fff1fe13228, ...)
 | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | < kcsnew3+0x00000000006f returns: 0x8000
 | | > ktfwtsm(0x3a182314, 0x7fff1fe13228, ...)
 | | | > kcmgtsf(0x2, 0x7fff1fe13228, ...)
 | | | | > sltrgatime64(0x2, 0x7fff1fe13228, ...)
 | | | | | > sltrgftime64(0x2, 0x7fff1fe13228, ...)
 | | | | | | > clock_gettime@plt(0x1, 0x7fff1fe12fe0, ...)
 | | | | | | | > clock_gettime(0x1, 0x7fff1fe12fe0, ...)
 | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | < sltrgftime64+0x00000000004c returns: 0x537484a6d
 | | | | < sltrgatime64+0x00000000003e returns: 0x155d511
 | | | < kcmgtsf+0x0000000001b2 returns: 0x3a182314
 | | | > kcmtdif(0x3a182314, 0x3a182314, ...)
 | | | < kcmtdif+0x00000000001b returns: 0
 | | | > ksl_get_shared_latch_int(0x60050340, 0x6ddb1408, ...)
 | | | < ksl_get_shared_latch_int+0x00000000016b returns: 0x1
 | | <> kslfre(0x60050340, 0x66c3c7, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kcn_stm_write(0x7fff1fe13228, 0x66c3c7, ...)
 | | | > kstmgetsectick(0x7fff1fe13228, 0x66c3c7, ...)
 | | | < kstmgetsectick+0x00000000003a returns: 0x5ae4c494
 | | | > ksl_get_shared_latch_int(0x6004ee40, 0x6ddb1408, ...)
 | | | < ksl_get_shared_latch_int+0x00000000016b returns: 0x1
 | | <> kslfre(0x6004ee40, 0x2244, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kcrfw_redo_write_initpic(0x699b4508, 0x7fff1fe13228, ...)
 | | | > kscn_to_ub8_impl(0x7fff1fe13228, 0x7fff1fe13228, ...)
 | | | < kscn_to_ub8_impl+0x00000000003e returns: 0x66c3c7
 | | < kcrfw_redo_write_initpic+0x0000000000dc returns: 0x3a182314
 | | > kscn_to_ub8_impl(0x7fff1fe13228, 0, ...)
 | | < kscn_to_ub8_impl+0x00000000003e returns: 0x66c3c7
 | | > kcrfw_gather_lwn(0x7fff1fe13268, 0x699b4508, ...)
 | | | > kslgetl(0x6abe4538, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7fff1fe13268, 0, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x6abe4538, 0x17d5f, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | | > kslgetl(0x6abe45d8, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7fff1fe13268, 0x1, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x6abe45d8, 0x137, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | < kcrfw_gather_lwn+0x00000000065c returns: 0xffffffff
 | | > krsh_trace(0x1000, 0x200, ...)
 | | < krsh_trace+0x00000000005d returns: 0
 | | > kspgip(0x71e, 0x1, ...)
 | | < kspgip+0x00000000023f returns: 0
 | | > kcrfw_slave_queue_setpreparing(0, 0, ...)
 | | < kcrfw_slave_queue_setpreparing+0x000000000021 returns: 0
 | | > kcrfw_slave_queue_flush_internal(0x1, 0, ...)
 | | < kcrfw_slave_queue_flush_internal+0x0000000000d7 returns: 0x1
 | | > kcrfw_do_null_write(0, 0, ...)
 | | | > kcrfw_slave_phase_batchdo(0, 0, ...)
 | | | | > kcrfw_slave_phase_enter(0, 0x9b, ...)
 | | | | < kcrfw_slave_phase_enter+0x000000000449 returns: 0
 | | | <> kcrfw_slave_phase_exit(0, 0x9b, ...)
 | | | < kcrfw_slave_phase_exit+0x00000000035a returns: 0
 | | | > kcrfw_post(0, 0, ...)
 | | | | > kcrfw_slave_single_getactivegroup(0, 0, ...)
 | | | | < kcrfw_slave_single_getactivegroup+0x000000000047 returns: 0x6a9a0718
 | | | | > kspGetInstType(0x1, 0x1, ...)
 | | | | | > vsnffe_internal(0x19, 0x1, ...)
 | | | | | | > vsnfprd(0x19, 0x1, ...)
 | | | | | | < vsnfprd+0x00000000000f returns: 0x8
 | | | | | | > kfIsASMOn(0x19, 0x1, ...)
 | | | | | | <> kfOsmInstanceSafe(0x19, 0x1, ...)
 | | | | | | < kfOsmInstanceSafe+0x000000000031 returns: 0
 | | | | | < vsnffe_internal+0x0000000000a7 returns: 0
 | | | | | > kspges(0x115, 0x1, ...)
 | | | | | < kspges+0x00000000010f returns: 0
 | | | | < kspGetInstType+0x0000000000b1 returns: 0x1
 | | | | > kcrfw_slave_phase_enter(0x1, 0x9b, ...)
 | | | | < kcrfw_slave_phase_enter+0x00000000006f returns: 0x9b
 | | | | > kcscu8(0x60016290, 0x7fff1fe12f98, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcsaj8(0x60016290, 0x7fff1fe12f38, ...)
 | | | | < kcsaj8+0x0000000000dc returns: 0x1
 | | | | > kcrfw_slave_phase_exit(0x1, 0x9b, ...)
 | | | | < kcrfw_slave_phase_exit+0x00000000008e returns: 0
 | | | | > kslpsemf(0x97, 0, ...)
 | | | | | > ksl_postm_init(0x7fff1fe0ac30, 0x7fff1fe12c50, ...)
 | | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | < kslpsemf+0x0000000006b5 returns: 0x1f
 | | | | > kcrfw_slave_barrier_nonmasterwait(0x6a9a0720, 0x4, ...)
 | | | | < kcrfw_slave_barrier_nonmasterwait+0x000000000035 returns: 0x600161a0
 | | | < kcrfw_post+0x000000000c1c returns: 0xd3
 | | < kcrfw_do_null_write+0x0000000000b2 returns: 0xd3
 | < kcrfw_redo_write_driver+0x000000000535 returns: 0xd3

The highlighted functions are extra functions executed when the instance is set to scalable log writer mode, or when adaptive mode has set the instance to scalable log writer mode. This means that the changes between the modes is minimal when there’s no writes, and outside of a few extra functions, the log writer does exactly the same.

The absence of any spectacular changes in the behaviour of the log writer when in scalable log writer mode when there are no writes does hint what the actual changes will be of the scalable mode, which is how writing is handled. In single log writer mode, the most time the log writer is process is likely to spend on is writing the change vectors into the online redologfiles, and maybe, if you have a bad application (!) semop()-ing foreground sessions will be second, if there are a large number of processes committing, because every process needs to be semop()-ed individually. These two functions, along with some other functionality are exactly what the log writer worker processes are doing.

This means that foreground processes do nothing different in scalable log writer mode, they signal (semop) the master log writer, which will investigate the public redo strands, and if the master log writer finds change vectors to write, it will assign log writer worker processes to perform the write, and the log writer worker process will semop() the foreground sessions to indicate the redo has been written when the instance is in post/wait mode, or do not semop() when the instance is in polling mode.

This is the entire function flow of a write when the instance is in scalable log writer mode:

 | > kcrfw_slave_queue_insert(0, 0xd3, ...)
 | | > kcrfw_slave_group_setcurrsize(0, 0, ...)
 | | < kcrfw_slave_group_setcurrsize+0x0000000001d1 returns: 0x1
 | | > _intel_fast_memcpy(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | <> _intel_fast_memcpy.P(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | <> __intel_ssse3_rep_memcpy(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x6a9a05f8
 | | > kcrfw_slave_group_postall(0, 0xf0, ...)
 | | | > ksvgcls(0, 0xf0, ...)
 | | | < ksvgcls+0x000000000021 returns: 0
 | | | > ksl_post_proc(0x6ddb32f0, 0, ...)
 | | | <> kskpthr(0x6ddb32f0, 0, ...)
 | | | <> kslpsprns(0x6ddb32f0, 0, ...)
 | | | | > ksl_update_post_stats(0x6ddb32f0, 0, ...)
 | | | | | > dbgtTrcData_int(0x7f464c0676c0, 0x2050031, ...)
 | | | | | | > dbgtBucketRedirect(0x7f464c0676c0, 0x7ffdae335338, ...)
 | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
 | | | | | | > dbgtIncInMemTrcRedirect(0x7f464c0676c0, 0x6fa, ...)
 | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
 | | | | | | > skgstmGetEpochTs(0x7f464c0676c0, 0x6fa, ...)
 | | | | | | | > gettimeofday@plt(0x7ffdae334e40, 0, ...)
 | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x20e067375b55d
 | | | | | | > dbgtrRecAllocate(0x7f464c0676c0, 0x7ffdae3352e0, ...)
 | | | | | | | > dbgtrPrepareWrite(0x7f464c0676c0, 0x65accba0, ...)
 | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x4
 | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
 | | | | | | > _intel_fast_memcpy(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | <> _intel_fast_memcpy.P(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | <> __intel_ssse3_rep_memcpy(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x000000002030 returns: 0x65acda30
 | | | | | | > dbgtrRecEndSegment(0x7f464c0676c0, 0x7ffdae3352e0, ...)
 | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
 | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
 | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
 | | | | > skgpwpost(0x7ffdae335480, 0x7f464c0acca0, ...)
 | | | | <> sskgpwpost(0x7ffdae335480, 0x7f464c0acca0, ...)
 | | | | | > semop@plt(0xc0000, 0x7ffdae335410, ...)
 | | | | | < semop+0x00000000000f returns: 0
 | | | | < sskgpwpost+0x00000000009a returns: 0x1
 | | | < kslpsprns+0x0000000001c3 returns: 0
 | | < kcrfw_slave_group_postall+0x0000000000a8 returns: 0
 | < kcrfw_slave_queue_insert+0x0000000001b6 returns: 0x667bc540

After the instance has established there are change vectors in kcrfw_gather_lwn, in single log writer mode, the function kcrfw_redo_write is called, which will call kcrfw_do_write which handles the writing, and kslpslf to semop any waiting processes among other things. Now in scalable log writer mode, kcrfw_slave_queue_insert is called which assigns work to worker processes, and then kcrfw_slave_group_postall is called to semop one or more worker processes.

The worker processes are sleeping on a semaphore, and if a process gets signalled, it exits the kcrfw_slave_queue_remove function, ends the wait event, and calls kcrfw_redo_write, just like the master log writer process would call in single log writer mode, which includes doing the write (kcrfw_do_write) and posting the foregrounds (kslpslf), exactly all the functions.

Conclusion.
The adaptive scalable log writer processes function has been silently introduced with Oracle 12, although a lot of the used functionality has been available more or less in earlier versions. It is a fully automatic feature which will turn itself on and off based on heuristics. The purpose of this article is to explain how it works and what it is doing. Essentially, all the functionality that surrounds a log writer write has been moved to a worker process, which means the work can be done in parallel with multiple processes, whilst all the work outside of the work around the write, which is not performance critical, is left with the master log writer.

I gotten some requests to provide an overview of the redo series of blogposts I am currently running. Here it is:

https://fritshoogland.wordpress.com/2018/01/29/a-look-into-oracle-redo-part-1-redo-allocation-latches/
https://fritshoogland.wordpress.com/2018/02/05/a-look-into-oracle-redo-part-2-the-discovery-of-the-kcrfa-structure/
https://fritshoogland.wordpress.com/2018/02/12/a-look-into-oracle-redo-part-3-the-log-writer-work-cycle-overview/
https://fritshoogland.wordpress.com/2018/02/20/a-look-into-into-oracle-redo-part-4-the-log-writer-null-write/
https://fritshoogland.wordpress.com/2018/02/27/a-look-into-oracle-redo-part-5-the-log-writer-writing/
https://fritshoogland.wordpress.com/2018/03/05/a-look-into-oracle-redo-part-6-oracle-post-wait-commit-and-the-on-disk-scn/
https://fritshoogland.wordpress.com/2018/03/19/a-look-into-oracle-redo-part-7-adaptive-log-file-sync/
https://fritshoogland.wordpress.com/2018/03/26/a-look-into-oracle-redo-part-8-generate-redo/
https://fritshoogland.wordpress.com/2018/04/03/a-look-into-oracle-redo-part-9-commit/
https://fritshoogland.wordpress.com/2018/04/09/a-look-into-oracle-redo-part-9a-commit-concurrency-considerations/
https://fritshoogland.wordpress.com/2018/04/16/a-look-into-oracle-redo-part-10-commit_wait-and-commit_logging/

Private redo strands, In memory undo and throw away undo: https://fritshoogland.wordpress.com/2016/11/15/redo-a-blogpost/

During the investigations of my previous blogpost about what happens during a commit and when the data becomes available, I used breaks in gdb (GNU debugger) at various places of the execution of an insert and a commit to see what is visible for other sessions during the various stages of execution of the commit.

However, I did find something else, which is very logical, but is easily overlooked: at certain moments access to the table is blocked/serialised in order to let a session make changes to blocks belonging to the table, or peripheral blocks like undo, for the sake of consistency. These are changes made at the physical layer of an Oracle segment, the logical model of Oracle says that writers don’t block readers.

The first question is if this is a big issue. Well, for that you should simply look at any normal usage of an Oracle database. I don’t believe these serialisation moments are an issue, because the majority of my client’s does not experience serialisation problem.

However, this can become an issue if database processes can not run at will. What that means is that if processes are randomly stopped from execution by the operating system or hypervisor, it could be that the database process can be at an earlier mentioned ‘serialisation point’, which then means that access for all other processes remains blocked until the process is made running again, which then gives the process the opportunity to free exclusive access to a resource.

What are issues when a database process can not run at will?
– CPU oversubscription. If more processes are running than CPU’s are available (think about core’s and threads, and what it means for your CPU architecture), it means the operating system needs to make a decision on what it wants to be running, and what it wants to wait. No operating system scheduler understands or can know when an Oracle database process is running in a critical part of code, and therefore should not stop execution of it.
– Memory oversubscription. Whenever your system starts actively swapping in and out, your system is swapping, which means it’s moving active pages onto the swap device, and reads back active pages from the swap device for a process that demands them. This will activate additional tasks, and make processes get stuck waiting for memory to become available at all kinds of points during execution. Often, swapping is the beginning of the end, which means the services on the server stop functioning altogether for multiple reasons, like the Out Of Memory killer, or simply timing out for client’s of a process.
– Oversubscription of CPU or memory using virtualisation. As a troubleshooter, I would say this is actually worse than the ‘simple oversubscription’ mentioned in the two earlier points. Because typically, the visible part is the virtual machine. The virtual machine in this case itself is not oversubscribed, the oversubscription takes place at a layer invisible to the virtual machine, which unexplainably performs unpredictable. It is simply not okay to oversubscribe any machine that needs to run something that is latency sensitive like the Oracle database, unless you fully understand all the trade-offs that are coming with it, and you have the ability to understand when it does start to occur.

Whenever database processes can not run at will, you will see waits you didn’t see when processes could run at will. Typical waits in that case are (not an exhaustive list):
– any latch waits
– buffer busy waits
– any cursor: pin waits
Please mind these waits are not unique for this issue, any of these waits can occur for other reasons too.

Let me show a couple of examples for which the wait is artificially created by stopping execution at a point where this serialisation takes place, thereby mimicking getting put off cpu due to load, using the earlier insert and commit execution:

a) latch: cache buffers chains
Whenever a process needs to read or modify a database buffer, it needs to ‘pin’ the buffer. Such a pin is also known as a buffer handle or a buffer state object. The pin, which is a memory area, must be obtained from a freelist and attached to the buffer header in order to pin the buffer. A pin is obtained using the kcbzgs (kernel cache buffers helper functions get state object) function, which calls the kssadf_numa_intl (kernel service state objects add from freelist numa internal) function which initialises the state object.

The function that performs the actual pin of a block is not a single one, pinning a block is done in multiple functions. Two of them are kcbgtcr (kernel cache buffers get consistent read) and kcbgcur (kernel cache buffers get current). Under normal circumstances, concurrency for a consistent read of a buffer (kcbgtcr) does not lead to any blocking, because the latch (cache buffers chains) can be taken in shared mode, and the pinning is done in ‘fast mode’ (my interpretation):

> kcbgtcr(0x7f88a8b79db0, 0, ...)
| > kscn_to_ub8_impl(0x7f88a8b88d4c, 0x6baefdc0, ...)
| < kscn_to_ub8_impl+0x00000000003e returns: 0x287562
| > ktrexf(0x7fff998123f8, 0x7f88ae5d7f30, ...)
| | > ktrEvalBlockForCR(0x7f88a8b88d4c, 0x7f88ae5d7f30, ...)
| | < ktrEvalBlockForCR+0x0000000000f4 returns: 0x1
| < ktrexf+0x0000000000a4 returns: 0x9
| > kcbzgsf(0x1, 0x7fff998120b8, ...)
| | > kssadf_numa_intl(0x36, 0x6ddc0b50, ...)
| | | > kpdbIdToUid(0, 0x6ddc0b50, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x6bf84908
| < kcbzgsf+0x0000000001a0 returns: 0x6bf84908
| > kcbz_fp_cr(0x8dff9478, 0x6bf84998, ...)
| < kcbz_fp_cr+0x000000000089 returns: 0
| > kscn_to_ub8_impl(0x7f88a8b88d70, 0x1, ...)
| < kscn_to_ub8_impl+0x00000000003e returns: 0
< kcbgtcr+0x000000000b38 returns: 0x8df94014

Inside the kcbgtcr function, kcbzgsf (kernel cache buffers helper functions test state object fast) is called, and after that, the state object is pinned to the block in kcbz_fp_cr (kernel cache buffers helper functions fast pin for consistent read). Do you see the absence of ksl_get_shared_latch function to serialise access to add this pin? Actually the cache buffers chains latch is gotten, but it’s done inside the kcbgtcr function. The latch is gotten before the ktrexf (kernel transaction redo examine block fast) in shared mode, and the latch is released after the kcbz_fp_cr function. Multiple processes can pin the block for consistent read without serialisation or blocking each other because of the cache buffers chains latch being a shared latch, and this latch is taken in shared mode.

Why do I show this? Well, in the “old” days, which is Oracle 8.1, Oracle did not use shared latches (at least on the platforms I worked on, but if my memory serves me right, shared latches were introduced starting from Oracle 9), which meant for any logical read the cache buffers chains latch had to be obtained. This could get painful for very frequently visited blocks, like index root blocks, and high concurrency, and additional to that, because there was no post/wait messaging for latch waits but only spinning, this meant lots of processes waiting for latch number 98, and CPU usage going through the roof.

Back to today. Below is shown how the function calls look like when DML is done. When a buffer change is done (data is changed), a buffer has to be obtained in current mode, which is done using the kcbgcur function:

> kcbgcur(0x7ffda5957c70, 0x2, ...)
| > kcbzgs(0x1, 0x2, ...)
| | > kssadf_numa_intl(0x36, 0x7a6086c8, ...)
| | | > kpdbIdToUid(0, 0x7a6086c8, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x7867c9c0
| < kcbzgs+0x000000000178 returns: 0x7867c9c0
| > kti_is_imu(0x7867ca50, 0xb0f71018, ...)
| | > ktcgtx(0x76f29d50, 0xb0f71018, ...)
| | < ktcgtx+0x00000000001e returns: 0x76f29d50
| < kti_is_imu+0x000000000039 returns: 0
< kcbgcur+0x0000000009fb returns: 0xb0302014

This looks reasonably the same as the previous call overview which shown kcbgtcr, however the kcbzgsf function changed to kcbzgs, so minus the ‘f’ for fast, and a few other functions are missing. Another interesting thing (not visible) is the cache buffers chains latch is obtained after the kcbzgs function that obtains the buffer state object, and the latch is obtained with special bit 0x1000000000000000 set to indicate the shared latch is obtained in non-shared mode. After the kti_is_imu function the pin is attached to the buffer and the latch is freed.

To make things a little more complicated, a buffer can be gotten in current mode, but still get the cache buffers chains in shared mode. This is how that looks like:

> kcbgcur(0x7ffda5955620, 0x1, ...)
| > kcbzgs(0x1, 0x1, ...)
| | > kssadf_numa_intl(0x36, 0x7a6086c8, ...)
| | | > kpdbIdToUid(0, 0x7a6086c8, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x7867c9c0
| < kcbzgs+0x000000000178 returns: 0x7867c9c0
| > kcbz_fp_shr(0xb0fa3cb8, 0x7867ca50, ...)
| < kcbz_fp_shr+0x000000000085 returns: 0x7f3011d82f40
< kcbgcur+0x0000000009fb returns: 0xb07a0014

The function kcbzgs without ‘f’ is called, but the cache buffers chains latch is gotten in shared mode (not visible; done in the kcbgcur function after kcbzgs), and there’s the function kcbz_fp_shr (kernel cache buffers helper function fast pin shared) to pin the buffer in shared mode.

In order to be allowed to change a buffer, the cache buffers chains latch must be taken in non-shared mode to change the buffer header to indicate the buffer is busy and guarantee only the process itself can access it. Obtaining the cache buffers chains latch in non-shared mode means that access to the hash bucket (multiple hash buckets actually) to which the buffer is linked is blocked until the latch is freed. Please mind the period that the latch is obtained here is very short.

However, this blocking scenario can be easily replayed:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on the kti_is_imu function (break kti_is_imu) and continue.
4. session 1: insert a row into a table (insert into test values (‘a’);). the breakpoint will fire in gdb, stopping execution.
5. session 3: startup another database foreground session.
6. session 3: select the table on which the insert just broke (select * from test;). This will hang.
7. session 4: startup a SYSDBA session, and obtain the wait state of the foreground sessions:

SQL> @who active

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -------------
59,53450,@1     28166    SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:98
102,53150,@1    28083    TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:66867.83ms,event:ON CPU:SQL,seq#:33
101,6637,@1     28186    TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:344013.82ms,event:latch: cache buffers chainsp1:1811473056p2:289p3:24599> Blocked by (inst:sid): 1:102,seq#:39

SID 101 is waiting on the cache buffers chains latch, because SID 102 is holding that. SID 102 is showing ON CPU because we broke execution outside of a wait event, so Oracle thinks it’s busy executing, while it is actually stopped by gdb.

Again, this is a wait (latch: cache buffers chains) that is unlikely be an issue, and if it is, you either ran into a bug, or this is only the way a much bigger problem (oversubscription, bad application design) is showing itself.

b) buffer busy waits
The name of the wait event ‘buffer busy wait’ is self explanatory. The technical implementation is way lesser known. What technically happens during a buffer busy wait, is that a session needs to read a certain buffer, and in order to read it, it performs the usual actions of selecting the correct cache buffers chains bucket, following the double linked list to the buffer headers to find the best best buffer and then obtain a buffer state object. The critical part for running into the buffer busy wait is when the session checks the state of the buffer via the buffer header and finds the change state (X$BH.CSTATE) to be higher than 0, indicating the block is currently being changed. At this point the session calls kcbzwb (kernel cache buffers helper function wait for buffer), and must wait for the buffer state to be changed and therefore accessible, for which the wait event is the buffer busy wait event. When the change state is reverted back to 0, the buffer is accessible again.

Once a buffer is pinned in current mode using kcbgcur en kcbzgs, the buffer is still fully accessible for all sessions. Only once the database truly starts to change the buffer, which is done in kcbchg1_main, the cache buffers chains latch is taken non-shared, and the CSTATE field in the buffer header is changed to 1 and then the latch is freed. My current investigations show that the CSTATE is changed to 2 after the kcopcv and kcrfw_copy_cv functions, so when the redo is written to the public redo strand, and set to 4 after the changes have been written to the buffers, which is done in kcbapl. At the end of the kcbchg1_main function, when all the changes are done, the cache buffers chains latch is taken non-shared again, the CSTATE field is set to 0 and the latch is freed to enable access to the buffer again.

68801:kcbchg1_main+2536:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):R:8:0/0()
68802:kcbchg1_main+2536:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):W:8:0x1000000000000022/1152921504606847010()
..
68897:kcbchg1_main+3318:0x00000000967eddda(Variable Size(pgsz:2048k)|+901701082 buffer header|9668A000+66 ):W:1:0x1/1()
..
68939:kcbchg1_main+3715:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):R:8:0x1000000000000022/1152921504606847010()
68940:kcbchg1_main+3715:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):W:8:0/0()

Above you see a cache buffers chains latch being obtained non shared (special bit 0x1000000000000000 is set), a little later the buffer header is updated which sets the value 1 at offset 66. Once this happens, this is visible in the CSTATE column of X$BH. A little further the latch is freed.

Why am I showing all of this? Well, there is quite some work that is done while the buffer is in a changed state (CSTATE>0) and therefore inaccessible. In normal situations, this is not a problem because despite all the work this is done very quickly, and therefore it’s hardly noticeable/measurable, so there’s no or very little time spend in buffer busy waits in most databases

However… if processes do get stuck randomly, it’s possible that a process gets stuck while making changes to a buffer, which then means any additional access to the buffer results in buffer busy waits. Also mind that a change to a table buffer requires (at least; in the most simple case) three buffers to be changed: the buffer containing the table data, the buffer containing the undo segment transaction table and the buffer containing the actual undo. A commit requires one buffer: the undo segment transaction table.

Let me show you an example:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on ktugur and kcopcv and continue.
4. session 3: startup a database foreground session.
5. session 1: insert a row into a table: insert into test values (‘a’);
6. session 2: gdb breaks execution of session 1 because it encountered ktugur.
7. session 3: select * from test; this runs unblocked.
8. session 4: startup a database foreground session as sysdba.
9. session 4: select * from v$lock; this will hang.

This one wasn’t obvious. I found this upon doing the tests. When a foreground session is in ktugur, it is creating the change vectors (kernel transaction undo generate undo and redo), and it holds two buffers in current mode:

SQL> l
  1  select 	ts.name tablespace,
  2  	dbablk,
  3  	class,
  4  	state,
  5  	decode(state,0,'free',1,'xcur',2,'scur',3,'cr', 4,'read',5,'mrec',6,'irec',7,'write',8,'pi', 9,'memory',10,'mwrite',11,'donated', 12,'protected',  13,'securefile', 14,'siop',15,'recckpt', 16, 'flashfree',  17, 'flashcur', 18, 'flashna') state_decoded,
  6  	mode_held,
  7  	changes,
  8  	cstate,
  9  	flag,
 10  	flag2,
 11  	rflag,
 12  	sflag,
 13  	lru_flag,
 14  	dirty_queue
 15  from 	  x$bh b
 16  	, v$tablespace ts
 17  where 	1=1
 18  and	not (us_nxt = us_prv and wa_nxt = wa_prv and to_number(wa_nxt,'xxxxxxxxxxxxxxxx') = to_number(us_nxt,'xxxxxxxxxxxxxxxx') + 16)
 19* and	b.ts#=ts.ts#
SQL> /

TABLESPACE                         DBABLK      CLASS      STATE STATE_DECO  MODE_HELD    CHANGES     CSTATE       FLAG      FLAG2      RFLAG      SFLAG   LRU_FLAG DIRTY_QUEUE
------------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -----------
UNDOTBS1                              128         17          1 xcur                2          1          0    2097152          0          0          0          0           0
TS                                  24597          1          1 xcur                2          1          0    2097153          0          0          0          0           0

The thing that was not obvious at first was that the blocks have CSTATE 0; so they can be accessed and read for a consistent read. Then I look at the wait and the p1/2/3 of the wait:

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -------------------------------------------------------
100,262,@1      6274     SYS             DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:2519.48ms,event:buffer busy waitsp1:3p2:128p3:17> Blocked by (inst:sid): 1:59,seq#:476
142,15103,@1    6695     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:47115.23ms,event:SQL*Net message from client,seq#:226
12,25165,@1     7036     SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:3165
59,31601,@1     5891     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:31292.51ms,event:ON CPU:SQL,seq#:111

Above SID 100 is session 4; it hangs on a buffer busy wait. The datafile is #2, which is the undo table space UNDOTBS1 and the buffer that it is waiting for is 128. Why does SID 100 needs to wait for the buffer, while another session (session 3 in the above runbook) can run without getting blocked for a buffer that is held exactly in the same state (xcur and CSTATE is zero)?

The answer can be gotten when executing pstack on the waiting process (or attach with gdb and obtain a backtrace):

(gdb) bt
#0  0x00007fd6a085bbda in semtimedop () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000010f9cca6 in sskgpwwait ()
#2  0x0000000010f9a2e8 in skgpwwait ()
#3  0x0000000010a66995 in ksliwat ()
#4  0x0000000010a65d25 in kslwaitctx ()
#5  0x00000000015c5e5e in kcbzwb ()
#6  0x0000000010b3748e in kcbgcur ()
#7  0x0000000010ab9b17 in ktuGetUsegHdr ()
#8  0x00000000014075d3 in ktcxbcProcessInPDB ()
#9  0x0000000001406de2 in ktcxbc ()
#10 0x0000000010ea8781 in qerfxFetch ()
#11 0x0000000010bb919c in rwsfcd ()
#12 0x00000000035ae618 in qeruaFetch ()
#13 0x00000000035ad837 in qervwFetch ()
#14 0x0000000010bb919c in rwsfcd ()
#15 0x0000000010ea1528 in qerhnFetch ()
#16 0x0000000010bb919c in rwsfcd ()
#17 0x0000000010ea1528 in qerhnFetch ()
#18 0x0000000010c9eeb3 in opifch2 ()
#19 0x000000000268c3de in kpoal8 ()
#20 0x0000000010ca5b3d in opiodr ()
#21 0x0000000010f37a29 in ttcpip ()
#22 0x0000000001eb4674 in opitsk ()
#23 0x0000000001eb914d in opiino ()
#24 0x0000000010ca5b3d in opiodr ()
#25 0x0000000001eb04ed in opidrv ()
#26 0x0000000002afedf1 in sou2o ()
#27 0x0000000000d05577 in opimai_real ()
#28 0x0000000002b09b31 in ssthrdmain ()
#29 0x0000000000d05386 in main ()

The important bit is at #6: kcbgcur (kernel cache buffers get buffer in current mode)! In other words: the block is gotten in current mode while another session is holding the block in current mode, which means it has to wait. The function kcbzwb means kernel cache buffer helper function wait for buffer.

I really wondered why the undo segment buffer is gotten in current mode; this is a read, there is no intention to change something. The current educated guess is this a way to be sure this buffer contains the current state of the transactions across all nodes in a RAC cluster. So this might be a clever optimisation to be absolutely sure the current state of the undo segment transaction table is gotten.

10. session 2: continue. This will break on kcopcv.
11. session 3: select * from test; this will hang.

Now we have two hanging sessions waiting for the wait event buffer busy wait:

SQL> @who active

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -----------------
100,262,@1      6274     SYS             DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:120790ms,event:buffer busy waitsp1:3p2:128p3:35> Blocked by (inst:sid): 1:59,seq#:501
142,15103,@1    6695     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:1175715.03ms,event:buffer busy waitsp1:4p2:24597p3:1> Blocked by (inst:sid): 1:59,seq#:250
12,25165,@1     7036     SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:3309
59,31601,@1     5891     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:9544.97ms,event:ON CPU:SQL,seq#:116

SID 100 still hangs in wait event buffer busy wait for file 3, block 128, but now the session doing the full table scan also hangs in wait event buffer busy wait, but for file 4, block 24597. Let’s look at the buffers held in current mode again:

SQL> @pinned_blocks

TABLESPACE                         DBABLK      CLASS      STATE STATE_DECO  MODE_HELD    CHANGES     CSTATE       FLAG      FLAG2      RFLAG      SFLAG   LRU_FLAG DIRTY_QUEUE
------------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -----------
UNDOTBS1                              128         35          1 xcur                2          0          1    2097156          0          0          0          0           0
UNDOTBS1                             3127         36          1 xcur                2          0          1    2097156          0          0          0          0           0
TS                                  24597          1          1 xcur                2          0          1    2097157          0          0          0          0           0

Because the insert progressed to kcopcv, the CSTATE of all three buffers involved in the insert, the undo segment transaction table, the undo segment undo entry and the buffer which has the inserted row are set to 1. This means that the two buffer busy waits are actually two different reasons: the select * from v$lock still waits to get a buffer in current mode, and the select * from test, which does a consistent read, now is waiting for the CSTATE to get set back to 0. In order to solve the hanging, go to session 2 and continue. This will break again on kcoapl, and again if you continue, because there are 3 blocks involved in this insert. Another way to make the insert finish is to disable all breakpoints (disable command) or to quit gdb.

Just for the fun of it, here is another example of how a stuck session can escalate: a process that gets stuck executing the kcbapl (kernel cache buffers apply a change to a buffer). This scenario can be replayed using the following steps:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on kcbapl and continue.
4. session 3: startup a database foreground session.
5. session 3: insert a row into a table: insert into test values (‘a’);
6. session 1: insert a row into a table: insert into test values (‘a’);
7. session 2: gdb broken execution of session 1 because it encountered kcbapl.
8. session 3: commit; this will hang

Can you reason why session 3 is hanging? Both the sessions are doing inserts to the same table…Give it a thought and replay it yourself.

Explanation: the breakpoint stopped execution before actually doing the changes. CSTATE for the buffer is set to 1, it will get set to 2 inside kcbapl to indicate the change vectors have been copied. But the important bit is that a redo copy latch was taken before executing kcbapl. The commit in session 3 sent a message to the logwriter asynchronously asking it to write (first invocation of kcrf_commit_force_int), and then freed the logical locking of the insert, and then invoked kcrf_commit_force again to wait for the logwriter to finish writing and get notified. However, the logwriter needs to obtain all the redo copy latches (it actually reads a metadata structure that indicates the redo copy latches are freed or not), at which it needs to wait because session 1 stuck holding a redo copy latch and therefore needs to wait. Not only does this show how dangerous it is when sessions can’t run, it also shows how dangerous it can be to run gdb in a production database, it can be quite hard to figure out what will be the implication of attaching with gdb to the process. So the actual outcome of above is that session 3 waits in the event ‘log file sync’, and the logwriter is waiting in ‘LGWR wait for redo copy’.

Summary
The Oracle database is carefully setup to be able to handle concurrency. In most normal situations you will see no to maybe very little time spend in wait events that indicate concurrency, like latches or buffer busy waits. If an Oracle database is on a machine that has more active processes than available CPU threads or cores (this depends on the hardware and CPU architecture; for Intel based architectures you should lean towards core’s, for SPARC you can lean more towards threads), the operating system needs to make choices, and will do that trying to give all processes an equal share. This means that processes will be put off CPU in order to give all processes a fair share of CPU run time. Also mind other oversubscription possibilities, like using more memory than physically available, and doing the same using virtualisation, which is handing out more VCPU’s than CPU threads, and handing out more memory than physically available. Because you can’t look outside your virtual machine, it’s often harder to troubleshoot virtualisation oversubscription issues.

Oracle really tries to minimise the amount of time spend while holding a non shared latch. Therefore with modern Oracle databases it’s unlikely that latch waits show up, even if a system is (not too much) oversubscribed. If you do have a latching problem, it’s most likely you have got another problem which causes this.

The most well documented reason for buffer busy waits is a buffer being written onto persistent media, and therefore the buffer is marked busy for the time of the write. Another reason that is documented, is getting a buffer in a mode that is not compatible with the mode it is held in. Having a buffer in current mode will block another request for the buffer in current mode. But, having a buffer pinned in current mode does not block consistent read access, it’s only when the buffer is being changed that requires the buffer to be marked busy. This state is externalised in the CSTATE column of the view X$BH, which shows fields of the buffer header struct that is used to manage the buffer cache buffers. Because there is quite some work to be done when a buffer is changed, and during making these changes the buffers are not usable (‘busy’) for other processes, a buffer busy wait is more likely to occur when oversubscription is taking place.

Thanks to Jonathan Lewis for explaining how to get the buffer header addresses.
Thanks to Abel Macias for brainstorming.

%d bloggers like this: