Tag Archives: redo

Starting from Oracle 12, in a default configured database, there are more log writer processes than the well known ‘LGWR’ process itself, which are the ‘LGnn’ processes:

$ ps -ef | grep test | grep lg
oracle   18048     1  0 12:50 ?        00:00:13 ora_lgwr_test
oracle   18052     1  0 12:50 ?        00:00:06 ora_lg00_test
oracle   18056     1  0 12:50 ?        00:00:00 ora_lg01_test

These are the log writer worker processes, for which the minimal amount is equal to the amount public redo strands. Worker processes are assigned to a group, and the group is assigned to a public redo strand. The amount of worker processes in the group is dependent on the undocumented parameter “_max_log_write_parallelism”, which is one by default.

The actual usage of the worker processes is dependent in the first place on the value of the undocumented parameter “_use_single_log_writer”, for which the default value is ‘ADAPTIVE’, which means it’s switching automatically between ‘single log writer mode’, which is the traditional way of the LGWR process handling everything that the log writer functionality needs to do, and the ‘scalable log writer mode’, which means the log writer functionality is presumably using the log writer worker processes.

Other values for “_use_single_log_writer” are ‘TRUE’ to set ‘single log writer mode’, or ‘FALSE’ to set ‘scalable log writer mode’ fixed.

I assume most readers of this blog will know that the master log writer idle work cycle is sleeping on a semaphore (semtimedop()) under the wait event ‘rdbms ipc message’ for 3 seconds, then performs some “housekeeping”, after which it’ll sleep again repeating the small cycle of sleeping and housekeeping. For the log writer worker processes, this looks different if you look at the wait event information of the log writer worker processes:

135,59779,@1    14346                    DEDICATED oracle@memory-presentation.local (LGWR)	    time:1909.44ms,event:rdbms ipc message,seq#:292
48,34282,@1     14350                    DEDICATED oracle@memory-presentation.local (LG00)	    time:57561.85ms,event:LGWR worker group idle,seq#:150
136,24935,@1    14354                    DEDICATED oracle@memory-presentation.local (LG01)	    time:112785.66ms,event:LGWR worker group idle,seq#:74

The master log writer process (LGWR) has been sleeping for 1.9s when I queried the database, and it will sleep for 3 seconds, and then do some work and sleep again. However, the log writer worker processes have been sleeping for much longer: LG00 for 57.6s and LG01 for 112.8s, and the event is different: ‘LGWR worker group idle’. How is this implemented? Let’s look!

$ strace -p $(pgrep lg01)
strace: Process 14354 attached
semtimedop(360448, [{27, -1, 0}], 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
semtimedop(360448, [{27, -1, 0}], 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

I used strace on the LG01 process, and it’s still doing the same as most idle background processes are doing: sleeping on a semaphore for 3 seconds. But, it does not end its wait like LGWR does, the event the log writer worker processes are waiting in keeps on being timed.

Using a pin tools debugtrace shows the following:

 | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff
 | | > __errno_location(0x38000, 0x7ffce278c328, ...)
 | | | > fthread_self(0x38000, 0x7ffce278c328, ...)
 | | | < fthread_self+0x000000000024 returns: 0
 | | < __errno_location+0x000000000010 returns: 0x7f7e930a26a0
 | < sskgpwwait+0x00000000014e returns: 0
 < skgpwwait+0x0000000000e0 returns: 0
 > ksuSdiInProgress(0x19e80, 0x19e80, ...)
 < ksuSdiInProgress+0x000000000035 returns: 0
 > sltrgftime64(0x19e80, 0x19e80, ...)
 | > clock_gettime@plt(0x1, 0x7ffce278c3a0, ...)
 | | > clock_gettime(0x1, 0x7ffce278c3a0, ...)
 | | < clock_gettime+0x000000000069 returns: 0
 | < clock_gettime+0x00000000003a returns: 0
 < sltrgftime64+0x00000000004c returns: 0x19c253f3ff
 > kslwo_getcbk(0xa2, 0xd80fa62, ...)
 < kslwo_getcbk+0x000000000017 returns: 0
 > kgslwait_last_waitctx_time_waited_usecs(0x7f7e930a29a0, 0x6dfd01c0, ...)
 < kgslwait_last_waitctx_time_waited_usecs+0x000000000045 returns: 0x25e5e80
 > kskiorm(0x6d1854a8, 0, ...)
 < kskiorm+0x00000000001e returns: 0
 > kfias_iswtgon_ksfd(0x6d1854a8, 0, ...)
 < kfias_iswtgon_ksfd+0x00000000002b returns: 0
 > kxdbio_has_work(0x7ffce278c3c4, 0x6003d010, ...)
 < kxdbio_has_work+0x000000000027 returns: 0
 > skgpwwait(0x7ffce278c630, 0x7f7e930a7ca0, ...)
 | > kslwait_conv_wait_time(0x2dc6c0, 0x7f7e930a7ca0, ...)
 | < kslwait_conv_wait_time+0x000000000027 returns: 0x2dc6c0
 | > sskgpwwait(0x7ffce278c630, 0x7f7e930a7ca0, ...)
 | | > semtimedop(0x38000, 0x7ffce278c328, ...)
 | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff

And a full stack trace of a log writer worker look like this:

$ pstack $(pgrep lg01)
#0  0x00007feda8eaebda in semtimedop () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000010f9cca6 in sskgpwwait ()
#2  0x0000000010f9a2e8 in skgpwwait ()
#3  0x0000000010a66995 in ksliwat ()
#4  0x0000000010a65d25 in kslwaitctx ()
#5  0x00000000031fb4d0 in kcrfw_slave_queue_remove ()
#6  0x00000000031fad2a in kcrfw_slave_group_main ()
#7  0x00000000012160fa in ksvrdp_int ()
#8  0x000000000370d99a in opirip ()
#9  0x0000000001eb034a in opidrv ()
#10 0x0000000002afedf1 in sou2o ()
#11 0x0000000000d0547a in opimai_real ()
#12 0x0000000002b09b31 in ssthrdmain ()
#13 0x0000000000d05386 in main ()

If you combine the pstack backtrace and the debugtrace information, you see that the idle cycle does not leave the ‘ksliwat’ function, so the wait event is not finished. Quickly looking at the other functions, it’s easy to spot it reads the system clock (sltrgftime64), updates some information (kgslwait_last_waitctx_time_waited_usecs) and then performs some proactive IO checks (kskiorm, kfias_iswtgon_ksfd, kxdbio_has_work) after which it calls the post/wait based functions to setup the semaphore again.

Conclusion so far is the log writer workers do perform a 3 second sleep just like the master log writer, however the wait event ‘LGWR worker group idle’ is not interrupted like ‘rdbms ipc message’ is for the master log writer. This means the wait time for the event for each worker process indicates the last time the worker process actually performed something. A next logical question then is: but what do the log writer worker processes perform? Do they entirely take over the master log writer functionality, or do they work together with the master log writer?

In order to fully understand the next part, it is very beneficial to read up on how the log writer works in ‘single log writer’ mode, where the master log writer handling the idle and work cycle itself:

If you want to perform this investigation yourself, make sure the database is in ‘scalable log writer’ mode, by setting “_use_single_log_writer” to FALSE. This is exactly what I did in order to make sure a log write is done in ‘scalable log writer’ mode.

Now let’s first apply some logic. Above the idle cycle of a log writer worker process is shown. Based on the ‘log writer null write’ blog post, we know that the log writer does advance the LWN and On-disk SCN every 3 seconds. Clearly, the log writer worker process does not do that. So that must mean the master log writer is still performing that function. It would also make very much sense, because it doesn’t matter for scalability if the master log writer performs the function of advancing the LWN and On-disk SCN or a worker process, nothing is waiting on it. Plus, if the master log writer performs most of its functions just like in ‘single log writer’ mode, the change to scalable mode would mean no change for client processes, any committing process must semop() the log writer to start writing.

Let’s look at the relevant debugtrace output of the master log writer in scalable log writer mode:

 | > kcrfw_redo_write_driver(0, 0, ...)
 | | > kcrfw_handle_member_write_errors(0, 0, ...)
 | | < kcrfw_handle_member_write_errors+0x000000000020 returns: 0x600161a0
 | | > kcmgtsf(0, 0, ...)
 | | | > sltrgatime64(0, 0, ...)
 | | | | > sltrgftime64(0, 0, ...)
 | | | | | > clock_gettime@plt(0x1, 0x7fff1fe13010, ...)
 | | | | | | > clock_gettime(0x1, 0x7fff1fe13010, ...)
 | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | < sltrgftime64+0x00000000004c returns: 0x53747fe42
 | | | < sltrgatime64+0x00000000003e returns: 0x155d4fd
 | | < kcmgtsf+0x00000000032f returns: 0x3a182314
 | | > kcrfw_slave_adaptive_updatemode(0, 0x600161a0, ...)
 | | < kcrfw_slave_adaptive_updatemode+0x000000000080 returns: 0x7efe34d1f760
 | | > kcrfw_defer_write(0, 0x600161a0, ...)
 | | < kcrfw_defer_write+0x000000000038 returns: 0x7efe34d1f760
 | | > kcrfw_slave_queue_find(0, 0x600161a0, ...)
 | | < kcrfw_slave_queue_find+0x0000000000f1 returns: 0
 | | > kcrfw_slave_queue_setpreparing(0, 0x1, ...)
 | | < kcrfw_slave_queue_setpreparing+0x000000000021 returns: 0
 | | > kcrfw_slave_group_switchpic(0, 0x1, ...)
 | | < kcrfw_slave_group_switchpic+0x000000000050 returns: 0x699b4508
 | | > skgstmGetEpochTs(0, 0x1, ...)
 | | | > gettimeofday@plt(0x7fff1fe13070, 0, ...)
 | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | < skgstmGetEpochTs+0x000000000049 returns: 0x20debfd6192e5
 | | > kcsnew3(0x600113b8, 0x7fff1fe13228, ...)
 | | | > kcsnew8(0x600113b8, 0x7fff1fe13070, ...)
 | | | | > kslgetl(0x60049800, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > kslfre(0x60049800, 0x1, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | < kcsnew8+0x000000000117 returns: 0
 | | | > ub8_to_kscn_impl(0x66c3c7, 0x7fff1fe13228, ...)
 | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | < kcsnew3+0x00000000006f returns: 0x8000
 | | > ktfwtsm(0x3a182314, 0x7fff1fe13228, ...)
 | | | > kcmgtsf(0x2, 0x7fff1fe13228, ...)
 | | | | > sltrgatime64(0x2, 0x7fff1fe13228, ...)
 | | | | | > sltrgftime64(0x2, 0x7fff1fe13228, ...)
 | | | | | | > clock_gettime@plt(0x1, 0x7fff1fe12fe0, ...)
 | | | | | | | > clock_gettime(0x1, 0x7fff1fe12fe0, ...)
 | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | < sltrgftime64+0x00000000004c returns: 0x537484a6d
 | | | | < sltrgatime64+0x00000000003e returns: 0x155d511
 | | | < kcmgtsf+0x0000000001b2 returns: 0x3a182314
 | | | > kcmtdif(0x3a182314, 0x3a182314, ...)
 | | | < kcmtdif+0x00000000001b returns: 0
 | | | > ksl_get_shared_latch_int(0x60050340, 0x6ddb1408, ...)
 | | | < ksl_get_shared_latch_int+0x00000000016b returns: 0x1
 | | <> kslfre(0x60050340, 0x66c3c7, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kcn_stm_write(0x7fff1fe13228, 0x66c3c7, ...)
 | | | > kstmgetsectick(0x7fff1fe13228, 0x66c3c7, ...)
 | | | < kstmgetsectick+0x00000000003a returns: 0x5ae4c494
 | | | > ksl_get_shared_latch_int(0x6004ee40, 0x6ddb1408, ...)
 | | | < ksl_get_shared_latch_int+0x00000000016b returns: 0x1
 | | <> kslfre(0x6004ee40, 0x2244, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kcrfw_redo_write_initpic(0x699b4508, 0x7fff1fe13228, ...)
 | | | > kscn_to_ub8_impl(0x7fff1fe13228, 0x7fff1fe13228, ...)
 | | | < kscn_to_ub8_impl+0x00000000003e returns: 0x66c3c7
 | | < kcrfw_redo_write_initpic+0x0000000000dc returns: 0x3a182314
 | | > kscn_to_ub8_impl(0x7fff1fe13228, 0, ...)
 | | < kscn_to_ub8_impl+0x00000000003e returns: 0x66c3c7
 | | > kcrfw_gather_lwn(0x7fff1fe13268, 0x699b4508, ...)
 | | | > kslgetl(0x6abe4538, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7fff1fe13268, 0, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x6abe4538, 0x17d5f, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | | > kslgetl(0x6abe45d8, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7fff1fe13268, 0x1, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x6abe45d8, 0x137, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | < kcrfw_gather_lwn+0x00000000065c returns: 0xffffffff
 | | > krsh_trace(0x1000, 0x200, ...)
 | | < krsh_trace+0x00000000005d returns: 0
 | | > kspgip(0x71e, 0x1, ...)
 | | < kspgip+0x00000000023f returns: 0
 | | > kcrfw_slave_queue_setpreparing(0, 0, ...)
 | | < kcrfw_slave_queue_setpreparing+0x000000000021 returns: 0
 | | > kcrfw_slave_queue_flush_internal(0x1, 0, ...)
 | | < kcrfw_slave_queue_flush_internal+0x0000000000d7 returns: 0x1
 | | > kcrfw_do_null_write(0, 0, ...)
 | | | > kcrfw_slave_phase_batchdo(0, 0, ...)
 | | | | > kcrfw_slave_phase_enter(0, 0x9b, ...)
 | | | | < kcrfw_slave_phase_enter+0x000000000449 returns: 0
 | | | <> kcrfw_slave_phase_exit(0, 0x9b, ...)
 | | | < kcrfw_slave_phase_exit+0x00000000035a returns: 0
 | | | > kcrfw_post(0, 0, ...)
 | | | | > kcrfw_slave_single_getactivegroup(0, 0, ...)
 | | | | < kcrfw_slave_single_getactivegroup+0x000000000047 returns: 0x6a9a0718
 | | | | > kspGetInstType(0x1, 0x1, ...)
 | | | | | > vsnffe_internal(0x19, 0x1, ...)
 | | | | | | > vsnfprd(0x19, 0x1, ...)
 | | | | | | < vsnfprd+0x00000000000f returns: 0x8
 | | | | | | > kfIsASMOn(0x19, 0x1, ...)
 | | | | | | <> kfOsmInstanceSafe(0x19, 0x1, ...)
 | | | | | | < kfOsmInstanceSafe+0x000000000031 returns: 0
 | | | | | < vsnffe_internal+0x0000000000a7 returns: 0
 | | | | | > kspges(0x115, 0x1, ...)
 | | | | | < kspges+0x00000000010f returns: 0
 | | | | < kspGetInstType+0x0000000000b1 returns: 0x1
 | | | | > kcrfw_slave_phase_enter(0x1, 0x9b, ...)
 | | | | < kcrfw_slave_phase_enter+0x00000000006f returns: 0x9b
 | | | | > kcscu8(0x60016290, 0x7fff1fe12f98, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcsaj8(0x60016290, 0x7fff1fe12f38, ...)
 | | | | < kcsaj8+0x0000000000dc returns: 0x1
 | | | | > kcrfw_slave_phase_exit(0x1, 0x9b, ...)
 | | | | < kcrfw_slave_phase_exit+0x00000000008e returns: 0
 | | | | > kslpsemf(0x97, 0, ...)
 | | | | | > ksl_postm_init(0x7fff1fe0ac30, 0x7fff1fe12c50, ...)
 | | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | < kslpsemf+0x0000000006b5 returns: 0x1f
 | | | | > kcrfw_slave_barrier_nonmasterwait(0x6a9a0720, 0x4, ...)
 | | | | < kcrfw_slave_barrier_nonmasterwait+0x000000000035 returns: 0x600161a0
 | | | < kcrfw_post+0x000000000c1c returns: 0xd3
 | | < kcrfw_do_null_write+0x0000000000b2 returns: 0xd3
 | < kcrfw_redo_write_driver+0x000000000535 returns: 0xd3

The highlighted functions are extra functions executed when the instance is set to scalable log writer mode, or when adaptive mode has set the instance to scalable log writer mode. This means that the changes between the modes is minimal when there’s no writes, and outside of a few extra functions, the log writer does exactly the same.

The absence of any spectacular changes in the behaviour of the log writer when in scalable log writer mode when there are no writes does hint what the actual changes will be of the scalable mode, which is how writing is handled. In single log writer mode, the most time the log writer is process is likely to spend on is writing the change vectors into the online redologfiles, and maybe, if you have a bad application (!) semop()-ing foreground sessions will be second, if there are a large number of processes committing, because every process needs to be semop()-ed individually. These two functions, along with some other functionality are exactly what the log writer worker processes are doing.

This means that foreground processes do nothing different in scalable log writer mode, they signal (semop) the master log writer, which will investigate the public redo strands, and if the master log writer finds change vectors to write, it will assign log writer worker processes to perform the write, and the log writer worker process will semop() the foreground sessions to indicate the redo has been written when the instance is in post/wait mode, or do not semop() when the instance is in polling mode.

This is the entire function flow of a write when the instance is in scalable log writer mode:

 | > kcrfw_slave_queue_insert(0, 0xd3, ...)
 | | > kcrfw_slave_group_setcurrsize(0, 0, ...)
 | | < kcrfw_slave_group_setcurrsize+0x0000000001d1 returns: 0x1
 | | > _intel_fast_memcpy(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | <> _intel_fast_memcpy.P(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | <> __intel_ssse3_rep_memcpy(0x6a9a05f8, 0x7ffdae335fa0, ...)
 | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x6a9a05f8
 | | > kcrfw_slave_group_postall(0, 0xf0, ...)
 | | | > ksvgcls(0, 0xf0, ...)
 | | | < ksvgcls+0x000000000021 returns: 0
 | | | > ksl_post_proc(0x6ddb32f0, 0, ...)
 | | | <> kskpthr(0x6ddb32f0, 0, ...)
 | | | <> kslpsprns(0x6ddb32f0, 0, ...)
 | | | | > ksl_update_post_stats(0x6ddb32f0, 0, ...)
 | | | | | > dbgtTrcData_int(0x7f464c0676c0, 0x2050031, ...)
 | | | | | | > dbgtBucketRedirect(0x7f464c0676c0, 0x7ffdae335338, ...)
 | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
 | | | | | | > dbgtIncInMemTrcRedirect(0x7f464c0676c0, 0x6fa, ...)
 | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
 | | | | | | > skgstmGetEpochTs(0x7f464c0676c0, 0x6fa, ...)
 | | | | | | | > gettimeofday@plt(0x7ffdae334e40, 0, ...)
 | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x20e067375b55d
 | | | | | | > dbgtrRecAllocate(0x7f464c0676c0, 0x7ffdae3352e0, ...)
 | | | | | | | > dbgtrPrepareWrite(0x7f464c0676c0, 0x65accba0, ...)
 | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x4
 | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
 | | | | | | > _intel_fast_memcpy(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | <> _intel_fast_memcpy.P(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | <> __intel_ssse3_rep_memcpy(0x65acda30, 0x7ffdae3353d8, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x000000002030 returns: 0x65acda30
 | | | | | | > dbgtrRecEndSegment(0x7f464c0676c0, 0x7ffdae3352e0, ...)
 | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
 | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
 | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
 | | | | > skgpwpost(0x7ffdae335480, 0x7f464c0acca0, ...)
 | | | | <> sskgpwpost(0x7ffdae335480, 0x7f464c0acca0, ...)
 | | | | | > semop@plt(0xc0000, 0x7ffdae335410, ...)
 | | | | | < semop+0x00000000000f returns: 0
 | | | | < sskgpwpost+0x00000000009a returns: 0x1
 | | | < kslpsprns+0x0000000001c3 returns: 0
 | | < kcrfw_slave_group_postall+0x0000000000a8 returns: 0
 | < kcrfw_slave_queue_insert+0x0000000001b6 returns: 0x667bc540

After the instance has established there are change vectors in kcrfw_gather_lwn, in single log writer mode, the function kcrfw_redo_write is called, which will call kcrfw_do_write which handles the writing, and kslpslf to semop any waiting processes among other things. Now in scalable log writer mode, kcrfw_slave_queue_insert is called which assigns work to worker processes, and then kcrfw_slave_group_postall is called to semop one or more worker processes.

The worker processes are sleeping on a semaphore, and if a process gets signalled, it exits the kcrfw_slave_queue_remove function, ends the wait event, and calls kcrfw_redo_write, just like the master log writer process would call in single log writer mode, which includes doing the write (kcrfw_do_write) and posting the foregrounds (kslpslf), exactly all the functions.

The adaptive scalable log writer processes function has been silently introduced with Oracle 12, although a lot of the used functionality has been available more or less in earlier versions. It is a fully automatic feature which will turn itself on and off based on heuristics. The purpose of this article is to explain how it works and what it is doing. Essentially, all the functionality that surrounds a log writer write has been moved to a worker process, which means the work can be done in parallel with multiple processes, whilst all the work outside of the work around the write, which is not performance critical, is left with the master log writer.


I gotten some requests to provide an overview of the redo series of blogposts I am currently running. Here it is:

Private redo strands, In memory undo and throw away undo:

The redo series would not be complete without writing about changing the behaviour of commit. There are two ways to change commit behaviour:

1. Changing waiting for the logwriter to get notified that the generated redo is persisted. The default is ‘wait’. This can be set to ‘nowait’.
2. Changing the way the logwriter handles generated redo. The default is ‘immediate’. This can be set to ‘batch’.

There are actually three ways these changes can be made:
1. As argument of the commit statement: ‘commit’ can be written as ‘commit write wait immediate’ (statement level).
2. As a system level setting. By omitting an explicit commit mode when executing the commit command, the setting as set with the parameters commit_wait (default: wait) and commit_logging (default: immediate).
3. As a session level setting. By omitting an explicit commit mode, but by setting either commit_wait or commit_logging it overrides the settings at the system level.

At this point I should say that in my personal opinion, if you need to change this, there is something very wrong with how the database is used in the first place. This can enhance performance a bit (totally depending on what you are doing and how your hardware looks like), but it does nothing magic, as you will see.

a) commit wait/nowait
I ran a pin tools debugtrace on a session that commits explicitly with the write mode explicitly set to wait (the default), and a session that commits explicitly with the write mode set to nowait. If you took the time to read the other redo related articles you know that a commit generates changes vectors that are written in the public redo strand, changes the transaction table in the undo segment header and then signals the logwriter to write in kcrf_commit_force_int, releases all transactional control on the rows in the transaction that are committed, after which kcrf_commit_force_int is called again in order to wait for the logwriter to get notified that the change vectors have been persisted.

When commit is set to nowait, actually what happens is very simple: everything that is executed in ‘wait mode’ commit is executed in ‘nowait mode’ too, except for calling the kcrf_commit_force_int a second time, which is the functionality to wait for the notification from the logwriter.

commit wait:

 | | < kpoal8+0x000000000f8c returns: 0x2
 | | > ksupop(0x1, 0x7a87a9a0, ...)
 | | | > ksugit_i(0x11526940, 0x7a87a9a0, ...)
 | | | < ksugit_i+0x00000000002a returns: 0
 | | | > _setjmp@plt(0x7ffda5959c50, 0x7a87a9a0, ...)
 | | | <> __sigsetjmp(0x7ffda5959c50, 0, ...)
 | | | <> __sigjmp_save(0x7ffda5959c50, 0, ...)
 | | | < __sigjmp_save+0x000000000025 returns: 0
 | | | > kcbdsy(0x7ffda5959c50, 0x7f3011cbc028, ...)
 | | | <> kcrf_commit_force_int(0x7f3011d75e10, 0x1, ...)
 | | | < kcrf_commit_force_int+0x000000000b9c returns: 0x1
 | | | > kslws_check_waitstack(0x3, 0x7f3011d82f40, ...)
 | | | < kslws_check_waitstack+0x000000000065 returns: 0
 | | | > kssdel(0x7a87a9a0, 0x1, ...)
 | | | | > kpdbUidToId(0, 0x1, ...)
 | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | > kss_del_cb(0x7ffda5959b50, 0x7f3011d82f40, ...)
 | | | | | > kpdbUidToId(0, 0x7f3011d82f40, ...)
 | | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | | > ksudlc(0x7a87a9a0, 0x1, ...)

commit nowait:

 | | < kpoal8+0x000000000f8c returns: 0x2
 | | > ksupop(0x1, 0x63c82a38, ...)
 | | | > ksugit_i(0x11526940, 0x63c82a38, ...)
 | | | < ksugit_i+0x00000000002a returns: 0
 | | | > _setjmp@plt(0x7fff43332a50, 0x63c82a38, ...)
 | | | <> __sigsetjmp(0x7fff43332a50, 0, ...)
 | | | <> __sigjmp_save(0x7fff43332a50, 0, ...)
 | | | < __sigjmp_save+0x000000000025 returns: 0
 | | | > kslws_check_waitstack(0x3, 0x7fd1cea22028, ...)
 | | | < kslws_check_waitstack+0x000000000065 returns: 0
 | | | > kssdel(0x63c82a38, 0x1, ...)
 | | | | > kpdbUidToId(0, 0x1, ...)
 | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | > kss_del_cb(0x7fff43332950, 0x7fd1ceae8f40, ...)
 | | | | | > kpdbUidToId(0, 0x7fd1ceae8f40, ...)
 | | | | | < kpdbUidToId+0x00000000014e returns: 0
 | | | | | > ksudlc(0x63c82a38, 0x1, ...)

Yes, it’s that simple. In normal commit mode, commit wait, in ksupop (kernel service user pop (restore) user or recursive call) a call to kcbdsy is executed, which performs a tailcall to kcrf_commit_force_int. In nowait commit mode, kcbdsy is simply not called in ksupop, which actually exactly does what nowait means, the waiting for the logwriter notification is not done.

b) commit immediate/batch
I ran a pin tools debugtrace on a session that commits explicitly with the write mode explicitly set to immediate, and a session that commits explicitly with the write mode set to batch. If you read the other redo related articles you know that a commit generates changes vectors that are written in the public redo strand, changes the transaction table in the undo segment header and then signals the logwriter to write in kcrf_commit_force_int, then releases all transactional control on the rows in the transaction that are committed, after which kcrf_commit_force_int is called again in order to wait for the logwriter to get notified that the change vectors have been persisted.

When commit is set to batch, actually what happens is very simple: everything is done exactly the same in ‘immediate mode’ commit, except for calling the kcrf_commit_force_int the first time, which is the functionality that triggers the logwriter to write. So it looks like ‘batch mode’ is not explicitly batching writes for the logwriter, but rather the disablement of the signal to the logwriter to write right after the change vectors have been copied and the blocks are changed. But that is not all…

I noticed something weird when analysing the calls in the debugtrace of ‘commit write batch’: not only was the first invocation of kcrf_commit_force_int gone, the second invocation of kcrf_commit_force_int was also gone too! That is weird, because the Oracle documentation says:


Use these clauses to specify when control returns to the user.

The WAIT parameter ensures that the commit will return only after the corresponding redo is persistent in the online redo log. Whether in BATCH or IMMEDIATE mode, when the client receives a successful return from this COMMIT statement, the transaction has been committed to durable media. A crash occurring after a successful write to the log can prevent the success message from returning to the client. In this case the client cannot tell whether or not the transaction committed.

The NOWAIT parameter causes the commit to return to the client whether or not the write to the redo log has completed. This behavior can increase transaction throughput. With the WAIT parameter, if the commit message is received, then you can be sure that no data has been lost.

If you omit this clause, then the transaction commits with the WAIT behavior.

The important, and WRONG thing, is in the last line: ‘if you omit this clause, then the transaction commits with the WAIT behavior’. Actually, if the commit mode is set to batch, the commit wait mode flips to nowait with it. It does perform the ultimate batching, which is not sending a signal to the logwriter at all, so what happens is that change vectors in the public redo strands are written to disk by the logwriter only every 3 seconds, because that is the timeout for the logwriter sleeping on a semaphore, after which it obtains any potential redo to write via information in kcrfsg_ and KCRFA structures. This is important, because with NOWAIT behaviour, there is no guarantee changes have been persisted for the committing session.

I was surprised to find this, which for me it meant I was searching for ‘kcrf_commit_force_int’ in the debugtrace of a commit with the ‘write batch’ arguments, and did not find any of them. Actually, this has been reported by Marcin Przepiorowski in a comment on an article by Christian Antognini on this topic.

Can this commit batching be changed to include waiting for the logwriter? Yes, actually it can if you explicitly include ‘wait’ with the commit write batch. It is very interesting the kcrf_commit_force_int function then comes back at a totally different place:

 | | | | | | | | | | | | | < ktuulc+0x000000000119 returns: 0
 | | | | | | | | | | | | | > ktudnx(0x69fc8eb0, 0, ...)
 | | | | | | | | | | | | | | > ktuIMTabCacheCommittedTxn(0x69fc8eb0, 0x7ffe9eb79e74, ...)
 | | | | | | | | | | | | | | < ktuIMTabCacheCommittedTxn+0x000000000071 returns: 0
 | | | | | | | | | | | | | | > kslgetl(0x6ab9d6e8, 0x1, ...)
 | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | > kslfre(0x6ab9d6e8, 0x6ab9ce00, ...)
 | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | | | | | | | | | | < ktudnx+0x0000000005e4 returns: 0
 | | | | | | | | | | | | | > ktuTempdnx(0x69fc8eb0, 0, ...)
 | | | | | | | | | | | | | < ktuTempdnx+0x000000000083 returns: 0
 | | | | | | | | | | | | | > kcb_sync_last_change(0x69fc8eb0, 0x6df64df8, ...)
 | | | | | | | | | | | | | <> kcrf_commit_force_int(0x7f525ba19c00, 0x1, ...)
 | | | | | | | | | | | | | < kcrf_commit_force_int+0x000000000b9c returns: 0x1
 | | | | | | | | | | | | | > kghstack_free(0x7f525bb359a0, 0x7f525690ead8, ...)
 | | | | | | | | | | | | | < kghstack_free+0x00000000005a returns: 0
 | | | | | | | | | | | | < ktucmt+0x000000000e0c returns: 0

Instead of simply keeping the separate call after the transaction in the ksupop function, described above with commit wait/nowait, which is kcrf_commit_force_int with second argument set to 1, which means it notifies the logwriter as well as waits for the logwriter notification of the write, it is now is called after the function to clear the TX enqueue (ktuulc) and the undo transaction count has been lowered (ktudnx) at the end of the ktucmt function as a tailcall of kcb_sync_last_change, which wasn’t called before. Of course this limits the IO batching opportunities.

Do not change your database or even your session to make your commit faster. If you must, read this article carefully and understand the trade offs. One trade off which hasn’t been highlighted is: this might change in a different version, and it requires some effort to investigate. And again: if you still are considering this: probably you have a different problem that you should look at. Do not take this option in desperation to hope for a magical restoration of performance.

The commit_write option nowait does trigger the logwriter to write (the first invocation of the kcrf_commit_force_int function), but it does not wait for write confirmation. The commit_logging option batch does something different than the documentation says it does, it does not issue a signal to the logwriter, nor wait for it. This way the logwriter can wait the full three seconds before it times out on its semaphore and write what is in the public redo strands. But there is no way to tell if the redo for your change has been persisted yet, because that wait is gone too (that wait is the infamous ‘log file sync’ wait). If you want batching but still want a write notification, you must set commit_write to wait explicitly. By doing that you do not get the optimal batching because then waiting for the logwriter, including sending a signal to write is executed, which I suspect to be in the same ballpark as regular committing, but I haven’t checked that.

During the investigations of my previous blogpost about what happens during a commit and when the data becomes available, I used breaks in gdb (GNU debugger) at various places of the execution of an insert and a commit to see what is visible for other sessions during the various stages of execution of the commit.

However, I did find something else, which is very logical, but is easily overlooked: at certain moments access to the table is blocked/serialised in order to let a session make changes to blocks belonging to the table, or peripheral blocks like undo, for the sake of consistency. These are changes made at the physical layer of an Oracle segment, the logical model of Oracle says that writers don’t block readers.

The first question is if this is a big issue. Well, for that you should simply look at any normal usage of an Oracle database. I don’t believe these serialisation moments are an issue, because the majority of my client’s does not experience serialisation problem.

However, this can become an issue if database processes can not run at will. What that means is that if processes are randomly stopped from execution by the operating system or hypervisor, it could be that the database process can be at an earlier mentioned ‘serialisation point’, which then means that access for all other processes remains blocked until the process is made running again, which then gives the process the opportunity to free exclusive access to a resource.

What are issues when a database process can not run at will?
– CPU oversubscription. If more processes are running than CPU’s are available (think about core’s and threads, and what it means for your CPU architecture), it means the operating system needs to make a decision on what it wants to be running, and what it wants to wait. No operating system scheduler understands or can know when an Oracle database process is running in a critical part of code, and therefore should not stop execution of it.
– Memory oversubscription. Whenever your system starts actively swapping in and out, your system is swapping, which means it’s moving active pages onto the swap device, and reads back active pages from the swap device for a process that demands them. This will activate additional tasks, and make processes get stuck waiting for memory to become available at all kinds of points during execution. Often, swapping is the beginning of the end, which means the services on the server stop functioning altogether for multiple reasons, like the Out Of Memory killer, or simply timing out for client’s of a process.
– Oversubscription of CPU or memory using virtualisation. As a troubleshooter, I would say this is actually worse than the ‘simple oversubscription’ mentioned in the two earlier points. Because typically, the visible part is the virtual machine. The virtual machine in this case itself is not oversubscribed, the oversubscription takes place at a layer invisible to the virtual machine, which unexplainably performs unpredictable. It is simply not okay to oversubscribe any machine that needs to run something that is latency sensitive like the Oracle database, unless you fully understand all the trade-offs that are coming with it, and you have the ability to understand when it does start to occur.

Whenever database processes can not run at will, you will see waits you didn’t see when processes could run at will. Typical waits in that case are (not an exhaustive list):
– any latch waits
– buffer busy waits
– any cursor: pin waits
Please mind these waits are not unique for this issue, any of these waits can occur for other reasons too.

Let me show a couple of examples for which the wait is artificially created by stopping execution at a point where this serialisation takes place, thereby mimicking getting put off cpu due to load, using the earlier insert and commit execution:

a) latch: cache buffers chains
Whenever a process needs to read or modify a database buffer, it needs to ‘pin’ the buffer. Such a pin is also known as a buffer handle or a buffer state object. The pin, which is a memory area, must be obtained from a freelist and attached to the buffer header in order to pin the buffer. A pin is obtained using the kcbzgs (kernel cache buffers helper functions get state object) function, which calls the kssadf_numa_intl (kernel service state objects add from freelist numa internal) function which initialises the state object.

The function that performs the actual pin of a block is not a single one, pinning a block is done in multiple functions. Two of them are kcbgtcr (kernel cache buffers get consistent read) and kcbgcur (kernel cache buffers get current). Under normal circumstances, concurrency for a consistent read of a buffer (kcbgtcr) does not lead to any blocking, because the latch (cache buffers chains) can be taken in shared mode, and the pinning is done in ‘fast mode’ (my interpretation):

> kcbgtcr(0x7f88a8b79db0, 0, ...)
| > kscn_to_ub8_impl(0x7f88a8b88d4c, 0x6baefdc0, ...)
| < kscn_to_ub8_impl+0x00000000003e returns: 0x287562
| > ktrexf(0x7fff998123f8, 0x7f88ae5d7f30, ...)
| | > ktrEvalBlockForCR(0x7f88a8b88d4c, 0x7f88ae5d7f30, ...)
| | < ktrEvalBlockForCR+0x0000000000f4 returns: 0x1
| < ktrexf+0x0000000000a4 returns: 0x9
| > kcbzgsf(0x1, 0x7fff998120b8, ...)
| | > kssadf_numa_intl(0x36, 0x6ddc0b50, ...)
| | | > kpdbIdToUid(0, 0x6ddc0b50, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x6bf84908
| < kcbzgsf+0x0000000001a0 returns: 0x6bf84908
| > kcbz_fp_cr(0x8dff9478, 0x6bf84998, ...)
| < kcbz_fp_cr+0x000000000089 returns: 0
| > kscn_to_ub8_impl(0x7f88a8b88d70, 0x1, ...)
| < kscn_to_ub8_impl+0x00000000003e returns: 0
< kcbgtcr+0x000000000b38 returns: 0x8df94014

Inside the kcbgtcr function, kcbzgsf (kernel cache buffers helper functions test state object fast) is called, and after that, the state object is pinned to the block in kcbz_fp_cr (kernel cache buffers helper functions fast pin for consistent read). Do you see the absence of ksl_get_shared_latch function to serialise access to add this pin? Actually the cache buffers chains latch is gotten, but it’s done inside the kcbgtcr function. The latch is gotten before the ktrexf (kernel transaction redo examine block fast) in shared mode, and the latch is released after the kcbz_fp_cr function. Multiple processes can pin the block for consistent read without serialisation or blocking each other because of the cache buffers chains latch being a shared latch, and this latch is taken in shared mode.

Why do I show this? Well, in the “old” days, which is Oracle 8.1, Oracle did not use shared latches (at least on the platforms I worked on, but if my memory serves me right, shared latches were introduced starting from Oracle 9), which meant for any logical read the cache buffers chains latch had to be obtained. This could get painful for very frequently visited blocks, like index root blocks, and high concurrency, and additional to that, because there was no post/wait messaging for latch waits but only spinning, this meant lots of processes waiting for latch number 98, and CPU usage going through the roof.

Back to today. Below is shown how the function calls look like when DML is done. When a buffer change is done (data is changed), a buffer has to be obtained in current mode, which is done using the kcbgcur function:

> kcbgcur(0x7ffda5957c70, 0x2, ...)
| > kcbzgs(0x1, 0x2, ...)
| | > kssadf_numa_intl(0x36, 0x7a6086c8, ...)
| | | > kpdbIdToUid(0, 0x7a6086c8, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x7867c9c0
| < kcbzgs+0x000000000178 returns: 0x7867c9c0
| > kti_is_imu(0x7867ca50, 0xb0f71018, ...)
| | > ktcgtx(0x76f29d50, 0xb0f71018, ...)
| | < ktcgtx+0x00000000001e returns: 0x76f29d50
| < kti_is_imu+0x000000000039 returns: 0
< kcbgcur+0x0000000009fb returns: 0xb0302014

This looks reasonably the same as the previous call overview which shown kcbgtcr, however the kcbzgsf function changed to kcbzgs, so minus the ‘f’ for fast, and a few other functions are missing. Another interesting thing (not visible) is the cache buffers chains latch is obtained after the kcbzgs function that obtains the buffer state object, and the latch is obtained with special bit 0x1000000000000000 set to indicate the shared latch is obtained in non-shared mode. After the kti_is_imu function the pin is attached to the buffer and the latch is freed.

To make things a little more complicated, a buffer can be gotten in current mode, but still get the cache buffers chains in shared mode. This is how that looks like:

> kcbgcur(0x7ffda5955620, 0x1, ...)
| > kcbzgs(0x1, 0x1, ...)
| | > kssadf_numa_intl(0x36, 0x7a6086c8, ...)
| | | > kpdbIdToUid(0, 0x7a6086c8, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x7867c9c0
| < kcbzgs+0x000000000178 returns: 0x7867c9c0
| > kcbz_fp_shr(0xb0fa3cb8, 0x7867ca50, ...)
| < kcbz_fp_shr+0x000000000085 returns: 0x7f3011d82f40
< kcbgcur+0x0000000009fb returns: 0xb07a0014

The function kcbzgs without ‘f’ is called, but the cache buffers chains latch is gotten in shared mode (not visible; done in the kcbgcur function after kcbzgs), and there’s the function kcbz_fp_shr (kernel cache buffers helper function fast pin shared) to pin the buffer in shared mode.

In order to be allowed to change a buffer, the cache buffers chains latch must be taken in non-shared mode to change the buffer header to indicate the buffer is busy and guarantee only the process itself can access it. Obtaining the cache buffers chains latch in non-shared mode means that access to the hash bucket (multiple hash buckets actually) to which the buffer is linked is blocked until the latch is freed. Please mind the period that the latch is obtained here is very short.

However, this blocking scenario can be easily replayed:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on the kti_is_imu function (break kti_is_imu) and continue.
4. session 1: insert a row into a table (insert into test values (‘a’);). the breakpoint will fire in gdb, stopping execution.
5. session 3: startup another database foreground session.
6. session 3: select the table on which the insert just broke (select * from test;). This will hang.
7. session 4: startup a SYSDBA session, and obtain the wait state of the foreground sessions:

SQL> @who active

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -------------
59,53450,@1     28166    SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:98
102,53150,@1    28083    TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:66867.83ms,event:ON CPU:SQL,seq#:33
101,6637,@1     28186    TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:344013.82ms,event:latch: cache buffers chainsp1:1811473056p2:289p3:24599> Blocked by (inst:sid): 1:102,seq#:39

SID 101 is waiting on the cache buffers chains latch, because SID 102 is holding that. SID 102 is showing ON CPU because we broke execution outside of a wait event, so Oracle thinks it’s busy executing, while it is actually stopped by gdb.

Again, this is a wait (latch: cache buffers chains) that is unlikely be an issue, and if it is, you either ran into a bug, or this is only the way a much bigger problem (oversubscription, bad application design) is showing itself.

b) buffer busy waits
The name of the wait event ‘buffer busy wait’ is self explanatory. The technical implementation is way lesser known. What technically happens during a buffer busy wait, is that a session needs to read a certain buffer, and in order to read it, it performs the usual actions of selecting the correct cache buffers chains bucket, following the double linked list to the buffer headers to find the best best buffer and then obtain a buffer state object. The critical part for running into the buffer busy wait is when the session checks the state of the buffer via the buffer header and finds the change state (X$BH.CSTATE) to be higher than 0, indicating the block is currently being changed. At this point the session calls kcbzwb (kernel cache buffers helper function wait for buffer), and must wait for the buffer state to be changed and therefore accessible, for which the wait event is the buffer busy wait event. When the change state is reverted back to 0, the buffer is accessible again.

Once a buffer is pinned in current mode using kcbgcur en kcbzgs, the buffer is still fully accessible for all sessions. Only once the database truly starts to change the buffer, which is done in kcbchg1_main, the cache buffers chains latch is taken non-shared, and the CSTATE field in the buffer header is changed to 1 and then the latch is freed. My current investigations show that the CSTATE is changed to 2 after the kcopcv and kcrfw_copy_cv functions, so when the redo is written to the public redo strand, and set to 4 after the changes have been written to the buffers, which is done in kcbapl. At the end of the kcbchg1_main function, when all the changes are done, the cache buffers chains latch is taken non-shared again, the CSTATE field is set to 0 and the latch is freed to enable access to the buffer again.

68801:kcbchg1_main+2536:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):R:8:0/0()
68802:kcbchg1_main+2536:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):W:8:0x1000000000000022/1152921504606847010()
68897:kcbchg1_main+3318:0x00000000967eddda(Variable Size(pgsz:2048k)|+901701082 buffer header|9668A000+66 ):W:1:0x1/1()
68939:kcbchg1_main+3715:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):R:8:0x1000000000000022/1152921504606847010()
68940:kcbchg1_main+3715:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):W:8:0/0()

Above you see a cache buffers chains latch being obtained non shared (special bit 0x1000000000000000 is set), a little later the buffer header is updated which sets the value 1 at offset 66. Once this happens, this is visible in the CSTATE column of X$BH. A little further the latch is freed.

Why am I showing all of this? Well, there is quite some work that is done while the buffer is in a changed state (CSTATE>0) and therefore inaccessible. In normal situations, this is not a problem because despite all the work this is done very quickly, and therefore it’s hardly noticeable/measurable, so there’s no or very little time spend in buffer busy waits in most databases

However… if processes do get stuck randomly, it’s possible that a process gets stuck while making changes to a buffer, which then means any additional access to the buffer results in buffer busy waits. Also mind that a change to a table buffer requires (at least; in the most simple case) three buffers to be changed: the buffer containing the table data, the buffer containing the undo segment transaction table and the buffer containing the actual undo. A commit requires one buffer: the undo segment transaction table.

Let me show you an example:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on ktugur and kcopcv and continue.
4. session 3: startup a database foreground session.
5. session 1: insert a row into a table: insert into test values (‘a’);
6. session 2: gdb breaks execution of session 1 because it encountered ktugur.
7. session 3: select * from test; this runs unblocked.
8. session 4: startup a database foreground session as sysdba.
9. session 4: select * from v$lock; this will hang.

This one wasn’t obvious. I found this upon doing the tests. When a foreground session is in ktugur, it is creating the change vectors (kernel transaction undo generate undo and redo), and it holds two buffers in current mode:

SQL> l
  1  select tablespace,
  2  	dbablk,
  3  	class,
  4  	state,
  5  	decode(state,0,'free',1,'xcur',2,'scur',3,'cr', 4,'read',5,'mrec',6,'irec',7,'write',8,'pi', 9,'memory',10,'mwrite',11,'donated', 12,'protected',  13,'securefile', 14,'siop',15,'recckpt', 16, 'flashfree',  17, 'flashcur', 18, 'flashna') state_decoded,
  6  	mode_held,
  7  	changes,
  8  	cstate,
  9  	flag,
 10  	flag2,
 11  	rflag,
 12  	sflag,
 13  	lru_flag,
 14  	dirty_queue
 15  from 	  x$bh b
 16  	, v$tablespace ts
 17  where 	1=1
 18  and	not (us_nxt = us_prv and wa_nxt = wa_prv and to_number(wa_nxt,'xxxxxxxxxxxxxxxx') = to_number(us_nxt,'xxxxxxxxxxxxxxxx') + 16)
 19* and	b.ts#=ts.ts#
SQL> /

------------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -----------
UNDOTBS1                              128         17          1 xcur                2          1          0    2097152          0          0          0          0           0
TS                                  24597          1          1 xcur                2          1          0    2097153          0          0          0          0           0

The thing that was not obvious at first was that the blocks have CSTATE 0; so they can be accessed and read for a consistent read. Then I look at the wait and the p1/2/3 of the wait:

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -------------------------------------------------------
100,262,@1      6274     SYS             DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:2519.48ms,event:buffer busy waitsp1:3p2:128p3:17> Blocked by (inst:sid): 1:59,seq#:476
142,15103,@1    6695     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:47115.23ms,event:SQL*Net message from client,seq#:226
12,25165,@1     7036     SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:3165
59,31601,@1     5891     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:31292.51ms,event:ON CPU:SQL,seq#:111

Above SID 100 is session 4; it hangs on a buffer busy wait. The datafile is #2, which is the undo table space UNDOTBS1 and the buffer that it is waiting for is 128. Why does SID 100 needs to wait for the buffer, while another session (session 3 in the above runbook) can run without getting blocked for a buffer that is held exactly in the same state (xcur and CSTATE is zero)?

The answer can be gotten when executing pstack on the waiting process (or attach with gdb and obtain a backtrace):

(gdb) bt
#0  0x00007fd6a085bbda in semtimedop () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000010f9cca6 in sskgpwwait ()
#2  0x0000000010f9a2e8 in skgpwwait ()
#3  0x0000000010a66995 in ksliwat ()
#4  0x0000000010a65d25 in kslwaitctx ()
#5  0x00000000015c5e5e in kcbzwb ()
#6  0x0000000010b3748e in kcbgcur ()
#7  0x0000000010ab9b17 in ktuGetUsegHdr ()
#8  0x00000000014075d3 in ktcxbcProcessInPDB ()
#9  0x0000000001406de2 in ktcxbc ()
#10 0x0000000010ea8781 in qerfxFetch ()
#11 0x0000000010bb919c in rwsfcd ()
#12 0x00000000035ae618 in qeruaFetch ()
#13 0x00000000035ad837 in qervwFetch ()
#14 0x0000000010bb919c in rwsfcd ()
#15 0x0000000010ea1528 in qerhnFetch ()
#16 0x0000000010bb919c in rwsfcd ()
#17 0x0000000010ea1528 in qerhnFetch ()
#18 0x0000000010c9eeb3 in opifch2 ()
#19 0x000000000268c3de in kpoal8 ()
#20 0x0000000010ca5b3d in opiodr ()
#21 0x0000000010f37a29 in ttcpip ()
#22 0x0000000001eb4674 in opitsk ()
#23 0x0000000001eb914d in opiino ()
#24 0x0000000010ca5b3d in opiodr ()
#25 0x0000000001eb04ed in opidrv ()
#26 0x0000000002afedf1 in sou2o ()
#27 0x0000000000d05577 in opimai_real ()
#28 0x0000000002b09b31 in ssthrdmain ()
#29 0x0000000000d05386 in main ()

The important bit is at #6: kcbgcur (kernel cache buffers get buffer in current mode)! In other words: the block is gotten in current mode while another session is holding the block in current mode, which means it has to wait. The function kcbzwb means kernel cache buffer helper function wait for buffer.

I really wondered why the undo segment buffer is gotten in current mode; this is a read, there is no intention to change something. The current educated guess is this a way to be sure this buffer contains the current state of the transactions across all nodes in a RAC cluster. So this might be a clever optimisation to be absolutely sure the current state of the undo segment transaction table is gotten.

10. session 2: continue. This will break on kcopcv.
11. session 3: select * from test; this will hang.

Now we have two hanging sessions waiting for the wait event buffer busy wait:

SQL> @who active

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -----------------
100,262,@1      6274     SYS             DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:120790ms,event:buffer busy waitsp1:3p2:128p3:35> Blocked by (inst:sid): 1:59,seq#:501
142,15103,@1    6695     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:1175715.03ms,event:buffer busy waitsp1:4p2:24597p3:1> Blocked by (inst:sid): 1:59,seq#:250
12,25165,@1     7036     SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:3309
59,31601,@1     5891     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:9544.97ms,event:ON CPU:SQL,seq#:116

SID 100 still hangs in wait event buffer busy wait for file 3, block 128, but now the session doing the full table scan also hangs in wait event buffer busy wait, but for file 4, block 24597. Let’s look at the buffers held in current mode again:

SQL> @pinned_blocks

------------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -----------
UNDOTBS1                              128         35          1 xcur                2          0          1    2097156          0          0          0          0           0
UNDOTBS1                             3127         36          1 xcur                2          0          1    2097156          0          0          0          0           0
TS                                  24597          1          1 xcur                2          0          1    2097157          0          0          0          0           0

Because the insert progressed to kcopcv, the CSTATE of all three buffers involved in the insert, the undo segment transaction table, the undo segment undo entry and the buffer which has the inserted row are set to 1. This means that the two buffer busy waits are actually two different reasons: the select * from v$lock still waits to get a buffer in current mode, and the select * from test, which does a consistent read, now is waiting for the CSTATE to get set back to 0. In order to solve the hanging, go to session 2 and continue. This will break again on kcoapl, and again if you continue, because there are 3 blocks involved in this insert. Another way to make the insert finish is to disable all breakpoints (disable command) or to quit gdb.

Just for the fun of it, here is another example of how a stuck session can escalate: a process that gets stuck executing the kcbapl (kernel cache buffers apply a change to a buffer). This scenario can be replayed using the following steps:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on kcbapl and continue.
4. session 3: startup a database foreground session.
5. session 3: insert a row into a table: insert into test values (‘a’);
6. session 1: insert a row into a table: insert into test values (‘a’);
7. session 2: gdb broken execution of session 1 because it encountered kcbapl.
8. session 3: commit; this will hang

Can you reason why session 3 is hanging? Both the sessions are doing inserts to the same table…Give it a thought and replay it yourself.

Explanation: the breakpoint stopped execution before actually doing the changes. CSTATE for the buffer is set to 1, it will get set to 2 inside kcbapl to indicate the change vectors have been copied. But the important bit is that a redo copy latch was taken before executing kcbapl. The commit in session 3 sent a message to the logwriter asynchronously asking it to write (first invocation of kcrf_commit_force_int), and then freed the logical locking of the insert, and then invoked kcrf_commit_force again to wait for the logwriter to finish writing and get notified. However, the logwriter needs to obtain all the redo copy latches (it actually reads a metadata structure that indicates the redo copy latches are freed or not), at which it needs to wait because session 1 stuck holding a redo copy latch and therefore needs to wait. Not only does this show how dangerous it is when sessions can’t run, it also shows how dangerous it can be to run gdb in a production database, it can be quite hard to figure out what will be the implication of attaching with gdb to the process. So the actual outcome of above is that session 3 waits in the event ‘log file sync’, and the logwriter is waiting in ‘LGWR wait for redo copy’.

The Oracle database is carefully setup to be able to handle concurrency. In most normal situations you will see no to maybe very little time spend in wait events that indicate concurrency, like latches or buffer busy waits. If an Oracle database is on a machine that has more active processes than available CPU threads or cores (this depends on the hardware and CPU architecture; for Intel based architectures you should lean towards core’s, for SPARC you can lean more towards threads), the operating system needs to make choices, and will do that trying to give all processes an equal share. This means that processes will be put off CPU in order to give all processes a fair share of CPU run time. Also mind other oversubscription possibilities, like using more memory than physically available, and doing the same using virtualisation, which is handing out more VCPU’s than CPU threads, and handing out more memory than physically available. Because you can’t look outside your virtual machine, it’s often harder to troubleshoot virtualisation oversubscription issues.

Oracle really tries to minimise the amount of time spend while holding a non shared latch. Therefore with modern Oracle databases it’s unlikely that latch waits show up, even if a system is (not too much) oversubscribed. If you do have a latching problem, it’s most likely you have got another problem which causes this.

The most well documented reason for buffer busy waits is a buffer being written onto persistent media, and therefore the buffer is marked busy for the time of the write. Another reason that is documented, is getting a buffer in a mode that is not compatible with the mode it is held in. Having a buffer in current mode will block another request for the buffer in current mode. But, having a buffer pinned in current mode does not block consistent read access, it’s only when the buffer is being changed that requires the buffer to be marked busy. This state is externalised in the CSTATE column of the view X$BH, which shows fields of the buffer header struct that is used to manage the buffer cache buffers. Because there is quite some work to be done when a buffer is changed, and during making these changes the buffers are not usable (‘busy’) for other processes, a buffer busy wait is more likely to occur when oversubscription is taking place.

Thanks to Jonathan Lewis for explaining how to get the buffer header addresses.
Thanks to Abel Macias for brainstorming.

The previous blogpost talked about a simple insert, this blogpost investigates what happens when the DML is committed. Of course this is done with regular commit settings, which means means they are not touched, which means commit_logging is set to immediate and commit_wait is set to wait as far as I know. The documentation says there is no default value, and the settings are empty in all parameter views. In my humble opinion, if you must change the commit settings in order to make your application perform usable with the database, something is severely wrong somewhere.

This blogpost works best if you thoroughly gone through the previous post. I admit it’s a bit dry and theoretical, but you will appreciate the knowledge which you gained there, because it directly applies to a commit.

First let’s look at the flow of functions for the commit:

  - kcscu8
  - kcscu8
  - kcscu8

Please mind this is an overview of functions which is not complete, it provides enough information to show the flow of functions I want to highlight. There are much more functions involved during the execution of commit.

The first thing that is visible here is that after the kpoal8/opiexe/kksExecuteCommand (kernel compile shared objects Execute Command) function are xct (transaction control) functions. Of course this is logical, a commit ends a transaction and makes changes visible. The xct layer then moves into the k2 layer, which is the distributed execution layer. I am not doing anything distributed, it is my current understanding that this layer is invoked this way so that if anything distributed was pending, it would be handled appropriately. After the k2 layer the function ktdcmt (kernel transaction distributed commit) is executed.

After the distributed layers we enter the ktc layer (kernel transaction control). In ktcCommitTXN_new I see handling of features like the result cache (qesrc) and In-Memory (ktuIM), then the ktu layer (kernel transaction undo) is entered, which enters the kcb layer (kernel cache buffers) using functions we saw in the previous post: kcbchg1 and kcbchg1_main.

In fact, at this point it looks very similar to the insert, in kcbchg1_main, ktiimu_chg and kcopcv (prepare change vectors) are called, but only once (because a commit only involves one block, see a little further for the explanation) instead of three times as we saw with the insert. Then kcrfw_redo_gen_ext is called, which is doing almost the same as the insert: first kcrfw_copy_cv is executed to copy the change vector to the public redo strand, then kcbapl is called to apply the change to a buffer. The kco_issue_callback function calls kturcm (kernel transaction undo redo commit) indicating the type change to the buffer. This means that a commit changes a single buffer, which is the buffer that holds the transaction in the transaction table in the undo segment, and the change is marking the transaction as committed. So a ‘commit marker’ is not a special token that is written into the redo stream, but in fact it’s simply a block change, just like all other change vectors.

After kcbapl, kcrfw_partial_checksum is called to checksum the redo in the public redo strand, again just like we saw with the insert.

Unique to a commit is the invocation of the kcrf_commit_force_int function. This is the first ‘use’ of the kcrf_commit_force_int function (indicated by the second function argument set to zero, not visible in the overview), which is signalling the logwriter to write any unwritten change vectors in the public redo strands. kcrf_commit_force_int checks the on disk SCN and the LWN SCN using kcscu8 (kernel cache scn management read current SCN) in the kcrfsg_ struct to check logwriter progress:
– If the on disk SCN is beyond the process’ commit SCN, the change vectors are written, and no further action is necessary (this function is quit), which also means a second invocation of kcrf_commit_force_int is not necessary.
– If the on disk SCN isn’t progressed beyond the process’ commit SCN, but the LWN SCN is, it means the logwriter is currently writing the change vectors for this commit SCN. In that case there is no need to signal the logwriter, but it requires the process to validate the write later using the second invocation of kcrf_commit_force_int.
– If both the on disk SCN and LWN SCN did not progress beyond or are equal to the process’ commit SCN, this invocation of kcrf_commit_force_int needs to send the logwriter a message using ksbasend (kernel service background processes asynchronous send message) to start writing the public redo strands. ksbasend will only send a message if the messages flag in the kcrfsg_ struct is not set indicating it has already been signalled.
After which the kcrf_commit_force_int function is returned from, as well as the kcrfw_redo_gen_ext function, so we are back in kcbchg1_main.

Also different from an insert is the invocation of the ktuulc (kernel transaction undo unlock; this is a guess) function. Which calls ksqrcl (kernel service enqueue release an enqueue), which calls ksqrcli_int (my guess this (=the addition of _int) is an enhanced version of the enqueue release function), which performs the clearance of the TX enqueue set for the inserted row. This clearance is not a guess, ksqrcli_int does clear the TX enqueue for the inserted row. After clearing the row lock, some more functions returned from: kcbchg1_main and kcbchg1, so we are back in ktucmt.

Because the transaction is now committed, the active transaction count in the undo segment can be decreased, which is done in ktudnx (kernel transaction undo decrease active transaction count). Then the ktucmt function is returned from too, and we are back in ktcCommitTxn_new.

In ktcCommitTxn_new state objects are cleaned up using kssdct (kernel service state object delete children of specified type). A state object is a memory area that keeps the state of various things that are transient, so if they get lost, the state object reflects the last known state. The callback action of the function performs some more housekeeping, the ktldbl (kernel transaction list blocks changed delete list of block SO’s) function removes block SO’s/buffer handles, which calls kcbnlc (kernel cache buffers analyse cleanout), which calls ktbdbc (kernel transaction block fast block cleanout) to perform delayed block cleanout/commit cleanout. This cleans up the state in the data block, which means it cleans up the lock bytes, set Fsc/Scn to the commit SCN and set the commit flag to C— in the ITL in the block.

The next state object that is cleaned is the shared table lock (TM); by looking at the functions it’s quite easy to understand that this is happening, ksqrcl is the function to release and enqueue, and ktaidm is kernel transaction access internal deallocate dml lock function.

Past releasing the TM enqueue, there are other things done for which I didn’t put their function names in, but the execution is returning from a lot of the other functions shown as calling functions. Of course Oracle needs to update all kind of counters and usage statistics, and record audit information. But eventually, everything has been released. However, there is something more that is executed as part of a commit. This is the second invocation of the kcrf_commit_force_int function.

Actually, when kcrf_commit_force_int is executed for the second ‘use’, this is visible with the second argument of the calling arguments is set to ‘1’ (not visible in the function call overview above). The functions that are executed in kcrf_commit_force_int are actually exactly the same as the first invocation:
– kcscu8 is called to read the on disk SCN from kcrfsg_
– kcscu8 is called to read the LWN SCN from kcrfsg_
Also the same logic is applied to the values that are the result of calling kcscu8 to read the SCN values as stated previously. If these SCNs did not progress far enough, ksbasend is called.

The interesting thing of the second execution of kcrf_commit_force_int happens after ksbasend: the kcrf_commit_force_int function loops until the on disk SCN has progressed beyond the process’ commit SCN (which means the change vectors are written from the public redo strands into the online redologfiles). To indicate it’s waiting/looping for the on disk SCN to progress that far, the wait interface is called (kslwtbctx) for the wait ‘log file sync’, after which it loops, for which I put a hyphen before the start of the loop to indicate what to loop consists of.

I illustrated the post/wait mode of log file sync, which is visible with ‘kslawe’ (kernel service lock management add post-wait entry). The post-wait entry is removed inside kslwaitctx, and then setup again. Interestingly, when in post-wait mode, the process must be posted by the logwriter, even if it finds the on disk SCN to have progressed beyond the process’ commit SCN. The other mode for waiting for the on disk SCN is called ‘polling’, search my blog for articles about it if this sparked your interest.

The intention of this blogpost is not to bury you in Oracle internal functions, despite the look of the article and the amount of functions mentioned :-). The aim for spelling out the functions is to show what happens, and to learn about the layers in which they execute.

If you skip past the first couple of functions that are executed with a commit, the ktc (kernel transaction control) layer is crossed, then the ktu (kernel transaction undo) layer, after which the change is executed under supervision of the kcb (kernel cache buffer) layer.

In fact, the rough outline of the change is the same as described in the previous article about insert: kcbchg1, kcbchg1_main, kcopcv, kcrfw_redo_gen_ext, etc. Just like with the insert, the function called in the function kco_issue_callback sets the type of block change, which is kturcm with commit.

A commit is a change to the block that holds the undo segment’s transaction table, and flags the current transaction as committed. This is what is commonly referred to as a ‘commit marker’.

After the kturcm function, the transaction is changed to the status committed. However, if you look closely, there are several functions executed AFTER kturcm, like kco_blkchk, kcoadv_hdr and kcbhfix_tail that complete the change made in kturcm in order to make the block consistent.

After block changes and the change vector checksum in kcrfw_partial_checksum, a function unique to commit is executed: kcrf_commit_force_int. The first time invocation of this function signals the logwriter to write.

At the time of kcrf_commit_force_int and returning from it into the function kcrfw_redo_gen_ext back to kcbchg1_main, the newly inserted value is not available, but when the execution in the kcbchg1_main function reaches ktuulc to clean up the TX enqueue, the the NEW value becomes available!

This is something which I still do find counter intuitive, because this means at the above mentioned time, which is prior to reaching ktuulc the change becomes visible to all sessions but the committing session. The committing session at that point needs to clean up the block a little, and later on remove the shared TM enqueue, and after that, the committing session executes kcrf_commit_force_int again to wait for the ‘commit marker’ and obviously all successive change vectors to complete. WHILE ALL OTHER SESSIONS CAN SEE AND USE THE CHANGED DATA FOR WHICH THE COMMITTING SESSION IS WAITING!

This blogpost looks at the very start of oracle redo: the generation of it. In order to do that, I start off with a very simple table, and look at the redo generation part. I guess the regular readers of this blogpost series understand that redo generation is closely connected with making changes made to blocks. This post therefore is not only about redo generation, but also about how the technical implementation of block changes.

I created a simple table (create table test (f1 varchar2(10)) with no index to make the execution as simple as possible, and simply insert rows (insert into test values (‘a’)). It could be argued that not having an index makes it not the most real life scenario, and this might be right. However, the goal here is to make the execution as simple as possible.

I then looked at the execution of the SQL, and created an overview of the relevant functions that are executed in my session:


How to read this: I started looking at the functions that are executed when an insert statement is executed at the ‘insexe’ function, which is the main insert function. An indention means it is called from the previous less indented function.

The insertion code calls query execute rowsource load table conventional functions (qerltc). Conventional here means that it’s not a direct insertion. The qerltcStart function essentially initialises memory which is needed to perform the execution of the insert statement.

After the rowsource layer, the data layer is entered using the kdtSimpleInsRow. The first thing that needs to be done here is to find space in a block in which the row can be inserted. Finding a block is done in the kdtgrs function (kernel data table get space for rows). Once a block is found, the code inserts the row using the kdtwrp function (kernel data table write row piece). Via the kdtchg function (kernel data table change) function, the transaction layer is entered.

The transaction layer function that are then used are ktbchg2 (kernel transaction block (header) change) and ktuchg2 (kernel transaction undo called from write logs and perform changes), after which the transaction is initialised in ktcbgn (kernel transaction control begin) next is ktubnd (kernel transaction undo bind undo segment. Inside ktubnd not only the undo segment is selected (‘bound’), but the transaction enqueue lock is set too (ksqgtlctx), which locks the row logically. Further down below you will see Oracle also needs to provide a physical way of locking access when it makes changes to prevent from wrong data being read.

Then we enter the cache layer via the kcbchg1 function (kernel cache buffers change). The cache layer is the actual layer that executes functions to change the block. The actual main function which performs block layer functions is kcbchg1_main.

In order for Oracle’s recovery mechanism to work, the redo must be generated before the block change, so a block can be recovered using the redo. Therefore, kcrfw_redo_gen_ext (kernel cache redo file write/broadcast SCN main redo generation function (12c)) is called. Inside kcrfw_redo_gen_ext, the redo change vectors are copied into the public redo strand using the kcrfw_copy_cv function (kernel cache redo file write/broadcast SCN copy change vectors), after which the changes to the block in the cache are done using the kcbapl (kernel cache buffers apply).

There are three kcbapl executions; the first one applies the change to the undo header where the transaction is assigned a slot which points to the actual undo data inside the undo segment, visible inside the kco_issue_callback function kturdh (kernel transaction undo write undo header data to undo block), the second one applies the change to the undo block with the previous version of the data, so a change be undone and read consistency can be provided. This is also visible inside the kco_issue_callback function: kturdb (kernel transaction undo write undo data to undo block), and the second invocation of kcbapl has the function kdoirp (kernel data operations insert row piece), which performs the actual change of the block to insert the row.

If you look closely at the functions inside kcbapl, you see that a check is done to the buffer prior the the change (kcbhlchk, kernel cache buffers header logical check), and after the change the block metadata is changed to reflect the change (kcoadv_hdr, kernel cache operations advance SCN in buffer header and kcbhfix_tail, kernel cache buffers header update tail).

After the change, a checksum is applied to the generated redo using kcrfw_partial_checksum.

This concludes the insert. A row lock is obtained, the redo change vectors have been copied into the public redo strand, and the changes have been applied to the undo and table blocks.

The functions return, and execution is back at the query execute rowsource layer (qer), which performs some post processing using qerltcPostInsertProcessing, qerltcSetupOerdef and qerltcFreeMemory.

A question that I asked myself is where the actual redo change vectors are created. It’s known that kcrfw_copy_cv copies the change vectors into the public redo strand, but these need to created before you can copy them. Are these created on the spot in kcrfw_copy_cv? Probably not. Of course this dives so deep into the actual working of the Oracle engine that Oracle does not reveal anything about it (that I am aware of).

Luckily, using the Intel Pin tools trace we can see memory access, and therefore we can track the memory area’s that a process is using, and see where (in which function) memory is written to before it is written into the public redo strand. First let’s look the kcrfw_copy_cv function:

40630:__intel_ssse3_rep_memcpy+8727:0x00000000d80cbe54(Redo Buffers(pgsz:2048k)|+835156 redo|PUB_REDO_0+835156 ):W:16:0x5223030ffff10100542b760/-263813982931104()
40633:__intel_ssse3_rep_memcpy+8742:0x00000000d80cbe64(Redo Buffers(pgsz:2048k)|+835172 redo|PUB_REDO_0+835172 ):W:16:0x000010ffff00000000/4785070309113856()
40634:__intel_ssse3_rep_memcpy+8747:0x00000000d80cbe68(Redo Buffers(pgsz:2048k)|+835176 redo|PUB_REDO_0+835176 ):W:16:0x10ffff0000000040200/-4503599627107840()

What this shows, is the kcrfw_copy_cv function calling _intel_fast_memcpy. The _intel_fast_memcpy is using a specialised memcpy function, __intel_ssse3_rep_memcpy, which uses a processor extension called ‘ssse3’. At line numbers 40629, 40631 and 40632 16 bytes are read and at 40630, 40633 and 40634 the reads are written into the public redo strand.

In order to understand where the data in these 16 bytes memory addresses has been written, I use an extension to my pinatrace annotate oracle script that creates a sqlite database of the annotation results using the ‘-d’ switch.

In order to see where the contents of the memory address 0x00007ffcd2011678 have been written, fire up sqlite3 with the created database:

$ sqlite3 insert.txt.db
sqlite> select * from annotation where memory_address between printf('%d',0x00007ffcd2011678) and printf('%d',0x00007ffcd2011678)+15 and line < 40629;
17393|kglpin+13|140723831772792|??|W|8|1810933992|Variable Size(pgsz:2048k)|+183544040 shared pool|KGLH0^d0901451 DS:free memory,duration 1,cls free+96 shared pool|KGLH0^d0901451,duration 1,cls recr+2792
18697|kglpin+1993|140723831772792|??|R|8|1810933992|Variable Size(pgsz:2048k)|+183544040 shared pool|KGLH0^d0901451 DS:free memory,duration 1,cls free+96 shared pool|KGLH0^d0901451,duration 1,cls recr+2792

The query is simple: look up all records which were generated before line 40629, which did access the memory region that is read for copying into the public redo strand.

This is the complete output, please be aware that the high memory addresses are PGA memory addresses. PGA memory is highly fluent in nature. If you look in memory access traces you see that nearly every layer that is used during execution of SQL and PL/SQL requires allocation of memory area’s to store information. In fact, it’s so fluent that some memory area’s can be reused during repetitions in execution within the same statement.

The “__intel_memset’ functions are functions that set a memory area to a certain value, in this case zero. This might be an initialisation of memory. In order to be sure, lookup the lines (30665/31116), and see from what function they were called:

30621:kdtgrs+244:0x000000006c20fb98(Variable Size(pgsz:2048k)|+186710936 shared pool|PLMCD^6daf6103 DS:permanent memor,duration 4,cls perm+2472 shared pool|PLMCD^6daf6103,duration 4,cls freeabl+2512 ):R:1:0x8/8
30622:kdtgrs+250:0x000000006c20fba4(Variable Size(pgsz:2048k)|+186710948 shared pool|PLMCD^6daf6103 DS:permanent memor,duration 4,cls perm+2484 shared pool|PLMCD^6daf6103,duration 4,cls freeabl+2524 ):R:1:0x80/

It’s called from kdtgrs. And by looking at the long list of memset commands, this clearly is initialisation of memory. This means that any allocation before the memset is something completely different. This means these functions did provide the redo information:


The ktuchg2 function hints at undo, and the kcopcv function is a function that seems to examine the block and create redo information based on the block. The ub8_to_kscn_impl is a function for the conversion of SCN numbers.

Let’s take the next memory region, and see what functions manipulated region:


A little more of the same, ktuchg2/kcopcv/ub8_to_kscn_imp, although in another order.
And the third memory region:


More of the same, although another function was added ‘kcobrh’. I am not familiar with this function, although my suspicion is this is a function to generate a header. Of course these 3 regions were copied using the same call.

The next memcpy call copies two 16 byte regions:

40645:__intel_ssse3_rep_memcpy+9249:0x00000000d80cbe88(Redo Buffers(pgsz:2048k)|+835208 redo|PUB_REDO_0+835208 ):W:16:0x120880000000000000/612489549322387456()
40647:__intel_ssse3_rep_memcpy+9258:0x00000000d80cbe78(Redo Buffers(pgsz:2048k)|+835192 redo|PUB_REDO_0+835192 ):W:16:0x100069600e810008a1c0/7593084918960792000()

(lines 40644 and 40646)
Which memory is written to by different functions. In fact, these were the functions that I suspected to generate the redo information:


One write of ktuchg2 and then ktugur (kernel transaction undo generate undo and redo).

I gone through all of the memory area’s that are written to the public redo strand, and found the functions in which redo is generated:
– ktuchg2
– kcopcv
– kcobrh
– ktugur
Other functions that I have found in other regions of memory:
– kdtwrp
– kcosze
– ktugul

There is one other thing I found that I didn’t suspect. During the investigation I used a debugger to stop on certain functions (in fact, that is how I gotten the list of functions at the start of this article) during the insert. When doing that, I tried querying the table in order to see what happened visibly to the table at which I performed the insert while it was happening.

I found that the kcbchg1_main function performs cache buffers chains latch gets without calling kslgetl/ksl_get_shared_latch(_int). This has been documented on a few places, it is using a macro to perform the latch get. However, when stopping (by means of breaking execution using a debugger), I found that my other session simply executing a ‘select’ query hung on the latch.

Also, after the cache buffers chain latch get, the block is getting modified. That’s obvious of course, we are doing an insert. However, during the entire time between kcopcv up to the two kcbapl functions, the block is not accessible for any querying, and keeps any other query waiting in the wait event ‘buffer busy wait’. This is a direct violation of the mantra that writers don’t block readers.

Now, before you jump to conclusions, please bare with me a little more. Is this a problem? No, this is the direct consequence of concurrency control, which, in a shared system, always must happen. Certain things can only be done one at a time. In fact, that is the whole purpose of a latch.

However, I do want to make you think about it. These concurrency control waits simply must happen in order not to disturb changes being made to a block, or read an inconsistent version of a block. And this probably in much the same way has been for a long time this way in Oracle, and it has not a problem. Generally. Where this does become a problem, is when you have your processes randomly sleep for longer periods of time. Because of your inserting process is stalling during executing the kcrfw_redo_gen_ext function, other processes will get a buffer busy wait.

So when does this happen? Why am I saying this? Well, these random waits of processes is something that actually happens when your system is overloaded on CPU, or when your system is actively swapping. Now think about this in a broader sense. Whenever you run virtualised, and you do not have strict policies enforced on CPU and memory allocation, and especially if there is oversubscription of CPU and/or memory, these random waits can be applied by the hypervisor. So, never -ever- let your database server run short on CPU or memory, because the database will start showing a very peculiar waits, which could put you on your wrong foot very easily.

I actually tried putting in three distinct messages in this blogpost. The first one is to show how a change (an insert) is done internally. This shows that Oracle is build up in layers, and a lot of these layers are crossed even when you do the simplest of things, like inserting a single row in a table. Yes, this is not simple or even plain confusing. That’s not something I can change. However, I do think that this information is crucial in order to understand how the database works, and thus how to troubleshoot issues and understand what happens.

When a change is made to a block, first the redo is generated, then the undo is written, and then a row is inserted into the data block.

The second message is research on where the redo change vectors are generated. This is background information that will not help with anything concrete, it’s just background information for improving understanding how Oracle works. The research shows that redo is generated in a couple of functions:
– ktuchg2
– kcopcv
– kcobrh
– ktugur
– kdtwrp
– kcosze
– ktugul

At certain points, Oracle serialises access to blocks in order to make sure changes are not disturbed, and other processes do not read an inconsistent version of a block. This is technical locking, not the logical transaction locking that Oracle provides using enqueues. This is the third piece of information that I put in this article. This is directly usable and concrete information, and this is crucial, especially in today’s world of using virtualisation on premises or in the cloud.

If processes get stuck during these times of serialisation, all kinds of waits become visible, which might look like other problems; my first thought with buffer busy waits is a block being read into the buffer cache from disk or being written from the buffer cache onto disk, not being modified in the cache. In fact, if you start inserting into a freshly truncated table, I found I could get TX lock waits while doing a select. Again: this is not obvious.

So the last paragraph warns to make absolutely sure your database server is not oversubscribed on any resource, especially CPU and memory, because it can show itself in unpredictable ways, making it harder to diagnose, and thus to solve.

This is the second post in a series of blogposts on Oracle database redo internals. If you landed on this blogpost without having read the first blogpost, here is a link to the first blogpost: The first blogpost contains all the versions used and a synopsis on what the purpose of this series of blogposts is.

In the first part, I showed how the principal access to the public redo strands is controlled by redo allocation latches, and showed a snippet of trace information of memory accesses of a foreground session when using the first public redo strand:

kcrfw_redo_gen_ext+11277:0x0000000076fdf3bc(shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf3d8(shared pool|permanent memor,duration 1,cls perm+16114648 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf430(shared pool|permanent memor,duration 1,cls perm+16114736 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3d0(shared pool|permanent memor,duration 1,cls perm+16114640 ):R:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11312:0x0000000076fdf428(shared pool|permanent memor,duration 1,cls perm+16114728 ):W:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3c8(shared pool|permanent memor,duration 1,cls perm+16114632 ):R:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11322:0x0000000076fdf420(shared pool|permanent memor,duration 1,cls perm+16114720 ):W:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11327:0x0000000076fdf3c0(shared pool|permanent memor,duration 1,cls perm+16114624 ):R:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11332:0x0000000076fdf418(shared pool|permanent memor,duration 1,cls perm+16114712 ):W:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11337:0x0000000076fdf3b8(shared pool|permanent memor,duration 1,cls perm+16114616 ):R:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11342:0x0000000076fdf410(shared pool|permanent memor,duration 1,cls perm+16114704 ):W:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11347:0x0000000076fdf3b0(shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf408(shared pool|permanent memor,duration 1,cls perm+16114696 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:8:0x7270/29296()
kcrfw_redo_gen_ext+11312:0x0000000076fdf400(shared pool|permanent memor,duration 1,cls perm+16114688 ):W:8:0x7270/29296()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf398(shared pool|permanent memor,duration 1,cls perm+16114584 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf3f0(shared pool|permanent memor,duration 1,cls perm+16114672 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf390(shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11342:0x0000000076fdf3e8(shared pool|permanent memor,duration 1,cls perm+16114664 ):W:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11347:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11352:0x0000000076fdf3e0(shared pool|permanent memor,duration 1,cls perm+16114656 ):W:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x7270/29296()
kcrfw_redo_gen_ext+11386:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):W:4:0x7271/29297()
kcrfw_redo_gen_ext+11392:0x0000000076fdf468(shared pool|permanent memor,duration 1,cls perm+16114792 ):R:8:0xe0000000/3758096384(Redo Buffers(pgsz:2048k)|+134217728 )

And a snippet of trace information of memory accesses of a foreground session when using the second public redo strand:

kcrfw_redo_gen_ext+11277:0x0000000076fdf4e4(shared pool|permanent memor,duration 1,cls perm+16114916 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf500(shared pool|permanent memor,duration 1,cls perm+16114944 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf558(shared pool|permanent memor,duration 1,cls perm+16115032 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4f8(shared pool|permanent memor,duration 1,cls perm+16114936 ):R:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11312:0x0000000076fdf550(shared pool|permanent memor,duration 1,cls perm+16115024 ):W:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4f0(shared pool|permanent memor,duration 1,cls perm+16114928 ):R:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11322:0x0000000076fdf548(shared pool|permanent memor,duration 1,cls perm+16115016 ):W:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11327:0x0000000076fdf4e8(shared pool|permanent memor,duration 1,cls perm+16114920 ):R:8:0x800/2048()
kcrfw_redo_gen_ext+11332:0x0000000076fdf540(shared pool|permanent memor,duration 1,cls perm+16115008 ):W:8:0x800/2048()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4e0(shared pool|permanent memor,duration 1,cls perm+16114912 ):R:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11342:0x0000000076fdf538(shared pool|permanent memor,duration 1,cls perm+16115000 ):W:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4d8(shared pool|permanent memor,duration 1,cls perm+16114904 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf530(shared pool|permanent memor,duration 1,cls perm+16114992 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:8:0xe6/230()
kcrfw_redo_gen_ext+11312:0x0000000076fdf528(shared pool|permanent memor,duration 1,cls perm+16114984 ):W:8:0xe6/230()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4c8(shared pool|permanent memor,duration 1,cls perm+16114888 ):R:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf4c0(shared pool|permanent memor,duration 1,cls perm+16114880 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf518(shared pool|permanent memor,duration 1,cls perm+16114968 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4b8(shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11342:0x0000000076fdf510(shared pool|permanent memor,duration 1,cls perm+16114960 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4b0(shared pool|permanent memor,duration 1,cls perm+16114864 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11352:0x0000000076fdf508(shared pool|permanent memor,duration 1,cls perm+16114952 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11366:0x0000000076fdf560(shared pool|permanent memor,duration 1,cls perm+16115040 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0xe6/230()
kcrfw_redo_gen_ext+11386:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):W:4:0xe7/231()
kcrfw_redo_gen_ext+11392:0x0000000076fdf590(shared pool|permanent memor,duration 1,cls perm+16115088 ):R:8:0xe0200000/3760193536(Redo Buffers(pgsz:2048k)|+136314880 )

If you pay really close attention you see the memory location moved from 0x76fdf3?? / 0x76fdf4?? to 0x76fdf4?? / 0x76fdf5??, all in permanent memory in the shared pool. Would this mean information for each strand is kept in a struct per strand? Let’s take a memory allocation on a specific offset in the kcrfw_redo_gen_ext function, and take the allocation on exactly the same offset in the kcrfw_redo_gen_ext function, which means it’s probably doing the same, and calculate the difference of the memory allocation:

strand 0
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8
strand 1
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8

Now take the memory addresses, and calculate the difference: 0x76fdf520 – 0x76fdf3f8 = 0x128 = 296

Okay, so would the size of the probable struct per strand be 296 bytes? Another question: where would this potential struct start?

Often, the location of structs is fetched from another location, just like we saw earlier for the redo allocation latch, especially if they are more dynamically created than fixed SGA structures, which are not dynamic by nature. Upon close examination of similarities between both pina traces, I found usage of a pointer in the kcrfsg_ struct at offset +0:

kcrfw_redo_gen_ext+14362:0x0000000060016228(fixed sga|var:kcrfsg_+0 ):R:8:0x76fdf388/1996354440(shared pool|permanent memor,duration 1,cls perm+16114568 )

The difference between the memory address I took and the pointer address in kcrfsg_ at offset 0 for the first public redo strand is:
0x76fdf3f8 – 0x76fdf388 = 0x70 = 112
The difference between the memory address I took for the second public redo strand and the pointer in kcrfsg_ at offset 0 is:
0x76fdf520 – 0x76fdf388 = 0x198 = 408
Then subtract with the calculated difference between the memory locations of the first and second public redo strand:
408 – 296 = 112

It would be really helpful to have some kind of explanation about this structure… I searched through My Oracle Support (very seldomly the regular knowledge base provides some internals information, the best chance for internals is the bug database, the rest is…well, let me say: not that helpful. I also hate that other tabs are enabled if nothing is found), oraus.msg, oradebug dumplist, oradebug doc component for any tracing or dumps that could give more information, but nothing seemed to point to any explanation. I dumped the shared pool, but I found the same information (the memory location being permanent memory) as I obtained from x$ksmsp. Also, since the pinatrace annotation takes care of any x$ view that has it’s source address in the SGA, it seems there is no way other than puzzle all the individual locations in what I assume is a structure that holds per strand metadata.

I decided to go through the x$ views one more time, and found an x$ view that I already knew of, which points to both ‘strands’ and ‘kcrf’; x$kcrfstrand. However, the reason it is not annotated is because this x$ view has it’s addresses in the PGA:

SQL> select addr from x$kcrfstrand;


This means that when querying this x$ view, the results are gathered from wherever it comes from (the source can be file headers like redo (x$kccle) and controlfile (x$kcccf) for example, in this case we will the actual source of x$kcrfstrand is the SGA) and then stored temporarily in PGA, potentially manipulated, and then shown from the PGA, as we can see from the addresses (addr field in x$kcrfstrand).

However, what I can do to investigate where the x$kcfstrand takes its information from and run the pina (memory access) trace on the execution of ‘select * from x$kcrfstrand’, and try to see if I can find where it takes its data from. This is how the relevant part looks like:

kcrf_strand_ftcb+23:0x00000000600169b8(Fixed Size(pgsz:2048k)|+92600 fixed sga|var:kcrf_max_strands_+0 ):R:4:0x2/2()
kcrf_strand_ftcb+29:0x00000000600169c8(Fixed Size(pgsz:2048k)|+92616 fixed sga|var:kcrf_actv_pvt_strands_+0 ):R:4:0/0()
kcrf_strand_ftcb+52:0x0000000060016228(Fixed Size(pgsz:2048k)|+90664 fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|permanent memor,duration 1,cls perm+16114568 )
__intel_ssse3_rep_memcpy+184:0x0000000076fdf4a0(shared pool|permanent memor,duration 1,cls perm+16114848 ):R:16:0x00000000200040000/8590196736(??)
__intel_ssse3_rep_memcpy+9254:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:16:0xfe56e00000fe56e00000/-2305841916834480128()

Going from top to bottom through this snippet:
– qerfxFetch is the query execute rowsource function to query x$ tables, kcrf_strand_ftcb is an helper function (kernel cache recovery file strand fetch copy buffer is my guess) to fetch the kcrf strand data from the shared pool structure; the other functions are memory copy functions.
– Interestingly, the kcrf_strand_ftcb function reads offset 0 of kcrfsg_ just like we seen before. This is quite logical if you think about it, this struct is dynamically allocated in the shared pool, which means it doesn’t have a fixed address, so it needs to be looked up every time something must be read from it.
– The next functions are memcpy functions. I’ve shown the relevant ones. The first memcpy (_intel_fast_memcpy) probably is a helper function that switches to an intel optimised memcpy, __intel_ssse3_rep_memcpy, which takes advantage of intel’s SIMD optimisations (SSE3 means Supplemental Streaming SIMD Extensions 3).
– The memcpy functions at lines 11 and 12 show it’s reading 16 bytes from 0x76fdf4a0, and then writing 16 bytes to 0x7f7780448670.
– 0x76fdf4a0 is the struct for the second strand: kcrfsg_+0 = 0x76fdf388 + 296 = 0x76fdf4b0 + 16 (copy size) = 0x76fdf4a0.
– 0x7f7780448670 is quite probably and address in the process’ PGA, which is used for the intermediate representation of the struct’s data for displaying it for the x$ view.

Even more interesting, this confirms that x$kcrfstrand is the externalisation of the struct that is used together with kcrfsg_, and the calculations that were made previously are correct; kcrfsg_+0 = 0x76fdf388, and that address with 296 added is 0x76fdf4b0, which is the same as 0x76fdf4a0 + 16 (if you look into the above example, the memcpy function copies 16 bytes at a time).

I think it’s pretty safe to assume this kcrf related struct in the shared pool is commonly referred to as ‘KCRFA’ in x$kcrfstrand. Of course we can only guess that the ‘A’ means in kernel cache recovery file, I would say ‘auxiliary’, but is could also very well be the A in ‘strAnd’. It is a structure that clearly was needed when oracle changed the concept of a single log buffer into multiple public redo strands, because identical information of multiple strands need to be stored at the same time, and the information in kcrfsg_ is singular of nature.

Please mind that because of the intermediate representation of the x$ table in PGA, I could not use the trick I showed in my ‘drilldown into the logwriter inner working and communication’, which is query a single column in the x$ view, and see which memory addresses in this struct are touched; the entire struct is copied into PGA and processed before it is used to show the contents, even if only a single column of a single row is queried.

Also, also consider that since the struct data is copied into the process’ PGA in order to make this information human readable, not all offsets (in x$kqfco) of the PGA representation might be applicable to the SGA struct. In fact, I think I have proven that the KCRFA per strand metadata width is 296. Now look at the column offsets from x$kqfco of x$kcrfstrand:

SQL> select c.kqfconam, c.kqfcooff, c.kqfcosiz
from x$kqfco c, x$kqfta t
where t.indx = c.kqfcotab
and t.kqftanam = 'X$KCRFSTRAND'
order by c.kqfcooff;

KQFCONAM                         KQFCOOFF   KQFCOSIZ
------------------------------ ---------- ----------
ADDR                                    0          8
INDX                                    0          4
INST_ID                                 0          4
CON_ID                                  0          2
STRAND_HDR_SCN_KCRFA_CLN                0          8
HIGHEST_SCN_KCRFA_CLN                   0          8
PNEXT_BUF_KCRFA_CLN                    32          8
NEXT_BUF_NUM_KCRFA_CLN                 40          4
BYTES_IN_BUF_KCRFA_CLN                 48          8
AVAILABLE_BUFS_KCRFA_CLN               56          4
MEM_BUFS_AVAILABLE_KCRFA_CLN           60          4
STRAND_HEADER_BNO_KCRFA_CLN            72          4
FLAG_KCRFA_CLN                         76          1
PVT_STRAND_STATE_KCRFA_CLN             77          1
HIGHEST_SUBSCN_KCRFA_CLN               78          2
LWN_BUF_NUM_KCRFA_CLN                  80          4
STRAND_NUM_ORDINAL_KCRFA_CLN           84          4
STRAND_INFO_PTR_KCRFA_CLN              88          8
ACTION_KCRFA                          184          4
LOG_RESIDUE_KCRFA                     188          1
LAST_BUF_WRITTEN_KCRFA                216          4
LAST_BUF_GATHERED_KCRFA               220          4
CONS_LOW_BUFNO_KCRFA                  224          4
FIRST_BUF_KCRFA                       240          8
LAST_BUF_KCRFA                        248          8
TOTAL_BUFS_KCRFA                      256          4
STRAND_SIZE_KCRFA                     260          4
LOWBUF_KCRFA                          272          4
HIBUF_KCRFA                           276          4
WRAPHIBUF_KCRFA                       280          4
LOG_SEQ_KCRFA                         284          4
PTR_KCRF_PVT_STRAND                   304          8
INDEX_KCRF_PVT_STRAND                 312          4
SPACE_KCRF_PVT_STRAND                 316          4
TXN_KCRF_PVT_STRAND                   320          2

This shows that the representation of the KCRFA struct in PGA apparently has a width of 322 bytes, while we know the KCRFA struct itself is 296. I am not sure why this is, the most obvious reason would be some data is processed already in PGA to make easier to show. This might also very well be the reason there is a need for copying it to PGA first, and not display it directly from its shared pool location.

It’s general knowledge that the Oracle database is ACID compliant, and that SCNs or ‘system change numbers’ are at the heart of this mechanism. This blogpost dives into the details of how the Oracle engine uses these numbers.

Oracle database version
Operating system version: OL 7.2, kernel: 4.1.12-61.1.14.el7uek.x86_64 (UEK4)

Redo generation
Whenever DML is executed, redo is generated in the form of ‘change vectors’. These change vectors are copied into the redo buffer as part of the transaction, during the transaction. The function that performs this action is called ‘kcrfw_copy_cv()’. This can be derived by watching the foreground process perform memory copy into the memory area of the redo buffer.

In order to do this, you first need to find the memory area of the redo buffer. This can be done by executing ‘oradebug setmypid’ and ‘oradebug ipc’ as sysdba, and examine the resulting trace file:

 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008534000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1  4784130 0x00000096400000 0x00000096400000 0x00000096400000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008534000 0000000008600000 default       readwrite

This shows the redo buffers are allocated between 0x96400000 and 0x9E934000 (actual address and actual address + segment size).

The function Oracle uses to copy memory is called “_intel_fast_memcpy()”. This probably is an optimised version of the standard C memcpy() function. What is important, is that the arguments of memcpy function:

$ man memcpy
       memcpy - copy memory area
       #include <string.h>
       void *memcpy(void *dest, const void *src, size_t n);

So the first argument of memcpy is a pointer to the destination, the second argument is a pointer to the source, and the third argument is the size of the memory area to be copied.

In order to find the function that uses memcpy() to copy into the redo buffers area, we can use the following gdb macro:

break _intel_fast_memcpy if $rdi > 0x96400000 && $rdi < 0x9E934000
printf "_intel_fast_memcpy dest %lx, source %lx, size %d\n", $rdi, $rsi, $rdx
frame 1

The macro creates a conditional breakpoint on the function _intel_fast_memcpy(). The condition is that the CPU register RDI (first argument, destination) should be higher than 0x96400000 and lower than 0x9E934000, which is the memory area the redo log buffers are mapped to. If the breakpoint and its conditions are triggered, the function name is printed, together with the arguments to memcpy, and ‘frame 1’ to list the function that called _intel_fast_memcpy().

To execute this macro, startup an Oracle foreground session, and attach to the session with gdb, and execute the macro. Next, execute an insert statement. In my case, I created a very simple table called ‘T’ with one field, T, varchar2(1). Please mind that you need to watch out for deferred segment creation, so ideally you first need to create the table, and insert a few rows to take care of that.

This is how the gdb side looks like when executing an insert statement:

_intel_fast_memcpy dest 96817454, source 7ffc9fcc9598, size 36
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817478, source 7ffc9fcc9658, size 32
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817498, source 7faf6ad677f0, size 44
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174c4, source 7ffc9fcc9678, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968174d8, source 7ffc9fcc95e4, size 76
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817524, source 7faf6ad67888, size 8
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 9681752c, source 7faf6ad67760, size 20
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817540, source 7faf6ad69ff8, size 40
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817568, source 7faf6ad6a070, size 24
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 96817580, source 7faf6ad69f60, size 49
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()
_intel_fast_memcpy dest 968175b4, source 89753568, size 1
#1  0x000000000ccbaa42 in kcrfw_copy_cv ()

The destination of the memcpy() action is the logbuffers, the break conditions made sure of that. The source of all except for the last memcpy() is the PGA (visible by the high number), the last memcpy() copies from the SGA. The memory address points to an SQLA component (cursor) in the SGA.

Okay, at this point we established the redo of a DML operation is put in the redo buffers by a function called kcrfw_copy_cv(). Also please mind that modern Oracle versions do put redo into the redo buffers during doing DML, not after it!

To move further on the topic of transactions, let’s look at when the current SCN is fetched and when kcrfw_copy_cv() is called. To do this, I use the following gdb macro:

break kcrfw_copy_cv
printf "kcrfw_copy_cv\n"
awatch *0x6001fbb0

When running this gdb macro against an Oracle session running an insert statement, the following output is generated:

Hardware access (read/write) watchpoint 2: *0x6001fbb0

Old value = 347661
New value = 347666
0x000000000ccbdfb3 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbeb04 in kcsgssn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbdfb3 in kcscur3 ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe9ad in kcsgrsn ()
Hardware access (read/write) watchpoint 2: *0x6001fbb0

Value = 347666
0x000000000ccbe827 in kcsgcsn ()

Actually, I was surprised by this output. I suspected a call to generate an unique SCN for this change. However, the calls/functions which are visible are:
– kcscur3: a function to read a SCN (the global SCN, but is also used to read other SCNs, like the LWN SCN and the on-disk SCN).
– kcsgrsn: kernel cache service get recent SCN, for reading the global SCN.
– kcsgcsn: kernel cache service get current SCN, for reading the global SCN.
– kcsgssn: kernel cache service get snapshot SCN, for reading the global SCN.
All of these functions do READ a SCN value, but do not modify it. In order to update a SCN, the function kcsgbsn (kernel cache service get batched SCN) should be called. There is a memory write shown (row 1-5), but that really is a read, it is an anomaly of watchpoints with memory changed by other processes.

So, conclusion at this point is that DML (in this case an insert statement) just reads the global SCN, and does not seem to update it.

This got me thinking: if an insert statement just reads the global SCN, instead of generating an unique SCN, it should be possible that the changes (redo vectors actually) of DML executed simultaneously in different sessions on different segments share the same SCN? It sounds counter intuitive to what I have been taught about Oracle.

This can be tested in the following way:
1 – Have two foreground sessions in a different schema execute an insert statement at exactly the same time. This can be done in iTerm2 using the ‘broadcast input’ option, which allows input to be executed in multiple panes at the same time. The command executed in both pane’s is ‘insert into t values (‘a’);’.

2 – Have gdb attached to the two sessions, having an access watchpoint at the global SCN, to be able to identify the SCN that both the session used during executing the insert, and a breakpoint at kcrfw_copy_cv() to understand which SCN it uses. I used the following gdb macro:

break kcrfw_copy_cv
    printf "kcrfw_copy_cv\n"
awatch *0x6001fbb0
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0

3 – To validate what is visible in gdb, dump the current logfile and find the SCN as shown in gdb. The current logfile can be dumped to trace using the following PL/SQL procedure:

-- This script will dump the current redolog file.
-- If the log file is big, this might take some time.
-- Original Author : Riyaj Shamsudeen
-- Modifications by: Frits Hoogland
-- No implied or explicit warranty !
set serveroutput on size 1000000
  v_sqltext varchar2(255);
 select 'alter system dump logfile '||chr(39)||member||chr(39) into v_sqltext
  from v$log lg, v$logfile lgfile
  where = and
  lg.sequence# = ( select sequence# from v$log where status='CURRENT' )
  and rownum <2;
  dbms_output.put_line ('Executing :'||v_sqltext);
  execute immediate v_sqltext;

In step 2, I got the following gdb output on both the inserts:

(1st session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860

(2nd session)
kcscur3 - SCN: 5c860
kcscur3 - SCN: 5c860

This is as expected based on our observation, because the insert statement only reads the global SCN. Now let’s dump the current redologfile as shown in step 3 above, and see if we can find SCN 5c860.
Here are the redo records for both changes:

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.0010 LEN: 0x01a8 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

REDO RECORD - Thread:1 RBA: 0x000014.0000a50e.01b8 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0005c860 SUBSCN:  1 10/21/2016 20:12:11

Watch the SCN from the gdb session and at the line with SCN and the SUBSCN value being the same. Also watch the RBA (redo block address) being different.

Both redo records contain 3 changes:

(1st session)
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0005c801 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:32 AFN:3 DBA:0x000000f9 OBJ:4294967295 SCN:0x0000.0005c800 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600c OBJ:20455 SCN:0x0000.0005c857 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

(2nd session)
CHANGE #1 CON_ID:0 TYP:0 CLS:35 AFN:3 DBA:0x00000110 OBJ:4294967295 SCN:0x0000.0005c7b2 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:0 TYP:0 CLS:36 AFN:3 DBA:0x00000895 OBJ:4294967295 SCN:0x0000.0005c78c SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:0 TYP:2 CLS:1 AFN:4 DBA:0x0000600e OBJ:20455 SCN:0x0000.0005c858 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000

These 3 transactions are consistent with an insert of a single row, this is visible by the opcodes:
5.2=update rollback segment header
5.1=undo block
11.2=insert row piece

This means the DML of the two inserts have exactly the same SCN! A logic thought that you might come up with is: but how is Oracle then able to apply these in the correct order, which is one of the fundamental reasons for having SCNs? For that, it’s important to realise this is just the change to a block, but not a committed transaction yet. In order to get a full understanding of how the SCN mechanism works for at least a single row insert, we need to change the gdb macro from the test above:

awatch *0x6001fbb0
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0
    if $rip >= 0xccbe340 && $rip <= 0xccbe7b0
      printf "kcsgbsn - SCN: %x (@%x)\n", *0x6001fbb0, $rip
break kcrfw_copy_cv
    printf "kcrfw_copy_cv\n"

The function kcsgbsn means: kernel cache service get batched SCN. This function DOES update the global SCN(!). However, this function does not get called during the insert. It does get called when commit is called (or rollback).

So, in order take another step in understanding SCN handling by Oracle, let’s use the updated version of the gdb macro, and insert another row into the table in two sessions, and then also commit:

(1st session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)

(2nd session)
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcscur3 - SCN: 7e39b
kcsgbsn - SCN: 7e39b (@ccbe3a3)
kcsgbsn - SCN: 7e39c (@ccbe3bb)
kcsgbsn - SCN: 7e39c (@ccbe63a)

In both sessions we see the insert (lines 2-4 and 11-13), and then the commit (lines 5-8 and 14-18).
During the insert, we see both sessions using the same SCN (7e39b), which can be verified by dumping the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.0010 LEN: 0x01c0 VLD: 0x05 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

REDO RECORD - Thread:1 RBA: 0x000016.00018f19.01d0 LEN: 0x0194 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39b SUBSCN:  1 10/22/2016 12:30:45

And we see kcsgbsn() being called. As has been documented on multiple places, the current value of the global SCN is read, and then advanced (get and advance). This can be seen in lines 6 and 7; 0x7e39b is read, and then advanced to 0x7e39c. However, to be honest to my surprise, the second session does the same, as seen in lines 15-17, it advances from 0x7e39b to 0x7e39c too. However it rereads the SCN one additional time. The surprise is the sessions seem to have the same SCN for the commit. Validation from the dump of the current redologfile:

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.0174 LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:31 AFN:3 DBA:0x000000f0 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

REDO RECORD - Thread:1 RBA: 0x000016.00018f1a.01dc LEN: 0x0068 VLD: 0x01 CON_UID: 0
SCN: 0x0000.0007e39c SUBSCN:  1 10/22/2016 12:30:45
CHANGE #1 CON_ID:0 TYP:0 CLS:19 AFN:3 DBA:0x00000090 OBJ:4294967295 SCN:0x0000.0007e39b SEQ:1 OP:5.4 ENC:0 RBL:0 FLG:0x0000

Yes, both commits (visible by the opcode (OP) 5.4) have the same SCN. At this point I think it’s fitting to theorise why this is happening. I think this is happening and makes sense because the table is protected for consistency at another level in the database, which is the locking mechanisms (TX locks for rows, TM locks for segments), which guarantees that two transactions at the same time do not compromise data consistency. And because of that, it’s not a problem to batch them with the same SCN.

Changes in the database are synchronised by system change numbers, also known as SCNs. SCNs are the way Oracle serialises changes to data. However, this post shows that the Oracle engine tries to maximise the number of changes inside a SCN, and as such a SCN is not an highly granular serialisation mechanism. This is not a problem, since Oracle’s transactional protection mechanisms (like TX and TM locks) do work alongside SCNs to make sure no conflicting changes can happen at the same time. The changes itself do just read and pick the current global SCN from the instance, a commit of these changes does increase the global SCN, but as little as possible. You might find multiple changes in the same SCN, and you might even find commits with the same SCNs.

Bonus topic: finding nemokcscur3.
In case you wondered what this little macro means, and how got the memory addresses:

awatch *0x6001fbb0
    if $rip >= 0xccbdfa0 && $rip <= 0xccbe010
      printf "kcscur3 - SCN: %x\n", *0x6001fbb0

awatch is a memory access watch function. The commands executed when this watch triggers are:
-silent (do not display output)
-if $rip >= 0xccbdfa0 && $rip <= 0xccbe010 (if the instruction pointer contains a value higher or equal to 0xccbdfa0 and the instruction pointer contains a value lower or equal to 0xccbe010)
-printf "kcscur3 – SCN: %x\n", *0x6001fbb0 (print some stuff and print (in hexadecimal format) where the value the pointer 0x6001fbb0 points to)
The text 'kcscur3' in the printf spoils that these addresses are actually the memory addresses where the function kcscur3 is located. How to find these? The starting address can be found in a variety of ways, for example in gdb: 'info function kcscur3', or in the shell using nm: 'nm oracle | grep kcscur3'. It is a bit harder to find the ending address of a function. However, it's still quite simple doable with nm: 'nm -S oracle | grep kcscur3':

[root@o12102fs bin]# nm -S oracle | grep kcscur3
000000000ccbdfa0 0000000000000070 T kcscur3

The first column is the starting memory address of the kcscur3 function, the second column is the length (in hexadecimal) of the function code. 0xccbdfa0+0x70=0xccbe010.

Update oktober 24, 2016:
Roger noted the following: Minor correction, Oracle doesn’t do DML directly, it first generates change vectors which are then applied to the local database which is effectively running in recovery. This way Oracle uniquely guarantees recovery on the primary or physical standby will always do exactly the same thing as the original DML.

Stefan noted the following: Hey Frits. Great work – as always. However i am not quite sure about your kcrfw_copy_cv() PGA interpretation for private redo strands. AFAIK private redo strands reside in permanent memory allocations (in SGA) and you should see memory copy operations between these memory chunks and the public redo buffer(s) on commit. Your GDB breakpoint should only kick in under these circumstances as anything else is handled in private redo (at least on your mentioned database version) and no memory operations on redo buffer. I also modeled a short test case with on OEL 6.7 to verify my theory/assumption and the result was as expected – not quite sure about your observed and mentioned PGA to redo log buffer copy operations then.

I should add that when I run snapper on the insert in my database, the statistics ‘IMU- failed to get a private strand’ and ‘IMU pool not allocated’ are increased, giving a hint no private strands are used.

%d bloggers like this: