Archive

Linux

This is the sixth part in a blog series about Oracle database redo. The previous posts provided information about the log writer writing, this post is about the process that is waiting after issuing commit for the log writer to write it’s redo from the public redo strand. When the database is using post/wait for process commits, the committing process follows the following (simplified) procedure:

kcrfw_redo_gen_ext: 
 kcsgbsn8: kcsnew8: commit SCN is fetched from kcsgscn_
 kslgetl: redo copy latch is taken
 kslgetl: redo allocation latch is taken
 : allocation annotations are made for the commit record in kcrfsg_ and KCRFA
 kslfre: redo allocation latch is freed
 kcrfw_copy_cv : commit record is copied into a public redo strand.
 kslfre: redo copy latch is freed
 kcrfw_commit_force_int: second argument 0: post logwriter if not already posted

These are the actual function calls inside kcrf_commit_force_int with second argument 0:

| | | | | | | | | | | | | | | | > kcrf_commit_force_int(0x7fffbea0de50, 0, ...)
| | | | | | | | | | | | | | | | | > kcscu8(0x60016290, 0x7fffbea0dc78, ...)
| | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
| | | | | | | | | | | | | | | | | > kcscu8(0x60016260, 0x7fffbea0dbc0, ...)
| | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
| | | | | | | | | | | | | | | | | > ksbasend(0x600161a0, 0x60016908, ...)
| | | | | | | | | | | | | | | | | | > ksbpro(0x600161a0, 0x60016908, ...)
| | | | | | | | | | | | | | | | | | < ksbpro+0x000000000063 returns: 0x7a6032e8
| | | | | | | | | | | | | | | | | | > ksbpser(0x600161a0, 0x60016908, ...)
| | | | | | | | | | | | | | | | | | < ksbpser+0x000000000046 returns: 0x1
| | | | | | | | | | | | | | | | | | > ksberr(0x600161a0, 0, ...)
| | | | | | | | | | | | | | | | | | < ksberr+0x000000000045 returns: 0
| | | | | | | | | | | | | | | | | | > ksasnd(0x7a6032e8, 0x1, ...)
| | | | | | | | | | | | | | | | | | | > ksasndint(0x7a6032e8, 0x1, ...)
| | | | | | | | | | | | | | | | | | | | > ksaamb(0x7a63a0b8, 0x7a5d9fe0, ...)
| | | | | | | | | | | | | | | | | | | | | > kslgetl(0x6003ef40, 0x1, ...)
| | | | | | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
| | | | | | | | | | | | | | | | | | | | | > _intel_fast_memcpy(0x7af99780, 0x7fffbea0dba8, ...)
| | | | | | | | | | | | | | | | | | | | | <> _intel_fast_memcpy.P(0x7af99780, 0x7fffbea0dba8, ...)
| | | | | | | | | | | | | | | | | | | | | <> __intel_ssse3_rep_memcpy(0x7af99780, 0x7fffbea0dba8, ...)
| | | | | | | | | | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x000000002030 returns: 0x7af99780
| | | | | | | | | | | | | | | | | | | | | > kslfre(0x6003ef40, 0x60009578, ...)
| | | | | | | | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0
| | | | | | | | | | | | | | | | | | | | < ksaamb+0x0000000001ed returns: 0x1
| | | | | | | | | | | | | | | | | | | < ksasndint+0x000000000120 returns: 0x1
| | | | | | | | | | | | | | | | | | | > ksl_post_proc(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | <> kskpthr(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | <> kslpsprns(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | | > ksl_update_post_stats(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | | | > dbgtTrcData_int(0x7f5b0ca886c0, 0x2050031, ...)
| | | | | | | | | | | | | | | | | | | | | | > dbgtBucketRedirect(0x7f5b0ca886c0, 0x7fffbea0d618, ...)
| | | | | | | | | | | | | | | | | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
| | | | | | | | | | | | | | | | | | | | | | > dbgtIncInMemTrcRedirect(0x7f5b0ca886c0, 0x6fa, ...)
| | | | | | | | | | | | | | | | | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
| | | | | | | | | | | | | | | | | | | | | | > skgstmGetEpochTs(0x7f5b0ca886c0, 0x6fa, ...)
| | | | | | | | | | | | | | | | | | | | | | | > gettimeofday@plt(0x7fffbea0d120, 0, ...)
| | | | | | | | | | | | | | | | | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
| | | | | | | | | | | | | | | | | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x203895d2b55e6
| | | | | | | | | | | | | | | | | | | | | | > dbgtrRecAllocate(0x7f5b0ca886c0, 0x7fffbea0d5c0, ...)
| | | | | | | | | | | | | | | | | | | | | | | > dbgtrPrepareWrite(0x7f5b0ca886c0, 0x711f5048, ...)
| | | | | | | | | | | | | | | | | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x1c
| | | | | | | | | | | | | | | | | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
| | | | | | | | | | | | | | | | | | | | | | > _intel_fast_memcpy(0x711f6ac0, 0x7fffbea0d6b8, ...)
| | | | | | | | | | | | | | | | | | | | | | <> _intel_fast_memcpy.P(0x711f6ac0, 0x7fffbea0d6b8, ...)
| | | | | | | | | | | | | | | | | | | | | | <> __intel_ssse3_rep_memcpy(0x711f6ac0, 0x7fffbea0d6b8, ...)
| | | | | | | | | | | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x711f6ac0
| | | | | | | | | | | | | | | | | | | | | | > dbgtrRecEndSegment(0x7f5b0ca886c0, 0x7fffbea0d5c0, ...)
| | | | | | | | | | | | | | | | | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
| | | | | | | | | | | | | | | | | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
| | | | | | | | | | | | | | | | | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
| | | | | | | | | | | | | | | | | | | | > skgpwpost(0x7fffbea0d760, 0x7f5b0cacdca0, ...)
| | | | | | | | | | | | | | | | | | | | <> sskgpwpost(0x7fffbea0d760, 0x7f5b0cacdca0, ...)
| | | | | | | | | | | | | | | | | | | | | > semop@plt(0x38000, 0x7fffbea0d6f0, ...)
| | | | | | | | | | | | | | | | | | | | | < semop+0x00000000000f returns: 0
| | | | | | | | | | | | | | | | | | | | < sskgpwpost+0x00000000009a returns: 0x1
| | | | | | | | | | | | | | | | | | | < kslpsprns+0x0000000001c3 returns: 0
| | | | | | | | | | | | | | | | | | < ksasnd+0x00000000006d returns: 0x1
| | | | | | | | | | | | | | | | | < ksbasend+0x0000000000f6 returns: 0x1
| | | | | | | | | | | | | | | | < kcrf_commit_force_int+0x000000000500 returns: 0

Basically, what happens is that kcscu8 is called to read 0x60016290 (kcrfsg_.on disk SCN) and kcscu8 is called to read 0x60016260 (kcrfsg_.LWN SCN). Then ksbasend (kernel service background processes asynchronous send message) is called, which essentially performs two functions in ksasnd: write a message in the messages struct (protected by the messages latch) in ksaamb (kernel service asynchronousm messages allocated message buffer) and then post the logwriter using post/wait (ksl_post_proc, actual posting using semop is done in skgpwpost; system kernel generic post wait post).
This also directly ends/returns from the kcrfw_redo_gen_ext function that called kcrf_commit_force_int.

At this point the commit was executed, and the logwriter is notified that it needs to write up to the commit record from the public redo strand.

The next thing the process do is release things like the enqueue(s) it’s holding. After that, the process will wait for the logwriter to execute the write and get posted via post/wait. This is also done in the kcrf_commit_force_int function, however now the second argument is set to ‘1’:

 | | | <> kcrf_commit_force_int(0x7f3011d75e10, 0x1, ...)
 | | | | > skgstmGetEpochTs(0x7f3011d75e10, 0x7, ...)
 | | | | | > gettimeofday@plt(0x7ffda5959810, 0, ...)
 | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x206daa374e6a1
 | | | | > kcscu8(0x60016290, 0x7ffda5959b88, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcscu8(0x60016260, 0x7ffda5959ad0, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > ksbasend(0x600161a0, 0x60016908, ...)
 | | | | | > ksbpro(0x600161a0, 0x60016908, ...)
 | | | | | < ksbpro+0x000000000063 returns: 0x7a6032e8
 | | | | | > ksbpser(0x600161a0, 0x60016908, ...)
 | | | | | < ksbpser+0x000000000046 returns: 0x1
 | | | | | > ksberr(0x600161a0, 0, ...)
 | | | | | < ksberr+0x000000000045 returns: 0
 | | | | | > ksasnd(0x7a6032e8, 0x1, ...)
 | | | | | | > ksasndint(0x7a6032e8, 0x1, ...)
 | | | | | | | > ksaamb(0x7a6085b8, 0x7a5d9fe0, ...)
 | | | | | | | < ksaamb+0x000000000504 returns: 0
 | | | | | | < ksasndint+0x000000000120 returns: 0x1
 | | | | | < ksasnd+0x00000000006d returns: 0x1

Actually exactly the same is happening here as we saw previously. The way the kcrf_commit_force_int function works is:
– It first reads the on disk SCN (0x60016290) via kcscu8. If the on disk SCN is equal or past the process’ commit SCN, it means the process’ redo is written to disk, and the kcrf_commit_force_int function is quit.
– If the on disk SCN is not progressed far enough, the LWN SCN (0x60016260) is read via kcscu8. If the LWN SCN progressed beyond or is equal to the commit SCN, it means the logwriter is in the progress of writing the process’ redo, so it does not need posting via ksbasend. If the logwriter has not progressed the LWN SCN that far, the process tries posting via ksbasend again.
– In the above function call trace we see ksaamb returning 0, which means it did try to post the logwriter again, but it found the logwriter already posted (by looking at kcrfsg_+516). As a result, it returned directly from the function without creating a message (and thus needing to fetch the messages latch and post the logwriter via semop).

At this point in kcrf_commit_force_int everything is done to make the logwriter write the process’ redo in the public redo strand, so it needs to wait for the logwriter to finish, and the instance is using post wait the notification mechanism for log file sync.

The next thing that happens is a post-wait entry in the post wait queue is setup, and the on disk SCN is checked once more:

 | | | | > kslawe(0x10380, 0x7a5d9fe0, ...)
 | | | | | > kslgetl(0x7a5d72c8, 0, ...)
 | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | > kslfre(0x7a5d72c8, 0x7a5d7368, ...)
 | | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | < kslawe+0x00000000027d returns: 0
 | | | | > kcscu8(0x60016290, 0x7ffda5959b88, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1

Now that the process registered itself in the post-wait queue, a wait is registered via kslwtbctx:

 | | | | > kslwtbctx(0x7ffda5959950, 0x7ffda5959b88, ...)
 | | | | | > sltrgftime64(0x7a6d9580, 0x7ae18b68, ...)
 | | | | | | > clock_gettime@plt(0x1, 0x7ffda5959700, ...)
 | | | | | | | > clock_gettime(0x1, 0x7ffda5959700, ...)
 | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | < sltrgftime64+0x00000000004c returns: 0x428fc8dac9
 | | | | | > kslwait_timeout_centi_to_micro(0x7fffffff, 0x443dfe7, ...)
 | | | | | < kslwait_timeout_centi_to_micro+0x000000000021 returns: 0x7fffffffffffffff
 | | | | | > kskthbwt(0, 0x428fc8dac9, ...)
 | | | | | < kskthbwt+0x0000000000a1 returns: 0x10
 | | | | | > kslwt_start_snapshot(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | < kslwt_start_snapshot+0x0000000000dc returns: 0x7a6d9db8
 | | | | < kslwtbctx+0x00000000050b returns: 0x7a6d9db8

Now that the process is in the wait event, the process needs to truly ‘wait’. In post-wait mode this is done on a semaphore, just like the logwriter does when sleeping in the wait event ‘rdbms ipc message’. This is how that looks like:

 | | | | > kslwaitctx(0x7ffda5959950, 0xa, ...)
 | | | | | > ksliwat(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | > sltrgftime64(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffda5959280, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffda5959280, ...)
 | | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | < sltrgftime64+0x00000000004c returns: 0x428fc8e664
 | | | | | | > kslawe(0x10380, 0x459ffef, ...)
 | | | | | | < kslawe+0x000000000076 returns: 0x10380
 | | | | | | > kskiorm(0x7ae0ee70, 0, ...)
 | | | | | | < kskiorm+0x00000000001e returns: 0
 | | | | | | > kfias_iswtgon_ksfd(0x7ae0ee70, 0, ...)
 | | | | | | < kfias_iswtgon_ksfd+0x00000000002b returns: 0
 | | | | | | > kxdbio_has_work(0x7ffda59592a4, 0x6003d010, ...)
 | | | | | | < kxdbio_has_work+0x000000000027 returns: 0
 | | | | | | > skgpwwait(0x7ffda5959510, 0x7f3011d45ca0, ...)
 | | | | | | | > kslwait_conv_wait_time(0x186a0, 0x7f3011d45ca0, ...)
 | | | | | | | < kslwait_conv_wait_time+0x000000000027 returns: 0x186a0
 | | | | | | | > sskgpwwait(0x7ffda5959510, 0x7f3011d45ca0, ...)
 | | | | | | | | > semtimedop(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff
 | | | | | | | | > __errno_location(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | | > fthread_self(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | | < fthread_self+0x000000000024 returns: 0
 | | | | | | | | < __errno_location+0x000000000010 returns: 0x7f3011d406a0
 | | | | | | | < sskgpwwait+0x00000000014e returns: 0
 | | | | | | < skgpwwait+0x0000000000e0 returns: 0
 | | | | | | > ksuSdiInProgress(0x10003, 0x10003, ...)
 | | | | | | < ksuSdiInProgress+0x000000000035 returns: 0
 | | | | | | > sltrgftime64(0x10003, 0x10003, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffda5959280, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffda5959280, ...)
 | | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | < sltrgftime64+0x00000000004c returns: 0x428fcad86e
 | | | | | | > kslwo_getcbk(0x9d, 0x80f6298, ...)
 | | | | | | < kslwo_getcbk+0x000000000017 returns: 0
 | | | | | | > kgslwait_last_waitctx_time_waited_usecs(0x7f3011d409a0, 0x7a6da7f0, ...)
 | | | | | | < kgslwait_last_waitctx_time_waited_usecs+0x000000000045 returns: 0x1f20a
 | | | | | | > kslwt_update_stats_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | | > kews_update_wait_time(0x5, 0x1fda5, ...)
 | | | | | | | < kews_update_wait_time+0x00000000002e returns: 0x6
 | | | | | | < kslwt_update_stats_int+0x00000000033e returns: 0x428fcad86e
 | | | | | | > ksliwat_cleanup(0x7ffda5959650, 0x7ffda5959628, ...)
 | | | | | | | > kslgetl(0x7a5d72c8, 0x1, ...)
 | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | > kslfre(0x7a5d72c8, 0x7a5d7378, ...)
 | | | | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | | | < ksliwat_cleanup+0x0000000000e5 returns: 0
 | | | | | < ksliwat+0x000000003acf returns: 0
 | | | | < kslwaitctx+0x00000000012b returns: 0

The most prominent thing is skgpwwait (system kernel generic post-wait wait), which calls sskgpwwait (skgpwwait with an additional s in front, which often are os dependent calls, like in this call, calling semtimedop). We see semtimedop returning 0xffffffffffffffff (-1), which means semaphore wait timed out, which means the semaphore was not posted using semop before the timeout set (100ms for log file sync) for the semaphore wait.

Interestingly, a little further down se see ksliwat_cleanup. This function clears this process from the post-wait queue, not only indicated by the ‘cleanup’ part in ksliwat_cleanup, but also by the latch get of 0x7a5d72c8, this is the ‘post/wait queue’ latch. This means that when the kslwaitctx function is left, this process is not in the post/wait queue anymore. However, the process is still waiting in the wait event ‘log file sync’.

The next thing that happens is a call to ‘ksu_dispatch_tac’ (kernel service user dispatch timed-out action):

 | | | | > ksu_dispatch_tac(0x7a5d72c8, 0, ...)
 ...
 | | | | < ksu_dispatch_tac+0x0000000003b4 returns: 0

Essentially, this routine validates certain conditions to trigger actions because the wait in kslwaitctx has timed out.

At this point the process repeats the steps of:
– kslawe: registering itself in the post-wait queue
– kcscu8: check on disk SCN
– kslwaitctx, semtimedop: wait to be posted
– (if not posted) ksliwat_cleanup: remove post-wait entry
– (if not posted) kcscu8: check on disk SCN
(the double check of on disk SCN via kcscu8 is not a typo!)

If the semaphore gets posted (visible by return code 0):

 | | | | | | | | > semtimedop(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | < semtimedop+0x000000000012 returns: 0

The post-wait cleanup function (ksliwat_cleanup) does not require a latch when the process has been posted, probably because now that the entry has been used, there is no chance it will be used again, so it’s safe to cleanup without serialising access via a latch. After the ksliwat_cleanup returns, the process also returns from ksliwat and kslwaitctx.

Next kcscu8 is called to read the on disk SCN again, and the wait event is ended with kslwtectx, some statistics about redo writing are written (kcrfw_stats_syncoverhead_record), and the function kcrf_commit_force_int is ended, because when the semaphore is posted, the wait for the logwriter to finish is done, meaning the commit is entirely finished:

 | | | | > kcscu8(0x60016290, 0x7ffda5959b88, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kslwtectx(0x7ffda5959950, 0x7ffda5959b88, ...)
 | | | | | > kslwt_end_snapshot(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | > kslwh_enter_waithist_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | < kslwh_enter_waithist_int+0x00000000013a returns: 0x7a8378a8
 | | | | | | > kslwtrk_enter_wait_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | < kslwtrk_enter_wait_int+0x000000000019 returns: 0x7a8378a8
 | | | | | < kslwt_end_snapshot+0x0000000000f2 returns: 0x7a6d9580
 | | | | | > kslwt_update_stats_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | > kews_update_wait_time(0x5, 0, ...)
 | | | | | | < kews_update_wait_time+0x00000000002e returns: 0x6
 | | | | | < kslwt_update_stats_int+0x00000000033e returns: 0
 | | | | | > kskthewt(0x4290257c02, 0x9d, ...)
 | | | | | < kskthewt+0x0000000000c3 returns: 0xa030
 | | | | < kslwtectx+0x0000000003f9 returns: 0xe923
 | | | | > skgstmGetEpochTs(0x4290257c02, 0x9d, ...)
 | | | | | > gettimeofday@plt(0x7ffda5959810, 0, ...)
 | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x206daa3d1ca1b
 | | | | > kcrfw_stats_syncoverhead_record(0x7f3011d75e10, 0x206daa3d1ca1b, ...)
 | | | | | > kcrfw_stats_writeinfo_find(0x7f3011d75e10, 0x206daa3d1ca1b, ...)
 | | | | | < kcrfw_stats_writeinfo_find+0x0000000000fb returns: 0x206daa3d1a057
 | | | | < kcrfw_stats_syncoverhead_record+0x0000000000a4 returns: 0x29c4
 | | | < kcrf_commit_force_int+0x000000000b9c returns: 0x1

What is clear, is that the process checks the on disk SCN a lot of times. The obvious reason for that would be to stop waiting whenver the on disk SCN indicates the process’ redo information is written. However, in the past I’ve heard about issues with sessions that were in post-wait mode waiting, which were ‘forgotten’ when the database switched to polling (this was Oracle 11.2.0.3, this issue is solved!). That however leads to an interesting question:

1. What happens when a process is posted by the logwriter, but the on disk SCN is NOT progressed beyond the process’ commit SCN?
2. What happens when a process is NOT posted by the logwriter, but the on disk SCN has progressed beyond the process’ commit SCN?

Let’s test that! Let’s test the first situation: a foreground posted, but the on disk SCN did not progress beyond the commit SCN of that process.

First we need to know the commit SCN for the inserting session. We know that the SCN is obtained from kcsgscn_, let’s get the address of it:

SQL> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv where ksmfsnam = 'kcsgscn_';
KSMFSNAM                       KSMFSADR           KSMFSSIZ
------------------------------ ---------------- ----------
kcsgscn_                       00000000600113B8         48

Now attach to a foreground session with gdb, and put an read/write (acces) watch on 0x600113b8, set pagination off and let gdb continue:

(gdb) awatch *0x600113b8
Hardware access (read/write) watchpoint 1: *0x600113b8
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) set pagi off
(gdb) c
Continuing.

Now insert something in the foreground session (anything is okay, as long as it’s a regular insert):

Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374652
New value = 374701
0x000000000320e5a8 in kcsAdjustUsingForeignSCN ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3ee68 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f03b in kcsgrsn8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f03b in kcsgrsn8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f2d5 in kcsgssn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3e93b in kcscu8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3e93b in kcscu8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3ee68 in kcsgcsn ()

Pay attention to the output sequence of a watchpoint:
– ‘Hardware access …’ and watchpoint number and address.
– old and new value or read value.
– instruction pointer address and function.
Also look at the different functions with which an SCN is queried.

Now execute the commit:

Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374701
New value = 374767
0x000000000320e5a8 in kcsAdjustUsingForeignSCN ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374767
0x0000000010e3ecc1 in kcsnew8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374767
New value = 374768
0x0000000010e3ecdb in kcsnew8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374768
New value = 374769
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374769
0x0000000010e3ee68 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374769
0x0000000010e3ee68 in kcsgcsn ()

If you take a close look, you see a new SCN for the commit is obtained in the aptly named function kcsnew8 at instruction pointer address 0x10e3ecdb. Now that we know the instruction pointer address, we can filter the watch:
Disable the previous watch, and create a second access watch point:

(gdb) dis 1
(gdb) awatch *0x600113b8 if ( $rip == 0x10e3ecdb )
Hardware access (read/write) watchpoint 2: *0x600113b8
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now issue another insert into and a commit:

Hardware access (read/write) watchpoint 2: *0x600113b8

Old value = 374908
New value = 374909
0x0000000010e3ecdb in kcsnew8 ()

Voila! commit SCN is 374909 in this case!

Because we know kcscu8 is called to read the on disk SCN from kcrfsg_, we can cheat and modify the on disk SCN in kcrfsg_ ourselves to make this lower than the commit SCN.

=> WARNING! Never EVER do this, or any of these techniques in a database that you care about! This can severely corrupt the database leading to an irrepairable situation! WARNING <=

First modify the watchpoint (2 in the above example) to stop when it is fired:

(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>end
(gdb) c
Continuing.

Now perform the insert and commit again. This will hang because the watchpoint will break execution:

Hardware access (read/write) watchpoint 2: *0x600113b8

Old value = 620035
New value = 620036
0x0000000010e3ecdb in kcsnew8 ()

Okay, so the commit SCN is 620036. Now break on kcscu8 and simply set the SCN to 620000, which undoubtly is lower than the commit SCN of 620036:

(gdb) break kcscu8
Breakpoint 3 at 0x10e3e910
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>set {int}0x60016290=620000
>c
>end

Also set an access watch on 0x60016290, alias the on disk SCN, to validate what SCN the process reads:

(gdb) awatch *0x60016290
Hardware watchpoint 4: *0x60016290
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb)

Now attach with another gdb session to the logwriter, and keep the logwriter stopped after attaching.
After the logwriter is stopped, continue in the foreground session:

(gdb) c
Continuing.

What I regularly encounter, especially with new(er) versions of the Oracle database, is the following messages:

Program received signal SIGUSR2, User defined signal 2.
0x0000000010e3ecdb in kcsnew8 ()
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000000010f973a5 in slaac_int ()
(gdb) c
Continuing.

It seems that a signal handler is triggered. In most cases just continuing will make the process running again.

After the signal handler, the following is shown in gdb:

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
---Type <return> to continue, or q <return> to quit---

Quit using q and enter, and set pagination off: set pagi off, then continue.

Now let the logwriter continue in its gdb session (continue).

In the foreground debugger session, you'll see that the foreground did not do anything else than it already did:

Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()

Stop execution of the committing process, and disable breakpoint 3 (dis 3), which sets the on disk SCN to a value lower than the process' commit SCN, so it will pick up the correct current on disk SCN, which is beyond the process' commit SCN:

(gdb) c
Continuing.
Hardware access (read/write) watchpoint 4: *0x60016290

Old value = 620000
New value = 620865
0x0000000010e3e93b in kcscu8 ()

And the commit finishes! This proves that despite the post/wait posting, the session actually checks the on disk SCN using kcscu8 additional to the semaphore being posted, and only continues if the on disk SCN has progressed up to the commit SCN or further!

We can use these breakpoints to see if we can do the exact opposite: make sure the foreground is not posted, and then progress the on disk SCN beyond the commit SCN. At this time you should have two gdb sessions, one on a foreground process, and one on the logwriter. Both gdb sessions have the process they are attached to running.

Now insert and commit again. Because of the watchpoint on kcsgscn_ with the instruction pointer set to kcsnew8 at breakpoint 2, this will break/stop execution:

Hardware access (read/write) watchpoint 2: *0x600113b8

Old value = 621375
New value = 621376
0x0000000010e3ecdb in kcsnew8 ()

Break execution of the logwriter, so the foreground cannot be posted anymore (simply ctrl-C the gdb session attached to the log writer process).
Now change breakpoint 3 in the gdb session attached to the foreground to set the on disk SCN to 621400, which is higher than the commit SCN of 621376:

(gdb) commands 3
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>set {int}0x60016290=621400
>c
>end

And enable breakpoint 3 again:

(gdb) enable 3

And then continue the foreground in gdb.

...
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 621400
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
...

Repeatedly.

This shows watchpoint 4 begin triggered because of access to 0x60016290 (on disk SCN), and breakpoint 3 setting the on disk SCN to 621400.
Now continue the logwriter. This will finish the commit of the foreground session.

So conclusion here is that despite the on disk SCN having progressed beyond the foreground session's commit SCN value (commit SCN was 621376, on disk SCN was set to 621400), a commit will wait until it is explicitly posted via post-wait.

Advertisements

This the the fifth blog in a series of blogposts about Oracle database redo. The previous blog looked into the ‘null write’ (kcrfw_do_null_write actually) function inside kcrfw_redo_write_driver, which does housekeeping like updating SCNs and posting processes if needed, this blog looks into what happens when the log writer is actually posted by a process or if public redo strand buffers have been written into. In part 3 of this blog series (the log writer working cycle) it can be seen that when a session posts the log writer, it returns from the semaphore related functions, and calls ‘kcrfw_redo_write_driver’ directly, which otherwise is called inside ksbcti.

Inside the kcrfw_redo_write_driver function, the first thing of interest is executed only when the logwriter is posted, and the kcrfw_redo_write_driver function is called directly after returning from ksarcv and ksl_exit_main_loop_wait:

kcrfw_redo_write_driver+145:0x000000006001642c(fixed sga|var:kcrfsg_+516 shared pool ):W:4:0/0()

At offset 516 in kcrfsg_, a 4 byte field is used for indicating that the logwriter is posted. This field is filled out by a process that wants the logwriter to write. The way it works is that a process reads this field, and changes it to 1 if this field contains 0, and then performs a semctl or semop to post the logwriter and get it working. If the field already contains 1, no further action is taken by the a committing process. The logwriter sets this value to 0 as soon as it has been posted and executes the kcrfw_redo_write_driver function. If the kcrfw_redo_write_driver function is called as part of the normal logwriter cycle (which means inside the ksbcti function), it will not read or write this field.

Then a lot of things are executed that have been shown in the previous post. The next thing the kcrfw_redo_write_driver executes is kcsnew3 is to obtain a LWN SCN and set the LWN number protected by the ‘lgwr LWN SCN’ latch:

kcsnew8+107:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0x1/1()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):R:4:0xc6/198()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):W:4:0xc7/199()
kcsnew8+125:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b80/7744384()
kcsnew8+131:0x00007fffa10728f0():W:4:0/0()
kcsnew8+140:0x00007fffa1072922():W:1:0/0()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b80/7744384()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):W:8:0x762b81/7744385()
kcsnew8+165:0x00007fffa10728f0():R:4:0/0()
kcsnew8+184:0x00007f520ba8be08():R:8:0x600113b8/1610683320(fixed sga|var:kcsgscn_+0 )
kcsnew8+193:0x00007f520ba8ae90():R:8:0x7a823620/2055353888(shared pool|permanent memor,duration 1,cls perm+7693856 )
kcsnew8+212:0x000000007a823e40(shared pool|permanent memor,duration 1,cls perm+7695936 ):R:8:0x719beed8/1906044632(shared pool|ksu:stats_freel,duration 1,cls freeabl+24 )
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):R:8:0xc7/199()
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):W:8:0xc8/200()
kcsnew8+223:0x00007fffa1072a00():W:8:0x762b81/7744385()
kcsnew8+232:0x0000000060016260(fixed sga|var:kcrfsg_+56 shared pool|x_kcrfws.lwn_scn+0 ):W:8:0x762b81/7744385()
kcsnew8+238:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0/0()
kcsnew8+246:0x0000000060016288(fixed sga|var:kcrfsg_+96 shared pool|pointer:lgwr lwn scn latch+0 ):R:8:0x60049800/1610913792(fixed sga|(parent)latch:lgwr LWN SCN+0 fixed sga|var:lwn_scn_lat_+0 )
kcsnew8+250:0x00007fffa10728d8():W:8:0x10e3ed3f/283372863()

The LWN is set to 0x762b81 / 7744385, LWN number is 199.

The next thing to look at is the kcrfw_gather_lwn function flow:

 | > kcrfw_gather_lwn(0x7ffc3dd11b78, 0x77eced90, ...)
 | | > kslgetl(0x76fe0c10, 0x1, ...)
 | | < kslgetl+0x00000000012f returns: 0x1
 | | > kcrfw_gather_strand(0x7ffc3dd11b78, 0, ...)
 | | < kcrfw_gather_strand+0x0000000000c2 returns: 0x2
 | | > kslfre(0x76fe0c10, 0x20002, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | < kslgetl+0x00000000012f returns: 0x1
 | | > kcrfw_gather_strand(0x7ffc3dd11b78, 0x1, ...)
 | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | > kslfre(0x76fe0cb0, 0x19, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | < kcrfw_gather_lwn+0x00000000065c returns: 0

Here we see exactly the same function sequence in kcrfw_gather_lwn as we saw in the ‘null write’ blogpost, however pay close attention to the return codes of the kcrfw_gather_strand functions. The first call returns 2 instead of 0, indicating two (strand) buffers need writing.

If we zoom in to the first kcrfw_gather_strand call, we see:

kcrfw_gather_strand+2:0x00007fffa1072910():W:8:0x7fffa1072a10/140735894989328()
kcrfw_gather_strand+15:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+22:0x00007fffa1072908():W:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+29:0x00007fffa1072900():W:8:0x76fdf130/1996353840(shared pool|permanent memor,duration 1,cls perm+16113968 )
kcrfw_gather_strand+36:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):R:4:0x65e/1630()
kcrfw_gather_strand+53:0x0000000076fdf480(shared pool|KCRFA+248 shared pool|TOTAL_BUFS_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114816 ):R:4:0x20000/131072()
kcrfw_gather_strand+70:0x00007fffa1072bf8():R:8:0x762b81/7744385()
kcrfw_gather_strand+79:0x0000000076fdf390(shared pool|KCRFA+8 shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x762b7f/7744383()
kcrfw_gather_strand+86:0x0000000076fdf3a8(shared pool|KCRFA+32 shared pool|NEXT_BUF_NUM_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x660/1632()
kcrfw_gather_strand+121:0x0000000076fdf45c(shared pool|KCRFA+212 shared pool|LAST_BUF_GATHERED_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114780 ):R:4:0x65e/1630()
kcrfw_gather_strand+169:0x00007fffa1072a08():W:4:0x660/1632()
kcrfw_gather_strand+176:0x00007fffa1072a0c():W:4:0/0()
kcrfw_gather_strand+182:0x00007fffa1072900():R:8:0x76fdf130/1996353840(shared pool|permanent memor,duration 1,cls perm+16113968 )
kcrfw_gather_strand+186:0x00007fffa1072908():R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+193:0x00007fffa1072910():R:8:0x7fffa1072a10/140735894989328()
kcrfw_gather_strand+194:0x00007fffa1072918():R:8:0x31f0f00/52367104()

What is visible here is roughly identical to the memory access of the kcrfw_gather_strand function we saw in the ‘null write’ blogpost, however there is one striking difference: at offset 86 in the kcrfw_gather_strand function KCRFA at offset 32 is read, which shows a value of 1632. The other locations show a value of 1630, which is an indication some public redo strand buffers have been written. This means there is potentially something to write! Please remember that the redo allocation latch was fetched before calling kcrfw_gather_strand.

Once the kcrfw_gather_strand function has returned to kcrfw_gather_lwn, the actual write is prepared by further manipulating the KCRFA struct for the first strand for the write:

kcrfw_gather_lwn+436:0x0000000076fdf45c(shared pool|KCRFA+212 shared pool|LAST_BUF_GATHERED_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114780 ):W:4:0x660/1632()
kcrfw_gather_lwn+449:0x0000000076fdf3b0(shared pool|BYTES_IN_BUF_KCRFA_CLN+0 shared pool|KCRFA+40 shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0x1b8/440()
kcrfw_gather_lwn+453:0x0000000060016a10(fixed sga|var:kcrf_kcrrh_sz_+0 ):R:8:0x18/24()
kcrfw_gather_lwn+462:0x0000000076fdf3a0(shared pool|KCRFA+24 shared pool|PNEXT_BUF_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd80cc048/3624714312(Redo Buffers(pgsz:2048k)|+835656 redo|PUB_REDO_0+835656 )
kcrfw_gather_lwn+466:0x00000000d80cc048(redo|PUB_REDO_0+835656 ):W:4:0/0()
kcrfw_gather_lwn+472:0x0000000076fdf3b0(shared pool|BYTES_IN_BUF_KCRFA_CLN+0 shared pool|KCRFA+40 shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0x1b8/440()
kcrfw_gather_lwn+476:0x0000000076fdf3a0(shared pool|KCRFA+24 shared pool|PNEXT_BUF_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd80cc048/3624714312(Redo Buffers(pgsz:2048k)|+835656 redo|PUB_REDO_0+835656 )
kcrfw_gather_lwn+476:0x0000000076fdf3a0(shared pool|KCRFA+24 shared pool|PNEXT_BUF_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114592 ):W:8:0xd80cc200/3624714752(Redo Buffers(pgsz:2048k)|+836096 redo|PUB_REDO_0+836096 )
kcrfw_gather_lwn+480:0x0000000076fdf3cc(shared pool|KCRFA+68 shared pool|permanent memor,duration 1,cls perm+16114636 ):R:1:0/0()
kcrfw_gather_lwn+480:0x0000000076fdf3cc(shared pool|KCRFA+68 shared pool|permanent memor,duration 1,cls perm+16114636 ):W:1:0x1/1()
kcrfw_gather_lwn+485:0x0000000076fdf3b0(shared pool|BYTES_IN_BUF_KCRFA_CLN+0 shared pool|KCRFA+40 shared pool|permanent memor,duration 1,cls perm+16114608 ):W:8:0/0()

The LAST_BUF_GATHERED_KCRFA field at KCRFA offset 212 is set to the new buffer number 1632. Then the actual amount of data inside the buffer is queried from KCRFA at offset 40 (BYTES_IN_BUF_KCRFA_CLN), and the pointer to the actual location in the first public redo strand is obtained from KCRFA offset 24 (PNEXT_BUF_KCRFA_CLN). The pointer into the first public redo strand then is increased by the amount of redo written, which is 440 in this case, and the BYTES_IN_BUF_KCRFA_CLN is reset to 0.

A little further in kcrfw_gather_lwn, LAST_BUF_WRITTEN in KCRFA at offset 208 is changed to public strand buffer 1632:

kcrfw_gather_lwn+977:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):R:4:0x65e/1630()
kcrfw_gather_lwn+987:0x0000000076fdf480(shared pool|KCRFA+248 shared pool|TOTAL_BUFS_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114816 ):R:4:0x20000/131072()
kcrfw_gather_lwn+996:0x0000000076fdf3c0(shared pool|KCRFA+56 shared pool|permanent memor,duration 1,cls perm+16114624 ):W:4:0x1000/4096()
kcrfw_gather_lwn+1006:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):W:4:0x660/1632()

At this point the redo allocation latch for this strand is released.
After the release of the latch, the kcrfw_gather_lwn function changes a value in the struct kcrfsg_ at offset 8, which is X$KCRFWS.NEXT_BLK, alias the administration of where to write in the online redolog file(s):

kcrfw_gather_lwn+1038:0x0000000060016230(fixed sga|var:kcrfsg_+8 shared pool|x_kcrfws.next_blk+0 ):R:4:0x16a16/92694()
kcrfw_gather_lwn+1044:0x0000000077eceea8(shared pool|permanent memor,duration 1,cls perm+2451776 ):W:4:0x16a16/92694()
kcrfw_gather_lwn+1048:0x0000000060016230(fixed sga|var:kcrfsg_+8 shared pool|x_kcrfws.next_blk+0 ):R:4:0x16a16/92694()
kcrfw_gather_lwn+1048:0x0000000060016230(fixed sga|var:kcrfsg_+8 shared pool|x_kcrfws.next_blk+0 ):W:4:0x16a18/92696()

The second public strand has no buffers written, which we know because the kcrfw_gather_strand function returned 0, so that one is skipped.

At this point, the memory area’s that take part in the writing have been setup.

In order to write, the function kcrfw_redo_write is called to start the actual redo writing. First, the logwriter needs to make sure that the buffers it has marked for write, which is the difference between the original LAST_BUF_GATHERED and LAST_BUF_WRITTEN, are not still written into. This is done in the function kcrfw_wait_for_fg_copy, which is called in kcrfw_redo_write. The way this works, is it looks up the address of the first redo copy latch child which a process that is copying change vectors into the public strand must take which is stored in kcrfsg_ at offset 472 and the address of the redo copy latch support memory area, which is stored in kcrfsg_ at offset 488. Then it looks up the buffer assigned to the first latch in the redo copy latch support memory area and reads 8 bytes to see if it contains the value 0xffffffffffffffff, which is what a writer in a redo log strand must write in the support memory area if it’s done writing to the redo strand. This means that as long as a process writes into the redo log strands, it will have set this memory area to another value indicating it is being written into. Once the value becomes 0xffffffffffffffff, kcrfw_wait_for_fg_copy will continue to see if any other redo copy latches needs checking. It does this by reading the actual total amount of redo copy latches in kcrfsg_ at offset 464 and the number of the current redo copy latch it has checked. If not, it will move to the next redo copy latch and redo copy latch support memory area, checking for 0xffffffffffffffff again, etc. Please mind it does not check or obtain any of the redo copy latches itself, nor prevents any write to the redo strands that happens after the check, which is probably by design:

kcrfw_wait_for_fg_copy+132:0x0000000060016400(fixed sga|var:kcrfsg_+472 shared pool|pointer:redo copy latch+0 ):R:8:0x76fdea30/1996352048(shared pool|(child)latch#1:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112176 )
kcrfw_wait_for_fg_copy+142:0x0000000060016410(fixed sga|var:kcrfsg_+488 ):R:8:0x76fdf0f0/1996353776(shared pool|permanent memor,duration 1,cls perm+16113904 )
kcrfw_wait_for_fg_copy+149:0x00007fffa1072968():W:1:0/0()
kcrfw_wait_for_fg_copy+153:0x00007fffa1072960():W:8:0x76fdea30/1996352048(shared pool|(child)latch#1:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112176 )
kcrfw_wait_for_fg_copy+157:0x00007fffa1072958():W:4:0/0()
kcrfw_wait_for_fg_copy+161:0x0000000076fdf0f0(shared pool|permanent memor,duration 1,cls perm+16113904 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+169:0x0000000076fdf0f4(shared pool|permanent memor,duration 1,cls perm+16113908 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+176:0x00007fffa1072968():R:1:0/0()
kcrfw_wait_for_fg_copy+180:0x00007fffa1072958():R:4:0/0()
kcrfw_wait_for_fg_copy+195:0x00000000600163f8(fixed sga|var:kcrfsg_+464 ):R:4:0x8/8()
kcrfw_wait_for_fg_copy+132:0x0000000060016400(fixed sga|var:kcrfsg_+472 shared pool|pointer:redo copy latch+0 ):R:8:0x76fdea30/1996352048(shared pool|(child)latch#1:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112176 )
kcrfw_wait_for_fg_copy+142:0x0000000060016410(fixed sga|var:kcrfsg_+488 ):R:8:0x76fdf0f0/1996353776(shared pool|permanent memor,duration 1,cls perm+16113904 )
kcrfw_wait_for_fg_copy+149:0x00007fffa1072968():W:1:0/0()
kcrfw_wait_for_fg_copy+153:0x00007fffa1072960():W:8:0x76fdeb08/1996352264(shared pool|(child)latch#2:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112392 )
kcrfw_wait_for_fg_copy+157:0x00007fffa1072958():W:4:0x1/1()
kcrfw_wait_for_fg_copy+161:0x0000000076fdf0f8(shared pool|permanent memor,duration 1,cls perm+16113912 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+169:0x0000000076fdf0fc(shared pool|permanent memor,duration 1,cls perm+16113916 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+176:0x00007fffa1072968():R:1:0/0()
kcrfw_wait_for_fg_copy+180:0x00007fffa1072958():R:4:0x1/1()
kcrfw_wait_for_fg_copy+195:0x00000000600163f8(fixed sga|var:kcrfsg_+464 ):R:4:0x8/8()

The above memory trace output shows the kcrfw_wait_for_fg_copy function looking up the redo copy latch memory address at kcrfw_wait_for_fg_copy offset 132, looking up the redo copy support memory area at kcrfw_wait_for_fg_copy offset 142, and then peeks into the redo copy helper memory structure in kcrfw_wait_for_fg_copy at offset 161 and 169. This way it does not need to obtain the redo copy latch. At offset 195 of the function it looks up the number of redo copy latches, and then executes the same procedure again, looking up the same pointers in kcrfsg_, but adding an offset for the next latch.

Please mind that there is no need for the log writer to take the redo copy latch, in fact, it doesn’t seem to care if a redo copy latch is taken or not, it waits for a redo strand writer to indicate it is done writing by waiting for the redo copy latch helper memory structure for each strand to be become 0xffffffff, which then automatically means the change vectors in the public strand buffers it is intending to write (which means having a SCN lower than LWN SCN) are not written to anymore, because a process that wants to write into a redo strand must change the redo copy latch helper memory structure to another value than 0xffffffff, and revert it to 0xffffffff once finished.

The next thing the logwriter does when it truly needs to write, is update the real redo SCN. The real redo SCN is the SCN value that contains the highest SCN of the change vectors that are truly written to the online redolog files. Remember, the ‘on disk SCN’ is not the SCN of a change that is truly written. The real redo SCN is another SCN value that is stored in the kcrfsg_ struct, in my database at offset 288. This is done in the function kcrfw_update_and_bcast_real_redo_scn, which uses the kcsaj8 function to execute the the change of the SCN. This does not require a latch get:

kcsaj8+2:0x00007fffa1072950():W:8:0x7fffa1072970/140735894989168()
kcsaj8+13:0x00007fffa1072960():R:8:0x762b80/7744384()
kcsaj8+16:0x0000000060016348(fixed sga|var:kcrfsg_+288 shared pool|x_kcrfws.real_redo_scn+0 ):R:8:0x762b7d/7744381()
kcsaj8+22:0x00007f520ba98f40():R:8:0x7f520ba98f40/139990359707456()
kcsaj8+40:0x00007fffa1072900():W:8:0x7fffa1072960/140735894989152()
kcsaj8+44:0x00007fffa1072908():W:8:0/0()
kcsaj8+48:0x00007fffa1072910():W:8:0/0()
kcsaj8+52:0x00007fffa1072918():W:4:0/0()
kcsaj8+56:0x00007fffa1072938():W:8:0x7f520ba98f40/139990359707456()
kcsaj8+63:0x00007fffa1072930():W:8:0/0()
kcsaj8+70:0x00007fffa1072928():W:8:0/0()
kcsaj8+74:0x00007fffa1072920():W:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+81:0x00007fffa1072940():W:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+88:0x00007fffa1072828():W:4:0/0()
kcsaj8+98:0x00007fffa107285a():W:1:0/0()
kcsaj8+105:0x0000000060016348(fixed sga|var:kcrfsg_+288 shared pool|x_kcrfws.real_redo_scn+0 ):R:8:0x762b7d/7744381()
kcsaj8+105:0x0000000060016348(fixed sga|var:kcrfsg_+288 shared pool|x_kcrfws.real_redo_scn+0 ):W:8:0x762b80/7744384()
kcsaj8+120:0x00007fffa1072828():R:4:0/0()
kcsaj8+138:0x00007fffa1072908():R:8:0/0()
kcsaj8+145:0x00007fffa1072900():R:8:0x7fffa1072960/140735894989152()
kcsaj8+152:0x00007fffa1072910():R:8:0/0()
kcsaj8+159:0x00007fffa1072918():R:4:0/0()
kcsaj8+166:0x00007fffa1072938():R:8:0x7f520ba98f40/139990359707456()
kcsaj8+170:0x00007fffa1072930():R:8:0/0()
kcsaj8+174:0x00007fffa1072928():R:8:0/0()
kcsaj8+178:0x00007fffa1072920():R:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+182:0x00007fffa1072940():R:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+202:0x00007f520ba8be08():R:8:0x600113b8/1610683320(fixed sga|var:kcsgscn_+0 )
kcsaj8+219:0x00007fffa1072950():R:8:0x7fffa1072970/140735894989168()
kcsaj8+220:0x00007fffa1072958():R:8:0x3208da6/52465062()

The real redo SCN SCN value is the value that was gathered in the kcsnew3 function at the beginning of the function, however, I am not sure here. It might be the highest SCN in the public strands, although it would make sense to limit the current write to the LWN SCN that was obtained at the start, which means it’s the highest SCN in the strand buffers it picked for writing, while additional buffers might have been written into, with newer higher SCNs.

What happens next, is the logwriter needs to prepare the public redo strand buffers to be written to disk:

 | | > kcrfw_update_redo_blk_hdrs(0x77eced90, 0, ...)
 | | | > _intel_fast_memset(0xda31669c, 0, ...)
 | | | <> _intel_fast_memset.J(0xda31669c, 0, ...)
 | | | <> __intel_new_memset(0xda31669c, 0, ...)
 | | | < __intel_new_memset+0x0000000005c0 returns: 0xda31669c
 | | < kcrfw_update_redo_blk_hdrs+0x000000000363 returns: 0x77eced90
 | | > kcrfw_update_strand_hdrs(0x7ffc3dd11b38, 0x77eced90, ...)
 | | < kcrfw_update_strand_hdrs+0x000000000222 returns: 0x77eced90
 | | > kcrfw_cal_target_rba(0x5, 0xda316650, ...)
 | | | > kslgetl(0x6004b780, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | <> kslfre(0x6004b780, 0x1, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > skgstmGetEpochTs(0x6004b780, 0x1, ...)
 | | | > gettimeofday@plt(0x7ffc3dd118e0, 0, ...)
 | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | < skgstmGetEpochTs+0x000000000049 returns: 0x2050c6a4e2e57
 | | > kcrfw_do_write(0x77eced90, 0x1, ...)

The last line, which is a call to the kcrfw_do_write function, is the logwriter function inside kcrfw_redo_write that actually calls the disk IO functions in ksfd. I’ll walk quickly through these because executing IO is not the focus of this blogpost.

At this point the memory area’s are setup, the IO is prepared:

This is probably calculating the request size:

 | | | > ksfdgrsz(0x1, 0, ...)
 | | | | > skgfrgsz(0x3, 0, ...)
 | | | | < skgfrgsz+0x00000000002d returns: 0xa0
 | | | < ksfdgrsz+0x00000000001a returns: 0x288

This is probably setting up the asynchronous IO information and requests:

 | | | > ksfd_aio_setobjinfo(0x71717da0, 0x76b423e8, ...)
 | | | < ksfd_aio_setobjinfo+0x00000000003c returns: 0
 | | | > ksfdrqfill(0x71717da0, 0x76b423e8, ...)
 | | | | > kskiorm(0x71717da0, 0, ...)
 | | | | < kskiorm+0x00000000001e returns: 0
 | | | < ksfdrqfill+0x00000000021b returns: 0x102

The main function for performing IO is ksfdbio (kernel system file disk block IO):

 | | | > ksfdbio(0x7f2c1e502000, 0x1, ...)
 ...
 | | | | > ksfd_update_iostatsbytes(0, 0x76b76930, ...)
 ...
 | | | | > kslwtbctx(0x7ffc3dd11700, 0x42, ...)
 ...
 | | | | < kslwtbctx+0x00000000050b returns: 0x7a823e58
 | | | | > ksfd_sbio(0x7f2c1e502028, 0x1, ...)
 ...
 | | | | < ksfd_sbio+0x00000000035b returns: 0x1
 | | | | > kslwtectx(0x7ffc3dd11700, 0x200, ...)
 ...
 | | | | < kslwtectx+0x0000000003f9 returns: 0xe923
 | | | | > ksfdbio_updatestats(0xb6b1, 0, ...)
 ...
 | | | < ksfdbio+0x000000000507 returns: 0x7f2c23b04040

This is ksfd_update_iostatsbytes for statistics, kslwtbctx for registering a wait, kslwtectx for ending the wait, ksfdbio_updatestats for statistics about the IO just performed, and in between kslwtbctx and kslwtectx ksfd_sbio performs the IO.

If you look in ksfd_sbio, you’ll see the true IO steps:

 | | | | > ksfd_sbio(0x7f2c1e502028, 0x1, ...)
 | | | | | > ksfd_osdrqfil(0x71717da0, 0x76b423e8, ...)
 | | | | | < ksfd_osdrqfil+0x00000000014c returns: 0x1
 | | | | | > ksfd_addtoissueq(0x71717da0, 0x200, ...)
 | | | | | < ksfd_addtoissueq+0x0000000001ac returns: 0x7a6032e8
 | | | | | > ksfdgo(0x500, 0x81e, ...)
 ...
 | | | | | < ksfdgo+0x000000000135 returns: 0
 | | | | | > ksfdwtio(0x1, 0x500, ...)
 ...
 | | | | | < ksfdwtio+0x000000000176 returns: 0x1
 | | | | | > ksfd_chkio(0x71717da0, 0x9, ...)
 | | | | | | > ksfd_update_iostatstime(0x76b76930, 0x200, ...)
 | | | | | | < ksfd_update_iostatstime+0x000000000103 returns: 0x76c70388
 | | | | | < ksfd_chkio+0x000000000126 returns: 0x1
 | | | | < ksfd_sbio+0x00000000035b returns: 0x1

The ksfd_osdrqfil function probably is a function to create IO requests, and ksfd_addtoissueq looks like it is a function to put these requests in a queue. ksfdgo executes the IO requests, which essentially calls io_submit to submit the write requests. ksfdwtio essentially calls io_getevents to look for the IO results.

After kcrfw_do_write is done, kcrfw_post is executed, just like we saw with the kcrfw_do_null_write function.

Also identical to the null write, the current on disk SCN is read from kcrfsg_ using kcscu8.

Then a series of latch gets is executed to protect changes in the SGA to reflect the write that was executed.
First the redo writing latch is taken. After the redo writing latch is obtained, kcrfsg_+12, alias X$KCRFSWS.LAST_BLK is read and updated with the write:

kcrfw_post+715:0x0000000060016234(fixed sga|var:kcrfsg_+12 shared pool|X$KCRFWS.LAST_BLK+0 shared pool|x_kcrfws.last_blk+0 ):R:4:0x16a15/92693()
kcrfw_post+725:0x00000000600163c0(fixed sga|var:kcrfsg_+408 ):R:4:0xf5/245()
kcrfw_post+736:0x00007fffa1072958():R:8:0/0()
kcrfw_post+743:0x00007fffa1072880():W:4:0xf5/245()
kcrfw_post+753:0x00007fffa1072940():R:4:0xdf7/3575()
kcrfw_post+767:0x0000000060016234(fixed sga|var:kcrfsg_+12 shared pool|X$KCRFWS.LAST_BLK+0 shared pool|x_kcrfws.last_blk+0 ):W:4:0x16a17/92695()

While still holding the redo writing latch, the redo allocation latch for the correct public redo strand is taken, and MEM_BUFS_AVAILABLE_KCRFA_CLN and AVAILABLE_BUFS_KCRFA_CLN are updated:

kcrfw_post+803:0x0000000076fdf3bc(shared pool|KCRFA+52 shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0xde2f/56879()
kcrfw_post+803:0x0000000076fdf3bc(shared pool|KCRFA+52 shared pool|permanent memor,duration 1,cls perm+16114620 ):W:4:0xde31/56881()
MEM_BUFS_AVAILABLE_KCRFA_CLN
kcrfw_post+3839:0x0000000076fdf3b8(shared pool|KCRFA+48 shared pool|permanent memor,duration 1,cls perm+16114616 ):R:4:0xd7d0/55248()
kcrfw_post+3839:0x0000000076fdf3b8(shared pool|KCRFA+48 shared pool|permanent memor,duration 1,cls perm+16114616 ):W:4:0xd7d0/55248()
AVAILABLE_BUFS_KCFRA_CLN

Then the redo allocation latch is freed, and the redo writing latch is freed.

Then the redo writing latch is taken again, in order to update the last write scn in kcrfsg_:

ub8_to_kscn_impl+21:0x00000000600162c0(fixed sga|var:kcrfsg_+152 shared pool|X$KCRFWS.LAST_WRITE_SCN+0 shared pool|x_kcrfws.last_write_scn+0 ):W:4:0x762b81/7744385()
ub8_to_kscn_impl+37:0x00000000600162c6(fixed sga|var:kcrfsg_+158 shared pool|X$KCRFWS.LAST_WRITE_SCN+6 ):W:2:0/0()
ub8_to_kscn_impl+41:0x00000000600162c4(fixed sga|var:kcrfsg_+156 shared pool|X$KCRFWS.LAST_WRITE_SCN+4 ):W:2:0/0()

And while holding the redo writing latch, the kcsnew_rba function is called to set the redo byte address data in kcrfsg_. In order to do this, the ‘consistent RBA’ latch is obtained first:

kslgetl+138:0x0000000060049c80(fixed sga|(parent)latch:Consistent RBA+0 fixed sga|var:consist_rba_lat_+0 ):R:8:0/0()
kslgetl+138:0x0000000060049c80(fixed sga|(parent)latch:Consistent RBA+0 fixed sga|var:consist_rba_lat_+0 ):W:8:0x13/19()

And then fields in the kcrfsg_ struct are read updated.

kcsnew_rba+47:0x00000000600162d8(fixed sga|var:kcrfsg_+176 ):W:4:0x1/1()  // rba details being modified flag
kcsnew_rba+58:0x00000000600162d4(fixed sga|var:kcrfsg_+172 ):R:4:0x7e/126()
kcsnew_rba+58:0x00000000600162d4(fixed sga|var:kcrfsg_+172 ):W:4:0x7f/127()
kcsnew_rba+65:0x00007fffa1072880():R:8:0x16a18000000f5/398126288470261()
kcsnew_rba+69:0x00000000600162c8(fixed sga|var:kcrfsg_+160 ):W:8:0x16a18000000f5/398126288470261() // 164: rba current log block
kcsnew_rba+72:0x00007fffa1072888():R:4:0x10010/65552()
kcsnew_rba+76:0x00000000600162d0(fixed sga|var:kcrfsg_+168 ):W:4:0x10010/65552()
kcsnew_rba+83:0x00000000600162d8(fixed sga|var:kcrfsg_+176 ):W:4:0/0()  // rba details being modified flag
kcsnew_rba+91:0x00000000600162e0(fixed sga|var:kcrfsg_+184 shared pool|pointer:consistent rba latch+0 ):R:8:0x60049c80/1610914944(fixed sga|(parent)latch:Consistent RBA+0 fixed sga|var:consist_rba_lat_+0 )
kcsnew_rba+95:0x00007fffa1072658():R:8:0x77eced90/2012016016(Variable Size(pgsz:2048k)|+384626064 shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsnew_rba+99:0x00007fffa1072650():R:8:0x7f520ba98f40/139990359707456()
kcsnew_rba+106:0x00007fffa1072660():R:8:0x7fffa1072970/140735894989168()

After the kcrfsg_ RBA update, the consistent RBA latch is freed. While still holding the redo writing latch, the last write scn time is updated in kcrfsg_:

kcrfw_post+1355:0x0000000077eceda0(Variable Size(pgsz:2048k)|+384626080 shared pool|permanent memor,duration 1,cls perm+2451512 ):R:4:0x397df7b0/964556720()
kcrfw_post+1361:0x000000006001623c(Fixed Size(pgsz:2048k)|+90684 fixed sga|var:kcrfsg_+20 shared pool|X$KCRFWS.LAST_WRITE_SCN_TIME+0 ):W:4:0x397df7b0/964556720()

And the redo writing latch is released.

A little further in the function kcrfw_post, the kcsadj8 function for changing the on disk scn to the lwn scn is called, exactly like in the null write function:

 | | | > kcsaj8(0x60016290, 0x7ffc3dd118b8, ...)
 | | | < kcsaj8+0x0000000000dc returns: 0x1

After the adjustment of the on disk SCN, the foregrounds that have registered a wait via the post wait interface need to be posted. This is done in the kslpslf (kernel service lock management post list of foregrounds) function in the kcrfw_post function:

 | | | > kslpslf(0x97, 0x769a5730, ...)
 | | | | > ksl_postm_init(0x7ffc3dd094f0, 0x7ffc3dd11510, ...)
 | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | > kslgetl(0x7a5d7388, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > ksl_update_post_stats(0x7a619f48, 0x7ffc3dd11510, ...)
 | | | | | > dbgtTrcData_int(0x7f2c23b5e6c0, 0x2050031, ...)
 | | | | | | > dbgtBucketRedirect(0x7f2c23b5e6c0, 0x7ffc3dd093c8, ...)
 | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
 | | | | | | > dbgtIncInMemTrcRedirect(0x7f2c23b5e6c0, 0x6fa, ...)
 | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
 | | | | | | > skgstmGetEpochTs(0x7f2c23b5e6c0, 0x6fa, ...)
 | | | | | | | > gettimeofday@plt(0x7ffc3dd08ed0, 0, ...)
 | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x2050c7178c0c1
 | | | | | | > dbgtrRecAllocate(0x7f2c23b5e6c0, 0x7ffc3dd09370, ...)
 | | | | | | | > dbgtrPrepareWrite(0x7f2c23b5e6c0, 0x71121048, ...)
 | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x11
 | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
 | | | | | | > _intel_fast_memcpy(0x71124368, 0x7ffc3dd09468, ...)
 | | | | | | <> _intel_fast_memcpy.P(0x71124368, 0x7ffc3dd09468, ...)
 | | | | | | <> __intel_ssse3_rep_memcpy(0x71124368, 0x7ffc3dd09468, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x71124368
 | | | | | | > dbgtrRecEndSegment(0x7f2c23b5e6c0, 0x7ffc3dd09370, ...)
 | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
 | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
 | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
 | | | | > kslfre(0x7a5d7388, 0, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | > ksl_postm_do_posts(0x7ffc3dd094f0, 0, ...)
 | | | | | > kslpwrp_int(0x7ffc3dd09450, 0x7f2c23b9e9a0, ...)
 | | | | | | > skgpwpost(0x7ffc3dd09350, 0x7f2c23ba3ca0, ...)
 | | | | | | <> sskgpwpost(0x7ffc3dd09350, 0x7f2c23ba3ca0, ...)
 | | | | | | | > semop@plt(0x38000, 0x7ffc3dd07310, ...)
 | | | | | | | < semop+0x00000000000f returns: 0
 | | | | | | < sskgpwpost+0x00000000009a returns: 0x1
 | | | | | < kslpwrp_int+0x000000000069 returns: 0x1
 | | | | | > kgecss(0x7f2c23b9e9a0, 0x7f2c1e54b048, ...)
 | | | | | < kgecss+0x000000000035 returns: 0x7f2c23be0f40
 | | | | < ksl_postm_do_posts+0x000000000144 returns: 0x1
 | | | < kslpslf+0x000000000822 returns: 0x7a5e26c8

Essentially, what happens here is the posting of a message is initialised (ksl_postm_init; kernel service locking post multi init), although most of the initialisation from a memory perspective seems to happen in the kslpslf function. Then the post/wait queue latch is obtained, and the kslpslf function manipulates the post/wait queue (which seems to be multiple memory area’s linked together with pointers) to reflect the process it is going to semctl/semop.
Then in the function ksl_update_post_stats, some housekeeping is done:

ksl_update_post_stats+117:0x000000007a603758(shared pool|permanent memor,duration 1,cls perm+5465944 ):W:8:0x7a612b58/2053188440(Variable Size(pgsz:2048k)|+425798488 shared pool|(indx:34)X$KSUPR+0 shared pool|permanent memor,duration 1,cls perm+5528408 )
ksl_update_post_stats+124:0x000000007a603748(shared pool|(indx:19)X$KSUPR.KSLLAPSC+0 shared pool|permanent memor,duration 1,cls perm+5465928 ):R:4:0x1e/30()
ksl_update_post_stats+134:0x000000007a603748(shared pool|(indx:19)X$KSUPR.KSLLAPSC+0 shared pool|permanent memor,duration 1,cls perm+5465928 ):W:4:0x1f/31()
ksl_update_post_stats+141:0x000000007a603744(shared pool|(indx:19)X$KSUPR.KSLLAPSN+0 shared pool|permanent memor,duration 1,cls perm+5465924 ):W:4:0x3/3()
ksl_update_post_stats+2705:0x00007fffa1072590():R:8:0x97/151()
ksl_update_post_stats+2708:0x000000007a612f80(shared pool|(indx:34)X$KSUPR.KSLLID1R+0 shared pool|permanent memor,duration 1,cls perm+5529472 ):W:8:0x97/151()
ksl_update_post_stats+2716:0x00007fffa1072598():R:8:0/0()
ksl_update_post_stats+2720:0x000000007a612f88(shared pool|(indx:34)X$KSUPR.KSLLID2R+0 shared pool|permanent memor,duration 1,cls perm+5529480 ):W:8:0/0()
ksl_update_post_stats+2728:0x00007fffa10725a0():R:8:0x7fff00005645/140733193410117()
ksl_update_post_stats+2732:0x000000007a612f90(shared pool|(indx:34)X$KSUPR.KSLLRTYP+0 shared pool|permanent memor,duration 1,cls perm+5529488 ):W:8:0x7fff00005645/140733193410117()
ksl_update_post_stats+202:0x000000007a612fc0(shared pool|permanent memor,duration 1,cls perm+5529536 ):W:8:0x7a6032e8/2053124840(Variable Size(pgsz:2048k)|+425734888 shared pool|(indx:19)X$KSUPR+0 shared pool|permanent memor,duration 1,cls perm+5464808 )
ksl_update_post_stats+209:0x000000007a612fbc(shared pool|(indx:34)X$KSUPR.KSLLAPRC+0 shared pool|permanent memor,duration 1,cls perm+5529532 ):R:4:0x2/2()
ksl_update_post_stats+209:0x000000007a612fbc(shared pool|(indx:34)X$KSUPR.KSLLAPRC+0 shared pool|permanent memor,duration 1,cls perm+5529532 ):W:4:0x3/3()
ksl_update_post_stats+216:0x000000007a612fb0(shared pool|(indx:34)X$KSUPR.KSLLAPRV+0 shared pool|permanent memor,duration 1,cls perm+5529520 ):W:4:0x3/3()

Of course x$ksupr is the kernel service user process, which is the base table for v$process. However, there are a lot of columns in x$ksupr that are not externalised in v$process. The indx number of the logwriter is 19, the indx number of the process that is going to be posted is 34. One of these fields that are not externalised is ksllapsc, which is the Posts Send Count. Because the logwriter is going to post the foreground session, it’s posts send count is increased by one. ksllapsn is last Post SeNt location number, which is set to 3.

This number corresponds with x$kslpo:

SQL> select indx, ksllwnam, kslposts from x$kslpo where indx = 3;
      INDX KSLLWNAM               KSLPOSTS
---------- -------------------- ----------
         3 kslpsr                     4356

(kernel service locking post receive (?))

The foreground process statistics in x$ksupr are changed too, ksllaprc Posts Received Count, and is increased by one too, because the logwriter is going to send a message to the foreground process. ksllaprv is last Post ReceiVed location number, which is set to 3 too.

After that some tracing and other housekeeping is taking place in dbgtTrcData_int, after which the post/wait queue latch is freed. The next function is doing the actual semop call, which is ksl_postm_do_posts: kernel service locking, post multi do posts. In ksl_postm_do_posts, the function kslpwrp_int is called, which looks like it’s essentially a function that can call the function skgpwpost (system kernel generic post wait post), but if there are multiple sessions to be posted, it calls the function skgpwvectorpost, which iterates through the list of processes to be posted.

At this point the kcrfw_redo_write_driver function is done.

This is the fourth blogpost on a series of blogposts about how the Oracle database handles redo. The previous blogpost talked about the work cycle of the log writer: https://fritshoogland.wordpress.com/2018/02/12/a-look-into-oracle-redo-part-3-the-log-writer-work-cycle-overview/. This posts is looking into the execution of the kcrfw_redo_write_driver function executed in the ksbcti.

Now that we are familiar with how the logwriter works in general, we need to take a closer look to the kcrfw_redo_write_driver function. First let me be clear: the kcrfw_redo_write_driver function inside ksbcti is called every time the logwriter process times out on its semaphore, which is every 3 seconds, so this means it is called too when nothing is written in the public redo strands. In fact, when the log writer times out on semtimedop (which means it is not (yet) semctl’ed or semop’ed), it doesn’t know if there are entries in the public redo strands at that point in time.

There is a lot of things that go on in the kcrfw_redo_write_driver function. The following tracing are small snippets of execution from the function, not the every detail in this function.

The first thing that is of interest is the logwriter obtaining a LWN (log write number) number and SCN.

 | | > kcsnew3(0x600113b8, 0x7ffc3dd10d28, ...)
 | | | > kcsnew8(0x600113b8, 0x7ffc3dd10b70, ...)
 | | | | > kslgetl(0x60049800, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > kslfre(0x60049800, 0x1, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | < kcsnew8+0x000000000117 returns: 0

Please mind at this time it’s unknown if there is anything to write in the public redo strands.

So the kcsnew3 function is executed, with 0x600113b8 as first argument. That is the address of kcsgscn_, the instance global SCN. The kcsnew3 function calls the kcsnew8 function, which calls kslgetl for latch address 0x60049800. That address belongs to the ‘lgwr LWN SCN’ latch in my instance. In the function call overview there is nothing more to see. If see look at the pina memory trace, we see a lot more; this is what is executed in between kslgetl and kslfre:

kcsnew8+107:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0x1/1()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):R:4:0xc5/197()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):W:4:0xc6/198()
kcsnew8+125:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b7e/7744382()
kcsnew8+131:0x00007fffa1071ae0():W:4:0/0()
kcsnew8+140:0x00007fffa1071b12():W:1:0/0()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b7e/7744382()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):W:8:0x762b7f/7744383()
kcsnew8+165:0x00007fffa1071ae0():R:4:0/0()
kcsnew8+184:0x00007f520ba8be08():R:8:0x600113b8/1610683320(fixed sga|var:kcsgscn_+0 )
kcsnew8+193:0x00007f520ba8ae90():R:8:0x7a823620/2055353888(shared pool|permanent memor,duration 1,cls perm+769 )
kcsnew8+212:0x000000007a823e40(shared pool|permanent memor,duration 1,cls perm+7695936 ):R:8:0x719beed8/190604(shared pool|ksu:stats_freel,duration 1,cls freeabl+24 )
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):R:8:0xc6/198()
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):W:8:0xc7/199()
kcsnew8+223:0x00007fffa1071bf0():W:8:0x762b7f/7744383()
kcsnew8+232:0x0000000060016260(fixed sga|var:kcrfsg_+56 shared pool|x_kcrfws.lwn_scn+0 ):W:8:0x762b7f/7744383()
kcsnew8+238:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0/0()
kcsnew8+246:0x0000000060016288(fixed sga|var:kcrfsg_+96 shared pool|pointer:lgwr lwn scn latch+0 ):R:8:0x60049800/1610913792(fixed sga|(parent)latch:lgwr LWN SCN+0 fixed sga|var:lwn_scn_lat_+0 )
kcsnew8+250:0x00007fffa1071ac8():W:8:0x10e3ed3f/283372863()

What is visible here, is that at kcrfsg_ offset 76 ‘1’ is written. This is quite probably a flag indicating the LWN related entries are being modified (this is a guess, of course the LWN SCN latch essentially performs the same function). Especially since this is set to ‘0’ at the end if the function (kcsnew8+238). At kcrfsg_ offset 72 a number is read and increased by 1. This is most likely the LWN number. If you take a look at the number of gets for the ‘lgwr LWN’ you will see the number of gets is very close (albeit not equal) to the LWN number.

Then the LWN SCN is fetched from kcsgscn_; it first is read, and increased by one to 7744383 and written back in kcsgscn_ (get and advance). The SCN from kcsgscn_ then is written in the kcrfsg_ struct at offset 56 a little further as well, after which the earlier set ‘1’ is reset to ‘0’ at offset 76. This SCN is visible in x$kcrfws as LWN SCN.

The next thing to look at is the function kcrfw_gather_lwn a little further:

 | | > kcrfw_gather_lwn(0x7ffc3dd10d68, 0x77eced90, ...)
 | | | > kslgetl(0x76fe0c10, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7ffc3dd10d68, 0, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x76fe0c10, 0x118b2, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7ffc3dd10d68, 0x1, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x76fe0cb0, 0x19, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | < kcrfw_gather_lwn+0x00000000065c returns: 0xffffffff

In the kcrfw_gather_lwn function first latch address x76fe0c10 is taken in immediate mode, which is the redo allocation child #1, and then the function kcrfw_gather_strand is called, which returns 0, after which the latch is freed. This pattern repeats itself with a different latch address, 0x76fe0cb0, which is redo allocation child #2. This gives a hint the two public strands are active. By looking at the memory accesses trace again, a fixed sga variable is used to find out how many strands are active:

7038:kcrfw_gather_lwn+76:0x00000000600169b8(Fixed Size(pgsz:2048k)|+92600 fixed sga|var:kcrf_max_strands_+0 ):R:4:0x2/2()

A little further information is obtained from kcrfsg_ about the redo allocation latch and the KCRFA structure address:

7052:kcrfw_gather_lwn+223:0x0000000060016570(fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
7053:kcrfw_gather_lwn+230:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )

Then the redo allocation latch child #1 is obtained via kslgetl, after which kcrfw_gather_lwn calls kcrfw_gather_strand:

kcrfw_gather_strand+2:0x00007fffa1071b00():W:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+15:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+22:0x00007fffa1071af8():W:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+29:0x00007fffa1071af0():W:8:0x76fdf2b0/1996354224(shared pool|permanent memor,duration 1,cls perm+16114352 )
kcrfw_gather_strand+36:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):R:4:0x65e/1630()
kcrfw_gather_strand+53:0x0000000076fdf480(shared pool|KCRFA+248 shared pool|TOTAL_BUFS_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114816 ):R:4:0x20000/131072()
kcrfw_gather_strand+70:0x00007fffa1071de8():R:8:0x762b7f/7744383()
kcrfw_gather_strand+79:0x0000000076fdf390(shared pool|KCRFA+8 shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x762b7c/7744380()
kcrfw_gather_strand+86:0x0000000076fdf3a8(shared pool|KCRFA+32 shared pool|NEXT_BUF_NUM_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x65e/1630()
kcrfw_gather_strand+121:0x0000000076fdf45c(shared pool|KCRFA+212 shared pool|LAST_BUF_GATHERED_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114780 ):R:4:0x65e/1630()
kcrfw_gather_strand+169:0x00007fffa1071bf8():W:4:0x65e/1630()
kcrfw_gather_strand+176:0x00007fffa1071bfc():W:4:0/0()
kcrfw_gather_strand+182:0x00007fffa1071af0():R:8:0x76fdf2b0/1996354224(shared pool|permanent memor,duration 1,cls perm+16114352 )
kcrfw_gather_strand+186:0x00007fffa1071af8():R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+193:0x00007fffa1071b00():R:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+194:0x00007fffa1071b08():R:8:0x31f0f00/52367104()

In kcrfw_gather_strand, we see the sequence we have seen before: at offset 0 of kcrfsg_, the KCRFA struct address is looked up, after which the function investigates addresses in the KCRFA struct for the first public redo strand only. I’ve annotated these in the pinatrace snippet above. At offset 36 in the function, the last buffer written to the online redologfile is obtained from offset 208 in KCRFA, which is public strand buffer 1630. A little further, at offset 86 and 121 in the function kcrfw_gather_strand, the most recently used public strand buffer that is written to is obtained at KCRFA offset 32 (NEXT_BUF_NUM_KCRFA_CLN), and the highest buffer number that has been investigated (gathered) by the logwriter process at KCRFA offset 212 (LAST_BUF_GATHERED_KCRFA). In all these cases, the public redo strand buffer number is 1630. This means no redo has been written into the first public redo strand.
After reading the KCRFA information for the first public redo strand, the latch for this struct is freed.

A little further, the latch is obtained that protects the second public strand, and kcrfw_gather_strand is executed again to read the KCRFA information for this strand:

kcrfw_gather_strand+2:0x00007fffa1071b00():W:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+15:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+22:0x00007fffa1071af8():W:8:0x76fdf4b0/1996354736(shared pool|KCRFA+296 shared pool|permanent memor,duration 1,cls perm+16114864 )
kcrfw_gather_strand+29:0x00007fffa1071af0():W:8:0x76fdf1f0/1996354032(shared pool|permanent memor,duration 1,cls perm+16114160 )
kcrfw_gather_strand+36:0x0000000076fdf580(shared pool|KCRFA+504 shared pool|permanent memor,duration 1,cls perm+16115072 ):R:4:0x1d/29()
kcrfw_gather_strand+53:0x0000000076fdf5a8(shared pool|KCRFA+544 shared pool|permanent memor,duration 1,cls perm+16115112 ):R:4:0x20000/131072()
kcrfw_gather_strand+70:0x00007fffa1071de8():R:8:0x762b7f/7744383()
kcrfw_gather_strand+79:0x0000000076fdf4b8(shared pool|KCRFA+304 shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x762ae7/7744231()
kcrfw_gather_strand+86:0x0000000076fdf4d0(shared pool|KCRFA+328 shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0x1d/29()
kcrfw_gather_strand+121:0x0000000076fdf584(shared pool|KCRFA+508 shared pool|permanent memor,duration 1,cls perm+16115076 ):R:4:0x1d/29()
kcrfw_gather_strand+169:0x00007fffa1071bf8():W:4:0x1d/29()
kcrfw_gather_strand+176:0x00007fffa1071bfc():W:4:0/0()
kcrfw_gather_strand+182:0x00007fffa1071af0():R:8:0x76fdf1f0/1996354032(shared pool|permanent memor,duration 1,cls perm+16114160 )
kcrfw_gather_strand+186:0x00007fffa1071af8():R:8:0x76fdf4b0/1996354736(shared pool|KCRFA+296 shared pool|permanent memor,duration 1,cls perm+16114864 )
kcrfw_gather_strand+193:0x00007fffa1071b00():R:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+194:0x00007fffa1071b08():R:8:0x31f0f00/52367104()

This is exactly the same function, just reading the same information for a different strand, which means at a different offset in KCRFRA. If you subtract the size of the the information for a single strand which we calculated to be 296 from the KCRFA offsets (see blogpost number 2), you get the same numbers as above (504-296=208=LAST_BUF_WRITTEN). Once you understand what is read, it should be fairly easy to see nothing has been written into the second strand too. So at this point, the logwriter knows there is nothing to write.

What does the logwriter do when there is nothing to write? It executes a ‘null write’:

 | | > kcrfw_do_null_write(0, 0, ...)
 | | | > kcrfw_slave_phase_batchdo(0, 0, ...)
 | | | | > kcrfw_slave_phase_enter(0, 0x40, ...)
 | | | | < kcrfw_slave_phase_enter+0x000000000449 returns: 0
 | | | <> kcrfw_slave_phase_exit(0, 0x40, ...)
 | | | < kcrfw_slave_phase_exit+0x00000000035a returns: 0
 | | | > kcrfw_post(0, 0, ...)
 | | | | > kcrfw_slave_single_getactivegroup(0, 0, ...)
 | | | | < kcrfw_slave_single_getactivegroup+0x000000000047 returns: 0x769a5620
 | | | | > kspGetInstType(0x1, 0x1, ...)
 | | | | | > vsnffe_internal(0x19, 0x1, ...)
 | | | | | | > vsnfprd(0x19, 0x1, ...)
 | | | | | | < vsnfprd+0x00000000000f returns: 0x8
 | | | | | | > kfIsASMOn(0x19, 0x1, ...)
 | | | | | | <> kfOsmInstanceSafe(0x19, 0x1, ...)
 | | | | | | < kfOsmInstanceSafe+0x000000000031 returns: 0
 | | | | | < vsnffe_internal+0x0000000000a7 returns: 0
 | | | | | > kspges(0x115, 0x1, ...)
 | | | | | < kspges+0x00000000010f returns: 0
 | | | | < kspGetInstType+0x0000000000b1 returns: 0x1
 | | | | > kcrfw_slave_phase_enter(0x1, 0x40, ...)
 | | | | < kcrfw_slave_phase_enter+0x00000000006f returns: 0x40
 | | | | > kcscu8(0x60016290, 0x7ffc3dd10a98, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcsaj8(0x60016290, 0x7ffc3dd10a38, ...)
 | | | | < kcsaj8+0x0000000000dc returns: 0x1
 | | | | > kcrfw_slave_phase_exit(0x1, 0x40, ...)
 | | | | < kcrfw_slave_phase_exit+0x00000000008e returns: 0
 | | | | > kslpsemf(0x97, 0, ...)
 | | | | | > ksl_postm_init(0x7ffc3dd08730, 0x7ffc3dd10750, ...)
 | | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | < kslpsemf+0x0000000006b5 returns: 0x1f
 | | | | > kcrfw_slave_barrier_nonmasterwait(0x769a5628, 0x4, ...)
 | | | | < kcrfw_slave_barrier_nonmasterwait+0x000000000035 returns: 0x600161a0
 | | | < kcrfw_post+0x000000000c1c returns: 0xd3
 | | < kcrfw_do_null_write+0x0000000000b2 returns: 0xd3

This means there is housekeeping to do, despite no public redo strand buffers needing writing. The most crucial things it does that I can detect, is updating the on disk SCN in the kcrfsg_ struct with the already set LWN SCN and posting any outstanding processes that are sleeping on a semaphore via post/wait. These are done in the kcrfw_post function, which probably means (kernel cache redo file write) ‘post write’, however there was nothing to write (null write), since there was nothing in the public redo strands. If you look to the above function trace, you see inside kcrfw_do_null_write, kcrfw_post is called.

Inside kcrfw_post, kcscu8 (kernel cache service get current SCN) is called with 0x60016290 as first argument, which is the memory address of offset 104 in the kcrfsg_ struct in the fixed SGA, which is externalised as ON_DISK_SCN in X$KCRFWS, which is done to read the current value of the on disk SCN:

kcscu8+2:0x00007fffa10718a0():W:8:0x7fffa1071bb0/140735894985648()
kcscu8+13:0x00007fffa1071898():W:8:0x77eced90/2012016016(Variable Size(pgsz:2048k)|+384626064 shared pool|permanent memor,duration 1,cls perm+2451496 )
kcscu8+20:0x00007fffa1071890():W:8:0x7f520ba98f40/139990359707456()
kcscu8+27:0x00000000600162a0(Fixed Size(pgsz:2048k)|+90784 fixed sga|var:kcrfsg_+120 ):R:4:0/0()
kcscu8+31:0x00000000600162a4(Fixed Size(pgsz:2048k)|+90788 fixed sga|var:kcrfsg_+124 ):R:4:0/0()
kcscu8+39:0x0000000060016290(Fixed Size(pgsz:2048k)|+90768 fixed sga|var:kcrfsg_+104 shared pool|x_kcrfws.on_disk_scn+0 ):R:8:0x762b7e/7744382()
kcscu8+43:0x00000000600162a0(Fixed Size(pgsz:2048k)|+90784 fixed sga|var:kcrfsg_+120 ):R:4:0/0()
kcscu8+56:0x00007fffa1071b18():W:8:0x762b7e/7744382()
kcscu8+59:0x00007fffa1071898():R:8:0x77eced90/2012016016(Variable Size(pgsz:2048k)|+384626064 shared pool|permanent memor,duration 1,cls perm+2451496 )
kcscu8+63:0x00007fffa1071890():R:8:0x7f520ba98f40/139990359707456()
kcscu8+70:0x00007fffa10718a0():R:8:0x7fffa1071bb0/140735894985648()
kcscu8+71:0x00007fffa10718a8():R:8:0x3202150/52437328()

The next function that is called is kcsaj8 (kernel cache service adjust SCN), with 0x60016290 as first argument, indicating it is going to change the on disk SCN:

kcsaj8+105:0x0000000060016290(Fixed Size(pgsz:2048k)|+90768 fixed sga|var:kcrfsg_+104 shared pool|x_kcrfws.on_disk_scn+0 ):R:8:0x762b7e/7744382()
kcsaj8+105:0x0000000060016290(Fixed Size(pgsz:2048k)|+90768 fixed sga|var:kcrfsg_+104 shared pool|x_kcrfws.on_disk_scn+0 ):W:8:0x762b7f/7744383()
(these are only the relevant lines from the memory trace for the function kcsaj8)

What this means is that the LWN SCN that was obtained at te beginning of the write cycle using the kcsnew3 function, now is written into the on disk SCN too, despite anything actually being written to disk. The obvious question at this point is ‘if the on disk SCN is increased without anything actually written, would there be a SCN value that actually tracks the SCN of truly written (“on disk”) redo? To spoil the next blogpost about actual redo writing, this is administered in a SCN value called the ‘real redo SCN’.

After kcrfw_post returns, the execution returns from kcrfw_do_null_write and kcrfw_redo_write_driver, ending the ‘null write’.

This is the second post in a series of blogposts on Oracle database redo internals. If you landed on this blogpost without having read the first blogpost, here is a link to the first blogpost: https://fritshoogland.wordpress.com/2018/01/29/a-look-into-oracle-redo-part-1-redo-allocation-latches/ The first blogpost contains all the versions used and a synopsis on what the purpose of this series of blogposts is.

In the first part, I showed how the principal access to the public redo strands is controlled by redo allocation latches, and showed a snippet of trace information of memory accesses of a foreground session when using the first public redo strand:

kcrfw_redo_gen_ext+11277:0x0000000076fdf3bc(shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf3d8(shared pool|permanent memor,duration 1,cls perm+16114648 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf430(shared pool|permanent memor,duration 1,cls perm+16114736 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3d0(shared pool|permanent memor,duration 1,cls perm+16114640 ):R:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11312:0x0000000076fdf428(shared pool|permanent memor,duration 1,cls perm+16114728 ):W:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3c8(shared pool|permanent memor,duration 1,cls perm+16114632 ):R:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11322:0x0000000076fdf420(shared pool|permanent memor,duration 1,cls perm+16114720 ):W:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11327:0x0000000076fdf3c0(shared pool|permanent memor,duration 1,cls perm+16114624 ):R:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11332:0x0000000076fdf418(shared pool|permanent memor,duration 1,cls perm+16114712 ):W:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11337:0x0000000076fdf3b8(shared pool|permanent memor,duration 1,cls perm+16114616 ):R:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11342:0x0000000076fdf410(shared pool|permanent memor,duration 1,cls perm+16114704 ):W:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11347:0x0000000076fdf3b0(shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf408(shared pool|permanent memor,duration 1,cls perm+16114696 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:8:0x7270/29296()
kcrfw_redo_gen_ext+11312:0x0000000076fdf400(shared pool|permanent memor,duration 1,cls perm+16114688 ):W:8:0x7270/29296()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf398(shared pool|permanent memor,duration 1,cls perm+16114584 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf3f0(shared pool|permanent memor,duration 1,cls perm+16114672 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf390(shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11342:0x0000000076fdf3e8(shared pool|permanent memor,duration 1,cls perm+16114664 ):W:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11347:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11352:0x0000000076fdf3e0(shared pool|permanent memor,duration 1,cls perm+16114656 ):W:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x7270/29296()
kcrfw_redo_gen_ext+11386:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):W:4:0x7271/29297()
kcrfw_redo_gen_ext+11392:0x0000000076fdf468(shared pool|permanent memor,duration 1,cls perm+16114792 ):R:8:0xe0000000/3758096384(Redo Buffers(pgsz:2048k)|+134217728 )

And a snippet of trace information of memory accesses of a foreground session when using the second public redo strand:

kcrfw_redo_gen_ext+11277:0x0000000076fdf4e4(shared pool|permanent memor,duration 1,cls perm+16114916 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf500(shared pool|permanent memor,duration 1,cls perm+16114944 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf558(shared pool|permanent memor,duration 1,cls perm+16115032 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4f8(shared pool|permanent memor,duration 1,cls perm+16114936 ):R:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11312:0x0000000076fdf550(shared pool|permanent memor,duration 1,cls perm+16115024 ):W:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4f0(shared pool|permanent memor,duration 1,cls perm+16114928 ):R:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11322:0x0000000076fdf548(shared pool|permanent memor,duration 1,cls perm+16115016 ):W:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11327:0x0000000076fdf4e8(shared pool|permanent memor,duration 1,cls perm+16114920 ):R:8:0x800/2048()
kcrfw_redo_gen_ext+11332:0x0000000076fdf540(shared pool|permanent memor,duration 1,cls perm+16115008 ):W:8:0x800/2048()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4e0(shared pool|permanent memor,duration 1,cls perm+16114912 ):R:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11342:0x0000000076fdf538(shared pool|permanent memor,duration 1,cls perm+16115000 ):W:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4d8(shared pool|permanent memor,duration 1,cls perm+16114904 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf530(shared pool|permanent memor,duration 1,cls perm+16114992 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:8:0xe6/230()
kcrfw_redo_gen_ext+11312:0x0000000076fdf528(shared pool|permanent memor,duration 1,cls perm+16114984 ):W:8:0xe6/230()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4c8(shared pool|permanent memor,duration 1,cls perm+16114888 ):R:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf4c0(shared pool|permanent memor,duration 1,cls perm+16114880 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf518(shared pool|permanent memor,duration 1,cls perm+16114968 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4b8(shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11342:0x0000000076fdf510(shared pool|permanent memor,duration 1,cls perm+16114960 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4b0(shared pool|permanent memor,duration 1,cls perm+16114864 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11352:0x0000000076fdf508(shared pool|permanent memor,duration 1,cls perm+16114952 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11366:0x0000000076fdf560(shared pool|permanent memor,duration 1,cls perm+16115040 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0xe6/230()
kcrfw_redo_gen_ext+11386:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):W:4:0xe7/231()
kcrfw_redo_gen_ext+11392:0x0000000076fdf590(shared pool|permanent memor,duration 1,cls perm+16115088 ):R:8:0xe0200000/3760193536(Redo Buffers(pgsz:2048k)|+136314880 )

If you pay really close attention you see the memory location moved from 0x76fdf3?? / 0x76fdf4?? to 0x76fdf4?? / 0x76fdf5??, all in permanent memory in the shared pool. Would this mean information for each strand is kept in a struct per strand? Let’s take a memory allocation on a specific offset in the kcrfw_redo_gen_ext function, and take the allocation on exactly the same offset in the kcrfw_redo_gen_ext function, which means it’s probably doing the same, and calculate the difference of the memory allocation:

strand 0
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8
strand 1
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8

Now take the memory addresses, and calculate the difference: 0x76fdf520 – 0x76fdf3f8 = 0x128 = 296

Okay, so would the size of the probable struct per strand be 296 bytes? Another question: where would this potential struct start?

Often, the location of structs is fetched from another location, just like we saw earlier for the redo allocation latch, especially if they are more dynamically created than fixed SGA structures, which are not dynamic by nature. Upon close examination of similarities between both pina traces, I found usage of a pointer in the kcrfsg_ struct at offset +0:

kcrfw_redo_gen_ext+14362:0x0000000060016228(fixed sga|var:kcrfsg_+0 ):R:8:0x76fdf388/1996354440(shared pool|permanent memor,duration 1,cls perm+16114568 )

The difference between the memory address I took and the pointer address in kcrfsg_ at offset 0 for the first public redo strand is:
0x76fdf3f8 – 0x76fdf388 = 0x70 = 112
The difference between the memory address I took for the second public redo strand and the pointer in kcrfsg_ at offset 0 is:
0x76fdf520 – 0x76fdf388 = 0x198 = 408
Then subtract with the calculated difference between the memory locations of the first and second public redo strand:
408 – 296 = 112
Bingo!

It would be really helpful to have some kind of explanation about this structure… I searched through My Oracle Support (very seldomly the regular knowledge base provides some internals information, the best chance for internals is the bug database, the rest is…well, let me say: not that helpful. I also hate that other tabs are enabled if nothing is found), oraus.msg, oradebug dumplist, oradebug doc component for any tracing or dumps that could give more information, but nothing seemed to point to any explanation. I dumped the shared pool, but I found the same information (the memory location being permanent memory) as I obtained from x$ksmsp. Also, since the pinatrace annotation takes care of any x$ view that has it’s source address in the SGA, it seems there is no way other than puzzle all the individual locations in what I assume is a structure that holds per strand metadata.

I decided to go through the x$ views one more time, and found an x$ view that I already knew of, which points to both ‘strands’ and ‘kcrf’; x$kcrfstrand. However, the reason it is not annotated is because this x$ view has it’s addresses in the PGA:

SQL> select addr from x$kcrfstrand;

ADDR
----------------
00007F14CE31AE00
00007F14CE31AE00

This means that when querying this x$ view, the results are gathered from wherever it comes from (the source can be file headers like redo (x$kccle) and controlfile (x$kcccf) for example, in this case we will the actual source of x$kcrfstrand is the SGA) and then stored temporarily in PGA, potentially manipulated, and then shown from the PGA, as we can see from the addresses (addr field in x$kcrfstrand).

However, what I can do to investigate where the x$kcfstrand takes its information from and run the pina (memory access) trace on the execution of ‘select * from x$kcrfstrand’, and try to see if I can find where it takes its data from. This is how the relevant part looks like:

qerfxFetch+1563:0x00007ffee590a638():W:8:0x10ea8781/283805569()
kcrf_strand_ftcb+2:0x00007ffee590a630():W:8:0x7ffee590b130/140732749885744()
kcrf_strand_ftcb+10:0x00007ffee590a620():W:8:0x7f7780569f40/140151230996288()
kcrf_strand_ftcb+23:0x00000000600169b8(Fixed Size(pgsz:2048k)|+92600 fixed sga|var:kcrf_max_strands_+0 ):R:4:0x2/2()
kcrf_strand_ftcb+29:0x00000000600169c8(Fixed Size(pgsz:2048k)|+92616 fixed sga|var:kcrf_actv_pvt_strands_+0 ):R:4:0/0()
kcrf_strand_ftcb+35:0x00007f7780448558():R:4:0/0()
kcrf_strand_ftcb+52:0x0000000060016228(Fixed Size(pgsz:2048k)|+90664 fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrf_strand_ftcb+64:0x00007ffee590a618():W:8:0xcf7cb25/217565989()
_intel_fast_memcpy+7:0x00000000157f62a0():R:8:0x137fff/1277951()
__intel_ssse3_rep_memcpy+160:0x00000000157b9488():R:4:0x8000/32768()
__intel_ssse3_rep_memcpy+184:0x0000000076fdf4a0(shared pool|permanent memor,duration 1,cls perm+16114848 ):R:16:0x00000000200040000/8590196736(??)
__intel_ssse3_rep_memcpy+3438:0x00007f7780448670():W:16:0x0000ec00000000000/4151755906482176()
...
__intel_ssse3_rep_memcpy+9254:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:16:0xfe56e00000fe56e00000/-2305841916834480128()
__intel_ssse3_rep_memcpy+9258:0x00007f7780448560():W:16:0xfe56e00000fe56e00000/-2305841916834480128()

Going from top to bottom through this snippet:
– qerfxFetch is the query execute rowsource function to query x$ tables, kcrf_strand_ftcb is an helper function (kernel cache recovery file strand fetch copy buffer is my guess) to fetch the kcrf strand data from the shared pool structure; the other functions are memory copy functions.
– Interestingly, the kcrf_strand_ftcb function reads offset 0 of kcrfsg_ just like we seen before. This is quite logical if you think about it, this struct is dynamically allocated in the shared pool, which means it doesn’t have a fixed address, so it needs to be looked up every time something must be read from it.
– The next functions are memcpy functions. I’ve shown the relevant ones. The first memcpy (_intel_fast_memcpy) probably is a helper function that switches to an intel optimised memcpy, __intel_ssse3_rep_memcpy, which takes advantage of intel’s SIMD optimisations (SSE3 means Supplemental Streaming SIMD Extensions 3).
– The memcpy functions at lines 11 and 12 show it’s reading 16 bytes from 0x76fdf4a0, and then writing 16 bytes to 0x7f7780448670.
– 0x76fdf4a0 is the struct for the second strand: kcrfsg_+0 = 0x76fdf388 + 296 = 0x76fdf4b0 + 16 (copy size) = 0x76fdf4a0.
– 0x7f7780448670 is quite probably and address in the process’ PGA, which is used for the intermediate representation of the struct’s data for displaying it for the x$ view.

Even more interesting, this confirms that x$kcrfstrand is the externalisation of the struct that is used together with kcrfsg_, and the calculations that were made previously are correct; kcrfsg_+0 = 0x76fdf388, and that address with 296 added is 0x76fdf4b0, which is the same as 0x76fdf4a0 + 16 (if you look into the above example, the memcpy function copies 16 bytes at a time).

I think it’s pretty safe to assume this kcrf related struct in the shared pool is commonly referred to as ‘KCRFA’ in x$kcrfstrand. Of course we can only guess that the ‘A’ means in kernel cache recovery file, I would say ‘auxiliary’, but is could also very well be the A in ‘strAnd’. It is a structure that clearly was needed when oracle changed the concept of a single log buffer into multiple public redo strands, because identical information of multiple strands need to be stored at the same time, and the information in kcrfsg_ is singular of nature.

Please mind that because of the intermediate representation of the x$ table in PGA, I could not use the trick I showed in my ‘drilldown into the logwriter inner working and communication’, which is query a single column in the x$ view, and see which memory addresses in this struct are touched; the entire struct is copied into PGA and processed before it is used to show the contents, even if only a single column of a single row is queried.

Also, also consider that since the struct data is copied into the process’ PGA in order to make this information human readable, not all offsets (in x$kqfco) of the PGA representation might be applicable to the SGA struct. In fact, I think I have proven that the KCRFA per strand metadata width is 296. Now look at the column offsets from x$kqfco of x$kcrfstrand:

SQL> select c.kqfconam, c.kqfcooff, c.kqfcosiz
from x$kqfco c, x$kqfta t
where t.indx = c.kqfcotab
and t.kqftanam = 'X$KCRFSTRAND'
order by c.kqfcooff;

KQFCONAM                         KQFCOOFF   KQFCOSIZ
------------------------------ ---------- ----------
ADDR                                    0          8
INDX                                    0          4
INST_ID                                 0          4
CON_ID                                  0          2
STRAND_HDR_SCN_KCRFA_CLN                0          8
HIGHEST_SCN_KCRFA_CLN                   0          8
PNEXT_BUF_KCRFA_CLN                    32          8
NEXT_BUF_NUM_KCRFA_CLN                 40          4
BYTES_IN_BUF_KCRFA_CLN                 48          8
AVAILABLE_BUFS_KCRFA_CLN               56          4
MEM_BUFS_AVAILABLE_KCRFA_CLN           60          4
STRAND_HEADER_BNO_KCRFA_CLN            72          4
FLAG_KCRFA_CLN                         76          1
PVT_STRAND_STATE_KCRFA_CLN             77          1
HIGHEST_SUBSCN_KCRFA_CLN               78          2
LWN_BUF_NUM_KCRFA_CLN                  80          4
STRAND_NUM_ORDINAL_KCRFA_CLN           84          4
STRAND_INFO_PTR_KCRFA_CLN              88          8
ACTION_KCRFA                          184          4
LOG_RESIDUE_KCRFA                     188          1
LAST_BUF_WRITTEN_KCRFA                216          4
LAST_BUF_GATHERED_KCRFA               220          4
CONS_LOW_BUFNO_KCRFA                  224          4
FIRST_BUF_KCRFA                       240          8
LAST_BUF_KCRFA                        248          8
TOTAL_BUFS_KCRFA                      256          4
STRAND_SIZE_KCRFA                     260          4
LOWBUF_KCRFA                          272          4
HIBUF_KCRFA                           276          4
WRAPHIBUF_KCRFA                       280          4
LOG_SEQ_KCRFA                         284          4
PTR_KCRF_PVT_STRAND                   304          8
INDEX_KCRF_PVT_STRAND                 312          4
SPACE_KCRF_PVT_STRAND                 316          4
TXN_KCRF_PVT_STRAND                   320          2

This shows that the representation of the KCRFA struct in PGA apparently has a width of 322 bytes, while we know the KCRFA struct itself is 296. I am not sure why this is, the most obvious reason would be some data is processed already in PGA to make easier to show. This might also very well be the reason there is a need for copying it to PGA first, and not display it directly from its shared pool location.

This will be a series of posts about Oracle database redo handling. The database in use is Oracle version 12.2.0.1, with PSU 170814 applied. The operating system version is Oracle Linux Server release 7.4. In order to look into the internals of the Oracle database, I use multiple tools; very simple ones like the X$ views and oradebug, but also advanced ones, quite specifically the intel PIN tools (https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool). One of these tools is ‘debugtrace’, which contains pretty usable output on itself (a indented list of function calls and returns), for which I essentially filter out some data, another one is ‘pinatrace’, which does not produce directly usable output, because it provides instruction pointer and memory addresses. However, the information it provides is extremely useful (and detailed!) once you translate the addresses to Oracle function names and memory region names. I have written a tool to do that: https://gitlab.com/FritsHoogland/pinatrace_annotate.git

The blog posts are meant for investigating Oracle database internals. This is a microscopic view into how the database works. I do believe by understanding the mechanics, you can very easy and quickly assess and resolve issues because of a fundamental understanding of how the database works.

First of all to immediately show the value of looking into the internals, a clear documentation bug: this entry from the online documentation of 12.2: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/cncpt/memory-architecture.html#GUID-C2AD1BF6-A5AE-42E9-9677-0AA08126864B (not sure how long this link will be valid, Oracle is notorious for changing it) states that the log buffer is singular buffer called the log buffer. In fact, the log buffer is not a singular buffer, but a buffer that contains at least to public redo strands, which changed since at least Oracle 10g. These strands are externalised in x$kcrfstrand:

SQL> select indx,strand_size_kcrfa, first_buf_kcrfa, last_buf_kcrfa from x$kcrfstrand where last_buf_kcrfa != '00';

      INDX STRAND_SIZE_KCRFA FIRST_BUF_KCRFA  LAST_BUF_KCRFA
---------- ----------------- ---------------- ----------------
         0          67108864 00000000D8000000 00000000DBFFFE00
         1          67108864 00000000DC000000 00000000DFFFFE00

The number of public redo buffers is dependent on the undocumented parameter “_log_parallelism_max”. A quick question out on twitter revealed that systems with a high count of processors could have a higher public redo strand count.

The public strands are allocated in the ‘Redo Buffers’ memory allocation:

SQL> select "AREA NAME", "START ADDR", to_char((to_number(rawtohex("START ADDR"),'xxxxxxxxxxxxxxxx')+"SEGMENT SIZE"),'XXXXXXXXXXXXXXXX'),pagesize from x$ksmssinfo; 

AREA NAME                        START ADDR       TO_CHAR((TO_NUMBE   PAGESIZE
-------------------------------- ---------------- ----------------- ----------
Variable Size                    0000000061000000          D8000000    2097152
Redo Buffers                     00000000D8000000          E07C6000    2097152
Fixed Size                       0000000060000000          60A00000    2097152
skgm overhead                    00000000E1000000          E1003000       4096

Each public strand is protected by a redo allocation latch:

SQL> select addr, child#, name, gets, misses, immediate_gets, immediate_misses from v$latch_children where name = 'redo allocation';

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE17F0         20 redo allocation               5          0              0                0
0000000076FE1750         19 redo allocation               5          0              0                0
0000000076FE16B0         18 redo allocation               5          0              0                0
0000000076FE1610         17 redo allocation               5          0              0                0
0000000076FE1570         16 redo allocation               5          0              0                0
0000000076FE14D0         15 redo allocation               5          0              0                0
0000000076FE1430         14 redo allocation               5          0              0                0
0000000076FE1390         13 redo allocation               5          0              0                0
0000000076FE12F0         12 redo allocation               5          0              0                0
0000000076FE1250         11 redo allocation               5          0              0                0
0000000076FE11B0         10 redo allocation               5          0              0                0

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE1110          9 redo allocation               5          0              0                0
0000000076FE1070          8 redo allocation               5          0              0                0
0000000076FE0FD0          7 redo allocation               5          0              0                0
0000000076FE0F30          6 redo allocation               5          0              0                0
0000000076FE0E90          5 redo allocation               5          0              0                0
0000000076FE0DF0          4 redo allocation               5          0              0                0
0000000076FE0D50          3 redo allocation               5          0              0                0
0000000076FE0CB0          2 redo allocation            6168          1           1708                2
0000000076FE0C10          1 redo allocation            7267         20          34804               15

Redo allocation latch child# 1 is the latch protecting allocations in public redo buffer 0, child# 2 is the latch protecting allocations in public redo buffer 1. Two questions that you would probably ask yourself:
1: why are there 20 redo allocation latches whilst there are only two public redo buffers?
2: how can you tell child# 1 is bound to public redo buffer 0, and child# 2 is bound to public redo buffer 1?

Answers to these questions:
1: the 18 other redo allocation latches are because of a mechanism called ‘private redo strands’, see https://blog.tanelpoder.com/files/Oracle_Performance_Scalability_10g_11g.pdf, https://fritshoogland.wordpress.com/2016/11/15/redo-a-blogpost/
2: in order to be sure about what that redo allocation latch protects, you need to look at the actual function calls.

For this, I am using a tool from intel, called ‘pin tools’, which is a suite of tools, of which one is called ‘debugtrace’: https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

First I am going to look into how a latch is chosen.

Attach a debugtrace to a normal foregrond session that executes an insert and a commit, and filter the output to make it more readable using a custom filter: https://gitlab.com/snippets/1686145

Now search for the latch address (in lowercase) 0x76fe0c10 (the address of the redo allocation latch, see above query from v$latch_children):

 | | | | | | | | | | | | | | | > kcrfw_redo_gen_ext(0x3, 0x7ffd52f513e0, ...)
 | | | | | | | | | | | | | | | | > kcobrh(0x7ffd52f513e0, 0x3, ...)
 | | | | | | | | | | | | | | | | < kcobrh+0x00000000014b returns: 0x1b4
 | | | | | | | | | | | | | | | | > kcb_get_redo_version(0x3, 0x7ffd52f513a0, ...)
 | | | | | | | | | | | | | | | | < kcb_get_redo_version+0x0000000000c4 returns: 0x1
 | | | | | | | | | | | | | | | | > kcmglscn8(0x7ffd52f508d0, 0, ...)
 | | | | | | | | | | | | | | | | | > kcscu8(0x600113b8, 0x7ffd52f508d0, ...)
 | | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | | | | | | | | | | | | | < kcmglscn8+0x000000000054 returns: 0x756239
 | | | | | | | | | | | | | | | | > kscn_to_ub8_impl(0x7ffd52f50838, 0x7ffd52f508d0, ...)
 | | | | | | | | | | | | | | | | < kscn_to_ub8_impl+0x00000000003e returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fdebe0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)                             <<<<<<<<
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > ksvgcls(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < ksvgcls+0x000000000021 returns: 0
 | | | | | | | | | | | | | | | | > kcscu8(0x60016260, 0x7ffd52f508d8, ...)
 | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0xd8eab838, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0xd8eab848, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0x7ffd52f50848, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > kscnbur3_int_impl(0x756239, 0xd8eab824, ...)
 | | | | | | | | | | | | | | | | < kscnbur3_int_impl+0x000000000030 returns: 0x756239
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756231, 0xd8eab840, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > kslfre(0x76fe0c10, 0xd8eab840, ...)
 | | | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0

If we look at the work of andrey nikolaev: https://andreynikolaev.files.wordpress.com/2016/03/latches_and_mutexes_in_oracle_12c.pdf
We know that the first argument of kslgetl is the latch address, and the second argument is 0 for no wait or 1 for willing to wait mode. What becomes clear, is that a foreground process gets the redo allocation latch in nowait mode. Of course this is widely known and documented. the return code of 0x1 means the latch is successfully taken. The latch that is fetched before the redo allocation latch of course is the redo copy latch:

SQL> @flba 76fdebe0
ADDR             NAME
---------------- ------------------------
0000000076FDEBE0 redo copy

If we look again at the figures of the two redo allocation latches protecting the public redo strands:

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE0CB0          2 redo allocation            6168          1           1708                2
0000000076FE0C10          1 redo allocation            7267         20          34804               15

It becomes apparent that child# 1 is used way more than child# 2, which is visible in the column ‘immediate_gets’. How does that work? Let’s take the first redo allocation child latch, and issue an insert again with debugtrace on it.

session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x76FE0C10 0 0 2442
Function returned 1

session 2; debugtrace output:

 | | | | | | | | | | | | | | | | > kslgetl(0x76fdebe0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1

First child# 1 is tried, but when the latch is already taken, indicated by returncode 0, it turns to the second latch.

What will happen when both latches are taken?

session 1:

SQL> oradebug call kslgetl 0x76FE0C10 0 0 2442
Function returned 1
SQL> oradebug call kslgetl 0x76FE0CB0 0 0 2442
Function returned 1

This is what it looks like on in debugtrace output:

 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | | | | | | | | | | | | | | | | > kslges(0x76fe0cb0, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | > ksuSdiInProgress(0x2, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | < ksuSdiInProgress+0x000000000035 returns: 0
 | | | | | | | | | | | | | | | | | | > kskthbwt(0, 0, ...)
 | | | | | | | | | | | | | | | | | | < kskthbwt+0x0000000000a1 returns: 0x10
 | | | | | | | | | | | | | | | | | | > kslwlmod(0x7ffc45d6ab00, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | | > kstmgetsectick(0x7ffc45d6ab00, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | | < kstmgetsectick+0x00000000003a returns: 0x5a4bd4b9
 | | | | | | | | | | | | | | | | | | < kslwlmod+0x0000000000cb returns: 0xffffffff
 | | | | | | | | | | | | | | | | | | > kslwtbctx(0x7ffc45d6a988, 0x7a63c588, ...)

What this shows, is the foreground session first trying to get the latch of the first redo allocation child in immediate mode, which failed, then the latch of the second redo allocation child in immediate mode, which failed too, after which it had no other option than to get a redo allocation latch in willing to wait mode. For that it performed a redo allocation latch get for the first child latch again with the second argument set to 1, indicating the willing to wait mode. Because the latch was still taken, kslgetl called kslges in order to spin (which means a series of retries of getting the latch) for it; on Linux the spinning happens inside kslges, so it is not not visible in a debugtrace). After it spun for 20000 times, the process registers a wait (kslwtbctx), and then goes to sleep on a semaphore in the wait:

 | | | | | | | | | | | | | | | | | | < kslwtbctx+0x00000000050b returns: 0x77c000a4
 | | | | | | | | | | | | | | | | | | > skgpwwait(0x7ffc45d6abf8, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | > kslwait_conv_wait_time(0xffffffff, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | < kslwait_conv_wait_time+0x000000000027 returns: 0xffffffff
 | | | | | | | | | | | | | | | | | | | > sskgpwwait(0x7ffc45d6abf8, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | | > semop@plt(0x98000, 0x7ffc45d6a888, ...)

This is the modern latch implementation where a session goes to sleep on a semaphore after spinning for a latch in willing to wait mode. In order to learn more about that, look in the work of Andrey Nikolaev. The semaphore implementation means that as soon as the latch becomes available, the session is posted and can do it’s allocation in the public redo strand.

However, this does not tie the child latch to a public strand, it only shows how a session bunny-hops public redo allocation latches in nowait mode until it hits a blind wall (if all have been taken) after which it will start waiting in willing to wait mode, waiting to be posted.

In order to see the connection between the redo allocation latches and the actual public strand memory areas, we need to look one level deeper, which means looking at (main) memory access of a process. This can be done using the pin tools pinatrace tool, and adding annotations to the pinatrace in order to understand what is going on. for this, I created a tool: https://gitlab.com/FritsHoogland/pinatrace_annotate

First, create a pinatrace of a normal insert on a test database. A word of warning: if you think a function call trace generates a lot of detailed data, the pinatrace alias memory access trace generates *WAY* more. Even modest things like ‘select count(*) from dual’ generates heaps of data.

Back to the pinatrace of the normal insert; generate the trace, and then annotate it using pinatrace_annotate tool. This should show the insert takes redo allocation latch child #1. To validate, search for the latch address again:

kcrfw_redo_gen_ext+1479:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1486:0x00007fff5d88efe8():W:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1493:0x00007fff5d88efc8():R:4:0/0()
kcrfw_redo_gen_ext+1516:0x00007fff5d88efe8():R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )

What this shows is the kcrfw_redo_gen_ext function reading address 0x60016570, which is in the fixed SGA, which belongs to a ‘fixed sga variable’ called kcrfsg_ at offset 840, reading the value 0x76fe0c10, which we already know is the address of redo allocation latch child #1. The latch address is written into address 0x7fff5d88efe8, which is in the process’ PGA, but we got no annotation of it, because there is nothing between the round brackets. Later, the latch address in PGA is read again.

We know the kcrfsg_ variable/struct from my earlier investigation into how the logwriter works. This is a struct in fixed sga that holds data important to kcrf (kernel cache redo file). My current guess is sg means ‘SGA’.

Only a little further in the trace, we see that the kslgetl function is invoked, and indeed it is trying to get this latch:

kslgetl+69:0x0000000076fe0c1f(Variable Size(pgsz:2048k)|+368970783 shared pool|(child)latch#1:redo allocation+15 shared pool|permanent memor,duration 1,cls perm+16120863 ):R:1:0x37/55()

The way a successful latch get looks like, which is visible in the trace, is this:

kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):R:8:0/0()
kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):W:8:0x1a/26()

The work of Andrey nikolaev describes that when a latch is taken, the ora pid is written at offset 0. Here we see at offset 0, the kslgetl function reads a 0, indicating the latch is free, after which the value 26, the ora pid of the process, is written into it, which means this process now has taken the latch.

The next step is to see if there is any indication of it pointing to the public redo strand. It should be close to the latch get, because a latch should be hold for as little time as possible. Essentially, what I did then is search for ‘Redo Buffers’. I found this a little further still in the kcrfw_redo_gen_ext function:

kcrfw_redo_gen_ext+11277:0x0000000076fdf3bc(shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf3d8(shared pool|permanent memor,duration 1,cls perm+16114648 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf430(shared pool|permanent memor,duration 1,cls perm+16114736 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3d0(shared pool|permanent memor,duration 1,cls perm+16114640 ):R:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11312:0x0000000076fdf428(shared pool|permanent memor,duration 1,cls perm+16114728 ):W:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3c8(shared pool|permanent memor,duration 1,cls perm+16114632 ):R:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11322:0x0000000076fdf420(shared pool|permanent memor,duration 1,cls perm+16114720 ):W:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11327:0x0000000076fdf3c0(shared pool|permanent memor,duration 1,cls perm+16114624 ):R:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11332:0x0000000076fdf418(shared pool|permanent memor,duration 1,cls perm+16114712 ):W:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11337:0x0000000076fdf3b8(shared pool|permanent memor,duration 1,cls perm+16114616 ):R:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11342:0x0000000076fdf410(shared pool|permanent memor,duration 1,cls perm+16114704 ):W:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11347:0x0000000076fdf3b0(shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf408(shared pool|permanent memor,duration 1,cls perm+16114696 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:8:0x7270/29296()
kcrfw_redo_gen_ext+11312:0x0000000076fdf400(shared pool|permanent memor,duration 1,cls perm+16114688 ):W:8:0x7270/29296()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf398(shared pool|permanent memor,duration 1,cls perm+16114584 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf3f0(shared pool|permanent memor,duration 1,cls perm+16114672 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf390(shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11342:0x0000000076fdf3e8(shared pool|permanent memor,duration 1,cls perm+16114664 ):W:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11347:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11352:0x0000000076fdf3e0(shared pool|permanent memor,duration 1,cls perm+16114656 ):W:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x7270/29296()
kcrfw_redo_gen_ext+11386:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):W:4:0x7271/29297()
kcrfw_redo_gen_ext+11392:0x0000000076fdf468(shared pool|permanent memor,duration 1,cls perm+16114792 ):R:8:0xe0000000/3758096384(Redo Buffers(pgsz:2048k)|+134217728 )

There’s a lot to say about this. The first thing to point out, is that these are all memory locations of a memory segment in the shared pool in permanent memory somewhere around 0x76fdf300 / 0x76fdf400. The first public strand is explicitly mentioned (by my annotation of the memory structures):

kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )

As well as the second public strand:

kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )

I am confident that the second public strand just happens to collide with a sort of ‘not used’ code, if you look at what the hexadecimal value of this is: 0xdeadbeef

The second thing to notice is there is a lot of information written really close to each other (0x76fdf300-0x76fdf400). The only location I know that holds this kind of redo related information is the previously mentioned kcrfsg_ struct, but this seems to be another location that holds critical information about redo, of which all we know is it sits in the shared pool (not in fixed sga), and it is allocated as permanent memory.

I tried various things to get more information about this memory location, like dumping the sga heap, inspecting this memory location in a heap dump to see if there is a sub heap that I can dump for more information (there isn’t), asking on twitter (which sometimes helps), but I gotten no information. I spend a considerable time searching in the bug database, and still gotten no answers….

I decided to prove usage of the second public strand by manually obtain the latch for the first strand, so an insert will take the second latch, and if that indeed means the second public redo strand is used. This means fetching the first latch using oradebug in a session, and then starting another database session, attach a pinatrace to it, and execute an insert.

Once the pinatrace is annotated, search for 0x76fe0c10 (the first redo allocation latch) again. The start is exactly the same, the latch address is fetched from kcrfsg_ at offset 840.

kcrfw_redo_gen_ext+1479:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1486:0x00007ffcca22e6e8():W:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1493:0x00007ffcca22e6c8():R:4:0/0()
kcrfw_redo_gen_ext+1516:0x00007ffcca22e6e8():R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )

Now look at the same offsets in the function kslgetl (+138) again:

kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):R:8:0x27/39()
kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):W:8:0x27/39()

Now we see the latch already has a value at offset zero, which is 39, indicating the latch is taken. Now that the session executing the insert found the first redo allocation latch busy, first this happens:

kcrfw_redo_gen_ext+14267:0x00000000600169c4(Fixed Size(pgsz:2048k)|+92612 fixed sga|var:kcrf_actv_strands_+0 ):R:4:0x1/1()
kcrfw_redo_gen_ext+14273:0x00007ffcca22e6d8():R:4:0x2/2()
kcrfw_redo_gen_ext+14283:0x00000000600169c4(Fixed Size(pgsz:2048k)|+92612 fixed sga|var:kcrf_actv_strands_+0 ):W:4:0x2/2()

Because the latch for the first public redo strand was taken, an additional strand is enabled by setting the fixed sga variable kcrf_actv_strands_ to 2, indicating the second strand is now active too.

Then kcrfsg_ at offset 840 is read again, but a higher value is saved in PGA memory. In fact, the latch size is known (160), so in order to get then next latch, the session picks up the latch address of child #1, and adds 160 in order to get to child #2.

kcrfw_redo_gen_ext+14335:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+14342:0x00007ffcca22e6e8():W:8:0x76fe0cb0/1996360880(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 )

This latch is free, and is taken by this session:

kslgetl+138:0x0000000076fe0cb0(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 ):R:8:0/0()
kslgetl+138:0x0000000076fe0cb0(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 ):W:8:0x1a/26()

Now let’s look at the same function locations with the first redo allocation latch taken, and the second redo allocation latch acquired:

kcrfw_redo_gen_ext+11277:0x0000000076fdf4e4(shared pool|permanent memor,duration 1,cls perm+16114916 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf500(shared pool|permanent memor,duration 1,cls perm+16114944 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf558(shared pool|permanent memor,duration 1,cls perm+16115032 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4f8(shared pool|permanent memor,duration 1,cls perm+16114936 ):R:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11312:0x0000000076fdf550(shared pool|permanent memor,duration 1,cls perm+16115024 ):W:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4f0(shared pool|permanent memor,duration 1,cls perm+16114928 ):R:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11322:0x0000000076fdf548(shared pool|permanent memor,duration 1,cls perm+16115016 ):W:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11327:0x0000000076fdf4e8(shared pool|permanent memor,duration 1,cls perm+16114920 ):R:8:0x800/2048()
kcrfw_redo_gen_ext+11332:0x0000000076fdf540(shared pool|permanent memor,duration 1,cls perm+16115008 ):W:8:0x800/2048()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4e0(shared pool|permanent memor,duration 1,cls perm+16114912 ):R:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11342:0x0000000076fdf538(shared pool|permanent memor,duration 1,cls perm+16115000 ):W:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4d8(shared pool|permanent memor,duration 1,cls perm+16114904 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf530(shared pool|permanent memor,duration 1,cls perm+16114992 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:8:0xe6/230()
kcrfw_redo_gen_ext+11312:0x0000000076fdf528(shared pool|permanent memor,duration 1,cls perm+16114984 ):W:8:0xe6/230()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4c8(shared pool|permanent memor,duration 1,cls perm+16114888 ):R:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf4c0(shared pool|permanent memor,duration 1,cls perm+16114880 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf518(shared pool|permanent memor,duration 1,cls perm+16114968 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4b8(shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11342:0x0000000076fdf510(shared pool|permanent memor,duration 1,cls perm+16114960 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4b0(shared pool|permanent memor,duration 1,cls perm+16114864 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11352:0x0000000076fdf508(shared pool|permanent memor,duration 1,cls perm+16114952 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11366:0x0000000076fdf560(shared pool|permanent memor,duration 1,cls perm+16115040 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0xe6/230()
kcrfw_redo_gen_ext+11386:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):W:4:0xe7/231()
kcrfw_redo_gen_ext+11392:0x0000000076fdf590(shared pool|permanent memor,duration 1,cls perm+16115088 ):R:8:0xe0200000/3760193536(Redo Buffers(pgsz:2048k)|+136314880 )

Here we see reads and writes in the same fashion, including the 0xdeadbeef write, but now we see PUB_REDO_1, alias the second public redo strand. This proves the second redo allocation child latch is really protecting the second public redo strand!

This post is an introduction to pinatrace annotate version 2, which is a tool to annotate the output of the Intel Pin tools ‘pinatrace’ tool.

The pinatrace tool generates a file with every single memory access of a process. Please realise what this means: this is every single read from main memory or write to main memory from the CPU. This allows you to get an understanding what happens within a C function. This means you can determine what information or data is accessed in what function. Needless to say this is a tool for internals investigations and research, not something for normal daily database maintenance and support. Also, the performance of the process that you attached to is severely impacted, and it can only be turned off by stopping the process. Do not use this on a production database, use this at your own risk for research and investigational purposes only.

The general way to use the pin tool is to run something together with pin, for example ‘ls’:

$ ./pin -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -- ls

This generates an output file that looks like this:

#
# Memory Access Trace Generated By Pin
#
0x00007f49de60f173: W 0x00007ffd160c46f8  8     0x7f49de60f178
0x00007f49de60f840: W 0x00007ffd160c46f0  8                  0
0x00007f49de60f844: W 0x00007ffd160c46e8  8                  0
0x00007f49de60f846: W 0x00007ffd160c46e0  8                  0
0x00007f49de60f848: W 0x00007ffd160c46d8  8                  0
0x00007f49de60f84a: W 0x00007ffd160c46d0  8                  0
0x00007f49de60f84f: W 0x00007ffd160c46c8  8                  0
0x00007f49de60f85f: R 0x00007f49de82fe00  8                0xe
0x00007f49de60f866: W 0x00007f49de82fbd8  8      0x5f25f53b3dc
0x00007f49de60f877: R 0x00007f49de82ff90  8           0x221e00
0x00007f49de60f881: W 0x00007f49de8309a8  8     0x7f49de82fe00
0x00007f49de60f888: W 0x00007f49de830998  8     0x7f49de60e000

This on itself is not very helpful. Let me explain what you are seeing: the first column is the instruction pointer address, in other words: the place in the executable in which the memory transfer was initiated, ‘W’ and ‘R’ are self explanatory, the third column is the memory address to which the read or write is addressed, the fourth column is the size of the memory IO, and the fifth is the value.

Luckily, there is another way to run pin, which is attaching to an already running process. This is what is helpful in using the memory access trace with an Oracle process. The way this is done, is quite simple; first get the linux process number, then run pin specifying the process number:

$ ps -ef | grep [L]OC
oracle    7250  7249  0 09:42 ?        00:00:00 oracletest (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ ./pin -pid 7250 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

There is an important caveat here: because pin inserts itself into the running process, the pin library (source/tools/SimpleExamples/obj-intel64/pinatrace.so in this case) must be available to this process. So if you downloaded, extracted and linked pin tools as root, and then as root try to attach to an oracle process, it will not work because the oracle process can not reach the library. So: extract and link the pin tools as oracle.

The output of the pinatrace (a file called pinatrace.out by default) is in $ORACLE_HOME/dbs when you attach to an Oracle database process.

Back to the pinatrace output: what if we could see the C function that was at the instruction pointer address, the name of the Oracle memory area for the memory address and for the fifth column the name of the Oracle memory area again in case the value is actually a pointer? That is exactly what my pinatrace annotate oracle tool does!

How to use it: first clone the pinatrace_annotate repository:

git clone https://gitlab.com/FritsHoogland/pinatrace_annotate.git

Then set the O_BINARY to your oracle binary at line 6 of pinatrace_annotate_oracle.sh.

The next thing is to generate the oracle memory area CSV files. I added a script to generate the memory ranges for the PGA too, however you should be aware this is very fluent, and it requires an execution of the oracle PID you point it to before the v$process_memory_detail gets populated. Sometimes I skip the PGA detail step because it doesn’t add too much value.
To run the scripts for generating the memory area CSV files:

$ sqlplus / as sysdba
...
SQL> @get_memory_ranges
...
	76 14,20,@1	   8645 		    SYS 			   sqlplus@memory-presentation.local (TNS V1-V3)
...
enter pid number to get pga details: 76
old   1: alter session set events 'immediate trace name pga_detail_get level &pid'
new   1: alter session set events 'immediate trace name pga_detail_get level 76'

Session altered.

execute something (eg. select * from dual) in the target process, then press enter

Now head over to the session you want the PGA memory details from, and execute something like ‘select * from dual’, and press enter in the sqlplus session in which you ran get_memory_ranges.sql. This will then run for a while, it generates pipe-delimited files that sqlite can parse.

After the CSV files have been generated, create the sqlite database for use by the tool:

$ ./pinatrace_annotate_oracle.sh -c
generate symbols from /u01/app/oracle/product/12.2.0.1/dbhome_1/bin/oracle.
create and populate table symbol_ranges with symbols.
create index.
create table memory_ranges.
import memory_ranges.csv
import memory_ranges_xtables.csv
import memory_ranges_pga.csv
create index.

Now the pinatrace_annotate_oracle.sh tool is all set for usage!

Example usage
Let’s use an actual example to see where the memory trace provides benefit: latches. Nowadays, latches are not the issue these were in the oracle 8 days. In these days, the latches protecting the database buffers double linked lists, the ‘cache buffers chains’ latches, were not gotten in shared mode. That meant especially index root blocks could get latch contention if there was much concurrency.

Some great work has been done on the internals of latches by Andrey Nikolaev; see his latch internals presentation. His work is done on Solaris with Dtrace, I did some of that on linux (part 1, part 2 and part 3).

Let’s replay a willing to wait latch get and see how that looks like in the annotated pinatrace output! Setup the host with the pin tools (link it to get the pinatrace.so library), and clone pinatrace_annotate and configure it to point to the oracle binary you are using.

1. session 1:

$ sqlplus / as sysdba
SQL> select distinct sid from v$mystat;

2. session 2:

$ sqlplus / as sysdba
SQL> @get_memory_ranges

identify session 1, and fill out the pid.
take note of the O/S pid.
it will ask you to execute something in the session for which the pid was entered.

3. session 1:

SQL> select * from dual;

4. session 2:
press enter, and wait for @get_memory_ranges to finish.

SQL> exit
$ ./pinatrace_annotate_oracle.sh -c
$ sqlplus / as sysdba
SQL> select addr from v$latch where name = 'cache table scan latch';

note the latch address, it’s 0x6004A400 in my case.

SQL> oradebug setmypid
SQL> oradebug call kslgetl 0x6004A400 0 0 2442

result: ‘Function returned 1’ indicating a successful latch get.

5. session 1:

SQL> oradebug setmypid

6. session 3:
go to the directory where pin is extracted and linked, and go to the directory where the pin executable is located. Execute the following and change the pid for the O/S pid obtained in step 2.

$ ./pin -pid 8645 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -o pina_latch.txt

now the memory trace is enabled for pid 8645

7. session 1:

SQL> oradebug call kslgetl 0x6004A400 1 0 2442

this will hang, it is waiting for the latch which is taken by session 2.

8. session 2:

SQL> oradebug call kslfre 0x6004A400

this will return ‘Function returned 0’ indicating it freed the latch.

9. session 1:
the oradebug call will now return ‘Function returned 1’ indicating it has gotten the latch.

10. session 3:

$ cp /u01/app/oracle/product/12.2.0.1/dbhome_1/dbs/pina_latch.txt ~/pinatrace_annotate/pina_latch.txt

as indicated, the pinatrace output file will be in $ORACLE_HOME/dbs, and with this I copy the current state of the trace to the pinatrace_annotate directory where my pinatrace annotate tool is cloned.

$ cd ~/pinatrace_annotate
$ ./pinatrace_annotate_oracle.sh pina_latch.txt > pina_latch_annotated.txt

This will run for a while…

Once pinatrace_annotate_oracle.sh is done, open the pina_latch_annotated.txt file. Short guide on how this looks like; the first two lines in my file are:

function+offset:memory address(annotation+offset):Read/Write:size:value hex/dec(pointer annotation+offset)
??:0x00007fff224e6498():R:8:0x11060785/285607813()

The first line is the header, which is displayed once, and essentially explains it all. The second line starts with ‘??’. As indicated by the header, the first field is the function. Any call to a function (‘symbol’) that is not in the oracle executable, like functions in shared libraries, do not have an address in the executable. In fact, the place at which the shared libraries are loaded, and thus their functions are located is randomised by default, which explains why the executable does not have the address.

The second column is the address memory is read from or written into. In the second line, there is nothing in between the parentheses, meaning this address is not in a known memory range.

The ‘R’ or ‘W’ is obvious, and the number is the amount of memory read or written too.

The fifth field shows the content of the read or write in hexadecimal and in decimal. Here again there are parenthesis which are empty. I look up the value in the memory area’s, and put the memory area, if the value is in a known memory area, as annotation including offset. What you should realise is that I don’t know if a value is in fact a memory area and thus a pointer, or that the value coincidentally the same as a known memory area. I do think understanding if the value is a known memory area is handy and useful, in the C language pointers are heavily used.

Another example:

nsbasic_brc+548:0x00007f33d3026690(pga|Other, Fixed Uga, Fixed UGA heap+88712 ):R:8:0x7f33d3019600/139860560156160(pga|Other, Fixed Uga, Fixed UGA heap+3532
0 )

Here we see an example where the function is a function in the oracle executable, nsbasic_brc, which at offset 548 of the function read 8 bytes from memory address 0x00007f33d3026690, which was a known memory area at the time of running 0_get_pga_detail.sql when ‘select * from dual’ was run, which was in the PGA, in the Fixed UGA heap at offset 88712, and the value it read was 0x7f33d3019600, which is a memory address in the same chunk of memory at offset 3532, and therefore likely to be a pointer.

Now let’s look further into the annotated pinatracefile. If you search for ‘kslgetl’, you will find the actual latch get executed with oradebug. This is how that looks like on my system:

kslgetl+2:0x00007fff224e4ec0():W:8:0x7fff224e4ff0/140733768945648()
kslgetl+13:0x00007fff224e4ea0():W:8:0x7fff224e5870/140733768947824()
kslgetl+17:0x00007f33d8095f40(pga|Other, free memory, top call heap+243776 ):R:8:0x7f33d8095f40/139860644552512(pga|Other, free memory, top call heap+243776 )
kslgetl+26:0x00007fff224e4e98():W:8:0x1/1()
kslgetl+33:0x00007fff224e4e90():W:8:0/0()
kslgetl+37:0x00007fff224e4ea8():W:8:0x4/4()
kslgetl+44:0x00007f33d8087e78(pga|Other, free memory, top call heap+186232 ):R:8:0x7a63d268/2053362280(shared pool|permanent memor,duration 1,cls perm+5702248 )
kslgetl+58:0x00007fff224e4eb0():W:8:0x7fff224e57d0/140733768947664()
kslgetl+65:0x00007fff224e4e60():W:8:0/0()
kslgetl+69:0x000000006004a40f(fixed sga|(parent)latch:cache table scan latch+15 fixed sga|var:kcb_table_scan_latch_+15 ):R:1:0x37/55()
kslgetl+80:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):R:8:0/0()
kslgetl+96:0x000000006004a410(fixed sga|(parent)latch:cache table scan latch+16 fixed sga|var:kcb_table_scan_latch_+16 ):R:4:0x1/1()
kslgetl+103:0x000000007a63d5f0(shared pool|X$KSUPR.KSLLALOW+0 shared pool|permanent memor,duration 1,cls perm+5703152 ):R:4:0/0()
kslgetl+109:0x0000000011141524():R:4:0x1cffe/118782()
kslgetl+122:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0x6004a400/1610916864(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 )
kslgetl+131:0x00007f33d8087ea8(pga|Other, free memory, top call heap+186280 ):R:8:0x4b/75()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):W:8:0x22/34()
kslgetl+1898:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0/0()
kslgetl+1909:0x00000000600119b4(fixed sga|var:kcbccc_+0 ):R:1:0/0()
kslgetl+1938:0x00007fff224e4e60():R:8:0/0()
kslgetl+1942:0x00007fff224e4e38():W:8:0x10a6084b/279316555()

The two highlighted lines show that the function read the beginning of the latch memory and found a number, indicating the oracle process id that has the latch taken already. In other words, this means the process found a latch that has already been taken by another process, oracle pid 34.

If you scroll down further you see kslgetl called kslges, which is how a willing to wait latch works if the latch it wants to take is already taken, and then further down, you will see this pattern:

kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()

This is what is the spinning for the latch looks like from a memory access perspective. So that means that spinning on a latch does not mean the entire latching function (kslgetl or kslges) is called to try to take the latch, but rather when kslgetl finds the latch to be taken, it calls kslges, which does some housekeeping, and then in a really tight loop reads the latch’ memory at offset 0 to detect when it is freed. We can see it is a loop by looking at the offset numbers of the function, these are 930, 937 and 944, which are the offsets we see over and over.

In fact, if I grep for that line and count the number of lines, we can see this version of oracle (12.2.0.1) still does spin 20’000 times for a latch in willing to wait mode:

grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | wc -l
20001

Yes, this shows 20001 executions, remember we freed the latch in the other session?

$ grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | tail -3
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0/0()

The last line shows kslges finding the latch in a free state, which is after it has been woken from sleeping on a semaphore.

A modern version of Oracle doesn’t sleep for an arbitrary time and then spin again, it goes to sleep on a semaphore after it spun 20’000 times and then waits until it gets posted. That can be seen in the pinatrace output too, but if you want to look at the function flow, it might be better to generate a debugtrace trace, the pinatrace trace shows what information is used within a function, and might be too verbose. In fact, the function flow towards the semaphore call is kslges > skgpwwait > sskgpwwait.

Conclusion
pinatrace_annotate_oracle.sh is a tool to annotate a tracefile generated by the pin tool ‘pinatrace’ when running the oracle binary. It shows information about any memory access done by a process, which allows you to understand what information is obtained in what function of Oracle. This is a tool only for research and debug purposes.

Recently I was triggered about the ‘automatic big table caching’ feature introduced in Oracle version 12.1.0.2 with Roger Macnicol’s blogpost about Oracle database IO and caching or not caching (https://blogs.oracle.com/smartscan-deep-dive/when-bloggers-get-it-wrong-part-1 https://blogs.oracle.com/smartscan-deep-dive/when-bloggers-get-it-wrong-part-2). If you want to read something about the feature in general, search for the feature name, you’ll find several blogposts about it.

If you are not familiar with automatic big table caching, it’s a feature which allows you to dedicate a percentage of the buffer cache for scanning big tables. Big tables are tables that are not small :-), which technically means tables bigger (a higher number of blocks) than the “_small_table_threshold” parameter, which defaults to 2% of the buffer cache or 20 blocks, whichever is bigger. The choice to store blocks in the big table cache area is dependent on the segment “temperature”, which essentially is a kind of touch count. There is one other caveat of using this feature, which I currently find weird: once a database is operating in clustered mode (RAC), serial scans can not use the automatic big table caching anymore, the use of it then exclusively is for caching parallel query workers blocks scanned.

The purpose of this blogpost is to investigate the IO implementation of using the big table caching feature. The available general read code paths that I am aware of in Oracle are:
– Single block reads, visible via the wait event ‘db file sequential read’, which always go to the buffer cache.
– Multiple single block reads, visible via the wait ‘db file parallel read’, which always go to the buffer cache.
– Multi block reads, visible via the wait event ‘db file scattered read’, which always go to the buffer cache.
– Multi block reads, visible via the wait event ‘direct path read’, which go to the process’ PGA.
– Multi block reads, visible via the wait event ‘direct path read’, which go to the in-memory area.

For full scans of a segment, essentially the choices are to go buffered, or go direct. The question that I asked myself is: what code path is used when the automatic big table caching feature is used?

The reason for asking myself this is because the difference between a cached multi block read and a direct path read is significant. The significance of this decision is whether multiple IO requests can be submitted for multiple extents leveraging asynchronous IO, which is possible in the direct path read code path, or only the IO requests can be submitted for a single extent, which is what is true for the db file scattered read code path. To summarise the essence: can I only execute a single IO request (scattered read) or multiple IO requests (direct path read).

The answer is very simple to find if you enable the big table caching feature, and then read a table that is considered big with an empty big table cache. A table that is read into the big table area, shows the ‘db file scattered read’ wait event. What is interesting when a table is scanned that does not fit in the big table area: when a big table is not read for reading into the big table area, the wait event switches to ‘direct path read’. Or actually I should say: the general rules are applied for choosing between buffered or direct path reads, potentially being a direct path read.

I want to emphasise again on the importance of the IO code path decision. If you have an Oracle database server that uses modern high performance hardware for its disk subsystem, not choosing the correct multi block read codepath can severely influence throughput.

For example, if your IO subsystem has an average IO time of 1ms, the maximal theoretical throughput for buffered IO is 1/0.001*1MB=1GBPS (this will be way lower in reality). Direct path reads start off with 2 IO requests, which obviously means that that’s double the amount of buffered IO, but this can, depending on an Oracle internal resource calculation, grow up to 32 IO requests at the same time.

%d bloggers like this: