Archive

Oracle

During investigating how Oracle works with regards to waiting, I discovered an oddity. I was researching for my redo blog posts, and found that in a certain case, Oracle uses the ‘nanosleep’ system call. As the name of this system call suggests, this call allows you to let a process sleep with nanosecond precision.

The oddity that I found, was the following:

$ strace -Tp 42
...
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
...

I executed strace with the ‘-T’ argument (show time spend in system calls), and ‘-p’ to attach it to the process I want to measure. The nanosleep function was called with struct time spec set to {0, 1000}, which means 0 seconds and 1000 nanoseconds. However, the actual time spend was between 0.000191 and 0.000253 seconds, which is 1910000 and 253000 in nanoseconds. That’s approximately 200 times longer than it’s set to wait!

Since then I have spend a fair amount of time to understand why this is happening. So far I found a number of facts that I think explain this, but if anyone else has something to throw in here, please leave a comment.

The first thing I did was create a C program that just runs nanosleep in a loop, so I can run the nanosleep systemcall in isolation. Here’s the nanosleep PoC code (measure_time.c):

#include 
#include <time>
int main(int argc, char **argv)
{
  int       loop_counter, loop_total;
  struct    timespec sleep;

  sleep.tv_sec=0;
  sleep.tv_nsec=1000;

  loop_total=10;
  loop_counter=0;

  while ( loop_counter &lt; loop_total ) {
    nanosleep(&amp;sleep,NULL);
    loop_counter++;
  }
}

Then, when researching I found that there is a systemcall that shows the actual clock resolution time! I created another very small C program to run just that (measure_resulution.c):

#include 
#include <time>
int main(int argc, char **argv)
{
  int       result;
  struct    timespec resolution;
  clockid_t clk_id;
  clk_id = CLOCK_MONOTONIC;

  clock_getres(clk_id, &amp;resolution);

  printf("Resolution: %ld s, %ld ns\n", resolution.tv_sec, resolution.tv_nsec);
}

This c program can be compiled using ‘gcc measure_resolution.c -o measure_resolution’. This is what it showed:

$ ./measure_resolution
Resolution: 0 s, 1 ns

So, my system has a 1 ns resolution, despite my nanosleep systemcalls taking way longer. A little later I found out this information can be obtained directly in /proc/timer_list:

$ grep resolution /proc/timer_list | uniq
  .resolution: 1 nsecs

The first thing I found while researching, is that when I change the priority of the process (not the 1000ns), I can get lower sleeping times:

# strace -T chrt --rr 99 ./measure_time
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

The lowest time here is 0.036ms (36000ns). In order to look further, I found that newer kernel versions have an addition to perf that can measure wait time, scheduler delay (‘run queue’) and actual run time; perf sched timehist. So at this point it seemed that was a way to understand more about the timing of nanosleep. In order to look at that I created a virtual machine with fedora 27 (kernel version 4.14.13), and compiled my measure_time.c program on it.

The next thing to do is run the measure_time executable with perf sched record:

# perf sched record chrt --rr 99 ./measure_time
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.183 MB perf.data (54 samples) ]

Then run perf sched timehist to look at the data and find the pid of the measure_time executable:

# perf sched timehist
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.760817 [0000]                                0.000      0.000      0.000
   37745.760823 [0000]  rcu_sched[8]                        0.000      0.000      0.005
   37745.762367 [0002]  perf[2777]                          0.000      0.000      0.000
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.762431 [0002]  migration/2[21]                     0.000      0.001      0.011
   37745.762479 [0003]                                0.000      0.000      0.000
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764108 [0003]                                1.583      0.000      0.045
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005

So it’s pid 2778, now add that to perf sched timehist:

# perf sched timehist -p 2778
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005
   37745.764153 [0003]  measure_time[2778]                  0.034      0.002      0.004
   37745.764195 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764236 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764291 [0003]  measure_time[2778]                  0.050      0.002      0.004
   37745.764347 [0003]  measure_time[2778]                  0.051      0.002      0.004
   37745.764405 [0003]  measure_time[2778]                  0.052      0.002      0.004
   37745.764478 [0003]  measure_time[2778]                  0.067      0.006      0.005
   37745.764506 [0003]  measure_time[2778]                  0.022      0.002      0.005
   37745.764581 [0003]  measure_time[2778]                  0.022      0.002      0.052

The way to look at this: first perf is run. Because we just started running, there is no wait time, the task gets the state running, then it needs to be scheduled to run. That time is in ‘sch delay’, obviously scheduling delay. Then it gets truly is running and runs for some time, which time is visible in the ‘run time’ column. Next, the measure_time executable is run.

The first time measure_time runs, it needs to do some initialisation, like the dynamic loader loading libc, allocate memory (mmap calls), etcetera. This is visible in the run time of 1.583 ms. However, then it runs into the nanosleep call. The nanosleep call puts the process into an interruptible sleep. The sleep is visible in the wait time column. Here we see it’s waiting/sleeping for between 0.022 and 0.067 ms. After waiting, the process gets the status running, after which the scheduler must schedule the task and make it running. That time is in scheduling time again. Then it runs, but all it needs to do is end the nanosleep call, add one to the loop_counter variable, jump back and enter the nanosleep call again, as we can see, that only takes 0.004 ms.

Okay, so far it was quite a geeking out session, but not really an answer has been found to why a 1000ns sleep turns out to be way higher. I sought out help via social media, and gotten some sane advise from Kevin Closson. Essentially, the advise was to run it on physical hardware instead of my test virtual machine, and make the load as low as possible in order for the scheduler to be able to schedule my process as quickly as it can. Yes, very obvious actually.

So, the next target is a physical machine in our Enkitec lab. This is on a X3 generation machine (Sandy Bridge EP). I compiled measure_time.c, and ran ‘strace -T chrt –rr 99 ./measure_time’ again:

nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

That’s way better! But still the nanosleep timing is approximately 13000ns, instead of 1000ns that the call is supposed to take.

At this point I had no clue, and obviously needed to rethink it again; the time the call is supposed to sleep and thus take and the wallclock time I measured are still quite far apart in my measurements. On one hand you could argue that the time it is set to wait is so small that this could be simply caused by the code in the nanosleep function itself, on the other hand it’s not even close.

Then I gotten a eureka moment: strace is using ptrace facilities to attach to the process, and ptrace is known to quite severely slow processes down. Hat tip to Martin Bach who contacted me with this at the very same moment I realised this. So then the question becomes: can I measure nanosleep in another way?

Then I thought about the linux kernel ftrace facilities. In fact, the physical machines at my disposal in the Enkitec lab all run an Exadata images, and the Exadata images do not contain bleeding edge kernels like fedora 27 has, so I couldn’t repeat the perf sched timehist tracing. I wrote about ftrace in the past.

So, let’s redo the timing investigation again using ftrace:
1. Make sure the debug filesystem is mounted:

# mount /sys/kernel/debug || mount -t debugfs none /sys/kernel/debug

2. Enter the trace directory in /sys/kernel/debug:

# cd /sys/kernel/debug/tracing

3. Create another session on the same machine and obtain the process id:

# echo $$
62733

4. In the debug/tracing session, setup nanosleep system call tracing:
Enter the process id as the pid to trace:

echo 62733 > set_ftrace_pid

Set the type of trace. The default ‘tracer’ is nop (no tracing). Set it to ‘function_graph’:

echo function_graph > current_tracer

If there is an option ‘(no)function-fork’ in trace_options, set it to ‘function-fork’ to make ftrace trace child processes too (function-fork is an addition in recent kernels):

# egrep ^nofunction-fork trace_options && echo function-fork > trace_options

Set the function to trace. For the nanosleep system call, this is SyS_nanosleep:

# echo SyS_nanosleep > set_ftrace_filter

Now enable tracing:

# echo 1 > tracing_on

5. In the session to trace, execute chrt –rr 99 ./measure_time.
6. In the debug/tracing session, stop the trace and look at the results:

# echo 0 > tracing_on
# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 20)   8.958 us    |  SyS_nanosleep();
 20)   5.077 us    |  SyS_nanosleep();
 20)   4.260 us    |  SyS_nanosleep();
 20)   9.384 us    |  SyS_nanosleep();
 20)   5.495 us    |  SyS_nanosleep();
 20)   5.526 us    |  SyS_nanosleep();
 20)   5.039 us    |  SyS_nanosleep();
 20)   4.936 us    |  SyS_nanosleep();
 20)   4.898 us    |  SyS_nanosleep();
 20)   4.889 us    |  SyS_nanosleep();

This shows the waiting time still fluctuating, but going down to as low as 4.260 us, alias 4260 nanoseconds. This still is roughly four times the sleeping time set (1000ns), but it gotten way lower than earlier! Probably, the tracing increased the latency a bit, so my guess would be the true sleeping time when no trace is applied is around 2000ns. Please mind this is with changed (increased) process priorities (chrt –rr 99); when measure_time is ran without any priorities set, it looks like this:

# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 21) + 60.583 us   |  SyS_nanosleep();
 21) + 56.504 us   |  SyS_nanosleep();
 21) + 55.940 us   |  SyS_nanosleep();
 21) + 56.118 us   |  SyS_nanosleep();
 21) + 56.076 us   |  SyS_nanosleep();
 21) + 56.078 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 56.100 us   |  SyS_nanosleep();
 21) + 56.868 us   |  SyS_nanosleep();

But: there is Oracle grid infrastructure running, and multiple databases are running.

What would be interesting to see is how nanosleep is actually executed. Ftrace provides a way to do exactly that! The important thing to keep in mind is that ftrace is kernel only. In order to see how SyS_nanosleep is executed, do the following:

# > set_ftrace_filter
# echo SyS_nanosleep > set_graph_function
# > trace

Then execute ‘chrt –rr 99 ./measure_time’ again, and look in trace again:
(please mind I picked the second occurrence of SyS_nanosleep, of course the trace shows them all)

 14)               |  SyS_nanosleep() {
 14)               |    hrtimer_nanosleep() {
 14)   0.045 us    |      hrtimer_init();
 14)               |      do_nanosleep() {
 14)               |        hrtimer_start_range_ns() {
 14)               |          __hrtimer_start_range_ns() {
 14)               |            lock_hrtimer_base() {
 14)   0.039 us    |              _raw_spin_lock_irqsave();
 14)   0.527 us    |            } /* lock_hrtimer_base */
 14)   0.048 us    |            ktime_get();
 14)   0.043 us    |            get_nohz_timer_target();
 14)   0.079 us    |            enqueue_hrtimer();
 14)               |            tick_program_event() {
 14)               |              clockevents_program_event() {
 14)   0.048 us    |                ktime_get();
 14)   0.446 us    |              } /* clockevents_program_event */
 14)   0.876 us    |            } /* tick_program_event */
 14)   0.040 us    |            _raw_spin_unlock();
 14)   0.040 us    |            __raise_softirq_irqoff();
 14)   4.318 us    |          } /* __hrtimer_start_range_ns */
 14)   4.792 us    |        } /* hrtimer_start_range_ns */
 14)               |        schedule() {
 14)               |          __schedule() {
 14)   0.041 us    |            rcu_note_context_switch();
 14)   0.039 us    |            _raw_spin_lock_irq();
 14)               |            deactivate_task() {
 14)               |              dequeue_task() {
 14)   0.056 us    |                update_rq_clock();
 14)               |                dequeue_task_rt() {
 14)               |                  update_curr_rt() {
 14)   0.057 us    |                    cpuacct_charge();
 14)   0.053 us    |                    sched_avg_update();
 14)   0.046 us    |                    _raw_spin_lock();
 14)               |                    sched_rt_runtime_exceeded() {
 14)   0.041 us    |                      balance_runtime();
 14)   0.402 us    |                    } /* sched_rt_runtime_exceeded */
 14)   0.046 us    |                    _raw_spin_unlock();
 14)   2.701 us    |                  } /* update_curr_rt */
 14)               |                  dequeue_rt_entity() {
 14)               |                    dequeue_rt_stack() {
 14)               |                      __dequeue_rt_entity() {
 14)   0.062 us    |                        cpupri_set();
 14)   0.057 us    |                        update_rt_migration();
 14)   0.861 us    |                      } /* __dequeue_rt_entity */
 14)   1.258 us    |                    } /* dequeue_rt_stack */
 14)   0.047 us    |                    enqueue_top_rt_rq();
 14)   1.908 us    |                  } /* dequeue_rt_entity */
 14)   0.048 us    |                  dequeue_pushable_task();
 14)   5.564 us    |                } /* dequeue_task_rt */
 14)   6.445 us    |              } /* dequeue_task */
 14)   6.789 us    |            } /* deactivate_task */
 14)   0.044 us    |            pick_next_task_stop();
 14)   0.041 us    |            pick_next_task_dl();
 14)               |            pick_next_task_rt() {
 14)   0.042 us    |              pull_rt_task();
 14)   0.050 us    |              update_curr_rt();
 14)   0.823 us    |            } /* pick_next_task_rt */
 14)               |            pick_next_task_fair() {
 14)               |              put_prev_task_rt() {
 14)   0.042 us    |                update_curr_rt();
 14)   0.377 us    |              } /* put_prev_task_rt */
 14)               |              pick_next_entity() {
 14)   0.039 us    |                clear_buddies();
 14)   0.340 us    |              } /* pick_next_entity */
 14)               |              set_next_entity() {
 14)   0.065 us    |                __dequeue_entity();
 14)   0.426 us    |              } /* set_next_entity */
 14)   2.159 us    |            } /* pick_next_task_fair */
 14)   0.052 us    |            finish_task_switch();
 14) + 15.656 us   |          } /* __schedule */
 14) + 16.003 us   |        } /* schedule */
 14)   0.039 us    |        _cond_resched();
 14)               |        hrtimer_cancel() {
 14)               |          hrtimer_try_to_cancel() {
 14)               |            lock_hrtimer_base() {
 14)   0.040 us    |              _raw_spin_lock_irqsave();
 14)   0.365 us    |            } /* lock_hrtimer_base */
 14)   0.046 us    |            _raw_spin_unlock_irqrestore();
 14)   1.006 us    |          } /* hrtimer_try_to_cancel */
 14)   1.403 us    |        } /* hrtimer_cancel */
 14) + 23.559 us   |      } /* do_nanosleep */
 14) + 24.444 us   |    } /* hrtimer_nanosleep */
 14) + 24.842 us   |  } /* SyS_nanosleep */

This gives a full overview of the functions that SyS_nanosleep is executing in the kernel, including timing.
Essentially, everything happens in ‘do_nanosleep’. Inside do_nanosleep, the following functions are directly called:
– hrtimer_start_range_ns -> 4.792 us
– schedule -> 16.003 us
– _cond_resched -> 0.039 us
– hrtimer_cancel -> 1.403 us
This is more than the 4260 ns we saw earlier, it’s clear the tracing influences the latency of the execution.

In order to get a breakdown on where the time goes for the 4260 ns, remove SyS_nanosleep from set_graph_function, empty the trace and add the four functions mentioned above to set_ftrace_filter:

# > set_graph_function
# > trace
# echo 'hrtimer_start_range_ns schedule _cond_resched hrtimer_cancel' > set_ftrace_filter

Now execute ‘./chrt –rr 99 ./measure_time’ again, and look in trace; this is a snippet that shows a single nanosleep invocation:

...
  2)   0.938 us    |  hrtimer_start_range_ns();
  2)   3.406 us    |  schedule();
  2)   0.077 us    |  _cond_resched();
  2)   0.123 us    |  hrtimer_cancel();
...

What this shows is that apparently the time it takes for the schedule function to finish is taking the majority of the time. There will be influence of ftrace being active, but taking that into account on this system the time to get rescheduled after waiting for a short time is around 3 to 4 us (microseconds).

Conclusion
There is a certain point at which timing is so fine grained, that anything you do, even running a (kernel) function itself distorts time. It looks like this is 1us (microsecond) granularity for the systems I’ve looked at.

Obviously, when you run on a virtualised platform, especially like I did with a ‘desktop’ virtualisation product like Virtualbox, but it applies to any virtualisation, scheduling times increase because the virtualisation needs to abstract running on a CPU. With modern virtualisation running user-mode this overhead is minimal or non-existent. However, kernel mode access needs to be abstracted in order to keep virtual machines from invading each others resources. This can become significant if resources are oversubscribed (we gotten a glimpse of that in this article with the first nanosleep measurements at roughly 220 us).

The nanosleep times gotten lower when the process priority was set as high as possible. Then when moving onto physical hardware, the latency times gotten even lower. The priority of the process strongly influences the scheduling time, which is kind of logical, because that’s the entire purpose of the priority.

The linux strace utility influences process execution latency quite significantly when looking at a microsecond granularity. Ftrace provides a way to trace in a very flexible way with way lesser overhead than strace, although it still does impose an overhead on the process it is tracing.

Using ftrace, the minimal time that I could measure for a process to get scheduled again after sleeping was between 3 to 4 microseconds. This latency is influenced by the tracing, so it probably comes down to between 2 to 3 microseconds without it. Also, this is strongly influenced by the priority, I needed to set it to the highest priority in order to get the low latency. Another big influencer will be the hardware that I am running, and there are a fair amount of processes running, some at high priority (cluster ware ocssd, Oracle database lms processes, linux daemons).

Advertisements

This is the sixth part in a blog series about Oracle database redo. The previous posts provided information about the log writer writing, this post is about the process that is waiting after issuing commit for the log writer to write it’s redo from the public redo strand. When the database is using post/wait for process commits, the committing process follows the following (simplified) procedure:

kcrfw_redo_gen_ext: 
 kcsgbsn8: kcsnew8: commit SCN is fetched from kcsgscn_
 kslgetl: redo copy latch is taken
 kslgetl: redo allocation latch is taken
 : allocation annotations are made for the commit record in kcrfsg_ and KCRFA
 kslfre: redo allocation latch is freed
 kcrfw_copy_cv : commit record is copied into a public redo strand.
 kslfre: redo copy latch is freed
 kcrfw_commit_force_int: second argument 0: post logwriter if not already posted

These are the actual function calls inside kcrf_commit_force_int with second argument 0:

| | | | | | | | | | | | | | | | > kcrf_commit_force_int(0x7fffbea0de50, 0, ...)
| | | | | | | | | | | | | | | | | > kcscu8(0x60016290, 0x7fffbea0dc78, ...)
| | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
| | | | | | | | | | | | | | | | | > kcscu8(0x60016260, 0x7fffbea0dbc0, ...)
| | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
| | | | | | | | | | | | | | | | | > ksbasend(0x600161a0, 0x60016908, ...)
| | | | | | | | | | | | | | | | | | > ksbpro(0x600161a0, 0x60016908, ...)
| | | | | | | | | | | | | | | | | | < ksbpro+0x000000000063 returns: 0x7a6032e8
| | | | | | | | | | | | | | | | | | > ksbpser(0x600161a0, 0x60016908, ...)
| | | | | | | | | | | | | | | | | | < ksbpser+0x000000000046 returns: 0x1
| | | | | | | | | | | | | | | | | | > ksberr(0x600161a0, 0, ...)
| | | | | | | | | | | | | | | | | | < ksberr+0x000000000045 returns: 0
| | | | | | | | | | | | | | | | | | > ksasnd(0x7a6032e8, 0x1, ...)
| | | | | | | | | | | | | | | | | | | > ksasndint(0x7a6032e8, 0x1, ...)
| | | | | | | | | | | | | | | | | | | | > ksaamb(0x7a63a0b8, 0x7a5d9fe0, ...)
| | | | | | | | | | | | | | | | | | | | | > kslgetl(0x6003ef40, 0x1, ...)
| | | | | | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
| | | | | | | | | | | | | | | | | | | | | > _intel_fast_memcpy(0x7af99780, 0x7fffbea0dba8, ...)
| | | | | | | | | | | | | | | | | | | | | <> _intel_fast_memcpy.P(0x7af99780, 0x7fffbea0dba8, ...)
| | | | | | | | | | | | | | | | | | | | | <> __intel_ssse3_rep_memcpy(0x7af99780, 0x7fffbea0dba8, ...)
| | | | | | | | | | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x000000002030 returns: 0x7af99780
| | | | | | | | | | | | | | | | | | | | | > kslfre(0x6003ef40, 0x60009578, ...)
| | | | | | | | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0
| | | | | | | | | | | | | | | | | | | | < ksaamb+0x0000000001ed returns: 0x1
| | | | | | | | | | | | | | | | | | | < ksasndint+0x000000000120 returns: 0x1
| | | | | | | | | | | | | | | | | | | > ksl_post_proc(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | <> kskpthr(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | <> kslpsprns(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | | > ksl_update_post_stats(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | | | > dbgtTrcData_int(0x7f5b0ca886c0, 0x2050031, ...)
| | | | | | | | | | | | | | | | | | | | | | > dbgtBucketRedirect(0x7f5b0ca886c0, 0x7fffbea0d618, ...)
| | | | | | | | | | | | | | | | | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
| | | | | | | | | | | | | | | | | | | | | | > dbgtIncInMemTrcRedirect(0x7f5b0ca886c0, 0x6fa, ...)
| | | | | | | | | | | | | | | | | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
| | | | | | | | | | | | | | | | | | | | | | > skgstmGetEpochTs(0x7f5b0ca886c0, 0x6fa, ...)
| | | | | | | | | | | | | | | | | | | | | | | > gettimeofday@plt(0x7fffbea0d120, 0, ...)
| | | | | | | | | | | | | | | | | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
| | | | | | | | | | | | | | | | | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x203895d2b55e6
| | | | | | | | | | | | | | | | | | | | | | > dbgtrRecAllocate(0x7f5b0ca886c0, 0x7fffbea0d5c0, ...)
| | | | | | | | | | | | | | | | | | | | | | | > dbgtrPrepareWrite(0x7f5b0ca886c0, 0x711f5048, ...)
| | | | | | | | | | | | | | | | | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x1c
| | | | | | | | | | | | | | | | | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
| | | | | | | | | | | | | | | | | | | | | | > _intel_fast_memcpy(0x711f6ac0, 0x7fffbea0d6b8, ...)
| | | | | | | | | | | | | | | | | | | | | | <> _intel_fast_memcpy.P(0x711f6ac0, 0x7fffbea0d6b8, ...)
| | | | | | | | | | | | | | | | | | | | | | <> __intel_ssse3_rep_memcpy(0x711f6ac0, 0x7fffbea0d6b8, ...)
| | | | | | | | | | | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x711f6ac0
| | | | | | | | | | | | | | | | | | | | | | > dbgtrRecEndSegment(0x7f5b0ca886c0, 0x7fffbea0d5c0, ...)
| | | | | | | | | | | | | | | | | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
| | | | | | | | | | | | | | | | | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
| | | | | | | | | | | | | | | | | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
| | | | | | | | | | | | | | | | | | | | > skgpwpost(0x7fffbea0d760, 0x7f5b0cacdca0, ...)
| | | | | | | | | | | | | | | | | | | | <> sskgpwpost(0x7fffbea0d760, 0x7f5b0cacdca0, ...)
| | | | | | | | | | | | | | | | | | | | | > semop@plt(0x38000, 0x7fffbea0d6f0, ...)
| | | | | | | | | | | | | | | | | | | | | < semop+0x00000000000f returns: 0
| | | | | | | | | | | | | | | | | | | | < sskgpwpost+0x00000000009a returns: 0x1
| | | | | | | | | | | | | | | | | | | < kslpsprns+0x0000000001c3 returns: 0
| | | | | | | | | | | | | | | | | | < ksasnd+0x00000000006d returns: 0x1
| | | | | | | | | | | | | | | | | < ksbasend+0x0000000000f6 returns: 0x1
| | | | | | | | | | | | | | | | < kcrf_commit_force_int+0x000000000500 returns: 0

Basically, what happens is that kcscu8 is called to read 0x60016290 (kcrfsg_.on disk SCN) and kcscu8 is called to read 0x60016260 (kcrfsg_.LWN SCN). Then ksbasend (kernel service background processes asynchronous send message) is called, which essentially performs two functions in ksasnd: write a message in the messages struct (protected by the messages latch) in ksaamb (kernel service asynchronousm messages allocated message buffer) and then post the logwriter using post/wait (ksl_post_proc, actual posting using semop is done in skgpwpost; system kernel generic post wait post).
This also directly ends/returns from the kcrfw_redo_gen_ext function that called kcrf_commit_force_int.

At this point the commit was executed, and the logwriter is notified that it needs to write up to the commit record from the public redo strand.

The next thing the process do is release things like the enqueue(s) it’s holding. After that, the process will wait for the logwriter to execute the write and get posted via post/wait. This is also done in the kcrf_commit_force_int function, however now the second argument is set to ‘1’:

 | | | <> kcrf_commit_force_int(0x7f3011d75e10, 0x1, ...)
 | | | | > skgstmGetEpochTs(0x7f3011d75e10, 0x7, ...)
 | | | | | > gettimeofday@plt(0x7ffda5959810, 0, ...)
 | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x206daa374e6a1
 | | | | > kcscu8(0x60016290, 0x7ffda5959b88, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcscu8(0x60016260, 0x7ffda5959ad0, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > ksbasend(0x600161a0, 0x60016908, ...)
 | | | | | > ksbpro(0x600161a0, 0x60016908, ...)
 | | | | | < ksbpro+0x000000000063 returns: 0x7a6032e8
 | | | | | > ksbpser(0x600161a0, 0x60016908, ...)
 | | | | | < ksbpser+0x000000000046 returns: 0x1
 | | | | | > ksberr(0x600161a0, 0, ...)
 | | | | | < ksberr+0x000000000045 returns: 0
 | | | | | > ksasnd(0x7a6032e8, 0x1, ...)
 | | | | | | > ksasndint(0x7a6032e8, 0x1, ...)
 | | | | | | | > ksaamb(0x7a6085b8, 0x7a5d9fe0, ...)
 | | | | | | | < ksaamb+0x000000000504 returns: 0
 | | | | | | < ksasndint+0x000000000120 returns: 0x1
 | | | | | < ksasnd+0x00000000006d returns: 0x1

Actually exactly the same is happening here as we saw previously. The way the kcrf_commit_force_int function works is:
– It first reads the on disk SCN (0x60016290) via kcscu8. If the on disk SCN is equal or past the process’ commit SCN, it means the process’ redo is written to disk, and the kcrf_commit_force_int function is quit.
– If the on disk SCN is not progressed far enough, the LWN SCN (0x60016260) is read via kcscu8. If the LWN SCN progressed beyond or is equal to the commit SCN, it means the logwriter is in the progress of writing the process’ redo, so it does not need posting via ksbasend. If the logwriter has not progressed the LWN SCN that far, the process tries posting via ksbasend again.
– In the above function call trace we see ksaamb returning 0, which means it did try to post the logwriter again, but it found the logwriter already posted (by looking at kcrfsg_+516). As a result, it returned directly from the function without creating a message (and thus needing to fetch the messages latch and post the logwriter via semop).

At this point in kcrf_commit_force_int everything is done to make the logwriter write the process’ redo in the public redo strand, so it needs to wait for the logwriter to finish, and the instance is using post wait the notification mechanism for log file sync.

The next thing that happens is a post-wait entry in the post wait queue is setup, and the on disk SCN is checked once more:

 | | | | > kslawe(0x10380, 0x7a5d9fe0, ...)
 | | | | | > kslgetl(0x7a5d72c8, 0, ...)
 | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | > kslfre(0x7a5d72c8, 0x7a5d7368, ...)
 | | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | < kslawe+0x00000000027d returns: 0
 | | | | > kcscu8(0x60016290, 0x7ffda5959b88, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1

Now that the process registered itself in the post-wait queue, a wait is registered via kslwtbctx:

 | | | | > kslwtbctx(0x7ffda5959950, 0x7ffda5959b88, ...)
 | | | | | > sltrgftime64(0x7a6d9580, 0x7ae18b68, ...)
 | | | | | | > clock_gettime@plt(0x1, 0x7ffda5959700, ...)
 | | | | | | | > clock_gettime(0x1, 0x7ffda5959700, ...)
 | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | < sltrgftime64+0x00000000004c returns: 0x428fc8dac9
 | | | | | > kslwait_timeout_centi_to_micro(0x7fffffff, 0x443dfe7, ...)
 | | | | | < kslwait_timeout_centi_to_micro+0x000000000021 returns: 0x7fffffffffffffff
 | | | | | > kskthbwt(0, 0x428fc8dac9, ...)
 | | | | | < kskthbwt+0x0000000000a1 returns: 0x10
 | | | | | > kslwt_start_snapshot(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | < kslwt_start_snapshot+0x0000000000dc returns: 0x7a6d9db8
 | | | | < kslwtbctx+0x00000000050b returns: 0x7a6d9db8

Now that the process is in the wait event, the process needs to truly ‘wait’. In post-wait mode this is done on a semaphore, just like the logwriter does when sleeping in the wait event ‘rdbms ipc message’. This is how that looks like:

 | | | | > kslwaitctx(0x7ffda5959950, 0xa, ...)
 | | | | | > ksliwat(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | > sltrgftime64(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffda5959280, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffda5959280, ...)
 | | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | < sltrgftime64+0x00000000004c returns: 0x428fc8e664
 | | | | | | > kslawe(0x10380, 0x459ffef, ...)
 | | | | | | < kslawe+0x000000000076 returns: 0x10380
 | | | | | | > kskiorm(0x7ae0ee70, 0, ...)
 | | | | | | < kskiorm+0x00000000001e returns: 0
 | | | | | | > kfias_iswtgon_ksfd(0x7ae0ee70, 0, ...)
 | | | | | | < kfias_iswtgon_ksfd+0x00000000002b returns: 0
 | | | | | | > kxdbio_has_work(0x7ffda59592a4, 0x6003d010, ...)
 | | | | | | < kxdbio_has_work+0x000000000027 returns: 0
 | | | | | | > skgpwwait(0x7ffda5959510, 0x7f3011d45ca0, ...)
 | | | | | | | > kslwait_conv_wait_time(0x186a0, 0x7f3011d45ca0, ...)
 | | | | | | | < kslwait_conv_wait_time+0x000000000027 returns: 0x186a0
 | | | | | | | > sskgpwwait(0x7ffda5959510, 0x7f3011d45ca0, ...)
 | | | | | | | | > semtimedop(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff
 | | | | | | | | > __errno_location(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | | > fthread_self(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | | < fthread_self+0x000000000024 returns: 0
 | | | | | | | | < __errno_location+0x000000000010 returns: 0x7f3011d406a0
 | | | | | | | < sskgpwwait+0x00000000014e returns: 0
 | | | | | | < skgpwwait+0x0000000000e0 returns: 0
 | | | | | | > ksuSdiInProgress(0x10003, 0x10003, ...)
 | | | | | | < ksuSdiInProgress+0x000000000035 returns: 0
 | | | | | | > sltrgftime64(0x10003, 0x10003, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffda5959280, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffda5959280, ...)
 | | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | < sltrgftime64+0x00000000004c returns: 0x428fcad86e
 | | | | | | > kslwo_getcbk(0x9d, 0x80f6298, ...)
 | | | | | | < kslwo_getcbk+0x000000000017 returns: 0
 | | | | | | > kgslwait_last_waitctx_time_waited_usecs(0x7f3011d409a0, 0x7a6da7f0, ...)
 | | | | | | < kgslwait_last_waitctx_time_waited_usecs+0x000000000045 returns: 0x1f20a
 | | | | | | > kslwt_update_stats_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | | > kews_update_wait_time(0x5, 0x1fda5, ...)
 | | | | | | | < kews_update_wait_time+0x00000000002e returns: 0x6
 | | | | | | < kslwt_update_stats_int+0x00000000033e returns: 0x428fcad86e
 | | | | | | > ksliwat_cleanup(0x7ffda5959650, 0x7ffda5959628, ...)
 | | | | | | | > kslgetl(0x7a5d72c8, 0x1, ...)
 | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | > kslfre(0x7a5d72c8, 0x7a5d7378, ...)
 | | | | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | | | < ksliwat_cleanup+0x0000000000e5 returns: 0
 | | | | | < ksliwat+0x000000003acf returns: 0
 | | | | < kslwaitctx+0x00000000012b returns: 0

The most prominent thing is skgpwwait (system kernel generic post-wait wait), which calls sskgpwwait (skgpwwait with an additional s in front, which often are os dependent calls, like in this call, calling semtimedop). We see semtimedop returning 0xffffffffffffffff (-1), which means semaphore wait timed out, which means the semaphore was not posted using semop before the timeout set (100ms for log file sync) for the semaphore wait.

Interestingly, a little further down se see ksliwat_cleanup. This function clears this process from the post-wait queue, not only indicated by the ‘cleanup’ part in ksliwat_cleanup, but also by the latch get of 0x7a5d72c8, this is the ‘post/wait queue’ latch. This means that when the kslwaitctx function is left, this process is not in the post/wait queue anymore. However, the process is still waiting in the wait event ‘log file sync’.

The next thing that happens is a call to ‘ksu_dispatch_tac’ (kernel service user dispatch timed-out action):

 | | | | > ksu_dispatch_tac(0x7a5d72c8, 0, ...)
 ...
 | | | | < ksu_dispatch_tac+0x0000000003b4 returns: 0

Essentially, this routine validates certain conditions to trigger actions because the wait in kslwaitctx has timed out.

At this point the process repeats the steps of:
– kslawe: registering itself in the post-wait queue
– kcscu8: check on disk SCN
– kslwaitctx, semtimedop: wait to be posted
– (if not posted) ksliwat_cleanup: remove post-wait entry
– (if not posted) kcscu8: check on disk SCN
(the double check of on disk SCN via kcscu8 is not a typo!)

If the semaphore gets posted (visible by return code 0):

 | | | | | | | | > semtimedop(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | < semtimedop+0x000000000012 returns: 0

The post-wait cleanup function (ksliwat_cleanup) does not require a latch when the process has been posted, probably because now that the entry has been used, there is no chance it will be used again, so it’s safe to cleanup without serialising access via a latch. After the ksliwat_cleanup returns, the process also returns from ksliwat and kslwaitctx.

Next kcscu8 is called to read the on disk SCN again, and the wait event is ended with kslwtectx, some statistics about redo writing are written (kcrfw_stats_syncoverhead_record), and the function kcrf_commit_force_int is ended, because when the semaphore is posted, the wait for the logwriter to finish is done, meaning the commit is entirely finished:

 | | | | > kcscu8(0x60016290, 0x7ffda5959b88, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kslwtectx(0x7ffda5959950, 0x7ffda5959b88, ...)
 | | | | | > kslwt_end_snapshot(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | > kslwh_enter_waithist_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | < kslwh_enter_waithist_int+0x00000000013a returns: 0x7a8378a8
 | | | | | | > kslwtrk_enter_wait_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | < kslwtrk_enter_wait_int+0x000000000019 returns: 0x7a8378a8
 | | | | | < kslwt_end_snapshot+0x0000000000f2 returns: 0x7a6d9580
 | | | | | > kslwt_update_stats_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | > kews_update_wait_time(0x5, 0, ...)
 | | | | | | < kews_update_wait_time+0x00000000002e returns: 0x6
 | | | | | < kslwt_update_stats_int+0x00000000033e returns: 0
 | | | | | > kskthewt(0x4290257c02, 0x9d, ...)
 | | | | | < kskthewt+0x0000000000c3 returns: 0xa030
 | | | | < kslwtectx+0x0000000003f9 returns: 0xe923
 | | | | > skgstmGetEpochTs(0x4290257c02, 0x9d, ...)
 | | | | | > gettimeofday@plt(0x7ffda5959810, 0, ...)
 | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x206daa3d1ca1b
 | | | | > kcrfw_stats_syncoverhead_record(0x7f3011d75e10, 0x206daa3d1ca1b, ...)
 | | | | | > kcrfw_stats_writeinfo_find(0x7f3011d75e10, 0x206daa3d1ca1b, ...)
 | | | | | < kcrfw_stats_writeinfo_find+0x0000000000fb returns: 0x206daa3d1a057
 | | | | < kcrfw_stats_syncoverhead_record+0x0000000000a4 returns: 0x29c4
 | | | < kcrf_commit_force_int+0x000000000b9c returns: 0x1

What is clear, is that the process checks the on disk SCN a lot of times. The obvious reason for that would be to stop waiting whenver the on disk SCN indicates the process’ redo information is written. However, in the past I’ve heard about issues with sessions that were in post-wait mode waiting, which were ‘forgotten’ when the database switched to polling (this was Oracle 11.2.0.3, this issue is solved!). That however leads to an interesting question:

1. What happens when a process is posted by the logwriter, but the on disk SCN is NOT progressed beyond the process’ commit SCN?
2. What happens when a process is NOT posted by the logwriter, but the on disk SCN has progressed beyond the process’ commit SCN?

Let’s test that! Let’s test the first situation: a foreground posted, but the on disk SCN did not progress beyond the commit SCN of that process.

First we need to know the commit SCN for the inserting session. We know that the SCN is obtained from kcsgscn_, let’s get the address of it:

SQL> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv where ksmfsnam = 'kcsgscn_';
KSMFSNAM                       KSMFSADR           KSMFSSIZ
------------------------------ ---------------- ----------
kcsgscn_                       00000000600113B8         48

Now attach to a foreground session with gdb, and put an read/write (acces) watch on 0x600113b8, set pagination off and let gdb continue:

(gdb) awatch *0x600113b8
Hardware access (read/write) watchpoint 1: *0x600113b8
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) set pagi off
(gdb) c
Continuing.

Now insert something in the foreground session (anything is okay, as long as it’s a regular insert):

Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374652
New value = 374701
0x000000000320e5a8 in kcsAdjustUsingForeignSCN ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3ee68 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f03b in kcsgrsn8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f03b in kcsgrsn8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f2d5 in kcsgssn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3e93b in kcscu8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3e93b in kcscu8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3ee68 in kcsgcsn ()

Pay attention to the output sequence of a watchpoint:
– ‘Hardware access …’ and watchpoint number and address.
– old and new value or read value.
– instruction pointer address and function.
Also look at the different functions with which an SCN is queried.

Now execute the commit:

Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374701
New value = 374767
0x000000000320e5a8 in kcsAdjustUsingForeignSCN ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374767
0x0000000010e3ecc1 in kcsnew8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374767
New value = 374768
0x0000000010e3ecdb in kcsnew8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374768
New value = 374769
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374769
0x0000000010e3ee68 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374769
0x0000000010e3ee68 in kcsgcsn ()

If you take a close look, you see a new SCN for the commit is obtained in the aptly named function kcsnew8 at instruction pointer address 0x10e3ecdb. Now that we know the instruction pointer address, we can filter the watch:
Disable the previous watch, and create a second access watch point:

(gdb) dis 1
(gdb) awatch *0x600113b8 if ( $rip == 0x10e3ecdb )
Hardware access (read/write) watchpoint 2: *0x600113b8
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now issue another insert into and a commit:

Hardware access (read/write) watchpoint 2: *0x600113b8

Old value = 374908
New value = 374909
0x0000000010e3ecdb in kcsnew8 ()

Voila! commit SCN is 374909 in this case!

Because we know kcscu8 is called to read the on disk SCN from kcrfsg_, we can cheat and modify the on disk SCN in kcrfsg_ ourselves to make this lower than the commit SCN.

=> WARNING! Never EVER do this, or any of these techniques in a database that you care about! This can severely corrupt the database leading to an irrepairable situation! WARNING <=

First modify the watchpoint (2 in the above example) to stop when it is fired:

(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>end
(gdb) c
Continuing.

Now perform the insert and commit again. This will hang because the watchpoint will break execution:

Hardware access (read/write) watchpoint 2: *0x600113b8

Old value = 620035
New value = 620036
0x0000000010e3ecdb in kcsnew8 ()

Okay, so the commit SCN is 620036. Now break on kcscu8 and simply set the SCN to 620000, which undoubtly is lower than the commit SCN of 620036:

(gdb) break kcscu8
Breakpoint 3 at 0x10e3e910
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>set {int}0x60016290=620000
>c
>end

Also set an access watch on 0x60016290, alias the on disk SCN, to validate what SCN the process reads:

(gdb) awatch *0x60016290
Hardware watchpoint 4: *0x60016290
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb)

Now attach with another gdb session to the logwriter, and keep the logwriter stopped after attaching.
After the logwriter is stopped, continue in the foreground session:

(gdb) c
Continuing.

What I regularly encounter, especially with new(er) versions of the Oracle database, is the following messages:

Program received signal SIGUSR2, User defined signal 2.
0x0000000010e3ecdb in kcsnew8 ()
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000000010f973a5 in slaac_int ()
(gdb) c
Continuing.

It seems that a signal handler is triggered. In most cases just continuing will make the process running again.

After the signal handler, the following is shown in gdb:

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
---Type <return> to continue, or q <return> to quit---

Quit using q and enter, and set pagination off: set pagi off, then continue.

Now let the logwriter continue in its gdb session (continue).

In the foreground debugger session, you'll see that the foreground did not do anything else than it already did:

Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()

Stop execution of the committing process, and disable breakpoint 3 (dis 3), which sets the on disk SCN to a value lower than the process' commit SCN, so it will pick up the correct current on disk SCN, which is beyond the process' commit SCN:

(gdb) c
Continuing.
Hardware access (read/write) watchpoint 4: *0x60016290

Old value = 620000
New value = 620865
0x0000000010e3e93b in kcscu8 ()

And the commit finishes! This proves that despite the post/wait posting, the session actually checks the on disk SCN using kcscu8 additional to the semaphore being posted, and only continues if the on disk SCN has progressed up to the commit SCN or further!

We can use these breakpoints to see if we can do the exact opposite: make sure the foreground is not posted, and then progress the on disk SCN beyond the commit SCN. At this time you should have two gdb sessions, one on a foreground process, and one on the logwriter. Both gdb sessions have the process they are attached to running.

Now insert and commit again. Because of the watchpoint on kcsgscn_ with the instruction pointer set to kcsnew8 at breakpoint 2, this will break/stop execution:

Hardware access (read/write) watchpoint 2: *0x600113b8

Old value = 621375
New value = 621376
0x0000000010e3ecdb in kcsnew8 ()

Break execution of the logwriter, so the foreground cannot be posted anymore (simply ctrl-C the gdb session attached to the log writer process).
Now change breakpoint 3 in the gdb session attached to the foreground to set the on disk SCN to 621400, which is higher than the commit SCN of 621376:

(gdb) commands 3
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>set {int}0x60016290=621400
>c
>end

And enable breakpoint 3 again:

(gdb) enable 3

And then continue the foreground in gdb.

...
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 621400
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
...

Repeatedly.

This shows watchpoint 4 begin triggered because of access to 0x60016290 (on disk SCN), and breakpoint 3 setting the on disk SCN to 621400.
Now continue the logwriter. This will finish the commit of the foreground session.

So conclusion here is that despite the on disk SCN having progressed beyond the foreground session's commit SCN value (commit SCN was 621376, on disk SCN was set to 621400), a commit will wait until it is explicitly posted via post-wait.

This the the fifth blog in a series of blogposts about Oracle database redo. The previous blog looked into the ‘null write’ (kcrfw_do_null_write actually) function inside kcrfw_redo_write_driver, which does housekeeping like updating SCNs and posting processes if needed, this blog looks into what happens when the log writer is actually posted by a process or if public redo strand buffers have been written into. In part 3 of this blog series (the log writer working cycle) it can be seen that when a session posts the log writer, it returns from the semaphore related functions, and calls ‘kcrfw_redo_write_driver’ directly, which otherwise is called inside ksbcti.

Inside the kcrfw_redo_write_driver function, the first thing of interest is executed only when the logwriter is posted, and the kcrfw_redo_write_driver function is called directly after returning from ksarcv and ksl_exit_main_loop_wait:

kcrfw_redo_write_driver+145:0x000000006001642c(fixed sga|var:kcrfsg_+516 shared pool ):W:4:0/0()

At offset 516 in kcrfsg_, a 4 byte field is used for indicating that the logwriter is posted. This field is filled out by a process that wants the logwriter to write. The way it works is that a process reads this field, and changes it to 1 if this field contains 0, and then performs a semctl or semop to post the logwriter and get it working. If the field already contains 1, no further action is taken by the a committing process. The logwriter sets this value to 0 as soon as it has been posted and executes the kcrfw_redo_write_driver function. If the kcrfw_redo_write_driver function is called as part of the normal logwriter cycle (which means inside the ksbcti function), it will not read or write this field.

Then a lot of things are executed that have been shown in the previous post. The next thing the kcrfw_redo_write_driver executes is kcsnew3 is to obtain a LWN SCN and set the LWN number protected by the ‘lgwr LWN SCN’ latch:

kcsnew8+107:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0x1/1()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):R:4:0xc6/198()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):W:4:0xc7/199()
kcsnew8+125:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b80/7744384()
kcsnew8+131:0x00007fffa10728f0():W:4:0/0()
kcsnew8+140:0x00007fffa1072922():W:1:0/0()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b80/7744384()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):W:8:0x762b81/7744385()
kcsnew8+165:0x00007fffa10728f0():R:4:0/0()
kcsnew8+184:0x00007f520ba8be08():R:8:0x600113b8/1610683320(fixed sga|var:kcsgscn_+0 )
kcsnew8+193:0x00007f520ba8ae90():R:8:0x7a823620/2055353888(shared pool|permanent memor,duration 1,cls perm+7693856 )
kcsnew8+212:0x000000007a823e40(shared pool|permanent memor,duration 1,cls perm+7695936 ):R:8:0x719beed8/1906044632(shared pool|ksu:stats_freel,duration 1,cls freeabl+24 )
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):R:8:0xc7/199()
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):W:8:0xc8/200()
kcsnew8+223:0x00007fffa1072a00():W:8:0x762b81/7744385()
kcsnew8+232:0x0000000060016260(fixed sga|var:kcrfsg_+56 shared pool|x_kcrfws.lwn_scn+0 ):W:8:0x762b81/7744385()
kcsnew8+238:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0/0()
kcsnew8+246:0x0000000060016288(fixed sga|var:kcrfsg_+96 shared pool|pointer:lgwr lwn scn latch+0 ):R:8:0x60049800/1610913792(fixed sga|(parent)latch:lgwr LWN SCN+0 fixed sga|var:lwn_scn_lat_+0 )
kcsnew8+250:0x00007fffa10728d8():W:8:0x10e3ed3f/283372863()

The LWN is set to 0x762b81 / 7744385, LWN number is 199.

The next thing to look at is the kcrfw_gather_lwn function flow:

 | > kcrfw_gather_lwn(0x7ffc3dd11b78, 0x77eced90, ...)
 | | > kslgetl(0x76fe0c10, 0x1, ...)
 | | < kslgetl+0x00000000012f returns: 0x1
 | | > kcrfw_gather_strand(0x7ffc3dd11b78, 0, ...)
 | | < kcrfw_gather_strand+0x0000000000c2 returns: 0x2
 | | > kslfre(0x76fe0c10, 0x20002, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | < kslgetl+0x00000000012f returns: 0x1
 | | > kcrfw_gather_strand(0x7ffc3dd11b78, 0x1, ...)
 | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | > kslfre(0x76fe0cb0, 0x19, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | < kcrfw_gather_lwn+0x00000000065c returns: 0

Here we see exactly the same function sequence in kcrfw_gather_lwn as we saw in the ‘null write’ blogpost, however pay close attention to the return codes of the kcrfw_gather_strand functions. The first call returns 2 instead of 0, indicating two (strand) buffers need writing.

If we zoom in to the first kcrfw_gather_strand call, we see:

kcrfw_gather_strand+2:0x00007fffa1072910():W:8:0x7fffa1072a10/140735894989328()
kcrfw_gather_strand+15:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+22:0x00007fffa1072908():W:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+29:0x00007fffa1072900():W:8:0x76fdf130/1996353840(shared pool|permanent memor,duration 1,cls perm+16113968 )
kcrfw_gather_strand+36:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):R:4:0x65e/1630()
kcrfw_gather_strand+53:0x0000000076fdf480(shared pool|KCRFA+248 shared pool|TOTAL_BUFS_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114816 ):R:4:0x20000/131072()
kcrfw_gather_strand+70:0x00007fffa1072bf8():R:8:0x762b81/7744385()
kcrfw_gather_strand+79:0x0000000076fdf390(shared pool|KCRFA+8 shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x762b7f/7744383()
kcrfw_gather_strand+86:0x0000000076fdf3a8(shared pool|KCRFA+32 shared pool|NEXT_BUF_NUM_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x660/1632()
kcrfw_gather_strand+121:0x0000000076fdf45c(shared pool|KCRFA+212 shared pool|LAST_BUF_GATHERED_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114780 ):R:4:0x65e/1630()
kcrfw_gather_strand+169:0x00007fffa1072a08():W:4:0x660/1632()
kcrfw_gather_strand+176:0x00007fffa1072a0c():W:4:0/0()
kcrfw_gather_strand+182:0x00007fffa1072900():R:8:0x76fdf130/1996353840(shared pool|permanent memor,duration 1,cls perm+16113968 )
kcrfw_gather_strand+186:0x00007fffa1072908():R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+193:0x00007fffa1072910():R:8:0x7fffa1072a10/140735894989328()
kcrfw_gather_strand+194:0x00007fffa1072918():R:8:0x31f0f00/52367104()

What is visible here is roughly identical to the memory access of the kcrfw_gather_strand function we saw in the ‘null write’ blogpost, however there is one striking difference: at offset 86 in the kcrfw_gather_strand function KCRFA at offset 32 is read, which shows a value of 1632. The other locations show a value of 1630, which is an indication some public redo strand buffers have been written. This means there is potentially something to write! Please remember that the redo allocation latch was fetched before calling kcrfw_gather_strand.

Once the kcrfw_gather_strand function has returned to kcrfw_gather_lwn, the actual write is prepared by further manipulating the KCRFA struct for the first strand for the write:

kcrfw_gather_lwn+436:0x0000000076fdf45c(shared pool|KCRFA+212 shared pool|LAST_BUF_GATHERED_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114780 ):W:4:0x660/1632()
kcrfw_gather_lwn+449:0x0000000076fdf3b0(shared pool|BYTES_IN_BUF_KCRFA_CLN+0 shared pool|KCRFA+40 shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0x1b8/440()
kcrfw_gather_lwn+453:0x0000000060016a10(fixed sga|var:kcrf_kcrrh_sz_+0 ):R:8:0x18/24()
kcrfw_gather_lwn+462:0x0000000076fdf3a0(shared pool|KCRFA+24 shared pool|PNEXT_BUF_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd80cc048/3624714312(Redo Buffers(pgsz:2048k)|+835656 redo|PUB_REDO_0+835656 )
kcrfw_gather_lwn+466:0x00000000d80cc048(redo|PUB_REDO_0+835656 ):W:4:0/0()
kcrfw_gather_lwn+472:0x0000000076fdf3b0(shared pool|BYTES_IN_BUF_KCRFA_CLN+0 shared pool|KCRFA+40 shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0x1b8/440()
kcrfw_gather_lwn+476:0x0000000076fdf3a0(shared pool|KCRFA+24 shared pool|PNEXT_BUF_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd80cc048/3624714312(Redo Buffers(pgsz:2048k)|+835656 redo|PUB_REDO_0+835656 )
kcrfw_gather_lwn+476:0x0000000076fdf3a0(shared pool|KCRFA+24 shared pool|PNEXT_BUF_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114592 ):W:8:0xd80cc200/3624714752(Redo Buffers(pgsz:2048k)|+836096 redo|PUB_REDO_0+836096 )
kcrfw_gather_lwn+480:0x0000000076fdf3cc(shared pool|KCRFA+68 shared pool|permanent memor,duration 1,cls perm+16114636 ):R:1:0/0()
kcrfw_gather_lwn+480:0x0000000076fdf3cc(shared pool|KCRFA+68 shared pool|permanent memor,duration 1,cls perm+16114636 ):W:1:0x1/1()
kcrfw_gather_lwn+485:0x0000000076fdf3b0(shared pool|BYTES_IN_BUF_KCRFA_CLN+0 shared pool|KCRFA+40 shared pool|permanent memor,duration 1,cls perm+16114608 ):W:8:0/0()

The LAST_BUF_GATHERED_KCRFA field at KCRFA offset 212 is set to the new buffer number 1632. Then the actual amount of data inside the buffer is queried from KCRFA at offset 40 (BYTES_IN_BUF_KCRFA_CLN), and the pointer to the actual location in the first public redo strand is obtained from KCRFA offset 24 (PNEXT_BUF_KCRFA_CLN). The pointer into the first public redo strand then is increased by the amount of redo written, which is 440 in this case, and the BYTES_IN_BUF_KCRFA_CLN is reset to 0.

A little further in kcrfw_gather_lwn, LAST_BUF_WRITTEN in KCRFA at offset 208 is changed to public strand buffer 1632:

kcrfw_gather_lwn+977:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):R:4:0x65e/1630()
kcrfw_gather_lwn+987:0x0000000076fdf480(shared pool|KCRFA+248 shared pool|TOTAL_BUFS_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114816 ):R:4:0x20000/131072()
kcrfw_gather_lwn+996:0x0000000076fdf3c0(shared pool|KCRFA+56 shared pool|permanent memor,duration 1,cls perm+16114624 ):W:4:0x1000/4096()
kcrfw_gather_lwn+1006:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):W:4:0x660/1632()

At this point the redo allocation latch for this strand is released.
After the release of the latch, the kcrfw_gather_lwn function changes a value in the struct kcrfsg_ at offset 8, which is X$KCRFWS.NEXT_BLK, alias the administration of where to write in the online redolog file(s):

kcrfw_gather_lwn+1038:0x0000000060016230(fixed sga|var:kcrfsg_+8 shared pool|x_kcrfws.next_blk+0 ):R:4:0x16a16/92694()
kcrfw_gather_lwn+1044:0x0000000077eceea8(shared pool|permanent memor,duration 1,cls perm+2451776 ):W:4:0x16a16/92694()
kcrfw_gather_lwn+1048:0x0000000060016230(fixed sga|var:kcrfsg_+8 shared pool|x_kcrfws.next_blk+0 ):R:4:0x16a16/92694()
kcrfw_gather_lwn+1048:0x0000000060016230(fixed sga|var:kcrfsg_+8 shared pool|x_kcrfws.next_blk+0 ):W:4:0x16a18/92696()

The second public strand has no buffers written, which we know because the kcrfw_gather_strand function returned 0, so that one is skipped.

At this point, the memory area’s that take part in the writing have been setup.

In order to write, the function kcrfw_redo_write is called to start the actual redo writing. First, the logwriter needs to make sure that the buffers it has marked for write, which is the difference between the original LAST_BUF_GATHERED and LAST_BUF_WRITTEN, are not still written into. This is done in the function kcrfw_wait_for_fg_copy, which is called in kcrfw_redo_write. The way this works, is it looks up the address of the first redo copy latch child which a process that is copying change vectors into the public strand must take which is stored in kcrfsg_ at offset 472 and the address of the redo copy latch support memory area, which is stored in kcrfsg_ at offset 488. Then it looks up the buffer assigned to the first latch in the redo copy latch support memory area and reads 8 bytes to see if it contains the value 0xffffffffffffffff, which is what a writer in a redo log strand must write in the support memory area if it’s done writing to the redo strand. This means that as long as a process writes into the redo log strands, it will have set this memory area to another value indicating it is being written into. Once the value becomes 0xffffffffffffffff, kcrfw_wait_for_fg_copy will continue to see if any other redo copy latches needs checking. It does this by reading the actual total amount of redo copy latches in kcrfsg_ at offset 464 and the number of the current redo copy latch it has checked. If not, it will move to the next redo copy latch and redo copy latch support memory area, checking for 0xffffffffffffffff again, etc. Please mind it does not check or obtain any of the redo copy latches itself, nor prevents any write to the redo strands that happens after the check, which is probably by design:

kcrfw_wait_for_fg_copy+132:0x0000000060016400(fixed sga|var:kcrfsg_+472 shared pool|pointer:redo copy latch+0 ):R:8:0x76fdea30/1996352048(shared pool|(child)latch#1:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112176 )
kcrfw_wait_for_fg_copy+142:0x0000000060016410(fixed sga|var:kcrfsg_+488 ):R:8:0x76fdf0f0/1996353776(shared pool|permanent memor,duration 1,cls perm+16113904 )
kcrfw_wait_for_fg_copy+149:0x00007fffa1072968():W:1:0/0()
kcrfw_wait_for_fg_copy+153:0x00007fffa1072960():W:8:0x76fdea30/1996352048(shared pool|(child)latch#1:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112176 )
kcrfw_wait_for_fg_copy+157:0x00007fffa1072958():W:4:0/0()
kcrfw_wait_for_fg_copy+161:0x0000000076fdf0f0(shared pool|permanent memor,duration 1,cls perm+16113904 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+169:0x0000000076fdf0f4(shared pool|permanent memor,duration 1,cls perm+16113908 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+176:0x00007fffa1072968():R:1:0/0()
kcrfw_wait_for_fg_copy+180:0x00007fffa1072958():R:4:0/0()
kcrfw_wait_for_fg_copy+195:0x00000000600163f8(fixed sga|var:kcrfsg_+464 ):R:4:0x8/8()
kcrfw_wait_for_fg_copy+132:0x0000000060016400(fixed sga|var:kcrfsg_+472 shared pool|pointer:redo copy latch+0 ):R:8:0x76fdea30/1996352048(shared pool|(child)latch#1:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112176 )
kcrfw_wait_for_fg_copy+142:0x0000000060016410(fixed sga|var:kcrfsg_+488 ):R:8:0x76fdf0f0/1996353776(shared pool|permanent memor,duration 1,cls perm+16113904 )
kcrfw_wait_for_fg_copy+149:0x00007fffa1072968():W:1:0/0()
kcrfw_wait_for_fg_copy+153:0x00007fffa1072960():W:8:0x76fdeb08/1996352264(shared pool|(child)latch#2:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112392 )
kcrfw_wait_for_fg_copy+157:0x00007fffa1072958():W:4:0x1/1()
kcrfw_wait_for_fg_copy+161:0x0000000076fdf0f8(shared pool|permanent memor,duration 1,cls perm+16113912 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+169:0x0000000076fdf0fc(shared pool|permanent memor,duration 1,cls perm+16113916 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+176:0x00007fffa1072968():R:1:0/0()
kcrfw_wait_for_fg_copy+180:0x00007fffa1072958():R:4:0x1/1()
kcrfw_wait_for_fg_copy+195:0x00000000600163f8(fixed sga|var:kcrfsg_+464 ):R:4:0x8/8()

The above memory trace output shows the kcrfw_wait_for_fg_copy function looking up the redo copy latch memory address at kcrfw_wait_for_fg_copy offset 132, looking up the redo copy support memory area at kcrfw_wait_for_fg_copy offset 142, and then peeks into the redo copy helper memory structure in kcrfw_wait_for_fg_copy at offset 161 and 169. This way it does not need to obtain the redo copy latch. At offset 195 of the function it looks up the number of redo copy latches, and then executes the same procedure again, looking up the same pointers in kcrfsg_, but adding an offset for the next latch.

Please mind that there is no need for the log writer to take the redo copy latch, in fact, it doesn’t seem to care if a redo copy latch is taken or not, it waits for a redo strand writer to indicate it is done writing by waiting for the redo copy latch helper memory structure for each strand to be become 0xffffffff, which then automatically means the change vectors in the public strand buffers it is intending to write (which means having a SCN lower than LWN SCN) are not written to anymore, because a process that wants to write into a redo strand must change the redo copy latch helper memory structure to another value than 0xffffffff, and revert it to 0xffffffff once finished.

The next thing the logwriter does when it truly needs to write, is update the real redo SCN. The real redo SCN is the SCN value that contains the highest SCN of the change vectors that are truly written to the online redolog files. Remember, the ‘on disk SCN’ is not the SCN of a change that is truly written. The real redo SCN is another SCN value that is stored in the kcrfsg_ struct, in my database at offset 288. This is done in the function kcrfw_update_and_bcast_real_redo_scn, which uses the kcsaj8 function to execute the the change of the SCN. This does not require a latch get:

kcsaj8+2:0x00007fffa1072950():W:8:0x7fffa1072970/140735894989168()
kcsaj8+13:0x00007fffa1072960():R:8:0x762b80/7744384()
kcsaj8+16:0x0000000060016348(fixed sga|var:kcrfsg_+288 shared pool|x_kcrfws.real_redo_scn+0 ):R:8:0x762b7d/7744381()
kcsaj8+22:0x00007f520ba98f40():R:8:0x7f520ba98f40/139990359707456()
kcsaj8+40:0x00007fffa1072900():W:8:0x7fffa1072960/140735894989152()
kcsaj8+44:0x00007fffa1072908():W:8:0/0()
kcsaj8+48:0x00007fffa1072910():W:8:0/0()
kcsaj8+52:0x00007fffa1072918():W:4:0/0()
kcsaj8+56:0x00007fffa1072938():W:8:0x7f520ba98f40/139990359707456()
kcsaj8+63:0x00007fffa1072930():W:8:0/0()
kcsaj8+70:0x00007fffa1072928():W:8:0/0()
kcsaj8+74:0x00007fffa1072920():W:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+81:0x00007fffa1072940():W:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+88:0x00007fffa1072828():W:4:0/0()
kcsaj8+98:0x00007fffa107285a():W:1:0/0()
kcsaj8+105:0x0000000060016348(fixed sga|var:kcrfsg_+288 shared pool|x_kcrfws.real_redo_scn+0 ):R:8:0x762b7d/7744381()
kcsaj8+105:0x0000000060016348(fixed sga|var:kcrfsg_+288 shared pool|x_kcrfws.real_redo_scn+0 ):W:8:0x762b80/7744384()
kcsaj8+120:0x00007fffa1072828():R:4:0/0()
kcsaj8+138:0x00007fffa1072908():R:8:0/0()
kcsaj8+145:0x00007fffa1072900():R:8:0x7fffa1072960/140735894989152()
kcsaj8+152:0x00007fffa1072910():R:8:0/0()
kcsaj8+159:0x00007fffa1072918():R:4:0/0()
kcsaj8+166:0x00007fffa1072938():R:8:0x7f520ba98f40/139990359707456()
kcsaj8+170:0x00007fffa1072930():R:8:0/0()
kcsaj8+174:0x00007fffa1072928():R:8:0/0()
kcsaj8+178:0x00007fffa1072920():R:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+182:0x00007fffa1072940():R:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+202:0x00007f520ba8be08():R:8:0x600113b8/1610683320(fixed sga|var:kcsgscn_+0 )
kcsaj8+219:0x00007fffa1072950():R:8:0x7fffa1072970/140735894989168()
kcsaj8+220:0x00007fffa1072958():R:8:0x3208da6/52465062()

The real redo SCN SCN value is the value that was gathered in the kcsnew3 function at the beginning of the function, however, I am not sure here. It might be the highest SCN in the public strands, although it would make sense to limit the current write to the LWN SCN that was obtained at the start, which means it’s the highest SCN in the strand buffers it picked for writing, while additional buffers might have been written into, with newer higher SCNs.

What happens next, is the logwriter needs to prepare the public redo strand buffers to be written to disk:

 | | > kcrfw_update_redo_blk_hdrs(0x77eced90, 0, ...)
 | | | > _intel_fast_memset(0xda31669c, 0, ...)
 | | | <> _intel_fast_memset.J(0xda31669c, 0, ...)
 | | | <> __intel_new_memset(0xda31669c, 0, ...)
 | | | < __intel_new_memset+0x0000000005c0 returns: 0xda31669c
 | | < kcrfw_update_redo_blk_hdrs+0x000000000363 returns: 0x77eced90
 | | > kcrfw_update_strand_hdrs(0x7ffc3dd11b38, 0x77eced90, ...)
 | | < kcrfw_update_strand_hdrs+0x000000000222 returns: 0x77eced90
 | | > kcrfw_cal_target_rba(0x5, 0xda316650, ...)
 | | | > kslgetl(0x6004b780, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | <> kslfre(0x6004b780, 0x1, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > skgstmGetEpochTs(0x6004b780, 0x1, ...)
 | | | > gettimeofday@plt(0x7ffc3dd118e0, 0, ...)
 | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | < skgstmGetEpochTs+0x000000000049 returns: 0x2050c6a4e2e57
 | | > kcrfw_do_write(0x77eced90, 0x1, ...)

The last line, which is a call to the kcrfw_do_write function, is the logwriter function inside kcrfw_redo_write that actually calls the disk IO functions in ksfd. I’ll walk quickly through these because executing IO is not the focus of this blogpost.

At this point the memory area’s are setup, the IO is prepared:

This is probably calculating the request size:

 | | | > ksfdgrsz(0x1, 0, ...)
 | | | | > skgfrgsz(0x3, 0, ...)
 | | | | < skgfrgsz+0x00000000002d returns: 0xa0
 | | | < ksfdgrsz+0x00000000001a returns: 0x288

This is probably setting up the asynchronous IO information and requests:

 | | | > ksfd_aio_setobjinfo(0x71717da0, 0x76b423e8, ...)
 | | | < ksfd_aio_setobjinfo+0x00000000003c returns: 0
 | | | > ksfdrqfill(0x71717da0, 0x76b423e8, ...)
 | | | | > kskiorm(0x71717da0, 0, ...)
 | | | | < kskiorm+0x00000000001e returns: 0
 | | | < ksfdrqfill+0x00000000021b returns: 0x102

The main function for performing IO is ksfdbio (kernel system file disk block IO):

 | | | > ksfdbio(0x7f2c1e502000, 0x1, ...)
 ...
 | | | | > ksfd_update_iostatsbytes(0, 0x76b76930, ...)
 ...
 | | | | > kslwtbctx(0x7ffc3dd11700, 0x42, ...)
 ...
 | | | | < kslwtbctx+0x00000000050b returns: 0x7a823e58
 | | | | > ksfd_sbio(0x7f2c1e502028, 0x1, ...)
 ...
 | | | | < ksfd_sbio+0x00000000035b returns: 0x1
 | | | | > kslwtectx(0x7ffc3dd11700, 0x200, ...)
 ...
 | | | | < kslwtectx+0x0000000003f9 returns: 0xe923
 | | | | > ksfdbio_updatestats(0xb6b1, 0, ...)
 ...
 | | | < ksfdbio+0x000000000507 returns: 0x7f2c23b04040

This is ksfd_update_iostatsbytes for statistics, kslwtbctx for registering a wait, kslwtectx for ending the wait, ksfdbio_updatestats for statistics about the IO just performed, and in between kslwtbctx and kslwtectx ksfd_sbio performs the IO.

If you look in ksfd_sbio, you’ll see the true IO steps:

 | | | | > ksfd_sbio(0x7f2c1e502028, 0x1, ...)
 | | | | | > ksfd_osdrqfil(0x71717da0, 0x76b423e8, ...)
 | | | | | < ksfd_osdrqfil+0x00000000014c returns: 0x1
 | | | | | > ksfd_addtoissueq(0x71717da0, 0x200, ...)
 | | | | | < ksfd_addtoissueq+0x0000000001ac returns: 0x7a6032e8
 | | | | | > ksfdgo(0x500, 0x81e, ...)
 ...
 | | | | | < ksfdgo+0x000000000135 returns: 0
 | | | | | > ksfdwtio(0x1, 0x500, ...)
 ...
 | | | | | < ksfdwtio+0x000000000176 returns: 0x1
 | | | | | > ksfd_chkio(0x71717da0, 0x9, ...)
 | | | | | | > ksfd_update_iostatstime(0x76b76930, 0x200, ...)
 | | | | | | < ksfd_update_iostatstime+0x000000000103 returns: 0x76c70388
 | | | | | < ksfd_chkio+0x000000000126 returns: 0x1
 | | | | < ksfd_sbio+0x00000000035b returns: 0x1

The ksfd_osdrqfil function probably is a function to create IO requests, and ksfd_addtoissueq looks like it is a function to put these requests in a queue. ksfdgo executes the IO requests, which essentially calls io_submit to submit the write requests. ksfdwtio essentially calls io_getevents to look for the IO results.

After kcrfw_do_write is done, kcrfw_post is executed, just like we saw with the kcrfw_do_null_write function.

Also identical to the null write, the current on disk SCN is read from kcrfsg_ using kcscu8.

Then a series of latch gets is executed to protect changes in the SGA to reflect the write that was executed.
First the redo writing latch is taken. After the redo writing latch is obtained, kcrfsg_+12, alias X$KCRFSWS.LAST_BLK is read and updated with the write:

kcrfw_post+715:0x0000000060016234(fixed sga|var:kcrfsg_+12 shared pool|X$KCRFWS.LAST_BLK+0 shared pool|x_kcrfws.last_blk+0 ):R:4:0x16a15/92693()
kcrfw_post+725:0x00000000600163c0(fixed sga|var:kcrfsg_+408 ):R:4:0xf5/245()
kcrfw_post+736:0x00007fffa1072958():R:8:0/0()
kcrfw_post+743:0x00007fffa1072880():W:4:0xf5/245()
kcrfw_post+753:0x00007fffa1072940():R:4:0xdf7/3575()
kcrfw_post+767:0x0000000060016234(fixed sga|var:kcrfsg_+12 shared pool|X$KCRFWS.LAST_BLK+0 shared pool|x_kcrfws.last_blk+0 ):W:4:0x16a17/92695()

While still holding the redo writing latch, the redo allocation latch for the correct public redo strand is taken, and MEM_BUFS_AVAILABLE_KCRFA_CLN and AVAILABLE_BUFS_KCRFA_CLN are updated:

kcrfw_post+803:0x0000000076fdf3bc(shared pool|KCRFA+52 shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0xde2f/56879()
kcrfw_post+803:0x0000000076fdf3bc(shared pool|KCRFA+52 shared pool|permanent memor,duration 1,cls perm+16114620 ):W:4:0xde31/56881()
MEM_BUFS_AVAILABLE_KCRFA_CLN
kcrfw_post+3839:0x0000000076fdf3b8(shared pool|KCRFA+48 shared pool|permanent memor,duration 1,cls perm+16114616 ):R:4:0xd7d0/55248()
kcrfw_post+3839:0x0000000076fdf3b8(shared pool|KCRFA+48 shared pool|permanent memor,duration 1,cls perm+16114616 ):W:4:0xd7d0/55248()
AVAILABLE_BUFS_KCFRA_CLN

Then the redo allocation latch is freed, and the redo writing latch is freed.

Then the redo writing latch is taken again, in order to update the last write scn in kcrfsg_:

ub8_to_kscn_impl+21:0x00000000600162c0(fixed sga|var:kcrfsg_+152 shared pool|X$KCRFWS.LAST_WRITE_SCN+0 shared pool|x_kcrfws.last_write_scn+0 ):W:4:0x762b81/7744385()
ub8_to_kscn_impl+37:0x00000000600162c6(fixed sga|var:kcrfsg_+158 shared pool|X$KCRFWS.LAST_WRITE_SCN+6 ):W:2:0/0()
ub8_to_kscn_impl+41:0x00000000600162c4(fixed sga|var:kcrfsg_+156 shared pool|X$KCRFWS.LAST_WRITE_SCN+4 ):W:2:0/0()

And while holding the redo writing latch, the kcsnew_rba function is called to set the redo byte address data in kcrfsg_. In order to do this, the ‘consistent RBA’ latch is obtained first:

kslgetl+138:0x0000000060049c80(fixed sga|(parent)latch:Consistent RBA+0 fixed sga|var:consist_rba_lat_+0 ):R:8:0/0()
kslgetl+138:0x0000000060049c80(fixed sga|(parent)latch:Consistent RBA+0 fixed sga|var:consist_rba_lat_+0 ):W:8:0x13/19()

And then fields in the kcrfsg_ struct are read updated.

kcsnew_rba+47:0x00000000600162d8(fixed sga|var:kcrfsg_+176 ):W:4:0x1/1()  // rba details being modified flag
kcsnew_rba+58:0x00000000600162d4(fixed sga|var:kcrfsg_+172 ):R:4:0x7e/126()
kcsnew_rba+58:0x00000000600162d4(fixed sga|var:kcrfsg_+172 ):W:4:0x7f/127()
kcsnew_rba+65:0x00007fffa1072880():R:8:0x16a18000000f5/398126288470261()
kcsnew_rba+69:0x00000000600162c8(fixed sga|var:kcrfsg_+160 ):W:8:0x16a18000000f5/398126288470261() // 164: rba current log block
kcsnew_rba+72:0x00007fffa1072888():R:4:0x10010/65552()
kcsnew_rba+76:0x00000000600162d0(fixed sga|var:kcrfsg_+168 ):W:4:0x10010/65552()
kcsnew_rba+83:0x00000000600162d8(fixed sga|var:kcrfsg_+176 ):W:4:0/0()  // rba details being modified flag
kcsnew_rba+91:0x00000000600162e0(fixed sga|var:kcrfsg_+184 shared pool|pointer:consistent rba latch+0 ):R:8:0x60049c80/1610914944(fixed sga|(parent)latch:Consistent RBA+0 fixed sga|var:consist_rba_lat_+0 )
kcsnew_rba+95:0x00007fffa1072658():R:8:0x77eced90/2012016016(Variable Size(pgsz:2048k)|+384626064 shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsnew_rba+99:0x00007fffa1072650():R:8:0x7f520ba98f40/139990359707456()
kcsnew_rba+106:0x00007fffa1072660():R:8:0x7fffa1072970/140735894989168()

After the kcrfsg_ RBA update, the consistent RBA latch is freed. While still holding the redo writing latch, the last write scn time is updated in kcrfsg_:

kcrfw_post+1355:0x0000000077eceda0(Variable Size(pgsz:2048k)|+384626080 shared pool|permanent memor,duration 1,cls perm+2451512 ):R:4:0x397df7b0/964556720()
kcrfw_post+1361:0x000000006001623c(Fixed Size(pgsz:2048k)|+90684 fixed sga|var:kcrfsg_+20 shared pool|X$KCRFWS.LAST_WRITE_SCN_TIME+0 ):W:4:0x397df7b0/964556720()

And the redo writing latch is released.

A little further in the function kcrfw_post, the kcsadj8 function for changing the on disk scn to the lwn scn is called, exactly like in the null write function:

 | | | > kcsaj8(0x60016290, 0x7ffc3dd118b8, ...)
 | | | < kcsaj8+0x0000000000dc returns: 0x1

After the adjustment of the on disk SCN, the foregrounds that have registered a wait via the post wait interface need to be posted. This is done in the kslpslf (kernel service lock management post list of foregrounds) function in the kcrfw_post function:

 | | | > kslpslf(0x97, 0x769a5730, ...)
 | | | | > ksl_postm_init(0x7ffc3dd094f0, 0x7ffc3dd11510, ...)
 | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | > kslgetl(0x7a5d7388, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > ksl_update_post_stats(0x7a619f48, 0x7ffc3dd11510, ...)
 | | | | | > dbgtTrcData_int(0x7f2c23b5e6c0, 0x2050031, ...)
 | | | | | | > dbgtBucketRedirect(0x7f2c23b5e6c0, 0x7ffc3dd093c8, ...)
 | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
 | | | | | | > dbgtIncInMemTrcRedirect(0x7f2c23b5e6c0, 0x6fa, ...)
 | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
 | | | | | | > skgstmGetEpochTs(0x7f2c23b5e6c0, 0x6fa, ...)
 | | | | | | | > gettimeofday@plt(0x7ffc3dd08ed0, 0, ...)
 | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x2050c7178c0c1
 | | | | | | > dbgtrRecAllocate(0x7f2c23b5e6c0, 0x7ffc3dd09370, ...)
 | | | | | | | > dbgtrPrepareWrite(0x7f2c23b5e6c0, 0x71121048, ...)
 | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x11
 | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
 | | | | | | > _intel_fast_memcpy(0x71124368, 0x7ffc3dd09468, ...)
 | | | | | | <> _intel_fast_memcpy.P(0x71124368, 0x7ffc3dd09468, ...)
 | | | | | | <> __intel_ssse3_rep_memcpy(0x71124368, 0x7ffc3dd09468, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x71124368
 | | | | | | > dbgtrRecEndSegment(0x7f2c23b5e6c0, 0x7ffc3dd09370, ...)
 | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
 | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
 | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
 | | | | > kslfre(0x7a5d7388, 0, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | > ksl_postm_do_posts(0x7ffc3dd094f0, 0, ...)
 | | | | | > kslpwrp_int(0x7ffc3dd09450, 0x7f2c23b9e9a0, ...)
 | | | | | | > skgpwpost(0x7ffc3dd09350, 0x7f2c23ba3ca0, ...)
 | | | | | | <> sskgpwpost(0x7ffc3dd09350, 0x7f2c23ba3ca0, ...)
 | | | | | | | > semop@plt(0x38000, 0x7ffc3dd07310, ...)
 | | | | | | | < semop+0x00000000000f returns: 0
 | | | | | | < sskgpwpost+0x00000000009a returns: 0x1
 | | | | | < kslpwrp_int+0x000000000069 returns: 0x1
 | | | | | > kgecss(0x7f2c23b9e9a0, 0x7f2c1e54b048, ...)
 | | | | | < kgecss+0x000000000035 returns: 0x7f2c23be0f40
 | | | | < ksl_postm_do_posts+0x000000000144 returns: 0x1
 | | | < kslpslf+0x000000000822 returns: 0x7a5e26c8

Essentially, what happens here is the posting of a message is initialised (ksl_postm_init; kernel service locking post multi init), although most of the initialisation from a memory perspective seems to happen in the kslpslf function. Then the post/wait queue latch is obtained, and the kslpslf function manipulates the post/wait queue (which seems to be multiple memory area’s linked together with pointers) to reflect the process it is going to semctl/semop.
Then in the function ksl_update_post_stats, some housekeeping is done:

ksl_update_post_stats+117:0x000000007a603758(shared pool|permanent memor,duration 1,cls perm+5465944 ):W:8:0x7a612b58/2053188440(Variable Size(pgsz:2048k)|+425798488 shared pool|(indx:34)X$KSUPR+0 shared pool|permanent memor,duration 1,cls perm+5528408 )
ksl_update_post_stats+124:0x000000007a603748(shared pool|(indx:19)X$KSUPR.KSLLAPSC+0 shared pool|permanent memor,duration 1,cls perm+5465928 ):R:4:0x1e/30()
ksl_update_post_stats+134:0x000000007a603748(shared pool|(indx:19)X$KSUPR.KSLLAPSC+0 shared pool|permanent memor,duration 1,cls perm+5465928 ):W:4:0x1f/31()
ksl_update_post_stats+141:0x000000007a603744(shared pool|(indx:19)X$KSUPR.KSLLAPSN+0 shared pool|permanent memor,duration 1,cls perm+5465924 ):W:4:0x3/3()
ksl_update_post_stats+2705:0x00007fffa1072590():R:8:0x97/151()
ksl_update_post_stats+2708:0x000000007a612f80(shared pool|(indx:34)X$KSUPR.KSLLID1R+0 shared pool|permanent memor,duration 1,cls perm+5529472 ):W:8:0x97/151()
ksl_update_post_stats+2716:0x00007fffa1072598():R:8:0/0()
ksl_update_post_stats+2720:0x000000007a612f88(shared pool|(indx:34)X$KSUPR.KSLLID2R+0 shared pool|permanent memor,duration 1,cls perm+5529480 ):W:8:0/0()
ksl_update_post_stats+2728:0x00007fffa10725a0():R:8:0x7fff00005645/140733193410117()
ksl_update_post_stats+2732:0x000000007a612f90(shared pool|(indx:34)X$KSUPR.KSLLRTYP+0 shared pool|permanent memor,duration 1,cls perm+5529488 ):W:8:0x7fff00005645/140733193410117()
ksl_update_post_stats+202:0x000000007a612fc0(shared pool|permanent memor,duration 1,cls perm+5529536 ):W:8:0x7a6032e8/2053124840(Variable Size(pgsz:2048k)|+425734888 shared pool|(indx:19)X$KSUPR+0 shared pool|permanent memor,duration 1,cls perm+5464808 )
ksl_update_post_stats+209:0x000000007a612fbc(shared pool|(indx:34)X$KSUPR.KSLLAPRC+0 shared pool|permanent memor,duration 1,cls perm+5529532 ):R:4:0x2/2()
ksl_update_post_stats+209:0x000000007a612fbc(shared pool|(indx:34)X$KSUPR.KSLLAPRC+0 shared pool|permanent memor,duration 1,cls perm+5529532 ):W:4:0x3/3()
ksl_update_post_stats+216:0x000000007a612fb0(shared pool|(indx:34)X$KSUPR.KSLLAPRV+0 shared pool|permanent memor,duration 1,cls perm+5529520 ):W:4:0x3/3()

Of course x$ksupr is the kernel service user process, which is the base table for v$process. However, there are a lot of columns in x$ksupr that are not externalised in v$process. The indx number of the logwriter is 19, the indx number of the process that is going to be posted is 34. One of these fields that are not externalised is ksllapsc, which is the Posts Send Count. Because the logwriter is going to post the foreground session, it’s posts send count is increased by one. ksllapsn is last Post SeNt location number, which is set to 3.

This number corresponds with x$kslpo:

SQL> select indx, ksllwnam, kslposts from x$kslpo where indx = 3;
      INDX KSLLWNAM               KSLPOSTS
---------- -------------------- ----------
         3 kslpsr                     4356

(kernel service locking post receive (?))

The foreground process statistics in x$ksupr are changed too, ksllaprc Posts Received Count, and is increased by one too, because the logwriter is going to send a message to the foreground process. ksllaprv is last Post ReceiVed location number, which is set to 3 too.

After that some tracing and other housekeeping is taking place in dbgtTrcData_int, after which the post/wait queue latch is freed. The next function is doing the actual semop call, which is ksl_postm_do_posts: kernel service locking, post multi do posts. In ksl_postm_do_posts, the function kslpwrp_int is called, which looks like it’s essentially a function that can call the function skgpwpost (system kernel generic post wait post), but if there are multiple sessions to be posted, it calls the function skgpwvectorpost, which iterates through the list of processes to be posted.

At this point the kcrfw_redo_write_driver function is done.

This is the fourth blogpost on a series of blogposts about how the Oracle database handles redo. The previous blogpost talked about the work cycle of the log writer: https://fritshoogland.wordpress.com/2018/02/12/a-look-into-oracle-redo-part-3-the-log-writer-work-cycle-overview/. This posts is looking into the execution of the kcrfw_redo_write_driver function executed in the ksbcti.

Now that we are familiar with how the logwriter works in general, we need to take a closer look to the kcrfw_redo_write_driver function. First let me be clear: the kcrfw_redo_write_driver function inside ksbcti is called every time the logwriter process times out on its semaphore, which is every 3 seconds, so this means it is called too when nothing is written in the public redo strands. In fact, when the log writer times out on semtimedop (which means it is not (yet) semctl’ed or semop’ed), it doesn’t know if there are entries in the public redo strands at that point in time.

There is a lot of things that go on in the kcrfw_redo_write_driver function. The following tracing are small snippets of execution from the function, not the every detail in this function.

The first thing that is of interest is the logwriter obtaining a LWN (log write number) number and SCN.

 | | > kcsnew3(0x600113b8, 0x7ffc3dd10d28, ...)
 | | | > kcsnew8(0x600113b8, 0x7ffc3dd10b70, ...)
 | | | | > kslgetl(0x60049800, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > kslfre(0x60049800, 0x1, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | < kcsnew8+0x000000000117 returns: 0

Please mind at this time it’s unknown if there is anything to write in the public redo strands.

So the kcsnew3 function is executed, with 0x600113b8 as first argument. That is the address of kcsgscn_, the instance global SCN. The kcsnew3 function calls the kcsnew8 function, which calls kslgetl for latch address 0x60049800. That address belongs to the ‘lgwr LWN SCN’ latch in my instance. In the function call overview there is nothing more to see. If see look at the pina memory trace, we see a lot more; this is what is executed in between kslgetl and kslfre:

kcsnew8+107:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0x1/1()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):R:4:0xc5/197()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):W:4:0xc6/198()
kcsnew8+125:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b7e/7744382()
kcsnew8+131:0x00007fffa1071ae0():W:4:0/0()
kcsnew8+140:0x00007fffa1071b12():W:1:0/0()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b7e/7744382()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):W:8:0x762b7f/7744383()
kcsnew8+165:0x00007fffa1071ae0():R:4:0/0()
kcsnew8+184:0x00007f520ba8be08():R:8:0x600113b8/1610683320(fixed sga|var:kcsgscn_+0 )
kcsnew8+193:0x00007f520ba8ae90():R:8:0x7a823620/2055353888(shared pool|permanent memor,duration 1,cls perm+769 )
kcsnew8+212:0x000000007a823e40(shared pool|permanent memor,duration 1,cls perm+7695936 ):R:8:0x719beed8/190604(shared pool|ksu:stats_freel,duration 1,cls freeabl+24 )
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):R:8:0xc6/198()
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):W:8:0xc7/199()
kcsnew8+223:0x00007fffa1071bf0():W:8:0x762b7f/7744383()
kcsnew8+232:0x0000000060016260(fixed sga|var:kcrfsg_+56 shared pool|x_kcrfws.lwn_scn+0 ):W:8:0x762b7f/7744383()
kcsnew8+238:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0/0()
kcsnew8+246:0x0000000060016288(fixed sga|var:kcrfsg_+96 shared pool|pointer:lgwr lwn scn latch+0 ):R:8:0x60049800/1610913792(fixed sga|(parent)latch:lgwr LWN SCN+0 fixed sga|var:lwn_scn_lat_+0 )
kcsnew8+250:0x00007fffa1071ac8():W:8:0x10e3ed3f/283372863()

What is visible here, is that at kcrfsg_ offset 76 ‘1’ is written. This is quite probably a flag indicating the LWN related entries are being modified (this is a guess, of course the LWN SCN latch essentially performs the same function). Especially since this is set to ‘0’ at the end if the function (kcsnew8+238). At kcrfsg_ offset 72 a number is read and increased by 1. This is most likely the LWN number. If you take a look at the number of gets for the ‘lgwr LWN’ you will see the number of gets is very close (albeit not equal) to the LWN number.

Then the LWN SCN is fetched from kcsgscn_; it first is read, and increased by one to 7744383 and written back in kcsgscn_ (get and advance). The SCN from kcsgscn_ then is written in the kcrfsg_ struct at offset 56 a little further as well, after which the earlier set ‘1’ is reset to ‘0’ at offset 76. This SCN is visible in x$kcrfws as LWN SCN.

The next thing to look at is the function kcrfw_gather_lwn a little further:

 | | > kcrfw_gather_lwn(0x7ffc3dd10d68, 0x77eced90, ...)
 | | | > kslgetl(0x76fe0c10, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7ffc3dd10d68, 0, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x76fe0c10, 0x118b2, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | | > kcrfw_gather_strand(0x7ffc3dd10d68, 0x1, ...)
 | | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | | > kslfre(0x76fe0cb0, 0x19, ...)
 | | | < kslfre+0x0000000001e2 returns: 0
 | | < kcrfw_gather_lwn+0x00000000065c returns: 0xffffffff

In the kcrfw_gather_lwn function first latch address x76fe0c10 is taken in immediate mode, which is the redo allocation child #1, and then the function kcrfw_gather_strand is called, which returns 0, after which the latch is freed. This pattern repeats itself with a different latch address, 0x76fe0cb0, which is redo allocation child #2. This gives a hint the two public strands are active. By looking at the memory accesses trace again, a fixed sga variable is used to find out how many strands are active:

7038:kcrfw_gather_lwn+76:0x00000000600169b8(Fixed Size(pgsz:2048k)|+92600 fixed sga|var:kcrf_max_strands_+0 ):R:4:0x2/2()

A little further information is obtained from kcrfsg_ about the redo allocation latch and the KCRFA structure address:

7052:kcrfw_gather_lwn+223:0x0000000060016570(fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
7053:kcrfw_gather_lwn+230:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )

Then the redo allocation latch child #1 is obtained via kslgetl, after which kcrfw_gather_lwn calls kcrfw_gather_strand:

kcrfw_gather_strand+2:0x00007fffa1071b00():W:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+15:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+22:0x00007fffa1071af8():W:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+29:0x00007fffa1071af0():W:8:0x76fdf2b0/1996354224(shared pool|permanent memor,duration 1,cls perm+16114352 )
kcrfw_gather_strand+36:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):R:4:0x65e/1630()
kcrfw_gather_strand+53:0x0000000076fdf480(shared pool|KCRFA+248 shared pool|TOTAL_BUFS_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114816 ):R:4:0x20000/131072()
kcrfw_gather_strand+70:0x00007fffa1071de8():R:8:0x762b7f/7744383()
kcrfw_gather_strand+79:0x0000000076fdf390(shared pool|KCRFA+8 shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x762b7c/7744380()
kcrfw_gather_strand+86:0x0000000076fdf3a8(shared pool|KCRFA+32 shared pool|NEXT_BUF_NUM_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x65e/1630()
kcrfw_gather_strand+121:0x0000000076fdf45c(shared pool|KCRFA+212 shared pool|LAST_BUF_GATHERED_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114780 ):R:4:0x65e/1630()
kcrfw_gather_strand+169:0x00007fffa1071bf8():W:4:0x65e/1630()
kcrfw_gather_strand+176:0x00007fffa1071bfc():W:4:0/0()
kcrfw_gather_strand+182:0x00007fffa1071af0():R:8:0x76fdf2b0/1996354224(shared pool|permanent memor,duration 1,cls perm+16114352 )
kcrfw_gather_strand+186:0x00007fffa1071af8():R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+193:0x00007fffa1071b00():R:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+194:0x00007fffa1071b08():R:8:0x31f0f00/52367104()

In kcrfw_gather_strand, we see the sequence we have seen before: at offset 0 of kcrfsg_, the KCRFA struct address is looked up, after which the function investigates addresses in the KCRFA struct for the first public redo strand only. I’ve annotated these in the pinatrace snippet above. At offset 36 in the function, the last buffer written to the online redologfile is obtained from offset 208 in KCRFA, which is public strand buffer 1630. A little further, at offset 86 and 121 in the function kcrfw_gather_strand, the most recently used public strand buffer that is written to is obtained at KCRFA offset 32 (NEXT_BUF_NUM_KCRFA_CLN), and the highest buffer number that has been investigated (gathered) by the logwriter process at KCRFA offset 212 (LAST_BUF_GATHERED_KCRFA). In all these cases, the public redo strand buffer number is 1630. This means no redo has been written into the first public redo strand.
After reading the KCRFA information for the first public redo strand, the latch for this struct is freed.

A little further, the latch is obtained that protects the second public strand, and kcrfw_gather_strand is executed again to read the KCRFA information for this strand:

kcrfw_gather_strand+2:0x00007fffa1071b00():W:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+15:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+22:0x00007fffa1071af8():W:8:0x76fdf4b0/1996354736(shared pool|KCRFA+296 shared pool|permanent memor,duration 1,cls perm+16114864 )
kcrfw_gather_strand+29:0x00007fffa1071af0():W:8:0x76fdf1f0/1996354032(shared pool|permanent memor,duration 1,cls perm+16114160 )
kcrfw_gather_strand+36:0x0000000076fdf580(shared pool|KCRFA+504 shared pool|permanent memor,duration 1,cls perm+16115072 ):R:4:0x1d/29()
kcrfw_gather_strand+53:0x0000000076fdf5a8(shared pool|KCRFA+544 shared pool|permanent memor,duration 1,cls perm+16115112 ):R:4:0x20000/131072()
kcrfw_gather_strand+70:0x00007fffa1071de8():R:8:0x762b7f/7744383()
kcrfw_gather_strand+79:0x0000000076fdf4b8(shared pool|KCRFA+304 shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x762ae7/7744231()
kcrfw_gather_strand+86:0x0000000076fdf4d0(shared pool|KCRFA+328 shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0x1d/29()
kcrfw_gather_strand+121:0x0000000076fdf584(shared pool|KCRFA+508 shared pool|permanent memor,duration 1,cls perm+16115076 ):R:4:0x1d/29()
kcrfw_gather_strand+169:0x00007fffa1071bf8():W:4:0x1d/29()
kcrfw_gather_strand+176:0x00007fffa1071bfc():W:4:0/0()
kcrfw_gather_strand+182:0x00007fffa1071af0():R:8:0x76fdf1f0/1996354032(shared pool|permanent memor,duration 1,cls perm+16114160 )
kcrfw_gather_strand+186:0x00007fffa1071af8():R:8:0x76fdf4b0/1996354736(shared pool|KCRFA+296 shared pool|permanent memor,duration 1,cls perm+16114864 )
kcrfw_gather_strand+193:0x00007fffa1071b00():R:8:0x7fffa1071c00/140735894985728()
kcrfw_gather_strand+194:0x00007fffa1071b08():R:8:0x31f0f00/52367104()

This is exactly the same function, just reading the same information for a different strand, which means at a different offset in KCRFRA. If you subtract the size of the the information for a single strand which we calculated to be 296 from the KCRFA offsets (see blogpost number 2), you get the same numbers as above (504-296=208=LAST_BUF_WRITTEN). Once you understand what is read, it should be fairly easy to see nothing has been written into the second strand too. So at this point, the logwriter knows there is nothing to write.

What does the logwriter do when there is nothing to write? It executes a ‘null write’:

 | | > kcrfw_do_null_write(0, 0, ...)
 | | | > kcrfw_slave_phase_batchdo(0, 0, ...)
 | | | | > kcrfw_slave_phase_enter(0, 0x40, ...)
 | | | | < kcrfw_slave_phase_enter+0x000000000449 returns: 0
 | | | <> kcrfw_slave_phase_exit(0, 0x40, ...)
 | | | < kcrfw_slave_phase_exit+0x00000000035a returns: 0
 | | | > kcrfw_post(0, 0, ...)
 | | | | > kcrfw_slave_single_getactivegroup(0, 0, ...)
 | | | | < kcrfw_slave_single_getactivegroup+0x000000000047 returns: 0x769a5620
 | | | | > kspGetInstType(0x1, 0x1, ...)
 | | | | | > vsnffe_internal(0x19, 0x1, ...)
 | | | | | | > vsnfprd(0x19, 0x1, ...)
 | | | | | | < vsnfprd+0x00000000000f returns: 0x8
 | | | | | | > kfIsASMOn(0x19, 0x1, ...)
 | | | | | | <> kfOsmInstanceSafe(0x19, 0x1, ...)
 | | | | | | < kfOsmInstanceSafe+0x000000000031 returns: 0
 | | | | | < vsnffe_internal+0x0000000000a7 returns: 0
 | | | | | > kspges(0x115, 0x1, ...)
 | | | | | < kspges+0x00000000010f returns: 0
 | | | | < kspGetInstType+0x0000000000b1 returns: 0x1
 | | | | > kcrfw_slave_phase_enter(0x1, 0x40, ...)
 | | | | < kcrfw_slave_phase_enter+0x00000000006f returns: 0x40
 | | | | > kcscu8(0x60016290, 0x7ffc3dd10a98, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcsaj8(0x60016290, 0x7ffc3dd10a38, ...)
 | | | | < kcsaj8+0x0000000000dc returns: 0x1
 | | | | > kcrfw_slave_phase_exit(0x1, 0x40, ...)
 | | | | < kcrfw_slave_phase_exit+0x00000000008e returns: 0
 | | | | > kslpsemf(0x97, 0, ...)
 | | | | | > ksl_postm_init(0x7ffc3dd08730, 0x7ffc3dd10750, ...)
 | | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | < kslpsemf+0x0000000006b5 returns: 0x1f
 | | | | > kcrfw_slave_barrier_nonmasterwait(0x769a5628, 0x4, ...)
 | | | | < kcrfw_slave_barrier_nonmasterwait+0x000000000035 returns: 0x600161a0
 | | | < kcrfw_post+0x000000000c1c returns: 0xd3
 | | < kcrfw_do_null_write+0x0000000000b2 returns: 0xd3

This means there is housekeeping to do, despite no public redo strand buffers needing writing. The most crucial things it does that I can detect, is updating the on disk SCN in the kcrfsg_ struct with the already set LWN SCN and posting any outstanding processes that are sleeping on a semaphore via post/wait. These are done in the kcrfw_post function, which probably means (kernel cache redo file write) ‘post write’, however there was nothing to write (null write), since there was nothing in the public redo strands. If you look to the above function trace, you see inside kcrfw_do_null_write, kcrfw_post is called.

Inside kcrfw_post, kcscu8 (kernel cache service get current SCN) is called with 0x60016290 as first argument, which is the memory address of offset 104 in the kcrfsg_ struct in the fixed SGA, which is externalised as ON_DISK_SCN in X$KCRFWS, which is done to read the current value of the on disk SCN:

kcscu8+2:0x00007fffa10718a0():W:8:0x7fffa1071bb0/140735894985648()
kcscu8+13:0x00007fffa1071898():W:8:0x77eced90/2012016016(Variable Size(pgsz:2048k)|+384626064 shared pool|permanent memor,duration 1,cls perm+2451496 )
kcscu8+20:0x00007fffa1071890():W:8:0x7f520ba98f40/139990359707456()
kcscu8+27:0x00000000600162a0(Fixed Size(pgsz:2048k)|+90784 fixed sga|var:kcrfsg_+120 ):R:4:0/0()
kcscu8+31:0x00000000600162a4(Fixed Size(pgsz:2048k)|+90788 fixed sga|var:kcrfsg_+124 ):R:4:0/0()
kcscu8+39:0x0000000060016290(Fixed Size(pgsz:2048k)|+90768 fixed sga|var:kcrfsg_+104 shared pool|x_kcrfws.on_disk_scn+0 ):R:8:0x762b7e/7744382()
kcscu8+43:0x00000000600162a0(Fixed Size(pgsz:2048k)|+90784 fixed sga|var:kcrfsg_+120 ):R:4:0/0()
kcscu8+56:0x00007fffa1071b18():W:8:0x762b7e/7744382()
kcscu8+59:0x00007fffa1071898():R:8:0x77eced90/2012016016(Variable Size(pgsz:2048k)|+384626064 shared pool|permanent memor,duration 1,cls perm+2451496 )
kcscu8+63:0x00007fffa1071890():R:8:0x7f520ba98f40/139990359707456()
kcscu8+70:0x00007fffa10718a0():R:8:0x7fffa1071bb0/140735894985648()
kcscu8+71:0x00007fffa10718a8():R:8:0x3202150/52437328()

The next function that is called is kcsaj8 (kernel cache service adjust SCN), with 0x60016290 as first argument, indicating it is going to change the on disk SCN:

kcsaj8+105:0x0000000060016290(Fixed Size(pgsz:2048k)|+90768 fixed sga|var:kcrfsg_+104 shared pool|x_kcrfws.on_disk_scn+0 ):R:8:0x762b7e/7744382()
kcsaj8+105:0x0000000060016290(Fixed Size(pgsz:2048k)|+90768 fixed sga|var:kcrfsg_+104 shared pool|x_kcrfws.on_disk_scn+0 ):W:8:0x762b7f/7744383()
(these are only the relevant lines from the memory trace for the function kcsaj8)

What this means is that the LWN SCN that was obtained at te beginning of the write cycle using the kcsnew3 function, now is written into the on disk SCN too, despite anything actually being written to disk. The obvious question at this point is ‘if the on disk SCN is increased without anything actually written, would there be a SCN value that actually tracks the SCN of truly written (“on disk”) redo? To spoil the next blogpost about actual redo writing, this is administered in a SCN value called the ‘real redo SCN’.

After kcrfw_post returns, the execution returns from kcrfw_do_null_write and kcrfw_redo_write_driver, ending the ‘null write’.

This is the third part of a series of blogposts on how the Oracle database handles redo. The previous part talked about the memory area that stores redo strand information: https://fritshoogland.wordpress.com/2018/02/05/a-look-into-oracle-redo-part-2-the-discovery-of-the-kcrfa-structure/.

The single most important process in the Oracle database for handling redo is the log writer, which primary task is flushing the redo information other Oracle database processes put in the public redo strands to disk. Now that we have investigated the public redo strands and concurrency (first part) and kcrfsg_ and the KCRFA structure (second part), it seems logical to me to look at the log writer.

Simply because we have seen the foreground process take the redo allocation latch, ‘fiddles around’ in the kcrfsg_ and KCRFA structures and releases the latch, it is a pretty safe guess the logwriter is using these structures to determine if it needs to write. It looks like the kcrfsg_ structure contains global redo information like redo log file information, and the KCRFA structure strand specific information.

First we need to get an understanding of what the logwriter process is actually doing. The below investigation is done to see how the idle and write work cycles work, I explicitly excluded additional tasks like redo log file switches.

First of all, actually quite surprisingly, there is little information about what the log writer is actually doing. The only well known features of the logwriter is that it writes the log buffer instrumented by the ‘log file parallel write’ wait event, and sleeps on a semaphore in the wait event ‘rdbms ipc message’. In order to learn more, I’ve run a debugtrace on the logwriter process running idle, and stripped excess information and then captured only the calls one level from the stack depth the logwriter process is on in between the work cycles:

$ awk -n '$0 ~ /^\ \x3e\ /' lgwr_full_cycle_stripped.txt
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)               <<<<
 > ksl_exit_main_loop_wait(0x9d64fb00c, 0x7, ...)
 > ksbcti(0x12b84c18, 0x7f2c1eb804b0, ...)          (nr 1)
 > dbktFlush(0, 0, ...)
 > sltrgatime64(0x7f2c23b9e9a0, 0x7f2c23b9ebe8, ...)
 > ksbcti(0x12b84be0, 0x7f2c1eb80208, ...)          (nr 2)
 > ksbcti(0x12b84bfc, 0x7f2c1eb804b0, ...)          (nr 3)
 > sltrgatime64(0, 0, ...)
 > ksl_enter_main_loop_wait(0x1, 0x4ccef9975, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)               <<<<
 > ksl_exit_main_loop_wait(0x9d67d8b01, 0x7, ...)
 > ksbcti(0x12b84c18, 0x7f2c1eb804b0, ...)          (nr 1)
 > dbktFlush(0, 0, ...)
 > sltrgatime64(0x7f2c23b9e9a0, 0x7f2c23b9ebe8, ...)
 > ksbcti(0x12b84be0, 0x7f2c1eb80208, ...)          (nr 2)
 > ksbcti(0x12b84bfc, 0x7f2c1eb804b0, ...)          (nr 3)
 > sltrgatime64(0, 0, ...)
 > ksl_enter_main_loop_wait(0x1, 0x4cd05fb07, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)               <<<<

ksarcv means kernel service asynchronous receive message, this is the function that sets up the semaphore and puts the logwriter process to sleep. This also explains the ksl_exit_main_loop_wait and ksl_enter_main_loop_wait functions, these are called exactly before (enter) and after (exit) the ksarcv function. The dbktFlush function is managing trace files, this function is used to write messages to the logwriter tracefile. The sltrgatime64 function is a function that calls clock_gettime. That leaves one function that can be seen in my example: ksbcti. ksbcti means kernel service background processes call timeout/interrupts. This is a function that performs several different functions based on the first argument.

The first ksbcti function following ksl_exit_main_loop_wait (with first argument 0x12b84c18) seems to be the function that performs administering the log writer process resource usage details in the SGA among other things:

 | | > kews_timeout(0x7f2c1eb93ec0, 0, ...)
 | | | > sltrgftime64(0x7f2c1eb93ec0, 0, ...)
 | | | | > clock_gettime@plt(0x1, 0x7ffc3dd10ad0, ...)
 | | | | | > clock_gettime(0x1, 0x7ffc3dd10ad0, ...)
 | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | < clock_gettime+0x00000000003a returns: 0
 | | | < sltrgftime64+0x00000000004c returns: 0x9cb6163fa
 | | | > slcpums(0x7f2c23ba3c58, 0x191ae1d5, ...)
 | | | | > getrusage(0x1, 0x7ffc3dd10a30, ...)
 | | | | | > fthread_self(0x1, 0x7ffc3dd10a30, ...)
 | | | | | < fthread_self+0x000000000024 returns: 0
 | | | | <> getrusage_ext(0x1, 0x7ffc3dd10a30, ...)
 | | | | < getrusage+0x00000000000f returns: 0
 | | | < slcpums+0x00000000008f returns: 0x31d592
 | | | > kews_sqlst_flush(0xc, 0x72fd7d40, ...)
 | | | | > sltrgftime64(0xc, 0x72fd7d40, ...)
 | | | | | > clock_gettime@plt(0x1, 0x7ffc3dd10a60, ...)
 | | | | | | > clock_gettime(0x1, 0x7ffc3dd10a60, ...)
 | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | < sltrgftime64+0x00000000004c returns: 0x9cb61723f
 | | | | > slcpums(0x7f2c23ba3c58, 0x7ae0ee70, ...)
 | | | | | > getrusage(0x1, 0x7ffc3dd109c0, ...)
 | | | | | | > fthread_self(0x1, 0x7ffc3dd109c0, ...)
 | | | | | | < fthread_self+0x000000000024 returns: 0
 | | | | | <> getrusage_ext(0x1, 0x7ffc3dd109c0, ...)
 | | | | | < getrusage+0x00000000000f returns: 0
 | | | | < slcpums+0x00000000008f returns: 0x31e417
 | | | < kews_sqlst_flush+0x00000000016a returns: 0x7f2c23b9e900
 | | < kews_timeout+0x00000000052c returns: 0x7f2c23b9e900

kews_timeout means kernel event wait statistics, so statistics updated because of the process timing out. sltrgftime64 is an o/s specific call wrapper to read time from the operating system, it calls clock_gettime that reads the system clock (argument 1 is CLOCK_MONOTONIC, which is a clock that is not affected by ‘discontinuous’ changes to the clock time; think daylight saving time time changes for example). Similar, slcpums is an o/s specific call wrapper to read cpu usage, it calls getrusage to obtain CPU accounting data from the operating system. kews_sqlst_flush seems to be related to automatic tuning features, this *probably* (I haven’t investigated further) flushes status data like time and cpu to a buffer so features like sql monitor and adaptive features can use it.

The second ksbcti function call (with first argument 0x12b84be0) is the main function that performs the actual redo write, which is done by the kcrfw_redo_write_driver function:

 > ksbcti(0x12b84be0, 0x7f2c1eb80208, ...)
 | > ksbckbast(0, 0, ...)
 | < ksbckbast+0x000000000061 returns: 0x1
 | > ksumcl(0, 0, ...)
 | < ksumcl+0x000000000045 returns: 0x1540
 | > kcrfw_redo_write_driver(0, 0, ...) 
...

The next blogpost will look into what the kcrfw_redo_write_driver function actually performs. To spoil it a bit: the kcrfw_redo_write_driver function is called every 3 seconds, even if there’s nothing to write!

The third ksbcti function call (with first argument 0x12b84bfc) performs a function or series of functions that are not clear to me, and it’s not doing a lot inside the functions that are called in ksbcti to give me a hint:

 > ksbcti(0x12b84bfc, 0x7f2c1eb804b0, ...)
 | > ksbmsg(0, 0, ...)
 | < ksbmsg+0x000000000040 returns: 0x60065d90
 | > ksbxiaf(0, 0, ...)
 | < ksbxiaf+0x000000000058 returns: 0x795b5828
 | > kjci_action(0, 0, ...)
 | < kjci_action+0x000000000035 returns: 0x60065d90
 | > kcfsmpoll(0, 0, ...)
 | < kcfsmpoll+0x000000000030 returns: 0x60065d90
 | > ksuwaitsysevent(0, 0, ...)
 | < ksuwaitsysevent+0x0000000001e7 returns: 0x795b4aa8
 | > krdrsb_lgwr_chintr(0, 0, ...)
 | | > krsh_trace(0x2000, 0, ...)
 | | < krsh_trace+0x00000000005d returns: 0
 | < krdrsb_lgwr_chintr+0x00000000008b returns: 0
 | > ksbckbast(0, 0, ...)
 | < ksbckbast+0x000000000061 returns: 0x1
 < ksbcti+0x00000000018f returns: 0x1

Of course there are several things that point to features that are not used in my database; krdrsb points to dataguard and kjci points to RAC.

If you look again to the sequence you might wonder why the logwriter is not executing the redo write function immediately after waking up from the semaphore in kcarcv, in order to be able to write as soon as a process semctl’s/semop’s the semaphore asking the logwriter to write. Well, actually there is another sequence of function calls that can happen:

 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)
 > ksl_exit_main_loop_wait(0x6003ef40, 0x7af972b8, ...)
 > kcrfw_redo_write_driver(0x7ffc3dd122a0, 0x18, ...)         <<<<<
 > ksbcti(0x12b84c74, 0x7f2c1eb804b0, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)
 > dbktFlush(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)
 > sltrgatime64(0x7f2c23b9e9a0, 0x7f2c23b9ebe8, ...)
 > ksl_enter_main_loop_wait(0x1, 0x7a6032e8, ...)
 > ksarcv(0x7ffc3dd12338, 0x7ffc3dd122a0, ...)

Here you see ksarcv, which calls kcrfw_redo_write_driver immediately after ksarcv and ksl_exit_main_loop_wait, not inside ksbcti. What happened in ksarcv, is that semtimedop call returned 0, indicating the semaphore was semctl’ed or semop’ed, instead of timing out after 3 seconds. If you payed close attention you saw that the first argument of the ksbcti function of kcrfw_redo_write_driver is different. I am not sure why, because the same functions as the call with 0x12b84bfc as first argument are called (the ones with functions hinting at dataguard and RAC). The ksarcv call that follows is not putting the process to sleep on a semaphore, it reads in the asynchronous messages structure in fixed sga (ksasga_) after which it returns immediately. What follows are dbktFlush and sltrgatime64 for administrative purposes (writing to the tracefile if necessary, and updating resource usage information), after which the process prepares the semaphore sleep using ksl_enter_main_loop_wait and then goes to sleep on a semaphore in ksarcv.

Please note this is the general cycle, there are other tasks that the log writer performs. One example of such a task is the log writer is looking up the process monitor (PMON) process to see if it is alive, which is done every minute.

This is the second post in a series of blogposts on Oracle database redo internals. If you landed on this blogpost without having read the first blogpost, here is a link to the first blogpost: https://fritshoogland.wordpress.com/2018/01/29/a-look-into-oracle-redo-part-1-redo-allocation-latches/ The first blogpost contains all the versions used and a synopsis on what the purpose of this series of blogposts is.

In the first part, I showed how the principal access to the public redo strands is controlled by redo allocation latches, and showed a snippet of trace information of memory accesses of a foreground session when using the first public redo strand:

kcrfw_redo_gen_ext+11277:0x0000000076fdf3bc(shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf3d8(shared pool|permanent memor,duration 1,cls perm+16114648 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf430(shared pool|permanent memor,duration 1,cls perm+16114736 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3d0(shared pool|permanent memor,duration 1,cls perm+16114640 ):R:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11312:0x0000000076fdf428(shared pool|permanent memor,duration 1,cls perm+16114728 ):W:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3c8(shared pool|permanent memor,duration 1,cls perm+16114632 ):R:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11322:0x0000000076fdf420(shared pool|permanent memor,duration 1,cls perm+16114720 ):W:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11327:0x0000000076fdf3c0(shared pool|permanent memor,duration 1,cls perm+16114624 ):R:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11332:0x0000000076fdf418(shared pool|permanent memor,duration 1,cls perm+16114712 ):W:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11337:0x0000000076fdf3b8(shared pool|permanent memor,duration 1,cls perm+16114616 ):R:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11342:0x0000000076fdf410(shared pool|permanent memor,duration 1,cls perm+16114704 ):W:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11347:0x0000000076fdf3b0(shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf408(shared pool|permanent memor,duration 1,cls perm+16114696 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:8:0x7270/29296()
kcrfw_redo_gen_ext+11312:0x0000000076fdf400(shared pool|permanent memor,duration 1,cls perm+16114688 ):W:8:0x7270/29296()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf398(shared pool|permanent memor,duration 1,cls perm+16114584 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf3f0(shared pool|permanent memor,duration 1,cls perm+16114672 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf390(shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11342:0x0000000076fdf3e8(shared pool|permanent memor,duration 1,cls perm+16114664 ):W:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11347:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11352:0x0000000076fdf3e0(shared pool|permanent memor,duration 1,cls perm+16114656 ):W:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x7270/29296()
kcrfw_redo_gen_ext+11386:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):W:4:0x7271/29297()
kcrfw_redo_gen_ext+11392:0x0000000076fdf468(shared pool|permanent memor,duration 1,cls perm+16114792 ):R:8:0xe0000000/3758096384(Redo Buffers(pgsz:2048k)|+134217728 )

And a snippet of trace information of memory accesses of a foreground session when using the second public redo strand:

kcrfw_redo_gen_ext+11277:0x0000000076fdf4e4(shared pool|permanent memor,duration 1,cls perm+16114916 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf500(shared pool|permanent memor,duration 1,cls perm+16114944 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf558(shared pool|permanent memor,duration 1,cls perm+16115032 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4f8(shared pool|permanent memor,duration 1,cls perm+16114936 ):R:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11312:0x0000000076fdf550(shared pool|permanent memor,duration 1,cls perm+16115024 ):W:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4f0(shared pool|permanent memor,duration 1,cls perm+16114928 ):R:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11322:0x0000000076fdf548(shared pool|permanent memor,duration 1,cls perm+16115016 ):W:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11327:0x0000000076fdf4e8(shared pool|permanent memor,duration 1,cls perm+16114920 ):R:8:0x800/2048()
kcrfw_redo_gen_ext+11332:0x0000000076fdf540(shared pool|permanent memor,duration 1,cls perm+16115008 ):W:8:0x800/2048()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4e0(shared pool|permanent memor,duration 1,cls perm+16114912 ):R:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11342:0x0000000076fdf538(shared pool|permanent memor,duration 1,cls perm+16115000 ):W:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4d8(shared pool|permanent memor,duration 1,cls perm+16114904 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf530(shared pool|permanent memor,duration 1,cls perm+16114992 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:8:0xe6/230()
kcrfw_redo_gen_ext+11312:0x0000000076fdf528(shared pool|permanent memor,duration 1,cls perm+16114984 ):W:8:0xe6/230()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4c8(shared pool|permanent memor,duration 1,cls perm+16114888 ):R:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf4c0(shared pool|permanent memor,duration 1,cls perm+16114880 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf518(shared pool|permanent memor,duration 1,cls perm+16114968 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4b8(shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11342:0x0000000076fdf510(shared pool|permanent memor,duration 1,cls perm+16114960 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4b0(shared pool|permanent memor,duration 1,cls perm+16114864 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11352:0x0000000076fdf508(shared pool|permanent memor,duration 1,cls perm+16114952 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11366:0x0000000076fdf560(shared pool|permanent memor,duration 1,cls perm+16115040 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0xe6/230()
kcrfw_redo_gen_ext+11386:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):W:4:0xe7/231()
kcrfw_redo_gen_ext+11392:0x0000000076fdf590(shared pool|permanent memor,duration 1,cls perm+16115088 ):R:8:0xe0200000/3760193536(Redo Buffers(pgsz:2048k)|+136314880 )

If you pay really close attention you see the memory location moved from 0x76fdf3?? / 0x76fdf4?? to 0x76fdf4?? / 0x76fdf5??, all in permanent memory in the shared pool. Would this mean information for each strand is kept in a struct per strand? Let’s take a memory allocation on a specific offset in the kcrfw_redo_gen_ext function, and take the allocation on exactly the same offset in the kcrfw_redo_gen_ext function, which means it’s probably doing the same, and calculate the difference of the memory allocation:

strand 0
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8
strand 1
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8

Now take the memory addresses, and calculate the difference: 0x76fdf520 – 0x76fdf3f8 = 0x128 = 296

Okay, so would the size of the probable struct per strand be 296 bytes? Another question: where would this potential struct start?

Often, the location of structs is fetched from another location, just like we saw earlier for the redo allocation latch, especially if they are more dynamically created than fixed SGA structures, which are not dynamic by nature. Upon close examination of similarities between both pina traces, I found usage of a pointer in the kcrfsg_ struct at offset +0:

kcrfw_redo_gen_ext+14362:0x0000000060016228(fixed sga|var:kcrfsg_+0 ):R:8:0x76fdf388/1996354440(shared pool|permanent memor,duration 1,cls perm+16114568 )

The difference between the memory address I took and the pointer address in kcrfsg_ at offset 0 for the first public redo strand is:
0x76fdf3f8 – 0x76fdf388 = 0x70 = 112
The difference between the memory address I took for the second public redo strand and the pointer in kcrfsg_ at offset 0 is:
0x76fdf520 – 0x76fdf388 = 0x198 = 408
Then subtract with the calculated difference between the memory locations of the first and second public redo strand:
408 – 296 = 112
Bingo!

It would be really helpful to have some kind of explanation about this structure… I searched through My Oracle Support (very seldomly the regular knowledge base provides some internals information, the best chance for internals is the bug database, the rest is…well, let me say: not that helpful. I also hate that other tabs are enabled if nothing is found), oraus.msg, oradebug dumplist, oradebug doc component for any tracing or dumps that could give more information, but nothing seemed to point to any explanation. I dumped the shared pool, but I found the same information (the memory location being permanent memory) as I obtained from x$ksmsp. Also, since the pinatrace annotation takes care of any x$ view that has it’s source address in the SGA, it seems there is no way other than puzzle all the individual locations in what I assume is a structure that holds per strand metadata.

I decided to go through the x$ views one more time, and found an x$ view that I already knew of, which points to both ‘strands’ and ‘kcrf’; x$kcrfstrand. However, the reason it is not annotated is because this x$ view has it’s addresses in the PGA:

SQL> select addr from x$kcrfstrand;

ADDR
----------------
00007F14CE31AE00
00007F14CE31AE00

This means that when querying this x$ view, the results are gathered from wherever it comes from (the source can be file headers like redo (x$kccle) and controlfile (x$kcccf) for example, in this case we will the actual source of x$kcrfstrand is the SGA) and then stored temporarily in PGA, potentially manipulated, and then shown from the PGA, as we can see from the addresses (addr field in x$kcrfstrand).

However, what I can do to investigate where the x$kcfstrand takes its information from and run the pina (memory access) trace on the execution of ‘select * from x$kcrfstrand’, and try to see if I can find where it takes its data from. This is how the relevant part looks like:

qerfxFetch+1563:0x00007ffee590a638():W:8:0x10ea8781/283805569()
kcrf_strand_ftcb+2:0x00007ffee590a630():W:8:0x7ffee590b130/140732749885744()
kcrf_strand_ftcb+10:0x00007ffee590a620():W:8:0x7f7780569f40/140151230996288()
kcrf_strand_ftcb+23:0x00000000600169b8(Fixed Size(pgsz:2048k)|+92600 fixed sga|var:kcrf_max_strands_+0 ):R:4:0x2/2()
kcrf_strand_ftcb+29:0x00000000600169c8(Fixed Size(pgsz:2048k)|+92616 fixed sga|var:kcrf_actv_pvt_strands_+0 ):R:4:0/0()
kcrf_strand_ftcb+35:0x00007f7780448558():R:4:0/0()
kcrf_strand_ftcb+52:0x0000000060016228(Fixed Size(pgsz:2048k)|+90664 fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrf_strand_ftcb+64:0x00007ffee590a618():W:8:0xcf7cb25/217565989()
_intel_fast_memcpy+7:0x00000000157f62a0():R:8:0x137fff/1277951()
__intel_ssse3_rep_memcpy+160:0x00000000157b9488():R:4:0x8000/32768()
__intel_ssse3_rep_memcpy+184:0x0000000076fdf4a0(shared pool|permanent memor,duration 1,cls perm+16114848 ):R:16:0x00000000200040000/8590196736(??)
__intel_ssse3_rep_memcpy+3438:0x00007f7780448670():W:16:0x0000ec00000000000/4151755906482176()
...
__intel_ssse3_rep_memcpy+9254:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:16:0xfe56e00000fe56e00000/-2305841916834480128()
__intel_ssse3_rep_memcpy+9258:0x00007f7780448560():W:16:0xfe56e00000fe56e00000/-2305841916834480128()

Going from top to bottom through this snippet:
– qerfxFetch is the query execute rowsource function to query x$ tables, kcrf_strand_ftcb is an helper function (kernel cache recovery file strand fetch copy buffer is my guess) to fetch the kcrf strand data from the shared pool structure; the other functions are memory copy functions.
– Interestingly, the kcrf_strand_ftcb function reads offset 0 of kcrfsg_ just like we seen before. This is quite logical if you think about it, this struct is dynamically allocated in the shared pool, which means it doesn’t have a fixed address, so it needs to be looked up every time something must be read from it.
– The next functions are memcpy functions. I’ve shown the relevant ones. The first memcpy (_intel_fast_memcpy) probably is a helper function that switches to an intel optimised memcpy, __intel_ssse3_rep_memcpy, which takes advantage of intel’s SIMD optimisations (SSE3 means Supplemental Streaming SIMD Extensions 3).
– The memcpy functions at lines 11 and 12 show it’s reading 16 bytes from 0x76fdf4a0, and then writing 16 bytes to 0x7f7780448670.
– 0x76fdf4a0 is the struct for the second strand: kcrfsg_+0 = 0x76fdf388 + 296 = 0x76fdf4b0 + 16 (copy size) = 0x76fdf4a0.
– 0x7f7780448670 is quite probably and address in the process’ PGA, which is used for the intermediate representation of the struct’s data for displaying it for the x$ view.

Even more interesting, this confirms that x$kcrfstrand is the externalisation of the struct that is used together with kcrfsg_, and the calculations that were made previously are correct; kcrfsg_+0 = 0x76fdf388, and that address with 296 added is 0x76fdf4b0, which is the same as 0x76fdf4a0 + 16 (if you look into the above example, the memcpy function copies 16 bytes at a time).

I think it’s pretty safe to assume this kcrf related struct in the shared pool is commonly referred to as ‘KCRFA’ in x$kcrfstrand. Of course we can only guess that the ‘A’ means in kernel cache recovery file, I would say ‘auxiliary’, but is could also very well be the A in ‘strAnd’. It is a structure that clearly was needed when oracle changed the concept of a single log buffer into multiple public redo strands, because identical information of multiple strands need to be stored at the same time, and the information in kcrfsg_ is singular of nature.

Please mind that because of the intermediate representation of the x$ table in PGA, I could not use the trick I showed in my ‘drilldown into the logwriter inner working and communication’, which is query a single column in the x$ view, and see which memory addresses in this struct are touched; the entire struct is copied into PGA and processed before it is used to show the contents, even if only a single column of a single row is queried.

Also, also consider that since the struct data is copied into the process’ PGA in order to make this information human readable, not all offsets (in x$kqfco) of the PGA representation might be applicable to the SGA struct. In fact, I think I have proven that the KCRFA per strand metadata width is 296. Now look at the column offsets from x$kqfco of x$kcrfstrand:

SQL> select c.kqfconam, c.kqfcooff, c.kqfcosiz
from x$kqfco c, x$kqfta t
where t.indx = c.kqfcotab
and t.kqftanam = 'X$KCRFSTRAND'
order by c.kqfcooff;

KQFCONAM                         KQFCOOFF   KQFCOSIZ
------------------------------ ---------- ----------
ADDR                                    0          8
INDX                                    0          4
INST_ID                                 0          4
CON_ID                                  0          2
STRAND_HDR_SCN_KCRFA_CLN                0          8
HIGHEST_SCN_KCRFA_CLN                   0          8
PNEXT_BUF_KCRFA_CLN                    32          8
NEXT_BUF_NUM_KCRFA_CLN                 40          4
BYTES_IN_BUF_KCRFA_CLN                 48          8
AVAILABLE_BUFS_KCRFA_CLN               56          4
MEM_BUFS_AVAILABLE_KCRFA_CLN           60          4
STRAND_HEADER_BNO_KCRFA_CLN            72          4
FLAG_KCRFA_CLN                         76          1
PVT_STRAND_STATE_KCRFA_CLN             77          1
HIGHEST_SUBSCN_KCRFA_CLN               78          2
LWN_BUF_NUM_KCRFA_CLN                  80          4
STRAND_NUM_ORDINAL_KCRFA_CLN           84          4
STRAND_INFO_PTR_KCRFA_CLN              88          8
ACTION_KCRFA                          184          4
LOG_RESIDUE_KCRFA                     188          1
LAST_BUF_WRITTEN_KCRFA                216          4
LAST_BUF_GATHERED_KCRFA               220          4
CONS_LOW_BUFNO_KCRFA                  224          4
FIRST_BUF_KCRFA                       240          8
LAST_BUF_KCRFA                        248          8
TOTAL_BUFS_KCRFA                      256          4
STRAND_SIZE_KCRFA                     260          4
LOWBUF_KCRFA                          272          4
HIBUF_KCRFA                           276          4
WRAPHIBUF_KCRFA                       280          4
LOG_SEQ_KCRFA                         284          4
PTR_KCRF_PVT_STRAND                   304          8
INDEX_KCRF_PVT_STRAND                 312          4
SPACE_KCRF_PVT_STRAND                 316          4
TXN_KCRF_PVT_STRAND                   320          2

This shows that the representation of the KCRFA struct in PGA apparently has a width of 322 bytes, while we know the KCRFA struct itself is 296. I am not sure why this is, the most obvious reason would be some data is processed already in PGA to make easier to show. This might also very well be the reason there is a need for copying it to PGA first, and not display it directly from its shared pool location.

This will be a series of posts about Oracle database redo handling. The database in use is Oracle version 12.2.0.1, with PSU 170814 applied. The operating system version is Oracle Linux Server release 7.4. In order to look into the internals of the Oracle database, I use multiple tools; very simple ones like the X$ views and oradebug, but also advanced ones, quite specifically the intel PIN tools (https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool). One of these tools is ‘debugtrace’, which contains pretty usable output on itself (a indented list of function calls and returns), for which I essentially filter out some data, another one is ‘pinatrace’, which does not produce directly usable output, because it provides instruction pointer and memory addresses. However, the information it provides is extremely useful (and detailed!) once you translate the addresses to Oracle function names and memory region names. I have written a tool to do that: https://gitlab.com/FritsHoogland/pinatrace_annotate.git

The blog posts are meant for investigating Oracle database internals. This is a microscopic view into how the database works. I do believe by understanding the mechanics, you can very easy and quickly assess and resolve issues because of a fundamental understanding of how the database works.

First of all to immediately show the value of looking into the internals, a clear documentation bug: this entry from the online documentation of 12.2: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/cncpt/memory-architecture.html#GUID-C2AD1BF6-A5AE-42E9-9677-0AA08126864B (not sure how long this link will be valid, Oracle is notorious for changing it) states that the log buffer is singular buffer called the log buffer. In fact, the log buffer is not a singular buffer, but a buffer that contains at least to public redo strands, which changed since at least Oracle 10g. These strands are externalised in x$kcrfstrand:

SQL> select indx,strand_size_kcrfa, first_buf_kcrfa, last_buf_kcrfa from x$kcrfstrand where last_buf_kcrfa != '00';

      INDX STRAND_SIZE_KCRFA FIRST_BUF_KCRFA  LAST_BUF_KCRFA
---------- ----------------- ---------------- ----------------
         0          67108864 00000000D8000000 00000000DBFFFE00
         1          67108864 00000000DC000000 00000000DFFFFE00

The number of public redo buffers is dependent on the undocumented parameter “_log_parallelism_max”. A quick question out on twitter revealed that systems with a high count of processors could have a higher public redo strand count.

The public strands are allocated in the ‘Redo Buffers’ memory allocation:

SQL> select "AREA NAME", "START ADDR", to_char((to_number(rawtohex("START ADDR"),'xxxxxxxxxxxxxxxx')+"SEGMENT SIZE"),'XXXXXXXXXXXXXXXX'),pagesize from x$ksmssinfo; 

AREA NAME                        START ADDR       TO_CHAR((TO_NUMBE   PAGESIZE
-------------------------------- ---------------- ----------------- ----------
Variable Size                    0000000061000000          D8000000    2097152
Redo Buffers                     00000000D8000000          E07C6000    2097152
Fixed Size                       0000000060000000          60A00000    2097152
skgm overhead                    00000000E1000000          E1003000       4096

Each public strand is protected by a redo allocation latch:

SQL> select addr, child#, name, gets, misses, immediate_gets, immediate_misses from v$latch_children where name = 'redo allocation';

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE17F0         20 redo allocation               5          0              0                0
0000000076FE1750         19 redo allocation               5          0              0                0
0000000076FE16B0         18 redo allocation               5          0              0                0
0000000076FE1610         17 redo allocation               5          0              0                0
0000000076FE1570         16 redo allocation               5          0              0                0
0000000076FE14D0         15 redo allocation               5          0              0                0
0000000076FE1430         14 redo allocation               5          0              0                0
0000000076FE1390         13 redo allocation               5          0              0                0
0000000076FE12F0         12 redo allocation               5          0              0                0
0000000076FE1250         11 redo allocation               5          0              0                0
0000000076FE11B0         10 redo allocation               5          0              0                0

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE1110          9 redo allocation               5          0              0                0
0000000076FE1070          8 redo allocation               5          0              0                0
0000000076FE0FD0          7 redo allocation               5          0              0                0
0000000076FE0F30          6 redo allocation               5          0              0                0
0000000076FE0E90          5 redo allocation               5          0              0                0
0000000076FE0DF0          4 redo allocation               5          0              0                0
0000000076FE0D50          3 redo allocation               5          0              0                0
0000000076FE0CB0          2 redo allocation            6168          1           1708                2
0000000076FE0C10          1 redo allocation            7267         20          34804               15

Redo allocation latch child# 1 is the latch protecting allocations in public redo buffer 0, child# 2 is the latch protecting allocations in public redo buffer 1. Two questions that you would probably ask yourself:
1: why are there 20 redo allocation latches whilst there are only two public redo buffers?
2: how can you tell child# 1 is bound to public redo buffer 0, and child# 2 is bound to public redo buffer 1?

Answers to these questions:
1: the 18 other redo allocation latches are because of a mechanism called ‘private redo strands’, see https://blog.tanelpoder.com/files/Oracle_Performance_Scalability_10g_11g.pdf, https://fritshoogland.wordpress.com/2016/11/15/redo-a-blogpost/
2: in order to be sure about what that redo allocation latch protects, you need to look at the actual function calls.

For this, I am using a tool from intel, called ‘pin tools’, which is a suite of tools, of which one is called ‘debugtrace’: https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

First I am going to look into how a latch is chosen.

Attach a debugtrace to a normal foregrond session that executes an insert and a commit, and filter the output to make it more readable using a custom filter: https://gitlab.com/snippets/1686145

Now search for the latch address (in lowercase) 0x76fe0c10 (the address of the redo allocation latch, see above query from v$latch_children):

 | | | | | | | | | | | | | | | > kcrfw_redo_gen_ext(0x3, 0x7ffd52f513e0, ...)
 | | | | | | | | | | | | | | | | > kcobrh(0x7ffd52f513e0, 0x3, ...)
 | | | | | | | | | | | | | | | | < kcobrh+0x00000000014b returns: 0x1b4
 | | | | | | | | | | | | | | | | > kcb_get_redo_version(0x3, 0x7ffd52f513a0, ...)
 | | | | | | | | | | | | | | | | < kcb_get_redo_version+0x0000000000c4 returns: 0x1
 | | | | | | | | | | | | | | | | > kcmglscn8(0x7ffd52f508d0, 0, ...)
 | | | | | | | | | | | | | | | | | > kcscu8(0x600113b8, 0x7ffd52f508d0, ...)
 | | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | | | | | | | | | | | | | < kcmglscn8+0x000000000054 returns: 0x756239
 | | | | | | | | | | | | | | | | > kscn_to_ub8_impl(0x7ffd52f50838, 0x7ffd52f508d0, ...)
 | | | | | | | | | | | | | | | | < kscn_to_ub8_impl+0x00000000003e returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fdebe0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)                             <<<<<<<<
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > ksvgcls(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < ksvgcls+0x000000000021 returns: 0
 | | | | | | | | | | | | | | | | > kcscu8(0x60016260, 0x7ffd52f508d8, ...)
 | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0xd8eab838, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0xd8eab848, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756239, 0x7ffd52f50848, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > kscnbur3_int_impl(0x756239, 0xd8eab824, ...)
 | | | | | | | | | | | | | | | | < kscnbur3_int_impl+0x000000000030 returns: 0x756239
 | | | | | | | | | | | | | | | | > ub8_to_kscn_impl(0x756231, 0xd8eab840, ...)
 | | | | | | | | | | | | | | | | < ub8_to_kscn_impl+0x000000000031 returns: 0
 | | | | | | | | | | | | | | | | > kslfre(0x76fe0c10, 0xd8eab840, ...)
 | | | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0

If we look at the work of andrey nikolaev: https://andreynikolaev.files.wordpress.com/2016/03/latches_and_mutexes_in_oracle_12c.pdf
We know that the first argument of kslgetl is the latch address, and the second argument is 0 for no wait or 1 for willing to wait mode. What becomes clear, is that a foreground process gets the redo allocation latch in nowait mode. Of course this is widely known and documented. the return code of 0x1 means the latch is successfully taken. The latch that is fetched before the redo allocation latch of course is the redo copy latch:

SQL> @flba 76fdebe0
ADDR             NAME
---------------- ------------------------
0000000076FDEBE0 redo copy

If we look again at the figures of the two redo allocation latches protecting the public redo strands:

ADDR                 CHILD# NAME                       GETS     MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------------- ---------- -------------------- ---------- ---------- -------------- ----------------
0000000076FE0CB0          2 redo allocation            6168          1           1708                2
0000000076FE0C10          1 redo allocation            7267         20          34804               15

It becomes apparent that child# 1 is used way more than child# 2, which is visible in the column ‘immediate_gets’. How does that work? Let’s take the first redo allocation child latch, and issue an insert again with debugtrace on it.

session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x76FE0C10 0 0 2442
Function returned 1

session 2; debugtrace output:

 | | | | | | | | | | | | | | | | > kslgetl(0x76fdebe0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1

First child# 1 is tried, but when the latch is already taken, indicated by returncode 0, it turns to the second latch.

What will happen when both latches are taken?

session 1:

SQL> oradebug call kslgetl 0x76FE0C10 0 0 2442
Function returned 1
SQL> oradebug call kslgetl 0x76FE0CB0 0 0 2442
Function returned 1

This is what it looks like on in debugtrace output:

 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0c10, 0, ...)
 | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0
 | | | | | | | | | | | | | | | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | | | | | | | | | | | | | | | | > kslges(0x76fe0cb0, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | > ksuSdiInProgress(0x2, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | < ksuSdiInProgress+0x000000000035 returns: 0
 | | | | | | | | | | | | | | | | | | > kskthbwt(0, 0, ...)
 | | | | | | | | | | | | | | | | | | < kskthbwt+0x0000000000a1 returns: 0x10
 | | | | | | | | | | | | | | | | | | > kslwlmod(0x7ffc45d6ab00, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | | > kstmgetsectick(0x7ffc45d6ab00, 0x7a63c410, ...)
 | | | | | | | | | | | | | | | | | | | < kstmgetsectick+0x00000000003a returns: 0x5a4bd4b9
 | | | | | | | | | | | | | | | | | | < kslwlmod+0x0000000000cb returns: 0xffffffff
 | | | | | | | | | | | | | | | | | | > kslwtbctx(0x7ffc45d6a988, 0x7a63c588, ...)

What this shows, is the foreground session first trying to get the latch of the first redo allocation child in immediate mode, which failed, then the latch of the second redo allocation child in immediate mode, which failed too, after which it had no other option than to get a redo allocation latch in willing to wait mode. For that it performed a redo allocation latch get for the first child latch again with the second argument set to 1, indicating the willing to wait mode. Because the latch was still taken, kslgetl called kslges in order to spin (which means a series of retries of getting the latch) for it; on Linux the spinning happens inside kslges, so it is not not visible in a debugtrace). After it spun for 20000 times, the process registers a wait (kslwtbctx), and then goes to sleep on a semaphore in the wait:

 | | | | | | | | | | | | | | | | | | < kslwtbctx+0x00000000050b returns: 0x77c000a4
 | | | | | | | | | | | | | | | | | | > skgpwwait(0x7ffc45d6abf8, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | > kslwait_conv_wait_time(0xffffffff, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | < kslwait_conv_wait_time+0x000000000027 returns: 0xffffffff
 | | | | | | | | | | | | | | | | | | | > sskgpwwait(0x7ffc45d6abf8, 0x7f910bf92ca0, ...)
 | | | | | | | | | | | | | | | | | | | | > semop@plt(0x98000, 0x7ffc45d6a888, ...)

This is the modern latch implementation where a session goes to sleep on a semaphore after spinning for a latch in willing to wait mode. In order to learn more about that, look in the work of Andrey Nikolaev. The semaphore implementation means that as soon as the latch becomes available, the session is posted and can do it’s allocation in the public redo strand.

However, this does not tie the child latch to a public strand, it only shows how a session bunny-hops public redo allocation latches in nowait mode until it hits a blind wall (if all have been taken) after which it will start waiting in willing to wait mode, waiting to be posted.

In order to see the connection between the redo allocation latches and the actual public strand memory areas, we need to look one level deeper, which means looking at (main) memory access of a process. This can be done using the pin tools pinatrace tool, and adding annotations to the pinatrace in order to understand what is going on. for this, I created a tool: https://gitlab.com/FritsHoogland/pinatrace_annotate

First, create a pinatrace of a normal insert on a test database. A word of warning: if you think a function call trace generates a lot of detailed data, the pinatrace alias memory access trace generates *WAY* more. Even modest things like ‘select count(*) from dual’ generates heaps of data.

Back to the pinatrace of the normal insert; generate the trace, and then annotate it using pinatrace_annotate tool. This should show the insert takes redo allocation latch child #1. To validate, search for the latch address again:

kcrfw_redo_gen_ext+1479:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1486:0x00007fff5d88efe8():W:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1493:0x00007fff5d88efc8():R:4:0/0()
kcrfw_redo_gen_ext+1516:0x00007fff5d88efe8():R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )

What this shows is the kcrfw_redo_gen_ext function reading address 0x60016570, which is in the fixed SGA, which belongs to a ‘fixed sga variable’ called kcrfsg_ at offset 840, reading the value 0x76fe0c10, which we already know is the address of redo allocation latch child #1. The latch address is written into address 0x7fff5d88efe8, which is in the process’ PGA, but we got no annotation of it, because there is nothing between the round brackets. Later, the latch address in PGA is read again.

We know the kcrfsg_ variable/struct from my earlier investigation into how the logwriter works. This is a struct in fixed sga that holds data important to kcrf (kernel cache redo file). My current guess is sg means ‘SGA’.

Only a little further in the trace, we see that the kslgetl function is invoked, and indeed it is trying to get this latch:

kslgetl+69:0x0000000076fe0c1f(Variable Size(pgsz:2048k)|+368970783 shared pool|(child)latch#1:redo allocation+15 shared pool|permanent memor,duration 1,cls perm+16120863 ):R:1:0x37/55()

The way a successful latch get looks like, which is visible in the trace, is this:

kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):R:8:0/0()
kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):W:8:0x1a/26()

The work of Andrey nikolaev describes that when a latch is taken, the ora pid is written at offset 0. Here we see at offset 0, the kslgetl function reads a 0, indicating the latch is free, after which the value 26, the ora pid of the process, is written into it, which means this process now has taken the latch.

The next step is to see if there is any indication of it pointing to the public redo strand. It should be close to the latch get, because a latch should be hold for as little time as possible. Essentially, what I did then is search for ‘Redo Buffers’. I found this a little further still in the kcrfw_redo_gen_ext function:

kcrfw_redo_gen_ext+11277:0x0000000076fdf3bc(shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf3d8(shared pool|permanent memor,duration 1,cls perm+16114648 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf430(shared pool|permanent memor,duration 1,cls perm+16114736 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3d0(shared pool|permanent memor,duration 1,cls perm+16114640 ):R:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11312:0x0000000076fdf428(shared pool|permanent memor,duration 1,cls perm+16114728 ):W:8:0xdeadbeef0000726e/-2401053092612115858()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3c8(shared pool|permanent memor,duration 1,cls perm+16114632 ):R:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11322:0x0000000076fdf420(shared pool|permanent memor,duration 1,cls perm+16114720 ):W:8:0x100010000726e/281479271707246()
kcrfw_redo_gen_ext+11327:0x0000000076fdf3c0(shared pool|permanent memor,duration 1,cls perm+16114624 ):R:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11332:0x0000000076fdf418(shared pool|permanent memor,duration 1,cls perm+16114712 ):W:8:0x1b800001000/1889785614336(??)
kcrfw_redo_gen_ext+11337:0x0000000076fdf3b8(shared pool|permanent memor,duration 1,cls perm+16114616 ):R:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11342:0x0000000076fdf410(shared pool|permanent memor,duration 1,cls perm+16114704 ):W:8:0x1256f0000b2fe/322633648354046()
kcrfw_redo_gen_ext+11347:0x0000000076fdf3b0(shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf408(shared pool|permanent memor,duration 1,cls perm+16114696 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:8:0x7270/29296()
kcrfw_redo_gen_ext+11312:0x0000000076fdf400(shared pool|permanent memor,duration 1,cls perm+16114688 ):W:8:0x7270/29296()
kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf398(shared pool|permanent memor,duration 1,cls perm+16114584 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf3f0(shared pool|permanent memor,duration 1,cls perm+16114672 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf390(shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11342:0x0000000076fdf3e8(shared pool|permanent memor,duration 1,cls perm+16114664 ):W:8:0x758e9f/7704223()
kcrfw_redo_gen_ext+11347:0x0000000076fdf388(shared pool|permanent memor,duration 1,cls perm+16114568 ):R:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11352:0x0000000076fdf3e0(shared pool|permanent memor,duration 1,cls perm+16114656 ):W:8:0x758ea0/7704224()
kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x7270/29296()
kcrfw_redo_gen_ext+11386:0x0000000076fdf3a8(shared pool|permanent memor,duration 1,cls perm+16114600 ):W:4:0x7271/29297()
kcrfw_redo_gen_ext+11392:0x0000000076fdf468(shared pool|permanent memor,duration 1,cls perm+16114792 ):R:8:0xe0000000/3758096384(Redo Buffers(pgsz:2048k)|+134217728 )

There’s a lot to say about this. The first thing to point out, is that these are all memory locations of a memory segment in the shared pool in permanent memory somewhere around 0x76fdf300 / 0x76fdf400. The first public strand is explicitly mentioned (by my annotation of the memory structures):

kcrfw_redo_gen_ext+11317:0x0000000076fdf3a0(shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf3f8(shared pool|permanent memor,duration 1,cls perm+16114680 ):W:8:0xd8e4e200/3638878720(Redo Buffers(pgsz:2048k)|+15000064 redo|PUB_REDO_0+15000064 )

As well as the second public strand:

kcrfw_redo_gen_ext+11366:0x0000000076fdf438(shared pool|permanent memor,duration 1,cls perm+16114744 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )

I am confident that the second public strand just happens to collide with a sort of ‘not used’ code, if you look at what the hexadecimal value of this is: 0xdeadbeef

The second thing to notice is there is a lot of information written really close to each other (0x76fdf300-0x76fdf400). The only location I know that holds this kind of redo related information is the previously mentioned kcrfsg_ struct, but this seems to be another location that holds critical information about redo, of which all we know is it sits in the shared pool (not in fixed sga), and it is allocated as permanent memory.

I tried various things to get more information about this memory location, like dumping the sga heap, inspecting this memory location in a heap dump to see if there is a sub heap that I can dump for more information (there isn’t), asking on twitter (which sometimes helps), but I gotten no information. I spend a considerable time searching in the bug database, and still gotten no answers….

I decided to prove usage of the second public strand by manually obtain the latch for the first strand, so an insert will take the second latch, and if that indeed means the second public redo strand is used. This means fetching the first latch using oradebug in a session, and then starting another database session, attach a pinatrace to it, and execute an insert.

Once the pinatrace is annotated, search for 0x76fe0c10 (the first redo allocation latch) again. The start is exactly the same, the latch address is fetched from kcrfsg_ at offset 840.

kcrfw_redo_gen_ext+1479:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1486:0x00007ffcca22e6e8():W:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+1493:0x00007ffcca22e6c8():R:4:0/0()
kcrfw_redo_gen_ext+1516:0x00007ffcca22e6e8():R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )

Now look at the same offsets in the function kslgetl (+138) again:

kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):R:8:0x27/39()
kslgetl+138:0x0000000076fe0c10(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 ):W:8:0x27/39()

Now we see the latch already has a value at offset zero, which is 39, indicating the latch is taken. Now that the session executing the insert found the first redo allocation latch busy, first this happens:

kcrfw_redo_gen_ext+14267:0x00000000600169c4(Fixed Size(pgsz:2048k)|+92612 fixed sga|var:kcrf_actv_strands_+0 ):R:4:0x1/1()
kcrfw_redo_gen_ext+14273:0x00007ffcca22e6d8():R:4:0x2/2()
kcrfw_redo_gen_ext+14283:0x00000000600169c4(Fixed Size(pgsz:2048k)|+92612 fixed sga|var:kcrf_actv_strands_+0 ):W:4:0x2/2()

Because the latch for the first public redo strand was taken, an additional strand is enabled by setting the fixed sga variable kcrf_actv_strands_ to 2, indicating the second strand is now active too.

Then kcrfsg_ at offset 840 is read again, but a higher value is saved in PGA memory. In fact, the latch size is known (160), so in order to get then next latch, the session picks up the latch address of child #1, and adds 160 in order to get to child #2.

kcrfw_redo_gen_ext+14335:0x0000000060016570(Fixed Size(pgsz:2048k)|+91504 fixed sga|var:kcrfsg_+840 ):R:8:0x76fe0c10/1996360720(Variable Size(pgsz:2048k)|+368970768 shared pool|(child)latch#1:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16120848 )
kcrfw_redo_gen_ext+14342:0x00007ffcca22e6e8():W:8:0x76fe0cb0/1996360880(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 )

This latch is free, and is taken by this session:

kslgetl+138:0x0000000076fe0cb0(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 ):R:8:0/0()
kslgetl+138:0x0000000076fe0cb0(Variable Size(pgsz:2048k)|+368970928 shared pool|(child)latch#2:redo allocation+0 shared pool|permanent memor,duration 1,cls perm+16121008 ):W:8:0x1a/26()

Now let’s look at the same function locations with the first redo allocation latch taken, and the second redo allocation latch acquired:

kcrfw_redo_gen_ext+11277:0x0000000076fdf4e4(shared pool|permanent memor,duration 1,cls perm+16114916 ):R:4:0x1256f/75119()
kcrfw_redo_gen_ext+11290:0x0000000076fdf500(shared pool|permanent memor,duration 1,cls perm+16114944 ):R:8:0/0()
kcrfw_redo_gen_ext+11300:0x0000000076fdf558(shared pool|permanent memor,duration 1,cls perm+16115032 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4f8(shared pool|permanent memor,duration 1,cls perm+16114936 ):R:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11312:0x0000000076fdf550(shared pool|permanent memor,duration 1,cls perm+16115024 ):W:8:0xdeadbeef000000e6/-2401053092612144922()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4f0(shared pool|permanent memor,duration 1,cls perm+16114928 ):R:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11322:0x0000000076fdf548(shared pool|permanent memor,duration 1,cls perm+16115016 ):W:8:0x10001000000e6/281479271678182()
kcrfw_redo_gen_ext+11327:0x0000000076fdf4e8(shared pool|permanent memor,duration 1,cls perm+16114920 ):R:8:0x800/2048()
kcrfw_redo_gen_ext+11332:0x0000000076fdf540(shared pool|permanent memor,duration 1,cls perm+16115008 ):W:8:0x800/2048()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4e0(shared pool|permanent memor,duration 1,cls perm+16114912 ):R:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11342:0x0000000076fdf538(shared pool|permanent memor,duration 1,cls perm+16115000 ):W:8:0x1256f00012488/322633648383112()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4d8(shared pool|permanent memor,duration 1,cls perm+16114904 ):R:8:0/0()
kcrfw_redo_gen_ext+11352:0x0000000076fdf530(shared pool|permanent memor,duration 1,cls perm+16114992 ):W:8:0/0()
kcrfw_redo_gen_ext+11307:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:8:0xe6/230()
kcrfw_redo_gen_ext+11312:0x0000000076fdf528(shared pool|permanent memor,duration 1,cls perm+16114984 ):W:8:0xe6/230()
kcrfw_redo_gen_ext+11317:0x0000000076fdf4c8(shared pool|permanent memor,duration 1,cls perm+16114888 ):R:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11322:0x0000000076fdf520(shared pool|permanent memor,duration 1,cls perm+16114976 ):W:8:0xdc01ce00/3691105792(Redo Buffers(pgsz:2048k)|+67227136 redo|PUB_REDO_1+118272 )
kcrfw_redo_gen_ext+11327:0x0000000076fdf4c0(shared pool|permanent memor,duration 1,cls perm+16114880 ):R:8:0/0()
kcrfw_redo_gen_ext+11332:0x0000000076fdf518(shared pool|permanent memor,duration 1,cls perm+16114968 ):W:8:0/0()
kcrfw_redo_gen_ext+11337:0x0000000076fdf4b8(shared pool|permanent memor,duration 1,cls perm+16114872 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11342:0x0000000076fdf510(shared pool|permanent memor,duration 1,cls perm+16114960 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11347:0x0000000076fdf4b0(shared pool|permanent memor,duration 1,cls perm+16114864 ):R:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11352:0x0000000076fdf508(shared pool|permanent memor,duration 1,cls perm+16114952 ):W:8:0x75a1d2/7709138()
kcrfw_redo_gen_ext+11366:0x0000000076fdf560(shared pool|permanent memor,duration 1,cls perm+16115040 ):W:4:0xdeadbeef/3735928559(Redo Buffers(pgsz:2048k)|+112049903 redo|PUB_REDO_1+44941039 )
kcrfw_redo_gen_ext+11380:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):R:4:0xe6/230()
kcrfw_redo_gen_ext+11386:0x0000000076fdf4d0(shared pool|permanent memor,duration 1,cls perm+16114896 ):W:4:0xe7/231()
kcrfw_redo_gen_ext+11392:0x0000000076fdf590(shared pool|permanent memor,duration 1,cls perm+16115088 ):R:8:0xe0200000/3760193536(Redo Buffers(pgsz:2048k)|+136314880 )

Here we see reads and writes in the same fashion, including the 0xdeadbeef write, but now we see PUB_REDO_1, alias the second public redo strand. This proves the second redo allocation child latch is really protecting the second public redo strand!

%d bloggers like this: