During the investigations of my previous blogpost about what happens during a commit and when the data becomes available, I used breaks in gdb (GNU debugger) at various places of the execution of an insert and a commit to see what is visible for other sessions during the various stages of execution of the commit.

However, I did find something else, which is very logical, but is easily overlooked: at certain moments access to the table is blocked/serialised in order to let a session make changes to blocks belonging to the table, or peripheral blocks like undo, for the sake of consistency. These are changes made at the physical layer of an Oracle segment, the logical model of Oracle says that writers don’t block readers.

The first question is if this is a big issue. Well, for that you should simply look at any normal usage of an Oracle database. I don’t believe these serialisation moments are an issue, because the majority of my client’s does not experience serialisation problem.

However, this can become an issue if database processes can not run at will. What that means is that if processes are randomly stopped from execution by the operating system or hypervisor, it could be that the database process can be at an earlier mentioned ‘serialisation point’, which then means that access for all other processes remains blocked until the process is made running again, which then gives the process the opportunity to free exclusive access to a resource.

What are issues when a database process can not run at will?
– CPU oversubscription. If more processes are running than CPU’s are available (think about core’s and threads, and what it means for your CPU architecture), it means the operating system needs to make a decision on what it wants to be running, and what it wants to wait. No operating system scheduler understands or can know when an Oracle database process is running in a critical part of code, and therefore should not stop execution of it.
– Memory oversubscription. Whenever your system starts actively swapping in and out, your system is swapping, which means it’s moving active pages onto the swap device, and reads back active pages from the swap device for a process that demands them. This will activate additional tasks, and make processes get stuck waiting for memory to become available at all kinds of points during execution. Often, swapping is the beginning of the end, which means the services on the server stop functioning altogether for multiple reasons, like the Out Of Memory killer, or simply timing out for client’s of a process.
– Oversubscription of CPU or memory using virtualisation. As a troubleshooter, I would say this is actually worse than the ‘simple oversubscription’ mentioned in the two earlier points. Because typically, the visible part is the virtual machine. The virtual machine in this case itself is not oversubscribed, the oversubscription takes place at a layer invisible to the virtual machine, which unexplainably performs unpredictable. It is simply not okay to oversubscribe any machine that needs to run something that is latency sensitive like the Oracle database, unless you fully understand all the trade-offs that are coming with it, and you have the ability to understand when it does start to occur.

Whenever database processes can not run at will, you will see waits you didn’t see when processes could run at will. Typical waits in that case are (not an exhaustive list):
– any latch waits
– buffer busy waits
– any cursor: pin waits
Please mind these waits are not unique for this issue, any of these waits can occur for other reasons too.

Let me show a couple of examples for which the wait is artificially created by stopping execution at a point where this serialisation takes place, thereby mimicking getting put off cpu due to load, using the earlier insert and commit execution:

a) latch: cache buffers chains
Whenever a process needs to read or modify a database buffer, it needs to ‘pin’ the buffer. Such a pin is also known as a buffer handle or a buffer state object. The pin, which is a memory area, must be obtained from a freelist and attached to the buffer header in order to pin the buffer. A pin is obtained using the kcbzgs (kernel cache buffers helper functions get state object) function, which calls the kssadf_numa_intl (kernel service state objects add from freelist numa internal) function which initialises the state object.

The function that performs the actual pin of a block is not a single one, pinning a block is done in multiple functions. Two of them are kcbgtcr (kernel cache buffers get consistent read) and kcbgcur (kernel cache buffers get current). Under normal circumstances, concurrency for a consistent read of a buffer (kcbgtcr) does not lead to any blocking, because the latch (cache buffers chains) can be taken in shared mode, and the pinning is done in ‘fast mode’ (my interpretation):

> kcbgtcr(0x7f88a8b79db0, 0, ...)
| > kscn_to_ub8_impl(0x7f88a8b88d4c, 0x6baefdc0, ...)
| < kscn_to_ub8_impl+0x00000000003e returns: 0x287562
| > ktrexf(0x7fff998123f8, 0x7f88ae5d7f30, ...)
| | > ktrEvalBlockForCR(0x7f88a8b88d4c, 0x7f88ae5d7f30, ...)
| | < ktrEvalBlockForCR+0x0000000000f4 returns: 0x1
| < ktrexf+0x0000000000a4 returns: 0x9
| > kcbzgsf(0x1, 0x7fff998120b8, ...)
| | > kssadf_numa_intl(0x36, 0x6ddc0b50, ...)
| | | > kpdbIdToUid(0, 0x6ddc0b50, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x6bf84908
| < kcbzgsf+0x0000000001a0 returns: 0x6bf84908
| > kcbz_fp_cr(0x8dff9478, 0x6bf84998, ...)
| < kcbz_fp_cr+0x000000000089 returns: 0
| > kscn_to_ub8_impl(0x7f88a8b88d70, 0x1, ...)
| < kscn_to_ub8_impl+0x00000000003e returns: 0
< kcbgtcr+0x000000000b38 returns: 0x8df94014

Inside the kcbgtcr function, kcbzgsf (kernel cache buffers helper functions test state object fast) is called, and after that, the state object is pinned to the block in kcbz_fp_cr (kernel cache buffers helper functions fast pin for consistent read). Do you see the absence of ksl_get_shared_latch function to serialise access to add this pin? Actually the cache buffers chains latch is gotten, but it’s done inside the kcbgtcr function. The latch is gotten before the ktrexf (kernel transaction redo examine block fast) in shared mode, and the latch is released after the kcbz_fp_cr function. Multiple processes can pin the block for consistent read without serialisation or blocking each other because of the cache buffers chains latch being a shared latch, and this latch is taken in shared mode.

Why do I show this? Well, in the “old” days, which is Oracle 8.1, Oracle did not use shared latches (at least on the platforms I worked on, but if my memory serves me right, shared latches were introduced starting from Oracle 9), which meant for any logical read the cache buffers chains latch had to be obtained. This could get painful for very frequently visited blocks, like index root blocks, and high concurrency, and additional to that, because there was no post/wait messaging for latch waits but only spinning, this meant lots of processes waiting for latch number 98, and CPU usage going through the roof.

Back to today. Below is shown how the function calls look like when DML is done. When a buffer change is done (data is changed), a buffer has to be obtained in current mode, which is done using the kcbgcur function:

> kcbgcur(0x7ffda5957c70, 0x2, ...)
| > kcbzgs(0x1, 0x2, ...)
| | > kssadf_numa_intl(0x36, 0x7a6086c8, ...)
| | | > kpdbIdToUid(0, 0x7a6086c8, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x7867c9c0
| < kcbzgs+0x000000000178 returns: 0x7867c9c0
| > kti_is_imu(0x7867ca50, 0xb0f71018, ...)
| | > ktcgtx(0x76f29d50, 0xb0f71018, ...)
| | < ktcgtx+0x00000000001e returns: 0x76f29d50
| < kti_is_imu+0x000000000039 returns: 0
< kcbgcur+0x0000000009fb returns: 0xb0302014

This looks reasonably the same as the previous call overview which shown kcbgtcr, however the kcbzgsf function changed to kcbzgs, so minus the ‘f’ for fast, and a few other functions are missing. Another interesting thing (not visible) is the cache buffers chains latch is obtained after the kcbzgs function that obtains the buffer state object, and the latch is obtained with special bit 0x1000000000000000 set to indicate the shared latch is obtained in non-shared mode. After the kti_is_imu function the pin is attached to the buffer and the latch is freed.

To make things a little more complicated, a buffer can be gotten in current mode, but still get the cache buffers chains in shared mode. This is how that looks like:

> kcbgcur(0x7ffda5955620, 0x1, ...)
| > kcbzgs(0x1, 0x1, ...)
| | > kssadf_numa_intl(0x36, 0x7a6086c8, ...)
| | | > kpdbIdToUid(0, 0x7a6086c8, ...)
| | | < kpdbIdToUid+0x0000000000e7 returns: 0
| | < kssadf_numa_intl+0x000000000235 returns: 0x7867c9c0
| < kcbzgs+0x000000000178 returns: 0x7867c9c0
| > kcbz_fp_shr(0xb0fa3cb8, 0x7867ca50, ...)
| < kcbz_fp_shr+0x000000000085 returns: 0x7f3011d82f40
< kcbgcur+0x0000000009fb returns: 0xb07a0014

The function kcbzgs without ‘f’ is called, but the cache buffers chains latch is gotten in shared mode (not visible; done in the kcbgcur function after kcbzgs), and there’s the function kcbz_fp_shr (kernel cache buffers helper function fast pin shared) to pin the buffer in shared mode.

In order to be allowed to change a buffer, the cache buffers chains latch must be taken in non-shared mode to change the buffer header to indicate the buffer is busy and guarantee only the process itself can access it. Obtaining the cache buffers chains latch in non-shared mode means that access to the hash bucket (multiple hash buckets actually) to which the buffer is linked is blocked until the latch is freed. Please mind the period that the latch is obtained here is very short.

However, this blocking scenario can be easily replayed:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on the kti_is_imu function (break kti_is_imu) and continue.
4. session 1: insert a row into a table (insert into test values (‘a’);). the breakpoint will fire in gdb, stopping execution.
5. session 3: startup another database foreground session.
6. session 3: select the table on which the insert just broke (select * from test;). This will hang.
7. session 4: startup a SYSDBA session, and obtain the wait state of the foreground sessions:

SQL> @who active

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -------------
59,53450,@1     28166    SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:98
102,53150,@1    28083    TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:66867.83ms,event:ON CPU:SQL,seq#:33
101,6637,@1     28186    TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:344013.82ms,event:latch: cache buffers chainsp1:1811473056p2:289p3:24599> Blocked by (inst:sid): 1:102,seq#:39

SID 101 is waiting on the cache buffers chains latch, because SID 102 is holding that. SID 102 is showing ON CPU because we broke execution outside of a wait event, so Oracle thinks it’s busy executing, while it is actually stopped by gdb.

Again, this is a wait (latch: cache buffers chains) that is unlikely be an issue, and if it is, you either ran into a bug, or this is only the way a much bigger problem (oversubscription, bad application design) is showing itself.

b) buffer busy waits
The name of the wait event ‘buffer busy wait’ is self explanatory. The technical implementation is way lesser known. What technically happens during a buffer busy wait, is that a session needs to read a certain buffer, and in order to read it, it performs the usual actions of selecting the correct cache buffers chains bucket, following the double linked list to the buffer headers to find the best best buffer and then obtain a buffer state object. The critical part for running into the buffer busy wait is when the session checks the state of the buffer via the buffer header and finds the change state (X$BH.CSTATE) to be higher than 0, indicating the block is currently being changed. At this point the session calls kcbzwb (kernel cache buffers helper function wait for buffer), and must wait for the buffer state to be changed and therefore accessible, for which the wait event is the buffer busy wait event. When the change state is reverted back to 0, the buffer is accessible again.

Once a buffer is pinned in current mode using kcbgcur en kcbzgs, the buffer is still fully accessible for all sessions. Only once the database truly starts to change the buffer, which is done in kcbchg1_main, the cache buffers chains latch is taken non-shared, and the CSTATE field in the buffer header is changed to 1 and then the latch is freed. My current investigations show that the CSTATE is changed to 2 after the kcopcv and kcrfw_copy_cv functions, so when the redo is written to the public redo strand, and set to 4 after the changes have been written to the buffers, which is done in kcbapl. At the end of the kcbchg1_main function, when all the changes are done, the cache buffers chains latch is taken non-shared again, the CSTATE field is set to 0 and the latch is freed to enable access to the buffer again.

68801:kcbchg1_main+2536:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):R:8:0/0()
68802:kcbchg1_main+2536:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):W:8:0x1000000000000022/1152921504606847010()
..
68897:kcbchg1_main+3318:0x00000000967eddda(Variable Size(pgsz:2048k)|+901701082 buffer header|9668A000+66 ):W:1:0x1/1()
..
68939:kcbchg1_main+3715:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):R:8:0x1000000000000022/1152921504606847010()
68940:kcbchg1_main+3715:0x000000006ba4e4d8(Variable Size(pgsz:2048k)|+182772952 shared pool|(child)latch#5383:cache buffers chains+0 shared pool|permanent memor,duration 1,cls perm+2204888 ):W:8:0/0()

Above you see a cache buffers chains latch being obtained non shared (special bit 0x1000000000000000 is set), a little later the buffer header is updated which sets the value 1 at offset 66. Once this happens, this is visible in the CSTATE column of X$BH. A little further the latch is freed.

Why am I showing all of this? Well, there is quite some work that is done while the buffer is in a changed state (CSTATE>0) and therefore inaccessible. In normal situations, this is not a problem because despite all the work this is done very quickly, and therefore it’s hardly noticeable/measurable, so there’s no or very little time spend in buffer busy waits in most databases

However… if processes do get stuck randomly, it’s possible that a process gets stuck while making changes to a buffer, which then means any additional access to the buffer results in buffer busy waits. Also mind that a change to a table buffer requires (at least; in the most simple case) three buffers to be changed: the buffer containing the table data, the buffer containing the undo segment transaction table and the buffer containing the actual undo. A commit requires one buffer: the undo segment transaction table.

Let me show you an example:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on ktugur and kcopcv and continue.
4. session 3: startup a database foreground session.
5. session 1: insert a row into a table: insert into test values (‘a’);
6. session 2: gdb breaks execution of session 1 because it encountered ktugur.
7. session 3: select * from test; this runs unblocked.
8. session 4: startup a database foreground session as sysdba.
9. session 4: select * from v$lock; this will hang.

This one wasn’t obvious. I found this upon doing the tests. When a foreground session is in ktugur, it is creating the change vectors (kernel transaction undo generate undo and redo), and it holds two buffers in current mode:

SQL> l
  1  select 	ts.name tablespace,
  2  	dbablk,
  3  	class,
  4  	state,
  5  	decode(state,0,'free',1,'xcur',2,'scur',3,'cr', 4,'read',5,'mrec',6,'irec',7,'write',8,'pi', 9,'memory',10,'mwrite',11,'donated', 12,'protected',  13,'securefile', 14,'siop',15,'recckpt', 16, 'flashfree',  17, 'flashcur', 18, 'flashna') state_decoded,
  6  	mode_held,
  7  	changes,
  8  	cstate,
  9  	flag,
 10  	flag2,
 11  	rflag,
 12  	sflag,
 13  	lru_flag,
 14  	dirty_queue
 15  from 	  x$bh b
 16  	, v$tablespace ts
 17  where 	1=1
 18  and	not (us_nxt = us_prv and wa_nxt = wa_prv and to_number(wa_nxt,'xxxxxxxxxxxxxxxx') = to_number(us_nxt,'xxxxxxxxxxxxxxxx') + 16)
 19* and	b.ts#=ts.ts#
SQL> /

TABLESPACE                         DBABLK      CLASS      STATE STATE_DECO  MODE_HELD    CHANGES     CSTATE       FLAG      FLAG2      RFLAG      SFLAG   LRU_FLAG DIRTY_QUEUE
------------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -----------
UNDOTBS1                              128         17          1 xcur                2          1          0    2097152          0          0          0          0           0
TS                                  24597          1          1 xcur                2          1          0    2097153          0          0          0          0           0

The thing that was not obvious at first was that the blocks have CSTATE 0; so they can be accessed and read for a consistent read. Then I look at the wait and the p1/2/3 of the wait:

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -------------------------------------------------------
100,262,@1      6274     SYS             DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:2519.48ms,event:buffer busy waitsp1:3p2:128p3:17> Blocked by (inst:sid): 1:59,seq#:476
142,15103,@1    6695     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:47115.23ms,event:SQL*Net message from client,seq#:226
12,25165,@1     7036     SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:3165
59,31601,@1     5891     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:31292.51ms,event:ON CPU:SQL,seq#:111

Above SID 100 is session 4; it hangs on a buffer busy wait. The datafile is #2, which is the undo table space UNDOTBS1 and the buffer that it is waiting for is 128. Why does SID 100 needs to wait for the buffer, while another session (session 3 in the above runbook) can run without getting blocked for a buffer that is held exactly in the same state (xcur and CSTATE is zero)?

The answer can be gotten when executing pstack on the waiting process (or attach with gdb and obtain a backtrace):

(gdb) bt
#0  0x00007fd6a085bbda in semtimedop () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000010f9cca6 in sskgpwwait ()
#2  0x0000000010f9a2e8 in skgpwwait ()
#3  0x0000000010a66995 in ksliwat ()
#4  0x0000000010a65d25 in kslwaitctx ()
#5  0x00000000015c5e5e in kcbzwb ()
#6  0x0000000010b3748e in kcbgcur ()
#7  0x0000000010ab9b17 in ktuGetUsegHdr ()
#8  0x00000000014075d3 in ktcxbcProcessInPDB ()
#9  0x0000000001406de2 in ktcxbc ()
#10 0x0000000010ea8781 in qerfxFetch ()
#11 0x0000000010bb919c in rwsfcd ()
#12 0x00000000035ae618 in qeruaFetch ()
#13 0x00000000035ad837 in qervwFetch ()
#14 0x0000000010bb919c in rwsfcd ()
#15 0x0000000010ea1528 in qerhnFetch ()
#16 0x0000000010bb919c in rwsfcd ()
#17 0x0000000010ea1528 in qerhnFetch ()
#18 0x0000000010c9eeb3 in opifch2 ()
#19 0x000000000268c3de in kpoal8 ()
#20 0x0000000010ca5b3d in opiodr ()
#21 0x0000000010f37a29 in ttcpip ()
#22 0x0000000001eb4674 in opitsk ()
#23 0x0000000001eb914d in opiino ()
#24 0x0000000010ca5b3d in opiodr ()
#25 0x0000000001eb04ed in opidrv ()
#26 0x0000000002afedf1 in sou2o ()
#27 0x0000000000d05577 in opimai_real ()
#28 0x0000000002b09b31 in ssthrdmain ()
#29 0x0000000000d05386 in main ()

The important bit is at #6: kcbgcur (kernel cache buffers get buffer in current mode)! In other words: the block is gotten in current mode while another session is holding the block in current mode, which means it has to wait. The function kcbzwb means kernel cache buffer helper function wait for buffer.

I really wondered why the undo segment buffer is gotten in current mode; this is a read, there is no intention to change something. The current educated guess is this a way to be sure this buffer contains the current state of the transactions across all nodes in a RAC cluster. So this might be a clever optimisation to be absolutely sure the current state of the undo segment transaction table is gotten.

10. session 2: continue. This will break on kcopcv.
11. session 3: select * from test; this will hang.

Now we have two hanging sessions waiting for the wait event buffer busy wait:

SQL> @who active

SID_SERIAL_INST OSPID    USERNAME      C SERVER    PROGRAM                                          EXTRA
--------------- -------- ------------- - --------- ------------------------------------------------ -----------------
100,262,@1      6274     SYS             DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:120790ms,event:buffer busy waitsp1:3p2:128p3:35> Blocked by (inst:sid): 1:59,seq#:501
142,15103,@1    6695     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:1175715.03ms,event:buffer busy waitsp1:4p2:24597p3:1> Blocked by (inst:sid): 1:59,seq#:250
12,25165,@1     7036     SYS           * DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:0ms,event:ON CPU:SQL,seq#:3309
59,31601,@1     5891     TS              DEDICATED sqlplus@memory-presentation.local (TNS V1-V3)    time:9544.97ms,event:ON CPU:SQL,seq#:116

SID 100 still hangs in wait event buffer busy wait for file 3, block 128, but now the session doing the full table scan also hangs in wait event buffer busy wait, but for file 4, block 24597. Let’s look at the buffers held in current mode again:

SQL> @pinned_blocks

TABLESPACE                         DBABLK      CLASS      STATE STATE_DECO  MODE_HELD    CHANGES     CSTATE       FLAG      FLAG2      RFLAG      SFLAG   LRU_FLAG DIRTY_QUEUE
------------------------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -----------
UNDOTBS1                              128         35          1 xcur                2          0          1    2097156          0          0          0          0           0
UNDOTBS1                             3127         36          1 xcur                2          0          1    2097156          0          0          0          0           0
TS                                  24597          1          1 xcur                2          0          1    2097157          0          0          0          0           0

Because the insert progressed to kcopcv, the CSTATE of all three buffers involved in the insert, the undo segment transaction table, the undo segment undo entry and the buffer which has the inserted row are set to 1. This means that the two buffer busy waits are actually two different reasons: the select * from v$lock still waits to get a buffer in current mode, and the select * from test, which does a consistent read, now is waiting for the CSTATE to get set back to 0. In order to solve the hanging, go to session 2 and continue. This will break again on kcoapl, and again if you continue, because there are 3 blocks involved in this insert. Another way to make the insert finish is to disable all breakpoints (disable command) or to quit gdb.

Just for the fun of it, here is another example of how a stuck session can escalate: a process that gets stuck executing the kcbapl (kernel cache buffers apply a change to a buffer). This scenario can be replayed using the following steps:
1. session 1: startup a database foreground session.
2. session 2: obtain the PID of the session 1’s process and attach to it with gdb.
3. session 2: break on kcbapl and continue.
4. session 3: startup a database foreground session.
5. session 3: insert a row into a table: insert into test values (‘a’);
6. session 1: insert a row into a table: insert into test values (‘a’);
7. session 2: gdb broken execution of session 1 because it encountered kcbapl.
8. session 3: commit; this will hang

Can you reason why session 3 is hanging? Both the sessions are doing inserts to the same table…Give it a thought and replay it yourself.

Explanation: the breakpoint stopped execution before actually doing the changes. CSTATE for the buffer is set to 1, it will get set to 2 inside kcbapl to indicate the change vectors have been copied. But the important bit is that a redo copy latch was taken before executing kcbapl. The commit in session 3 sent a message to the logwriter asynchronously asking it to write (first invocation of kcrf_commit_force_int), and then freed the logical locking of the insert, and then invoked kcrf_commit_force again to wait for the logwriter to finish writing and get notified. However, the logwriter needs to obtain all the redo copy latches (it actually reads a metadata structure that indicates the redo copy latches are freed or not), at which it needs to wait because session 1 stuck holding a redo copy latch and therefore needs to wait. Not only does this show how dangerous it is when sessions can’t run, it also shows how dangerous it can be to run gdb in a production database, it can be quite hard to figure out what will be the implication of attaching with gdb to the process. So the actual outcome of above is that session 3 waits in the event ‘log file sync’, and the logwriter is waiting in ‘LGWR wait for redo copy’.

Summary
The Oracle database is carefully setup to be able to handle concurrency. In most normal situations you will see no to maybe very little time spend in wait events that indicate concurrency, like latches or buffer busy waits. If an Oracle database is on a machine that has more active processes than available CPU threads or cores (this depends on the hardware and CPU architecture; for Intel based architectures you should lean towards core’s, for SPARC you can lean more towards threads), the operating system needs to make choices, and will do that trying to give all processes an equal share. This means that processes will be put off CPU in order to give all processes a fair share of CPU run time. Also mind other oversubscription possibilities, like using more memory than physically available, and doing the same using virtualisation, which is handing out more VCPU’s than CPU threads, and handing out more memory than physically available. Because you can’t look outside your virtual machine, it’s often harder to troubleshoot virtualisation oversubscription issues.

Oracle really tries to minimise the amount of time spend while holding a non shared latch. Therefore with modern Oracle databases it’s unlikely that latch waits show up, even if a system is (not too much) oversubscribed. If you do have a latching problem, it’s most likely you have got another problem which causes this.

The most well documented reason for buffer busy waits is a buffer being written onto persistent media, and therefore the buffer is marked busy for the time of the write. Another reason that is documented, is getting a buffer in a mode that is not compatible with the mode it is held in. Having a buffer in current mode will block another request for the buffer in current mode. But, having a buffer pinned in current mode does not block consistent read access, it’s only when the buffer is being changed that requires the buffer to be marked busy. This state is externalised in the CSTATE column of the view X$BH, which shows fields of the buffer header struct that is used to manage the buffer cache buffers. Because there is quite some work to be done when a buffer is changed, and during making these changes the buffers are not usable (‘busy’) for other processes, a buffer busy wait is more likely to occur when oversubscription is taking place.

Thanks to Jonathan Lewis for explaining how to get the buffer header addresses.
Thanks to Abel Macias for brainstorming.

Advertisements

The previous blogpost talked about a simple insert, this blogpost investigates what happens when the DML is committed. Of course this is done with regular commit settings, which means means they are not touched, which means commit_logging is set to immediate and commit_wait is set to wait as far as I know. The documentation says there is no default value, and the settings are empty in all parameter views. In my humble opinion, if you must change the commit settings in order to make your application perform usable with the database, something is severely wrong somewhere.

This blogpost works best if you thoroughly gone through the previous post. I admit it’s a bit dry and theoretical, but you will appreciate the knowledge which you gained there, because it directly applies to a commit.

First let’s look at the flow of functions for the commit:

kpoal8
  opiexe
    kksExecuteCommand
      xctCommitTxn
        xctctl
          k2send
            k2lcom
              ktdcmt
                ktcCommitTxn
                  ktcCommitTxn_new
                    qesrcCM_Enabled
                    qesrcCM_PreCmt_
                      qesrcTol_New
                    ktuIMTabPresetTxn
                    ktucmt
                      kcbchg1                 
                        kcbchg1_main
                          ktiimu_chg
                          kcopcv
                          kcrfw_redo_gen_ext
                            kcrfw_copy_cv
                            kcscur_rba
                            kcbapl
                              kcbhlchk
                              kcoapl
                              kco_issue_callback
                                kturcm
                              kco_blkchk
                              kcoadv_hdr
                              kcbhfix_tail
                            kcrfw_partial_checksum
                            kcrf_commit_force_int
                              kcscu8
                              kcscu8
                              ksbasend
                          ktuulc
                            ksqrcl
                              ksqrcli_int
                    ktudnx   
                  kssdct
                    kssdch_int
                      kssdel
                        kss_del_cb
                          ktldbl
                            ktldbl_noredo
                              kcbnlc
                                ktbdbc
                    kssdch_int
                      kssdel
                        kss_del_cb
                          ktaidm
                            ksqrcl
ksupop
  kcrf_commit_force_int
    kcscu8
    kcscu8
    ksbasend
    kslawe
    kcscu8
    kslwtbctx
    kslwaitctx
  - kcscu8
    kslawe
    kcscu8
    kslwaitctx
  - kcscu8
    kslawe
    kcscu8
    kslwaitctx
  - kcscu8
    kslwtectx

Please mind this is an overview of functions which is not complete, it provides enough information to show the flow of functions I want to highlight. There are much more functions involved during the execution of commit.

The first thing that is visible here is that after the kpoal8/opiexe/kksExecuteCommand (kernel compile shared objects Execute Command) function are xct (transaction control) functions. Of course this is logical, a commit ends a transaction and makes changes visible. The xct layer then moves into the k2 layer, which is the distributed execution layer. I am not doing anything distributed, it is my current understanding that this layer is invoked this way so that if anything distributed was pending, it would be handled appropriately. After the k2 layer the function ktdcmt (kernel transaction distributed commit) is executed.

After the distributed layers we enter the ktc layer (kernel transaction control). In ktcCommitTXN_new I see handling of features like the result cache (qesrc) and In-Memory (ktuIM), then the ktu layer (kernel transaction undo) is entered, which enters the kcb layer (kernel cache buffers) using functions we saw in the previous post: kcbchg1 and kcbchg1_main.

In fact, at this point it looks very similar to the insert, in kcbchg1_main, ktiimu_chg and kcopcv (prepare change vectors) are called, but only once (because a commit only involves one block, see a little further for the explanation) instead of three times as we saw with the insert. Then kcrfw_redo_gen_ext is called, which is doing almost the same as the insert: first kcrfw_copy_cv is executed to copy the change vector to the public redo strand, then kcbapl is called to apply the change to a buffer. The kco_issue_callback function calls kturcm (kernel transaction undo redo commit) indicating the type change to the buffer. This means that a commit changes a single buffer, which is the buffer that holds the transaction in the transaction table in the undo segment, and the change is marking the transaction as committed. So a ‘commit marker’ is not a special token that is written into the redo stream, but in fact it’s simply a block change, just like all other change vectors.

After kcbapl, kcrfw_partial_checksum is called to checksum the redo in the public redo strand, again just like we saw with the insert.

Unique to a commit is the invocation of the kcrf_commit_force_int function. This is the first ‘use’ of the kcrf_commit_force_int function (indicated by the second function argument set to zero, not visible in the overview), which is signalling the logwriter to write any unwritten change vectors in the public redo strands. kcrf_commit_force_int checks the on disk SCN and the LWN SCN using kcscu8 (kernel cache scn management read current SCN) in the kcrfsg_ struct to check logwriter progress:
– If the on disk SCN is beyond the process’ commit SCN, the change vectors are written, and no further action is necessary (this function is quit), which also means a second invocation of kcrf_commit_force_int is not necessary.
– If the on disk SCN isn’t progressed beyond the process’ commit SCN, but the LWN SCN is, it means the logwriter is currently writing the change vectors for this commit SCN. In that case there is no need to signal the logwriter, but it requires the process to validate the write later using the second invocation of kcrf_commit_force_int.
– If both the on disk SCN and LWN SCN did not progress beyond or are equal to the process’ commit SCN, this invocation of kcrf_commit_force_int needs to send the logwriter a message using ksbasend (kernel service background processes asynchronous send message) to start writing the public redo strands. ksbasend will only send a message if the messages flag in the kcrfsg_ struct is not set indicating it has already been signalled.
After which the kcrf_commit_force_int function is returned from, as well as the kcrfw_redo_gen_ext function, so we are back in kcbchg1_main.

Also different from an insert is the invocation of the ktuulc (kernel transaction undo unlock; this is a guess) function. Which calls ksqrcl (kernel service enqueue release an enqueue), which calls ksqrcli_int (my guess this (=the addition of _int) is an enhanced version of the enqueue release function), which performs the clearance of the TX enqueue set for the inserted row. This clearance is not a guess, ksqrcli_int does clear the TX enqueue for the inserted row. After clearing the row lock, some more functions returned from: kcbchg1_main and kcbchg1, so we are back in ktucmt.

Because the transaction is now committed, the active transaction count in the undo segment can be decreased, which is done in ktudnx (kernel transaction undo decrease active transaction count). Then the ktucmt function is returned from too, and we are back in ktcCommitTxn_new.

In ktcCommitTxn_new state objects are cleaned up using kssdct (kernel service state object delete children of specified type). A state object is a memory area that keeps the state of various things that are transient, so if they get lost, the state object reflects the last known state. The callback action of the function performs some more housekeeping, the ktldbl (kernel transaction list blocks changed delete list of block SO’s) function removes block SO’s/buffer handles, which calls kcbnlc (kernel cache buffers analyse cleanout), which calls ktbdbc (kernel transaction block fast block cleanout) to perform delayed block cleanout/commit cleanout. This cleans up the state in the data block, which means it cleans up the lock bytes, set Fsc/Scn to the commit SCN and set the commit flag to C— in the ITL in the block.

The next state object that is cleaned is the shared table lock (TM); by looking at the functions it’s quite easy to understand that this is happening, ksqrcl is the function to release and enqueue, and ktaidm is kernel transaction access internal deallocate dml lock function.

Past releasing the TM enqueue, there are other things done for which I didn’t put their function names in, but the execution is returning from a lot of the other functions shown as calling functions. Of course Oracle needs to update all kind of counters and usage statistics, and record audit information. But eventually, everything has been released. However, there is something more that is executed as part of a commit. This is the second invocation of the kcrf_commit_force_int function.

Actually, when kcrf_commit_force_int is executed for the second ‘use’, this is visible with the second argument of the calling arguments is set to ‘1’ (not visible in the function call overview above). The functions that are executed in kcrf_commit_force_int are actually exactly the same as the first invocation:
– kcscu8 is called to read the on disk SCN from kcrfsg_
– kcscu8 is called to read the LWN SCN from kcrfsg_
Also the same logic is applied to the values that are the result of calling kcscu8 to read the SCN values as stated previously. If these SCNs did not progress far enough, ksbasend is called.

The interesting thing of the second execution of kcrf_commit_force_int happens after ksbasend: the kcrf_commit_force_int function loops until the on disk SCN has progressed beyond the process’ commit SCN (which means the change vectors are written from the public redo strands into the online redologfiles). To indicate it’s waiting/looping for the on disk SCN to progress that far, the wait interface is called (kslwtbctx) for the wait ‘log file sync’, after which it loops, for which I put a hyphen before the start of the loop to indicate what to loop consists of.

I illustrated the post/wait mode of log file sync, which is visible with ‘kslawe’ (kernel service lock management add post-wait entry). The post-wait entry is removed inside kslwaitctx, and then setup again. Interestingly, when in post-wait mode, the process must be posted by the logwriter, even if it finds the on disk SCN to have progressed beyond the process’ commit SCN. The other mode for waiting for the on disk SCN is called ‘polling’, search my blog for articles about it if this sparked your interest.

Summary
The intention of this blogpost is not to bury you in Oracle internal functions, despite the look of the article and the amount of functions mentioned :-). The aim for spelling out the functions is to show what happens, and to learn about the layers in which they execute.

If you skip past the first couple of functions that are executed with a commit, the ktc (kernel transaction control) layer is crossed, then the ktu (kernel transaction undo) layer, after which the change is executed under supervision of the kcb (kernel cache buffer) layer.

In fact, the rough outline of the change is the same as described in the previous article about insert: kcbchg1, kcbchg1_main, kcopcv, kcrfw_redo_gen_ext, etc. Just like with the insert, the function called in the function kco_issue_callback sets the type of block change, which is kturcm with commit.

A commit is a change to the block that holds the undo segment’s transaction table, and flags the current transaction as committed. This is what is commonly referred to as a ‘commit marker’.

After the kturcm function, the transaction is changed to the status committed. However, if you look closely, there are several functions executed AFTER kturcm, like kco_blkchk, kcoadv_hdr and kcbhfix_tail that complete the change made in kturcm in order to make the block consistent.

After block changes and the change vector checksum in kcrfw_partial_checksum, a function unique to commit is executed: kcrf_commit_force_int. The first time invocation of this function signals the logwriter to write.

At the time of kcrf_commit_force_int and returning from it into the function kcrfw_redo_gen_ext back to kcbchg1_main, the newly inserted value is not available, but when the execution in the kcbchg1_main function reaches ktuulc to clean up the TX enqueue, the the NEW value becomes available!

This is something which I still do find counter intuitive, because this means at the above mentioned time, which is prior to reaching ktuulc the change becomes visible to all sessions but the committing session. The committing session at that point needs to clean up the block a little, and later on remove the shared TM enqueue, and after that, the committing session executes kcrf_commit_force_int again to wait for the ‘commit marker’ and obviously all successive change vectors to complete. WHILE ALL OTHER SESSIONS CAN SEE AND USE THE CHANGED DATA FOR WHICH THE COMMITTING SESSION IS WAITING!

This blogpost looks at the very start of oracle redo: the generation of it. In order to do that, I start off with a very simple table, and look at the redo generation part. I guess the regular readers of this blogpost series understand that redo generation is closely connected with making changes made to blocks. This post therefore is not only about redo generation, but also about how the technical implementation of block changes.

I created a simple table (create table test (f1 varchar2(10)) with no index to make the execution as simple as possible, and simply insert rows (insert into test values (‘a’)). It could be argued that not having an index makes it not the most real life scenario, and this might be right. However, the goal here is to make the execution as simple as possible.

I then looked at the execution of the SQL, and created an overview of the relevant functions that are executed in my session:

insexe
  qerltcStart
  qerltcFetch
    qerltcSingleRowLoad
      qerltcSimpleInsRowCBK
        kdtSimpleInsRow
          kdtgrs
          kdtwrp
            kdtchg
              ktbchg2
                ktuchg2
                  ktcbgn
                  ktubnd
                    ktuGetTxForXcb
                      ksqgtlctx
                  ktugur
                  kcbchg1
                    kcbchg1_main
                      ktiimu_chg
                      kcb_block_status
                      ktiimu_chg					
                      ktiimu_chg
                      kcopcv
                      kcopcv
                      kcopcv
                      kcrfw_redo_gen_ext
                        kcrfw_copy_cv
                        kcscur_rba
                        kcbapl
                          kcbhlchk
                          kcoapl
                            kco_issue_callback
                              kturdh
                            kco_blkchk
                            kcoadv_hdr
                            kcbhfix_tail
                        kcbapl
                          kcbhlchk
                          kcoapl
                            kco_issue_callback
                              kturdb
                            kco_blkchk
                            kcoadv_hdr
                            kcbhfix_tail
                        kcbapl
                          kcbhlchk
                          kcoapl
                            kco_issue_callback
                              kdoirp
                            kcoadv_hdr
                            kcbhfix_tail	
                        kcrfw_partial_checksum
    qerltcPostInsertProcessing
    qerltcSetupOerdef
    qerltcFreeMemory

How to read this: I started looking at the functions that are executed when an insert statement is executed at the ‘insexe’ function, which is the main insert function. An indention means it is called from the previous less indented function.

The insertion code calls query execute rowsource load table conventional functions (qerltc). Conventional here means that it’s not a direct insertion. The qerltcStart function essentially initialises memory which is needed to perform the execution of the insert statement.

After the rowsource layer, the data layer is entered using the kdtSimpleInsRow. The first thing that needs to be done here is to find space in a block in which the row can be inserted. Finding a block is done in the kdtgrs function (kernel data table get space for rows). Once a block is found, the code inserts the row using the kdtwrp function (kernel data table write row piece). Via the kdtchg function (kernel data table change) function, the transaction layer is entered.

The transaction layer function that are then used are ktbchg2 (kernel transaction block (header) change) and ktuchg2 (kernel transaction undo called from write logs and perform changes), after which the transaction is initialised in ktcbgn (kernel transaction control begin) next is ktubnd (kernel transaction undo bind undo segment. Inside ktubnd not only the undo segment is selected (‘bound’), but the transaction enqueue lock is set too (ksqgtlctx), which locks the row logically. Further down below you will see Oracle also needs to provide a physical way of locking access when it makes changes to prevent from wrong data being read.

Then we enter the cache layer via the kcbchg1 function (kernel cache buffers change). The cache layer is the actual layer that executes functions to change the block. The actual main function which performs block layer functions is kcbchg1_main.

In order for Oracle’s recovery mechanism to work, the redo must be generated before the block change, so a block can be recovered using the redo. Therefore, kcrfw_redo_gen_ext (kernel cache redo file write/broadcast SCN main redo generation function (12c)) is called. Inside kcrfw_redo_gen_ext, the redo change vectors are copied into the public redo strand using the kcrfw_copy_cv function (kernel cache redo file write/broadcast SCN copy change vectors), after which the changes to the block in the cache are done using the kcbapl (kernel cache buffers apply).

There are three kcbapl executions; the first one applies the change to the undo header where the transaction is assigned a slot which points to the actual undo data inside the undo segment, visible inside the kco_issue_callback function kturdh (kernel transaction undo write undo header data to undo block), the second one applies the change to the undo block with the previous version of the data, so a change be undone and read consistency can be provided. This is also visible inside the kco_issue_callback function: kturdb (kernel transaction undo write undo data to undo block), and the second invocation of kcbapl has the function kdoirp (kernel data operations insert row piece), which performs the actual change of the block to insert the row.

If you look closely at the functions inside kcbapl, you see that a check is done to the buffer prior the the change (kcbhlchk, kernel cache buffers header logical check), and after the change the block metadata is changed to reflect the change (kcoadv_hdr, kernel cache operations advance SCN in buffer header and kcbhfix_tail, kernel cache buffers header update tail).

After the change, a checksum is applied to the generated redo using kcrfw_partial_checksum.

This concludes the insert. A row lock is obtained, the redo change vectors have been copied into the public redo strand, and the changes have been applied to the undo and table blocks.

The functions return, and execution is back at the query execute rowsource layer (qer), which performs some post processing using qerltcPostInsertProcessing, qerltcSetupOerdef and qerltcFreeMemory.

A question that I asked myself is where the actual redo change vectors are created. It’s known that kcrfw_copy_cv copies the change vectors into the public redo strand, but these need to created before you can copy them. Are these created on the spot in kcrfw_copy_cv? Probably not. Of course this dives so deep into the actual working of the Oracle engine that Oracle does not reveal anything about it (that I am aware of).

Luckily, using the Intel Pin tools trace we can see memory access, and therefore we can track the memory area’s that a process is using, and see where (in which function) memory is written to before it is written into the public redo strand. First let’s look the kcrfw_copy_cv function:

40626:kcrfw_copy_cv+544:0x00007ffcd200f778():W:8:0x10e3d3e5/283366373()
40627:_intel_fast_memcpy+7:0x00000000157f62a0():R:8:0x137fff/1277951()
40628:__intel_ssse3_rep_memcpy+30:0x00000000137d0c90():R:4:0xf2afbf82/4071604098(??)
40629:__intel_ssse3_rep_memcpy+8722:0x00007ffcd2011678():R:16:0x5223030ffff10100542b760/-263813982931104()
40630:__intel_ssse3_rep_memcpy+8727:0x00000000d80cbe54(Redo Buffers(pgsz:2048k)|+835156 redo|PUB_REDO_0+835156 ):W:16:0x5223030ffff10100542b760/-263813982931104()
40631:__intel_ssse3_rep_memcpy+8732:0x00007ffcd2011688():R:16:0x000010ffff00000000/4785070309113856()
40632:__intel_ssse3_rep_memcpy+8737:0x00007ffcd201168c():R:16:0x10ffff0000000040200/-4503599627107840()
40633:__intel_ssse3_rep_memcpy+8742:0x00000000d80cbe64(Redo Buffers(pgsz:2048k)|+835172 redo|PUB_REDO_0+835172 ):W:16:0x000010ffff00000000/4785070309113856()
40634:__intel_ssse3_rep_memcpy+8747:0x00000000d80cbe68(Redo Buffers(pgsz:2048k)|+835176 redo|PUB_REDO_0+835176 ):W:16:0x10ffff0000000040200/-4503599627107840()
40635:__intel_ssse3_rep_memcpy+8752:0x00007ffcd200f778():R:8:0x10e3d3e5/283366373()
40636:kcrfw_copy_cv+569:0x00007ffcd200f890():R:4:0x2/2()

What this shows, is the kcrfw_copy_cv function calling _intel_fast_memcpy. The _intel_fast_memcpy is using a specialised memcpy function, __intel_ssse3_rep_memcpy, which uses a processor extension called ‘ssse3’. At line numbers 40629, 40631 and 40632 16 bytes are read and at 40630, 40633 and 40634 the reads are written into the public redo strand.

In order to understand where the data in these 16 bytes memory addresses has been written, I use an extension to my pinatrace annotate oracle script that creates a sqlite database of the annotation results using the ‘-d’ switch.

In order to see where the contents of the memory address 0x00007ffcd2011678 have been written, fire up sqlite3 with the created database:

$ sqlite3 insert.txt.db
sqlite> select * from annotation where memory_address between printf('%d',0x00007ffcd2011678) and printf('%d',0x00007ffcd2011678)+15 and line < 40629;
14439|kssadd+322|140723831772792|??|W|8|279514983|??
14444|kpdbIdToUid+231|140723831772792|??|R|8|279514983|??
16562|kglget+33|140723831772792|??|W|8|0|??
16563|kglget+37|140723831772800|??|W|8|0|??
17380|kglget+290|140723831772792|??|R|8|0|??
17381|kglget+294|140723831772800|??|R|8|0|??
17393|kglpin+13|140723831772792|??|W|8|1810933992|Variable Size(pgsz:2048k)|+183544040 shared pool|KGLH0^d0901451 DS:free memory,duration 1,cls free+96 shared pool|KGLH0^d0901451,duration 1,cls recr+2792
17406|kglpin+77|140723831772804|??|W|4|1|??
17621|kglpin+857|140723831772800|??|W|4|1|??
17720|kglpin+1115|140723831772800|??|R|4|1|??
17726|kglpin+1164|140723831772804|??|R|4|1|??
18697|kglpin+1993|140723831772792|??|R|8|1810933992|Variable Size(pgsz:2048k)|+183544040 shared pool|KGLH0^d0901451 DS:free memory,duration 1,cls free+96 shared pool|KGLH0^d0901451,duration 1,cls recr+2792
18755|kzpchkbu_internal+133|140723831772792|??|W|4|12|??
18763|kzpchkbu_internal+187|140723831772800|??|W|1|2|??
18772|kzpchkbu_internal+230|140723831772801|??|W|1|0|??
18778|kzpchkbu_internal+285|140723831772798|??|W|2|0|??
18818|kzpchkbu_internal+567|140723831772792|??|R|4|12|??
18867|kzpchkbu_internal+1241|140723831772796|??|W|1|0|??
19084|kzpchkbu_internal+1496|140723831772796|??|R|1|0|??
19088|kzpchkbu_internal+1524|140723831772792|??|R|4|12|??
19090|kzpchkbu_internal+1528|140723831772801|??|R|1|0|??
19218|kzpchkbu_internal+1747|140723831772801|??|R|1|0|??
20936|kglati+52|140723831772792|??|W|8|139724814752480|??
21062|kglati+190|140723831772800|??|W|8|80|??
22201|kglati+407|140723831772792|??|R|8|139724814752480|??
22203|kglati+415|140723831772800|??|R|8|80|??
30665|__intel_memset+2691|140723831772800|??|W|16|0|??
31116|__intel_memset+2691|140723831772800|??|W|16|0|??
38371|ktuchg2+9341|140723831772792|??|W|1|5|??
38372|ktuchg2+9345|140723831772793|??|W|1|2|??
39504|kcopcv+207|140723831772794|??|W|2|35|??
39507|kcopcv+221|140723831772800|??|W|4|272|??
39511|kcopcv+249|140723831772796|??|W|4|4294901763|??
39517|ub8_to_kscn_impl+21|140723831772804|??|W|4|7744340|??
40116|kcrfw_redo_gen_ext+1260|140723831772796|??|R|2|3|??
sqlite>

The query is simple: look up all records which were generated before line 40629, which did access the memory region that is read for copying into the public redo strand.

This is the complete output, please be aware that the high memory addresses are PGA memory addresses. PGA memory is highly fluent in nature. If you look in memory access traces you see that nearly every layer that is used during execution of SQL and PL/SQL requires allocation of memory area’s to store information. In fact, it’s so fluent that some memory area’s can be reused during repetitions in execution within the same statement.

The “__intel_memset’ functions are functions that set a memory area to a certain value, in this case zero. This might be an initialisation of memory. In order to be sure, lookup the lines (30665/31116), and see from what function they were called:

30620:kdtgrs+233:0x00007ffcd2012188():W:8:0/0()
30621:kdtgrs+244:0x000000006c20fb98(Variable Size(pgsz:2048k)|+186710936 shared pool|PLMCD^6daf6103 DS:permanent memor,duration 4,cls perm+2472 shared pool|PLMCD^6daf6103,duration 4,cls freeabl+2512 ):R:1:0x8/8
()
30622:kdtgrs+250:0x000000006c20fba4(Variable Size(pgsz:2048k)|+186710948 shared pool|PLMCD^6daf6103 DS:permanent memor,duration 4,cls perm+2484 shared pool|PLMCD^6daf6103,duration 4,cls freeabl+2524 ):R:1:0x80/
128()
30623:kdtgrs+273:0x00007ffcd20110e8():W:8:0x10e0cc66/283167846()
30624:_intel_fast_memset+7:0x00000000157f62a0():R:8:0x137fff/1277951()
30625:__intel_memset+1072:0x00000000157f6724():R:4:0x2/2()
30626:__intel_memset+1128:0x00000000062c7240():R:8:0x92/146()
30627:__intel_memset+2633:0x00000000157b9480():R:4:0x600000/6291456()
30628:__intel_memset+2660:0x00007ffcd2011430():W:16:0x0000000000000000/0()
30629:__intel_memset+2664:0x00007ffcd2011440():W:16:0x0000000000000000/0()
30630:__intel_memset+2669:0x00007ffcd2011450():W:16:0x0000000000000000/0()
30631:__intel_memset+2674:0x00007ffcd2011460():W:16:0x0000000000000000/0()
30632:__intel_memset+2686:0x00007ffcd2011470():W:16:0x0000000000000000/0()
30633:__intel_memset+2691:0x00007ffcd2011480():W:16:0x0000000000000000/0()

It’s called from kdtgrs. And by looking at the long list of memset commands, this clearly is initialisation of memory. This means that any allocation before the memset is something completely different. This means these functions did provide the redo information:

38371|ktuchg2+9341|140723831772792|??|W|1|5|??
38372|ktuchg2+9345|140723831772793|??|W|1|2|??
39504|kcopcv+207|140723831772794|??|W|2|35|??
39507|kcopcv+221|140723831772800|??|W|4|272|??
39511|kcopcv+249|140723831772796|??|W|4|4294901763|??
39517|ub8_to_kscn_impl+21|140723831772804|??|W|4|7744340|??

The ktuchg2 function hints at undo, and the kcopcv function is a function that seems to examine the block and create redo information based on the block. The ub8_to_kscn_impl is a function for the conversion of SCN numbers.

Let’s take the next memory region, and see what functions manipulated region:

38375|ktuchg2+9375|140723831772816|??|W|2|0|??
38376|ktuchg2+9380|140723831772818|??|W|2|0|??
38377|ktuchg2+9385|140723831772820|??|W|2|0|??
38378|ktuchg2+9390|140723831772822|??|W|2|0|??
39506|kcopcv+216|140723831772816|??|W|2|0|??
39510|kcopcv+237|140723831772814|??|W|2|65535|??
39518|ub8_to_kscn_impl+37|140723831772810|??|W|2|0|??
39519|ub8_to_kscn_impl+41|140723831772808|??|W|2|0|??
39523|kcopcv+326|140723831772812|??|W|1|1|??
39526|kcopcv+403|140723831772813|??|W|1|0|??

A little more of the same, ktuchg2/kcopcv/ub8_to_kscn_imp, although in another order.
And the third memory region:

38375|ktuchg2+9375|140723831772816|7ffcd2011690|??|W|2|0|0|??
38376|ktuchg2+9380|140723831772818|7ffcd2011692|??|W|2|0|0|??
38377|ktuchg2+9385|140723831772820|7ffcd2011694|??|W|2|0|0|??
38378|ktuchg2+9390|140723831772822|7ffcd2011696|??|W|2|0|0|??
38383|ktuchg2+9426|140723831772826|7ffcd201169a|??|W|2|32|20|??
39506|kcopcv+216|140723831772816|7ffcd2011690|??|W|2|0|0|??
39510|kcopcv+237|140723831772814|7ffcd201168e|??|W|2|65535|ffff|??
39523|kcopcv+326|140723831772812|7ffcd201168c|??|W|1|1|1|??
39526|kcopcv+403|140723831772813|7ffcd201168d|??|W|1|0|0|??
39895|kcobrh+157|140723831772824|7ffcd2011698|??|W|2|4|4|??

More of the same, although another function was added ‘kcobrh’. I am not familiar with this function, although my suspicion is this is a function to generate a header. Of course these 3 regions were copied using the same call.

The next memcpy call copies two 16 byte regions:

40641:kcrfw_copy_cv+544:0x00007ffcd200f778():W:8:0x10e3d3e5/283366373()
40642:_intel_fast_memcpy+7:0x00000000157f62a0():R:8:0x137fff/1277951()
40643:__intel_ssse3_rep_memcpy+49:0x00000000137d0a40():R:4:0xf2afc3cc/4071605196(??)
40644:__intel_ssse3_rep_memcpy+9244:0x00007ffcd201174c():R:16:0x120880000000000000/612489549322387456()
40645:__intel_ssse3_rep_memcpy+9249:0x00000000d80cbe88(Redo Buffers(pgsz:2048k)|+835208 redo|PUB_REDO_0+835208 ):W:16:0x120880000000000000/612489549322387456()
40646:__intel_ssse3_rep_memcpy+9254:0x00007ffcd201173c():R:16:0x100069600e810008a1c0/7593084918960792000()
40647:__intel_ssse3_rep_memcpy+9258:0x00000000d80cbe78(Redo Buffers(pgsz:2048k)|+835192 redo|PUB_REDO_0+835192 ):W:16:0x100069600e810008a1c0/7593084918960792000()
40648:__intel_ssse3_rep_memcpy+9262:0x00007ffcd200f778():R:8:0x10e3d3e5/283366373()
40649:kcrfw_copy_cv+569:0x00007ffcd200f890():R:4:0x2/2()

(lines 40644 and 40646)
Which memory is written to by different functions. In fact, these were the functions that I suspected to generate the redo information:

37790|ktuchg2+808|140723831773004|??|W|2|2|??
38266|ktugur+1575|140723831773006|??|W|2|136|??
38270|ktugur+1618|140723831773004|??|R|2|2|??
38271|ktugur+1625|140723831773004|??|W|2|18|??
38272|ktugur+1629|140723831773008|??|W|1|0|??
38279|ktugur+1663|140723831773012|??|W|2|0|??
38281|ktugur+1677|140723831773014|??|W|2|0|??
38283|ktugur+1690|140723831773016|??|W|4|0|??
38291|ktugur+1756|140723831773004|??|R|2|18|??
38292|ktugur+1763|140723831773004|??|W|2|18|??
38297|ktugur+1811|140723831773004|??|W|2|18|??

One write of ktuchg2 and then ktugur (kernel transaction undo generate undo and redo).

I gone through all of the memory area’s that are written to the public redo strand, and found the functions in which redo is generated:
– ktuchg2
– kcopcv
– kcobrh
– ktugur
Other functions that I have found in other regions of memory:
– kdtwrp
– kcosze
– ktugul

There is one other thing I found that I didn’t suspect. During the investigation I used a debugger to stop on certain functions (in fact, that is how I gotten the list of functions at the start of this article) during the insert. When doing that, I tried querying the table in order to see what happened visibly to the table at which I performed the insert while it was happening.

I found that the kcbchg1_main function performs cache buffers chains latch gets without calling kslgetl/ksl_get_shared_latch(_int). This has been documented on a few places, it is using a macro to perform the latch get. However, when stopping (by means of breaking execution using a debugger), I found that my other session simply executing a ‘select’ query hung on the latch.

Also, after the cache buffers chain latch get, the block is getting modified. That’s obvious of course, we are doing an insert. However, during the entire time between kcopcv up to the two kcbapl functions, the block is not accessible for any querying, and keeps any other query waiting in the wait event ‘buffer busy wait’. This is a direct violation of the mantra that writers don’t block readers.

Now, before you jump to conclusions, please bare with me a little more. Is this a problem? No, this is the direct consequence of concurrency control, which, in a shared system, always must happen. Certain things can only be done one at a time. In fact, that is the whole purpose of a latch.

However, I do want to make you think about it. These concurrency control waits simply must happen in order not to disturb changes being made to a block, or read an inconsistent version of a block. And this probably in much the same way has been for a long time this way in Oracle, and it has not a problem. Generally. Where this does become a problem, is when you have your processes randomly sleep for longer periods of time. Because of your inserting process is stalling during executing the kcrfw_redo_gen_ext function, other processes will get a buffer busy wait.

So when does this happen? Why am I saying this? Well, these random waits of processes is something that actually happens when your system is overloaded on CPU, or when your system is actively swapping. Now think about this in a broader sense. Whenever you run virtualised, and you do not have strict policies enforced on CPU and memory allocation, and especially if there is oversubscription of CPU and/or memory, these random waits can be applied by the hypervisor. So, never -ever- let your database server run short on CPU or memory, because the database will start showing a very peculiar waits, which could put you on your wrong foot very easily.

Conclusion
I actually tried putting in three distinct messages in this blogpost. The first one is to show how a change (an insert) is done internally. This shows that Oracle is build up in layers, and a lot of these layers are crossed even when you do the simplest of things, like inserting a single row in a table. Yes, this is not simple or even plain confusing. That’s not something I can change. However, I do think that this information is crucial in order to understand how the database works, and thus how to troubleshoot issues and understand what happens.

When a change is made to a block, first the redo is generated, then the undo is written, and then a row is inserted into the data block.

The second message is research on where the redo change vectors are generated. This is background information that will not help with anything concrete, it’s just background information for improving understanding how Oracle works. The research shows that redo is generated in a couple of functions:
– ktuchg2
– kcopcv
– kcobrh
– ktugur
– kdtwrp
– kcosze
– ktugul

At certain points, Oracle serialises access to blocks in order to make sure changes are not disturbed, and other processes do not read an inconsistent version of a block. This is technical locking, not the logical transaction locking that Oracle provides using enqueues. This is the third piece of information that I put in this article. This is directly usable and concrete information, and this is crucial, especially in today’s world of using virtualisation on premises or in the cloud.

If processes get stuck during these times of serialisation, all kinds of waits become visible, which might look like other problems; my first thought with buffer busy waits is a block being read into the buffer cache from disk or being written from the buffer cache onto disk, not being modified in the cache. In fact, if you start inserting into a freshly truncated table, I found I could get TX lock waits while doing a select. Again: this is not obvious.

So the last paragraph warns to make absolutely sure your database server is not oversubscribed on any resource, especially CPU and memory, because it can show itself in unpredictable ways, making it harder to diagnose, and thus to solve.

This is the seventh part of a blog series about oracle redo.

Adaptive log file sync is a feature that probably came with Oracle version 11.2. Probably means I looked at the undocumented parameters of Oracle version 11.1 and do not see any of the ‘_adaptive_log_file_sync*’ parameters. It was actually turned off by default with versions 11.2.0.1 and 11.2.0.2, and was turned on by default since version 11.2.0.3.

The adaptive log file sync feature means the logwriter can make the committing sessions switch from the traditional post/wait mechanism for redo write notification to a mechanism where the committing session picks a time to sleep after which it checks for the redo writing progress itself. The reasons for doing so that I can see are reducing the amount of work the logwriter needs to do, because semop’ing a semaphore can only be done serially, and prevent the logwriter process from the situation of getting scheduled off CPU because the processes it’s activating via semop are getting higher priorities than the logwriter.

An important thing to realise is that the adaptive log file sync mechanism only potentially can change a single communication event in the database, which is when a session needs to wait for its redo to be written to disk by the logwriter after a commit. Nothing else is changed and all other post/wait inter-process communication is done exactly the same.

Is this feature a good idea? In part 6 of my Oracle redo series I have shown that a process is checking the on-disk SCN even if the log file sync method is post/wait, but it needs to be posted before it will continue. I have seen cases where huge numbers of foreground processes were doing lots of commits per second, especially in these cases I can see how the logwriter can greatly benefit from not having to do a lot of semop calls after writing, and thus not spend its time on what the logwriter actually should be doing, which is writing. I have not seen any great benefits in databases where there’s a low redo writing and commit rate.

But how does this feature work? In order to investigate that I first executed a pinatrace on a sysdba session and executed:

SQL> alter system set "_use_adaptive_log_file_sync" = false scope = memory;

Then used my pinatrace annotate script to annotate the pinatrace output, and simply grepped for ‘kcrf’; I highly suspect the adaptive log file sync feature to be a ‘kernel cache redo file’ (kcrf) based feature:

$ grep kcrf pinatrace_17029_ann.txt
11452:kcrf_sync_adaptive_set+2:0x00007ffef43aecd0():W:8:0x7ffef43af1b0/140732995924400()
11453:kcrf_sync_adaptive_set+10:0x00007ffef43aecb0():W:8:0x7f703fa2ef40/140120080707392()
11454:kcrf_sync_adaptive_set+17:0x00007ffef43aecb8():W:8:0x115b3da0/291192224()
11455:kcrf_sync_adaptive_set+37:0x00007ffef43aecc0():W:8:0x7ffef43aefe8/140732995923944()
11456:kcrf_sync_adaptive_set+44:0x00007ffef43aecc8():W:8:0x7f703f9ec9a0/140120080435616()
11457:kcrf_sync_adaptive_set+48:0x00007ffef43aeca8():W:8:0xcf7c7f5/217565173()
11506:kcrf_sync_adaptive_set+81:0x00007ffef43aeca8():W:8:0xcf7c816/217565206()
11555:kcrf_sync_adaptive_set+100:0x0000000060016a34(Fixed Size(pgsz:2048k)|+92724 fixed sga|var:kcrf_sync_sleep_usecs_+0 shared pool|(indx:0)X$KSXRSG+49636 ):R:4:0/0()
11556:kcrf_sync_adaptive_set+161:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):W:4:0/0()
11557:kcrf_sync_adaptive_set+202:0x0000000060016a40(Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 ):W:4:0/0()
11558:kcrf_sync_adaptive_set+352:0x00007ffef43aecc8():R:8:0x7f703f9ec9a0/140120080435616()
11559:kcrf_sync_adaptive_set+356:0x00007ffef43aecc0():R:8:0x7ffef43aefe8/140732995923944()
11560:kcrf_sync_adaptive_set+360:0x00007ffef43aecb8():R:8:0x115b3da0/291192224()
11561:kcrf_sync_adaptive_set+364:0x00007ffef43aecb0():R:8:0x7f703fa2ef40/140120080707392()
11562:kcrf_sync_adaptive_set+371:0x00007ffef43aecd0():R:8:0x7ffef43af1b0/140732995924400()
11563:kcrf_sync_adaptive_set+372:0x00007ffef43aecd8():R:8:0x1034e7e/16993918()
77244:kzam_check_limit+342:0x0000000060016398(Fixed Size(pgsz:2048k)|+91032 fixed sga|var:kcrfsg_+368 shared pool|(indx:0)X$KSXRSG+47944 ):R:4:0xd3/211()

If the feature is set/changed (the database was in adaptive log file sync mode, because “_use_adaptive_log_file_sync” was not set), it must write in a shared memory location for the other processes to be able to see the setting. That means two things: 1) any non-shared memory location can not be relevant to the setting, so we can exclude these 2) in order to change something, there needs to be a WRITE to set it.

To make the investigation easier to do, I added an option to the annotate script to write the pinatrace_annotate output to a sqlite3 database. In order to do this, use the ‘-d’ (database) switch to the pinatrace_annotate.sh script.

First we need to know the max SGA address of the instance, this can be found in the memory_ranges.db sqlite database, which is created when the memory details are prepared using ‘./pinatrace_annotate.sh -c’:

$ sqlite3 memory_ranges.db
SQLite version 3.21.0 2017-10-24 18:55:49
Enter ".help" for usage hints.
sqlite> select printf('%x',max(end_address)) from memory_ranges;
e0000000

(please mind you need an sqlite3 version higher than 3.7 (which comes with Oracle Linux) to use printf. I downloaded the latest version from https://www.sqlite.org/download.html)

Okay so anything higher than 0xe0000000 is not in the SGA. Now use the database that was created using the ‘-d’ switch when I annotated the pinatrace output of ‘alter system set “_use_adaptive_log_file_sync” = false scope = memory’:

$ sqlite3 pinatrace_17029.txt.db
SQLite version 3.21.0 2017-10-24 18:55:49
Enter ".help" for usage hints.
sqlite> select line, function_offset, printf('%x',memory_address), memory_annotation, read_write, size, value, value_annotation from annotation where memory_annotation like '%kcrf%' and memory_address <= printf('%d',0xe0000000) and read_write = 'W';
11556|kcrf_sync_adaptive_set+161|60016a38|Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 |W|4|0|??
11557|kcrf_sync_adaptive_set+202|60016a40|Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 |W|4|0|??

This means that when I execute the alter system command, something is written into the fixed SGA, into a variable called ‘kcrf_alfs_info_’. I am pretty sure ‘alfs’ means adaptive log file sync here (and not many occasions of Gordon Shumway). There is a zero written in a 4 byte memory location at kcrf_alfs_info_+0, and there’s a zero written at kcrf_alfs_info_+8.

Now let’s repeat this, but with the pinatrace output when executing ‘alter system set “_use_adaptive_log_file_sync” = true scope = memory’:

$ sqlite3 pinatrace_8204.txt.db
SQLite version 3.21.0 2017-10-24 18:55:49
Enter ".help" for usage hints.
sqlite> select line, function_offset, printf('%x',memory_address), memory_annotation, read_write, size, value, value_annotation from annotation where memory_annotation like '%kcrf%' and memory_address <= printf('%d',0xe0000000) and read_write = 'W';
82476|kcrf_sync_adaptive_set+161|60016a38|Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 |W|4|1|??
82477|kcrf_sync_adaptive_set+202|60016a40|Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 |W|4|0|??

Here the same thing is happening as previously, when it was set to false, with the difference that at kcrf_alfs_info_+0, there is a one written.

Now let’s repeat this again, but with the pinatrace output when executing ‘”_use_adaptive_log_file_sync” = polling_only scope = memory’. This the third value that is allowed for “_use_adaptive_log_file_sync”. In case you are wondering how I knew about this being an allowed parameter value, there’s a script in Tanel Poder’s script bundle that can list them, called pvalid.sql:

SQL> @pvalid adaptive_log_file_sync
Display valid values for multioption parameters matching "adaptive_log_file_sync"...
old  11:   LOWER(NAME_KSPVLD_VALUES) LIKE LOWER('%&1%')
new  11:   LOWER(NAME_KSPVLD_VALUES) LIKE LOWER('%adaptive_log_file_sync%')

  PAR# PARAMETER                                                 ORD VALUE                          DEFAULT
------ -------------------------------------------------- ---------- ------------------------------ -------
  1840 _use_adaptive_log_file_sync                                 1 TRUE                           DEFAULT
       _use_adaptive_log_file_sync                                 2 FALSE
       _use_adaptive_log_file_sync                                 3 POLLING_ONLY

Here is the memory trace of setting “_use_adaptive_log_file_sync” to polling only:

$ sqlite3 pinatrace_25469.txt.db
SQLite version 3.21.0 2017-10-24 18:55:49
Enter ".help" for usage hints.
sqlite> select line, function_offset, printf('%x',memory_address), memory_annotation, read_write, size, value, value_annotation from annotation where memory_annotation like '%kcrf%' and memory_address <= printf('%d',0xe0000000) and read_write = 'W';
94687|kcrf_sync_adaptive_set+161|60016a38|Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 |W|4|1|??
94688|kcrf_sync_adaptive_set+175|60016a40|Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 |W|4|1|??
94689|kcrf_sync_adaptive_set+185|60016a78|Fixed Size(pgsz:2048k)|+92792 fixed sga|var:kcrf_alfs_info_+64 shared pool|(indx:0)X$KSXRSG+49704 |W|4|0|??
94690|kcrf_sync_adaptive_set+191|60016a7c|Fixed Size(pgsz:2048k)|+92796 fixed sga|var:kcrf_alfs_info_+68 shared pool|(indx:0)X$KSXRSG+49708 |W|4|0|??

That’s interesting! Now the values set at offsets 0 and 8 are both one, and there are two additional offsets written to, 64 and 68, to which a zero is written.

It is my interpretation that zero at kcrf_alfs_info_+0 means the adaptive log file sync feature is turned off. In a database starting from version 11.2.0.3 or higher, this can only be the case if the parameter “_use_adaptive_log_file_sync” has been explicitly set to false.

This is a foreground session in a database that where I have set “_use_adaptive_log_file_sync” to false, and performed an insert and commit:

$ grep kcrf_alfs_info_\+[08] insert_ann.txt
72891:kcrf_commit_force_int+268:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0/0()
81847:kcrf_commit_force_int+268:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0/0()
81929:kcrf_commit_force_int+602:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0/0()
81998:kcrf_commit_force_int+1503:0x0000000060016a90(Fixed Size(pgsz:2048k)|+92816 fixed sga|var:kcrf_alfs_info_+88 shared pool|(indx:0)X$KSXRSG+49728 ):R:4:0x3e8/1000()

And this is a foreground session in a database where I removed the parameter “_use_adaptive_log_file_sync”, which means the parameter defaults to true, and executed an insert and commit:

$ grep kcrf_alfs_info_\+[08] pinatrace_2444_ann.txt
71189:kcrf_commit_force_int+268:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0x1/1()
80148:kcrf_commit_force_int+268:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0x1/1()
80230:kcrf_commit_force_int+602:0x0000000060016a38(Fixed Size(pgsz:2048k)|+92728 fixed sga|var:kcrf_alfs_info_+0 shared pool|(indx:0)X$KSXRSG+49640 ):R:4:0x1/1()
80231:kcrf_commit_force_int+615:0x0000000060016a40(Fixed Size(pgsz:2048k)|+92736 fixed sga|var:kcrf_alfs_info_+8 shared pool|(indx:0)X$KSXRSG+49648 ):R:4:0/0()
80254:kcrf_commit_force_int+844:0x0000000060016a90(Fixed Size(pgsz:2048k)|+92816 fixed sga|var:kcrf_alfs_info_+88 shared pool|(indx:0)X$KSXRSG+49728 ):R:4:0x550/1360()

As you can see at row 80230 of the memory trace, the kcrf_commit_force_int function reads kcrf_alfs_info_+0 and finds a one, indicating the adaptive log file sync feature is turned on, and as a result needs to find out if it’s set to post/wait or polling, which is done by looking at kcrf_alfs_info_+8 (row 80231), and finds a zero. The kcrf_commit_force_int execution here is the second stage of the invocation of kcrf_commit_force_int, which is with 1 as second argument, which is when the function performs a wait for the logwriter to write its redo. One way of detecting this usage of kcrf_commit_force_int is by looking at the functions lead to this invocation of kcrf_commit_force_int (ksupop, kcbdsy), because in the pinatrace the function arguments are not visible.

The zero for kcrf_alfs_info_+8 probably means that the adaptive feature still uses post/wait to wait for the logwriter, which is reasonably easy to verify by looking if we see the functions that are associated with the post/wait interface, of which the a few obvious ones are kslawe for setting up the post/wait entry, and another one is a kslgetl/kslfre combination that obtains and frees the ‘post/wait queue’ latch:

79629:kslawe+381:0x0000000060006a60(fixed sga|var:kslpwqs_+0 ):R:8:0x6dd76db0/1842834864(Variable Size(pgsz:2048k)|+219639216 shared pool|(child)latch#1:post/wait queue+0 shared pool|permanent memor,duration 1,cls perm+1322416 )
79630:kslawe+388:0x00007ffcb79aecf0():W:8:0x6dd851b0/1842893232(shared pool|(child)latch#305:post/wait queue+0 shared pool|permanent memor,duration 1,cls perm+1380784 )
79631:kslawe+392:0x00007ffcb79aecd8():W:8:0x10a69edd/279355101()
79632:kslgetl+2:0x00007ffcb79aecd0():W:8:0x7ffcb79aed20/140723388869920()
79633:kslgetl+13:0x00007ffcb79aecb0():W:8:0x132/306()
79634:kslgetl+17:0x00007f303824ff40(??):R:8:0x7f303824ff40/139845077106496(??)
79635:kslgetl+26:0x00007ffcb79aeca8():W:8:0x21/33()
79636:kslgetl+33:0x00007ffcb79aeca0():W:8:0x130/304()
79637:kslgetl+37:0x00007ffcb79aecb8():W:8:0x130/304()
79638:kslgetl+44:0x00007f3038241e78(??):R:8:0x6dde6f10/1843293968(shared pool|(indx:71)X$KSUPR+0 shared pool|permanent memor,duration 1,cls perm+1781520 )
79639:kslgetl+58:0x00007ffcb79aecc0():W:8:0x131/305()
79640:kslgetl+65:0x00007ffcb79aec70():W:8:0x130/304()
79641:kslgetl+69:0x000000006dd851bf(shared pool|(child)latch#305:post/wait queue+15 shared pool|permanent memor,duration 1,cls perm+1380799 ):R:1:0x37/55()
79642:kslgetl+80:0x000000006dde7268(shared pool|(indx:71)X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+1782376 ):R:8:0/0()

Above we see the a snippet of part of the kslawe function reading the start address of the child latches from the fixed SGA variable kslpwqs_, then kslgetl is called for post/wait queue child latch number 305.

From the previous article we know that a committing session checks the on disk SCN in the kcrfsg_ struct in order to understand if the redo that it put into the public redo strand has been written or not. If the logwriter did not increase the on disk SCN beyond or equal to the commit SCN yet, the committing session loops executing the following functions:
– kslawe: registering itself in the post-wait queue
– kcscu8: check on disk SCN
– kslwaitctx: main waiting routine, semtimedop: wait to be posted for 100ms
– (if not posted) ksliwat_cleanup: remove post-wait entry
– (if not posted) kcscu8: check on disk SCN

Of course when the instance is in adaptive log file sync mode (which any normal database should be nowadays!), it chooses between post-wait and polling, and when post-wait is chosen, the above function sequence is valid, this is independent of being dedicated in post-wait mode or being in post-wait mode because the adaptive log file sync mechanism chose it.

How does this look like when the database in polling mode?
– kcscu8: check on disk SCN
– sltrusleep, nanosleep: wait for a certain amount of time

Yet, it is that simple: all it does is verify the on disk SCN and if it has not progressed equal to or beyond the session’s commit SCN, it sleeps in nanosleep.

The time sleeping in nanosleep as far as I seen, has changed: prior to Oracle version 12.2, the time was calculated dynamically, probably directly or indirectly from the IO time of the logwriter. As far as I can see, in version 12.2 the sleeping time changed to sleeping 1000ns. This sleeping time was the topic of my previous article because a 1000ns/1us sleep seems to be lesser than linux can sleep.

During investigating how Oracle works with regards to waiting, I discovered an oddity. I was researching for my redo blog posts, and found that in a certain case, Oracle uses the ‘nanosleep’ system call. As the name of this system call suggests, this call allows you to let a process sleep with nanosecond precision.

The oddity that I found, was the following:

$ strace -Tp 42
...
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
...

I executed strace with the ‘-T’ argument (show time spend in system calls), and ‘-p’ to attach it to the process I want to measure. The nanosleep function was called with struct time spec set to {0, 1000}, which means 0 seconds and 1000 nanoseconds. However, the actual time spend was between 0.000191 and 0.000253 seconds, which is 1910000 and 253000 in nanoseconds. That’s approximately 200 times longer than it’s set to wait!

Since then I have spend a fair amount of time to understand why this is happening. So far I found a number of facts that I think explain this, but if anyone else has something to throw in here, please leave a comment.

The first thing I did was create a C program that just runs nanosleep in a loop, so I can run the nanosleep systemcall in isolation. Here’s the nanosleep PoC code (measure_time.c):

#include 
#include <time>
int main(int argc, char **argv)
{
  int       loop_counter, loop_total;
  struct    timespec sleep;

  sleep.tv_sec=0;
  sleep.tv_nsec=1000;

  loop_total=10;
  loop_counter=0;

  while ( loop_counter &lt; loop_total ) {
    nanosleep(&amp;sleep,NULL);
    loop_counter++;
  }
}

Then, when researching I found that there is a systemcall that shows the actual clock resolution time! I created another very small C program to run just that (measure_resulution.c):

#include 
#include <time>
int main(int argc, char **argv)
{
  int       result;
  struct    timespec resolution;
  clockid_t clk_id;
  clk_id = CLOCK_MONOTONIC;

  clock_getres(clk_id, &amp;resolution);

  printf("Resolution: %ld s, %ld ns\n", resolution.tv_sec, resolution.tv_nsec);
}

This c program can be compiled using ‘gcc measure_resolution.c -o measure_resolution’. This is what it showed:

$ ./measure_resolution
Resolution: 0 s, 1 ns

So, my system has a 1 ns resolution, despite my nanosleep systemcalls taking way longer. A little later I found out this information can be obtained directly in /proc/timer_list:

$ grep resolution /proc/timer_list | uniq
  .resolution: 1 nsecs

The first thing I found while researching, is that when I change the priority of the process (not the 1000ns), I can get lower sleeping times:

# strace -T chrt --rr 99 ./measure_time
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

The lowest time here is 0.036ms (36000ns). In order to look further, I found that newer kernel versions have an addition to perf that can measure wait time, scheduler delay (‘run queue’) and actual run time; perf sched timehist. So at this point it seemed that was a way to understand more about the timing of nanosleep. In order to look at that I created a virtual machine with fedora 27 (kernel version 4.14.13), and compiled my measure_time.c program on it.

The next thing to do is run the measure_time executable with perf sched record:

# perf sched record chrt --rr 99 ./measure_time
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.183 MB perf.data (54 samples) ]

Then run perf sched timehist to look at the data and find the pid of the measure_time executable:

# perf sched timehist
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.760817 [0000]                                0.000      0.000      0.000
   37745.760823 [0000]  rcu_sched[8]                        0.000      0.000      0.005
   37745.762367 [0002]  perf[2777]                          0.000      0.000      0.000
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.762431 [0002]  migration/2[21]                     0.000      0.001      0.011
   37745.762479 [0003]                                0.000      0.000      0.000
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764108 [0003]                                1.583      0.000      0.045
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005

So it’s pid 2778, now add that to perf sched timehist:

# perf sched timehist -p 2778
Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
   37745.762420 [0002]  perf[2778]                          0.000      0.044      0.052
   37745.764063 [0003]  measure_time[2778]                  0.059      0.000      1.583
   37745.764114 [0003]  measure_time[2778]                  0.045      0.003      0.005
   37745.764153 [0003]  measure_time[2778]                  0.034      0.002      0.004
   37745.764195 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764236 [0003]  measure_time[2778]                  0.036      0.002      0.004
   37745.764291 [0003]  measure_time[2778]                  0.050      0.002      0.004
   37745.764347 [0003]  measure_time[2778]                  0.051      0.002      0.004
   37745.764405 [0003]  measure_time[2778]                  0.052      0.002      0.004
   37745.764478 [0003]  measure_time[2778]                  0.067      0.006      0.005
   37745.764506 [0003]  measure_time[2778]                  0.022      0.002      0.005
   37745.764581 [0003]  measure_time[2778]                  0.022      0.002      0.052

The way to look at this: first perf is run. Because we just started running, there is no wait time, the task gets the state running, then it needs to be scheduled to run. That time is in ‘sch delay’, obviously scheduling delay. Then it gets truly is running and runs for some time, which time is visible in the ‘run time’ column. Next, the measure_time executable is run.

The first time measure_time runs, it needs to do some initialisation, like the dynamic loader loading libc, allocate memory (mmap calls), etcetera. This is visible in the run time of 1.583 ms. However, then it runs into the nanosleep call. The nanosleep call puts the process into an interruptible sleep. The sleep is visible in the wait time column. Here we see it’s waiting/sleeping for between 0.022 and 0.067 ms. After waiting, the process gets the status running, after which the scheduler must schedule the task and make it running. That time is in scheduling time again. Then it runs, but all it needs to do is end the nanosleep call, add one to the loop_counter variable, jump back and enter the nanosleep call again, as we can see, that only takes 0.004 ms.

Okay, so far it was quite a geeking out session, but not really an answer has been found to why a 1000ns sleep turns out to be way higher. I sought out help via social media, and gotten some sane advise from Kevin Closson. Essentially, the advise was to run it on physical hardware instead of my test virtual machine, and make the load as low as possible in order for the scheduler to be able to schedule my process as quickly as it can. Yes, very obvious actually.

So, the next target is a physical machine in our Enkitec lab. This is on a X3 generation machine (Sandy Bridge EP). I compiled measure_time.c, and ran ‘strace -T chrt –rr 99 ./measure_time’ again:

nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 
nanosleep({0, 1000}, NULL)              = 0 

That’s way better! But still the nanosleep timing is approximately 13000ns, instead of 1000ns that the call is supposed to take.

At this point I had no clue, and obviously needed to rethink it again; the time the call is supposed to sleep and thus take and the wallclock time I measured are still quite far apart in my measurements. On one hand you could argue that the time it is set to wait is so small that this could be simply caused by the code in the nanosleep function itself, on the other hand it’s not even close.

Then I gotten a eureka moment: strace is using ptrace facilities to attach to the process, and ptrace is known to quite severely slow processes down. Hat tip to Martin Bach who contacted me with this at the very same moment I realised this. So then the question becomes: can I measure nanosleep in another way?

Then I thought about the linux kernel ftrace facilities. In fact, the physical machines at my disposal in the Enkitec lab all run an Exadata images, and the Exadata images do not contain bleeding edge kernels like fedora 27 has, so I couldn’t repeat the perf sched timehist tracing. I wrote about ftrace in the past.

So, let’s redo the timing investigation again using ftrace:
1. Make sure the debug filesystem is mounted:

# mount /sys/kernel/debug || mount -t debugfs none /sys/kernel/debug

2. Enter the trace directory in /sys/kernel/debug:

# cd /sys/kernel/debug/tracing

3. Create another session on the same machine and obtain the process id:

# echo $$
62733

4. In the debug/tracing session, setup nanosleep system call tracing:
Enter the process id as the pid to trace:

echo 62733 > set_ftrace_pid

Set the type of trace. The default ‘tracer’ is nop (no tracing). Set it to ‘function_graph’:

echo function_graph > current_tracer

If there is an option ‘(no)function-fork’ in trace_options, set it to ‘function-fork’ to make ftrace trace child processes too (function-fork is an addition in recent kernels):

# egrep ^nofunction-fork trace_options && echo function-fork > trace_options

Set the function to trace. For the nanosleep system call, this is SyS_nanosleep:

# echo SyS_nanosleep > set_ftrace_filter

Now enable tracing:

# echo 1 > tracing_on

5. In the session to trace, execute chrt –rr 99 ./measure_time.
6. In the debug/tracing session, stop the trace and look at the results:

# echo 0 > tracing_on
# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 20)   8.958 us    |  SyS_nanosleep();
 20)   5.077 us    |  SyS_nanosleep();
 20)   4.260 us    |  SyS_nanosleep();
 20)   9.384 us    |  SyS_nanosleep();
 20)   5.495 us    |  SyS_nanosleep();
 20)   5.526 us    |  SyS_nanosleep();
 20)   5.039 us    |  SyS_nanosleep();
 20)   4.936 us    |  SyS_nanosleep();
 20)   4.898 us    |  SyS_nanosleep();
 20)   4.889 us    |  SyS_nanosleep();

This shows the waiting time still fluctuating, but going down to as low as 4.260 us, alias 4260 nanoseconds. This still is roughly four times the sleeping time set (1000ns), but it gotten way lower than earlier! Probably, the tracing increased the latency a bit, so my guess would be the true sleeping time when no trace is applied is around 2000ns. Please mind this is with changed (increased) process priorities (chrt –rr 99); when measure_time is ran without any priorities set, it looks like this:

# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 21) + 60.583 us   |  SyS_nanosleep();
 21) + 56.504 us   |  SyS_nanosleep();
 21) + 55.940 us   |  SyS_nanosleep();
 21) + 56.118 us   |  SyS_nanosleep();
 21) + 56.076 us   |  SyS_nanosleep();
 21) + 56.078 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 55.745 us   |  SyS_nanosleep();
 21) + 56.100 us   |  SyS_nanosleep();
 21) + 56.868 us   |  SyS_nanosleep();

But: there is Oracle grid infrastructure running, and multiple databases are running.

What would be interesting to see is how nanosleep is actually executed. Ftrace provides a way to do exactly that! The important thing to keep in mind is that ftrace is kernel only. In order to see how SyS_nanosleep is executed, do the following:

# > set_ftrace_filter
# echo SyS_nanosleep > set_graph_function
# > trace

Then execute ‘chrt –rr 99 ./measure_time’ again, and look in trace again:
(please mind I picked the second occurrence of SyS_nanosleep, of course the trace shows them all)

 14)               |  SyS_nanosleep() {
 14)               |    hrtimer_nanosleep() {
 14)   0.045 us    |      hrtimer_init();
 14)               |      do_nanosleep() {
 14)               |        hrtimer_start_range_ns() {
 14)               |          __hrtimer_start_range_ns() {
 14)               |            lock_hrtimer_base() {
 14)   0.039 us    |              _raw_spin_lock_irqsave();
 14)   0.527 us    |            } /* lock_hrtimer_base */
 14)   0.048 us    |            ktime_get();
 14)   0.043 us    |            get_nohz_timer_target();
 14)   0.079 us    |            enqueue_hrtimer();
 14)               |            tick_program_event() {
 14)               |              clockevents_program_event() {
 14)   0.048 us    |                ktime_get();
 14)   0.446 us    |              } /* clockevents_program_event */
 14)   0.876 us    |            } /* tick_program_event */
 14)   0.040 us    |            _raw_spin_unlock();
 14)   0.040 us    |            __raise_softirq_irqoff();
 14)   4.318 us    |          } /* __hrtimer_start_range_ns */
 14)   4.792 us    |        } /* hrtimer_start_range_ns */
 14)               |        schedule() {
 14)               |          __schedule() {
 14)   0.041 us    |            rcu_note_context_switch();
 14)   0.039 us    |            _raw_spin_lock_irq();
 14)               |            deactivate_task() {
 14)               |              dequeue_task() {
 14)   0.056 us    |                update_rq_clock();
 14)               |                dequeue_task_rt() {
 14)               |                  update_curr_rt() {
 14)   0.057 us    |                    cpuacct_charge();
 14)   0.053 us    |                    sched_avg_update();
 14)   0.046 us    |                    _raw_spin_lock();
 14)               |                    sched_rt_runtime_exceeded() {
 14)   0.041 us    |                      balance_runtime();
 14)   0.402 us    |                    } /* sched_rt_runtime_exceeded */
 14)   0.046 us    |                    _raw_spin_unlock();
 14)   2.701 us    |                  } /* update_curr_rt */
 14)               |                  dequeue_rt_entity() {
 14)               |                    dequeue_rt_stack() {
 14)               |                      __dequeue_rt_entity() {
 14)   0.062 us    |                        cpupri_set();
 14)   0.057 us    |                        update_rt_migration();
 14)   0.861 us    |                      } /* __dequeue_rt_entity */
 14)   1.258 us    |                    } /* dequeue_rt_stack */
 14)   0.047 us    |                    enqueue_top_rt_rq();
 14)   1.908 us    |                  } /* dequeue_rt_entity */
 14)   0.048 us    |                  dequeue_pushable_task();
 14)   5.564 us    |                } /* dequeue_task_rt */
 14)   6.445 us    |              } /* dequeue_task */
 14)   6.789 us    |            } /* deactivate_task */
 14)   0.044 us    |            pick_next_task_stop();
 14)   0.041 us    |            pick_next_task_dl();
 14)               |            pick_next_task_rt() {
 14)   0.042 us    |              pull_rt_task();
 14)   0.050 us    |              update_curr_rt();
 14)   0.823 us    |            } /* pick_next_task_rt */
 14)               |            pick_next_task_fair() {
 14)               |              put_prev_task_rt() {
 14)   0.042 us    |                update_curr_rt();
 14)   0.377 us    |              } /* put_prev_task_rt */
 14)               |              pick_next_entity() {
 14)   0.039 us    |                clear_buddies();
 14)   0.340 us    |              } /* pick_next_entity */
 14)               |              set_next_entity() {
 14)   0.065 us    |                __dequeue_entity();
 14)   0.426 us    |              } /* set_next_entity */
 14)   2.159 us    |            } /* pick_next_task_fair */
 14)   0.052 us    |            finish_task_switch();
 14) + 15.656 us   |          } /* __schedule */
 14) + 16.003 us   |        } /* schedule */
 14)   0.039 us    |        _cond_resched();
 14)               |        hrtimer_cancel() {
 14)               |          hrtimer_try_to_cancel() {
 14)               |            lock_hrtimer_base() {
 14)   0.040 us    |              _raw_spin_lock_irqsave();
 14)   0.365 us    |            } /* lock_hrtimer_base */
 14)   0.046 us    |            _raw_spin_unlock_irqrestore();
 14)   1.006 us    |          } /* hrtimer_try_to_cancel */
 14)   1.403 us    |        } /* hrtimer_cancel */
 14) + 23.559 us   |      } /* do_nanosleep */
 14) + 24.444 us   |    } /* hrtimer_nanosleep */
 14) + 24.842 us   |  } /* SyS_nanosleep */

This gives a full overview of the functions that SyS_nanosleep is executing in the kernel, including timing.
Essentially, everything happens in ‘do_nanosleep’. Inside do_nanosleep, the following functions are directly called:
– hrtimer_start_range_ns -> 4.792 us
– schedule -> 16.003 us
– _cond_resched -> 0.039 us
– hrtimer_cancel -> 1.403 us
This is more than the 4260 ns we saw earlier, it’s clear the tracing influences the latency of the execution.

In order to get a breakdown on where the time goes for the 4260 ns, remove SyS_nanosleep from set_graph_function, empty the trace and add the four functions mentioned above to set_ftrace_filter:

# > set_graph_function
# > trace
# echo 'hrtimer_start_range_ns schedule _cond_resched hrtimer_cancel' > set_ftrace_filter

Now execute ‘./chrt –rr 99 ./measure_time’ again, and look in trace; this is a snippet that shows a single nanosleep invocation:

...
  2)   0.938 us    |  hrtimer_start_range_ns();
  2)   3.406 us    |  schedule();
  2)   0.077 us    |  _cond_resched();
  2)   0.123 us    |  hrtimer_cancel();
...

What this shows is that apparently the time it takes for the schedule function to finish is taking the majority of the time. There will be influence of ftrace being active, but taking that into account on this system the time to get rescheduled after waiting for a short time is around 3 to 4 us (microseconds).

Conclusion
There is a certain point at which timing is so fine grained, that anything you do, even running a (kernel) function itself distorts time. It looks like this is 1us (microsecond) granularity for the systems I’ve looked at.

Obviously, when you run on a virtualised platform, especially like I did with a ‘desktop’ virtualisation product like Virtualbox, but it applies to any virtualisation, scheduling times increase because the virtualisation needs to abstract running on a CPU. With modern virtualisation running user-mode this overhead is minimal or non-existent. However, kernel mode access needs to be abstracted in order to keep virtual machines from invading each others resources. This can become significant if resources are oversubscribed (we gotten a glimpse of that in this article with the first nanosleep measurements at roughly 220 us).

The nanosleep times gotten lower when the process priority was set as high as possible. Then when moving onto physical hardware, the latency times gotten even lower. The priority of the process strongly influences the scheduling time, which is kind of logical, because that’s the entire purpose of the priority.

The linux strace utility influences process execution latency quite significantly when looking at a microsecond granularity. Ftrace provides a way to trace in a very flexible way with way lesser overhead than strace, although it still does impose an overhead on the process it is tracing.

Using ftrace, the minimal time that I could measure for a process to get scheduled again after sleeping was between 3 to 4 microseconds. This latency is influenced by the tracing, so it probably comes down to between 2 to 3 microseconds without it. Also, this is strongly influenced by the priority, I needed to set it to the highest priority in order to get the low latency. Another big influencer will be the hardware that I am running, and there are a fair amount of processes running, some at high priority (cluster ware ocssd, Oracle database lms processes, linux daemons).

This is the sixth part in a blog series about Oracle database redo. The previous posts provided information about the log writer writing, this post is about the process that is waiting after issuing commit for the log writer to write it’s redo from the public redo strand. When the database is using post/wait for process commits, the committing process follows the following (simplified) procedure:

kcrfw_redo_gen_ext: 
 kcsgbsn8: kcsnew8: commit SCN is fetched from kcsgscn_
 kslgetl: redo copy latch is taken
 kslgetl: redo allocation latch is taken
 : allocation annotations are made for the commit record in kcrfsg_ and KCRFA
 kslfre: redo allocation latch is freed
 kcrfw_copy_cv : commit record is copied into a public redo strand.
 kslfre: redo copy latch is freed
 kcrfw_commit_force_int: second argument 0: post logwriter if not already posted

These are the actual function calls inside kcrf_commit_force_int with second argument 0:

| | | | | | | | | | | | | | | | > kcrf_commit_force_int(0x7fffbea0de50, 0, ...)
| | | | | | | | | | | | | | | | | > kcscu8(0x60016290, 0x7fffbea0dc78, ...)
| | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
| | | | | | | | | | | | | | | | | > kcscu8(0x60016260, 0x7fffbea0dbc0, ...)
| | | | | | | | | | | | | | | | | < kcscu8+0x000000000047 returns: 0x1
| | | | | | | | | | | | | | | | | > ksbasend(0x600161a0, 0x60016908, ...)
| | | | | | | | | | | | | | | | | | > ksbpro(0x600161a0, 0x60016908, ...)
| | | | | | | | | | | | | | | | | | < ksbpro+0x000000000063 returns: 0x7a6032e8
| | | | | | | | | | | | | | | | | | > ksbpser(0x600161a0, 0x60016908, ...)
| | | | | | | | | | | | | | | | | | < ksbpser+0x000000000046 returns: 0x1
| | | | | | | | | | | | | | | | | | > ksberr(0x600161a0, 0, ...)
| | | | | | | | | | | | | | | | | | < ksberr+0x000000000045 returns: 0
| | | | | | | | | | | | | | | | | | > ksasnd(0x7a6032e8, 0x1, ...)
| | | | | | | | | | | | | | | | | | | > ksasndint(0x7a6032e8, 0x1, ...)
| | | | | | | | | | | | | | | | | | | | > ksaamb(0x7a63a0b8, 0x7a5d9fe0, ...)
| | | | | | | | | | | | | | | | | | | | | > kslgetl(0x6003ef40, 0x1, ...)
| | | | | | | | | | | | | | | | | | | | | < kslgetl+0x00000000012f returns: 0x1
| | | | | | | | | | | | | | | | | | | | | > _intel_fast_memcpy(0x7af99780, 0x7fffbea0dba8, ...)
| | | | | | | | | | | | | | | | | | | | | <> _intel_fast_memcpy.P(0x7af99780, 0x7fffbea0dba8, ...)
| | | | | | | | | | | | | | | | | | | | | <> __intel_ssse3_rep_memcpy(0x7af99780, 0x7fffbea0dba8, ...)
| | | | | | | | | | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x000000002030 returns: 0x7af99780
| | | | | | | | | | | | | | | | | | | | | > kslfre(0x6003ef40, 0x60009578, ...)
| | | | | | | | | | | | | | | | | | | | | < kslfre+0x0000000001e2 returns: 0
| | | | | | | | | | | | | | | | | | | | < ksaamb+0x0000000001ed returns: 0x1
| | | | | | | | | | | | | | | | | | | < ksasndint+0x000000000120 returns: 0x1
| | | | | | | | | | | | | | | | | | | > ksl_post_proc(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | <> kskpthr(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | <> kslpsprns(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | | > ksl_update_post_stats(0x7a6032e8, 0, ...)
| | | | | | | | | | | | | | | | | | | | | > dbgtTrcData_int(0x7f5b0ca886c0, 0x2050031, ...)
| | | | | | | | | | | | | | | | | | | | | | > dbgtBucketRedirect(0x7f5b0ca886c0, 0x7fffbea0d618, ...)
| | | | | | | | | | | | | | | | | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
| | | | | | | | | | | | | | | | | | | | | | > dbgtIncInMemTrcRedirect(0x7f5b0ca886c0, 0x6fa, ...)
| | | | | | | | | | | | | | | | | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
| | | | | | | | | | | | | | | | | | | | | | > skgstmGetEpochTs(0x7f5b0ca886c0, 0x6fa, ...)
| | | | | | | | | | | | | | | | | | | | | | | > gettimeofday@plt(0x7fffbea0d120, 0, ...)
| | | | | | | | | | | | | | | | | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
| | | | | | | | | | | | | | | | | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x203895d2b55e6
| | | | | | | | | | | | | | | | | | | | | | > dbgtrRecAllocate(0x7f5b0ca886c0, 0x7fffbea0d5c0, ...)
| | | | | | | | | | | | | | | | | | | | | | | > dbgtrPrepareWrite(0x7f5b0ca886c0, 0x711f5048, ...)
| | | | | | | | | | | | | | | | | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x1c
| | | | | | | | | | | | | | | | | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
| | | | | | | | | | | | | | | | | | | | | | > _intel_fast_memcpy(0x711f6ac0, 0x7fffbea0d6b8, ...)
| | | | | | | | | | | | | | | | | | | | | | <> _intel_fast_memcpy.P(0x711f6ac0, 0x7fffbea0d6b8, ...)
| | | | | | | | | | | | | | | | | | | | | | <> __intel_ssse3_rep_memcpy(0x711f6ac0, 0x7fffbea0d6b8, ...)
| | | | | | | | | | | | | | | | | | | | | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x711f6ac0
| | | | | | | | | | | | | | | | | | | | | | > dbgtrRecEndSegment(0x7f5b0ca886c0, 0x7fffbea0d5c0, ...)
| | | | | | | | | | | | | | | | | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
| | | | | | | | | | | | | | | | | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
| | | | | | | | | | | | | | | | | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
| | | | | | | | | | | | | | | | | | | | > skgpwpost(0x7fffbea0d760, 0x7f5b0cacdca0, ...)
| | | | | | | | | | | | | | | | | | | | <> sskgpwpost(0x7fffbea0d760, 0x7f5b0cacdca0, ...)
| | | | | | | | | | | | | | | | | | | | | > semop@plt(0x38000, 0x7fffbea0d6f0, ...)
| | | | | | | | | | | | | | | | | | | | | < semop+0x00000000000f returns: 0
| | | | | | | | | | | | | | | | | | | | < sskgpwpost+0x00000000009a returns: 0x1
| | | | | | | | | | | | | | | | | | | < kslpsprns+0x0000000001c3 returns: 0
| | | | | | | | | | | | | | | | | | < ksasnd+0x00000000006d returns: 0x1
| | | | | | | | | | | | | | | | | < ksbasend+0x0000000000f6 returns: 0x1
| | | | | | | | | | | | | | | | < kcrf_commit_force_int+0x000000000500 returns: 0

Basically, what happens is that kcscu8 is called to read 0x60016290 (kcrfsg_.on disk SCN) and kcscu8 is called to read 0x60016260 (kcrfsg_.LWN SCN). Then ksbasend (kernel service background processes asynchronous send message) is called, which essentially performs two functions in ksasnd: write a message in the messages struct (protected by the messages latch) in ksaamb (kernel service asynchronousm messages allocated message buffer) and then post the logwriter using post/wait (ksl_post_proc, actual posting using semop is done in skgpwpost; system kernel generic post wait post).
This also directly ends/returns from the kcrfw_redo_gen_ext function that called kcrf_commit_force_int.

At this point the commit was executed, and the logwriter is notified that it needs to write up to the commit record from the public redo strand.

The next thing the process do is release things like the enqueue(s) it’s holding. After that, the process will wait for the logwriter to execute the write and get posted via post/wait. This is also done in the kcrf_commit_force_int function, however now the second argument is set to ‘1’:

 | | | <> kcrf_commit_force_int(0x7f3011d75e10, 0x1, ...)
 | | | | > skgstmGetEpochTs(0x7f3011d75e10, 0x7, ...)
 | | | | | > gettimeofday@plt(0x7ffda5959810, 0, ...)
 | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x206daa374e6a1
 | | | | > kcscu8(0x60016290, 0x7ffda5959b88, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kcscu8(0x60016260, 0x7ffda5959ad0, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > ksbasend(0x600161a0, 0x60016908, ...)
 | | | | | > ksbpro(0x600161a0, 0x60016908, ...)
 | | | | | < ksbpro+0x000000000063 returns: 0x7a6032e8
 | | | | | > ksbpser(0x600161a0, 0x60016908, ...)
 | | | | | < ksbpser+0x000000000046 returns: 0x1
 | | | | | > ksberr(0x600161a0, 0, ...)
 | | | | | < ksberr+0x000000000045 returns: 0
 | | | | | > ksasnd(0x7a6032e8, 0x1, ...)
 | | | | | | > ksasndint(0x7a6032e8, 0x1, ...)
 | | | | | | | > ksaamb(0x7a6085b8, 0x7a5d9fe0, ...)
 | | | | | | | < ksaamb+0x000000000504 returns: 0
 | | | | | | < ksasndint+0x000000000120 returns: 0x1
 | | | | | < ksasnd+0x00000000006d returns: 0x1

Actually exactly the same is happening here as we saw previously. The way the kcrf_commit_force_int function works is:
– It first reads the on disk SCN (0x60016290) via kcscu8. If the on disk SCN is equal or past the process’ commit SCN, it means the process’ redo is written to disk, and the kcrf_commit_force_int function is quit.
– If the on disk SCN is not progressed far enough, the LWN SCN (0x60016260) is read via kcscu8. If the LWN SCN progressed beyond or is equal to the commit SCN, it means the logwriter is in the progress of writing the process’ redo, so it does not need posting via ksbasend. If the logwriter has not progressed the LWN SCN that far, the process tries posting via ksbasend again.
– In the above function call trace we see ksaamb returning 0, which means it did try to post the logwriter again, but it found the logwriter already posted (by looking at kcrfsg_+516). As a result, it returned directly from the function without creating a message (and thus needing to fetch the messages latch and post the logwriter via semop).

At this point in kcrf_commit_force_int everything is done to make the logwriter write the process’ redo in the public redo strand, so it needs to wait for the logwriter to finish, and the instance is using post wait the notification mechanism for log file sync.

The next thing that happens is a post-wait entry in the post wait queue is setup, and the on disk SCN is checked once more:

 | | | | > kslawe(0x10380, 0x7a5d9fe0, ...)
 | | | | | > kslgetl(0x7a5d72c8, 0, ...)
 | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | > kslfre(0x7a5d72c8, 0x7a5d7368, ...)
 | | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | < kslawe+0x00000000027d returns: 0
 | | | | > kcscu8(0x60016290, 0x7ffda5959b88, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1

Now that the process registered itself in the post-wait queue, a wait is registered via kslwtbctx:

 | | | | > kslwtbctx(0x7ffda5959950, 0x7ffda5959b88, ...)
 | | | | | > sltrgftime64(0x7a6d9580, 0x7ae18b68, ...)
 | | | | | | > clock_gettime@plt(0x1, 0x7ffda5959700, ...)
 | | | | | | | > clock_gettime(0x1, 0x7ffda5959700, ...)
 | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | < sltrgftime64+0x00000000004c returns: 0x428fc8dac9
 | | | | | > kslwait_timeout_centi_to_micro(0x7fffffff, 0x443dfe7, ...)
 | | | | | < kslwait_timeout_centi_to_micro+0x000000000021 returns: 0x7fffffffffffffff
 | | | | | > kskthbwt(0, 0x428fc8dac9, ...)
 | | | | | < kskthbwt+0x0000000000a1 returns: 0x10
 | | | | | > kslwt_start_snapshot(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | < kslwt_start_snapshot+0x0000000000dc returns: 0x7a6d9db8
 | | | | < kslwtbctx+0x00000000050b returns: 0x7a6d9db8

Now that the process is in the wait event, the process needs to truly ‘wait’. In post-wait mode this is done on a semaphore, just like the logwriter does when sleeping in the wait event ‘rdbms ipc message’. This is how that looks like:

 | | | | > kslwaitctx(0x7ffda5959950, 0xa, ...)
 | | | | | > ksliwat(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | > sltrgftime64(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffda5959280, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffda5959280, ...)
 | | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | < sltrgftime64+0x00000000004c returns: 0x428fc8e664
 | | | | | | > kslawe(0x10380, 0x459ffef, ...)
 | | | | | | < kslawe+0x000000000076 returns: 0x10380
 | | | | | | > kskiorm(0x7ae0ee70, 0, ...)
 | | | | | | < kskiorm+0x00000000001e returns: 0
 | | | | | | > kfias_iswtgon_ksfd(0x7ae0ee70, 0, ...)
 | | | | | | < kfias_iswtgon_ksfd+0x00000000002b returns: 0
 | | | | | | > kxdbio_has_work(0x7ffda59592a4, 0x6003d010, ...)
 | | | | | | < kxdbio_has_work+0x000000000027 returns: 0
 | | | | | | > skgpwwait(0x7ffda5959510, 0x7f3011d45ca0, ...)
 | | | | | | | > kslwait_conv_wait_time(0x186a0, 0x7f3011d45ca0, ...)
 | | | | | | | < kslwait_conv_wait_time+0x000000000027 returns: 0x186a0
 | | | | | | | > sskgpwwait(0x7ffda5959510, 0x7f3011d45ca0, ...)
 | | | | | | | | > semtimedop(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | < semtimedop+0x000000000023 returns: 0xffffffffffffffff
 | | | | | | | | > __errno_location(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | | > fthread_self(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | | < fthread_self+0x000000000024 returns: 0
 | | | | | | | | < __errno_location+0x000000000010 returns: 0x7f3011d406a0
 | | | | | | | < sskgpwwait+0x00000000014e returns: 0
 | | | | | | < skgpwwait+0x0000000000e0 returns: 0
 | | | | | | > ksuSdiInProgress(0x10003, 0x10003, ...)
 | | | | | | < ksuSdiInProgress+0x000000000035 returns: 0
 | | | | | | > sltrgftime64(0x10003, 0x10003, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffda5959280, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffda5959280, ...)
 | | | | | | | | < clock_gettime+0x000000000069 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | < sltrgftime64+0x00000000004c returns: 0x428fcad86e
 | | | | | | > kslwo_getcbk(0x9d, 0x80f6298, ...)
 | | | | | | < kslwo_getcbk+0x000000000017 returns: 0
 | | | | | | > kgslwait_last_waitctx_time_waited_usecs(0x7f3011d409a0, 0x7a6da7f0, ...)
 | | | | | | < kgslwait_last_waitctx_time_waited_usecs+0x000000000045 returns: 0x1f20a
 | | | | | | > kslwt_update_stats_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | | > kews_update_wait_time(0x5, 0x1fda5, ...)
 | | | | | | | < kews_update_wait_time+0x00000000002e returns: 0x6
 | | | | | | < kslwt_update_stats_int+0x00000000033e returns: 0x428fcad86e
 | | | | | | > ksliwat_cleanup(0x7ffda5959650, 0x7ffda5959628, ...)
 | | | | | | | > kslgetl(0x7a5d72c8, 0x1, ...)
 | | | | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | | | | > kslfre(0x7a5d72c8, 0x7a5d7378, ...)
 | | | | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | | | < ksliwat_cleanup+0x0000000000e5 returns: 0
 | | | | | < ksliwat+0x000000003acf returns: 0
 | | | | < kslwaitctx+0x00000000012b returns: 0

The most prominent thing is skgpwwait (system kernel generic post-wait wait), which calls sskgpwwait (skgpwwait with an additional s in front, which often are os dependent calls, like in this call, calling semtimedop). We see semtimedop returning 0xffffffffffffffff (-1), which means semaphore wait timed out, which means the semaphore was not posted using semop before the timeout set (100ms for log file sync) for the semaphore wait.

Interestingly, a little further down se see ksliwat_cleanup. This function clears this process from the post-wait queue, not only indicated by the ‘cleanup’ part in ksliwat_cleanup, but also by the latch get of 0x7a5d72c8, this is the ‘post/wait queue’ latch. This means that when the kslwaitctx function is left, this process is not in the post/wait queue anymore. However, the process is still waiting in the wait event ‘log file sync’.

The next thing that happens is a call to ‘ksu_dispatch_tac’ (kernel service user dispatch timed-out action):

 | | | | > ksu_dispatch_tac(0x7a5d72c8, 0, ...)
 ...
 | | | | < ksu_dispatch_tac+0x0000000003b4 returns: 0

Essentially, this routine validates certain conditions to trigger actions because the wait in kslwaitctx has timed out.

At this point the process repeats the steps of:
– kslawe: registering itself in the post-wait queue
– kcscu8: check on disk SCN
– kslwaitctx, semtimedop: wait to be posted
– (if not posted) ksliwat_cleanup: remove post-wait entry
– (if not posted) kcscu8: check on disk SCN
(the double check of on disk SCN via kcscu8 is not a typo!)

If the semaphore gets posted (visible by return code 0):

 | | | | | | | | > semtimedop(0x58000, 0x7ffda5959208, ...)
 | | | | | | | | < semtimedop+0x000000000012 returns: 0

The post-wait cleanup function (ksliwat_cleanup) does not require a latch when the process has been posted, probably because now that the entry has been used, there is no chance it will be used again, so it’s safe to cleanup without serialising access via a latch. After the ksliwat_cleanup returns, the process also returns from ksliwat and kslwaitctx.

Next kcscu8 is called to read the on disk SCN again, and the wait event is ended with kslwtectx, some statistics about redo writing are written (kcrfw_stats_syncoverhead_record), and the function kcrf_commit_force_int is ended, because when the semaphore is posted, the wait for the logwriter to finish is done, meaning the commit is entirely finished:

 | | | | > kcscu8(0x60016290, 0x7ffda5959b88, ...)
 | | | | < kcscu8+0x000000000047 returns: 0x1
 | | | | > kslwtectx(0x7ffda5959950, 0x7ffda5959b88, ...)
 | | | | | > kslwt_end_snapshot(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | > kslwh_enter_waithist_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | < kslwh_enter_waithist_int+0x00000000013a returns: 0x7a8378a8
 | | | | | | > kslwtrk_enter_wait_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | < kslwtrk_enter_wait_int+0x000000000019 returns: 0x7a8378a8
 | | | | | < kslwt_end_snapshot+0x0000000000f2 returns: 0x7a6d9580
 | | | | | > kslwt_update_stats_int(0x7a6da7f0, 0x7a6da7f0, ...)
 | | | | | | > kews_update_wait_time(0x5, 0, ...)
 | | | | | | < kews_update_wait_time+0x00000000002e returns: 0x6
 | | | | | < kslwt_update_stats_int+0x00000000033e returns: 0
 | | | | | > kskthewt(0x4290257c02, 0x9d, ...)
 | | | | | < kskthewt+0x0000000000c3 returns: 0xa030
 | | | | < kslwtectx+0x0000000003f9 returns: 0xe923
 | | | | > skgstmGetEpochTs(0x4290257c02, 0x9d, ...)
 | | | | | > gettimeofday@plt(0x7ffda5959810, 0, ...)
 | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x206daa3d1ca1b
 | | | | > kcrfw_stats_syncoverhead_record(0x7f3011d75e10, 0x206daa3d1ca1b, ...)
 | | | | | > kcrfw_stats_writeinfo_find(0x7f3011d75e10, 0x206daa3d1ca1b, ...)
 | | | | | < kcrfw_stats_writeinfo_find+0x0000000000fb returns: 0x206daa3d1a057
 | | | | < kcrfw_stats_syncoverhead_record+0x0000000000a4 returns: 0x29c4
 | | | < kcrf_commit_force_int+0x000000000b9c returns: 0x1

What is clear, is that the process checks the on disk SCN a lot of times. The obvious reason for that would be to stop waiting whenver the on disk SCN indicates the process’ redo information is written. However, in the past I’ve heard about issues with sessions that were in post-wait mode waiting, which were ‘forgotten’ when the database switched to polling (this was Oracle 11.2.0.3, this issue is solved!). That however leads to an interesting question:

1. What happens when a process is posted by the logwriter, but the on disk SCN is NOT progressed beyond the process’ commit SCN?
2. What happens when a process is NOT posted by the logwriter, but the on disk SCN has progressed beyond the process’ commit SCN?

Let’s test that! Let’s test the first situation: a foreground posted, but the on disk SCN did not progress beyond the commit SCN of that process.

First we need to know the commit SCN for the inserting session. We know that the SCN is obtained from kcsgscn_, let’s get the address of it:

SQL> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv where ksmfsnam = 'kcsgscn_';
KSMFSNAM                       KSMFSADR           KSMFSSIZ
------------------------------ ---------------- ----------
kcsgscn_                       00000000600113B8         48

Now attach to a foreground session with gdb, and put an read/write (acces) watch on 0x600113b8, set pagination off and let gdb continue:

(gdb) awatch *0x600113b8
Hardware access (read/write) watchpoint 1: *0x600113b8
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end
(gdb) set pagi off
(gdb) c
Continuing.

Now insert something in the foreground session (anything is okay, as long as it’s a regular insert):

Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374652
New value = 374701
0x000000000320e5a8 in kcsAdjustUsingForeignSCN ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3ee68 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f03b in kcsgrsn8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f03b in kcsgrsn8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f2d5 in kcsgssn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3e93b in kcscu8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3e93b in kcscu8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374701
0x0000000010e3ee68 in kcsgcsn ()

Pay attention to the output sequence of a watchpoint:
– ‘Hardware access …’ and watchpoint number and address.
– old and new value or read value.
– instruction pointer address and function.
Also look at the different functions with which an SCN is queried.

Now execute the commit:

Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374701
New value = 374767
0x000000000320e5a8 in kcsAdjustUsingForeignSCN ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374767
0x0000000010e3ecc1 in kcsnew8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374767
New value = 374768
0x0000000010e3ecdb in kcsnew8 ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Old value = 374768
New value = 374769
0x0000000010e3f12a in kcsgrsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374769
0x0000000010e3ee68 in kcsgcsn ()
Hardware access (read/write) watchpoint 1: *0x600113b8

Value = 374769
0x0000000010e3ee68 in kcsgcsn ()

If you take a close look, you see a new SCN for the commit is obtained in the aptly named function kcsnew8 at instruction pointer address 0x10e3ecdb. Now that we know the instruction pointer address, we can filter the watch:
Disable the previous watch, and create a second access watch point:

(gdb) dis 1
(gdb) awatch *0x600113b8 if ( $rip == 0x10e3ecdb )
Hardware access (read/write) watchpoint 2: *0x600113b8
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

Now issue another insert into and a commit:

Hardware access (read/write) watchpoint 2: *0x600113b8

Old value = 374908
New value = 374909
0x0000000010e3ecdb in kcsnew8 ()

Voila! commit SCN is 374909 in this case!

Because we know kcscu8 is called to read the on disk SCN from kcrfsg_, we can cheat and modify the on disk SCN in kcrfsg_ ourselves to make this lower than the commit SCN.

=> WARNING! Never EVER do this, or any of these techniques in a database that you care about! This can severely corrupt the database leading to an irrepairable situation! WARNING <=

First modify the watchpoint (2 in the above example) to stop when it is fired:

(gdb) commands 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>end
(gdb) c
Continuing.

Now perform the insert and commit again. This will hang because the watchpoint will break execution:

Hardware access (read/write) watchpoint 2: *0x600113b8

Old value = 620035
New value = 620036
0x0000000010e3ecdb in kcsnew8 ()

Okay, so the commit SCN is 620036. Now break on kcscu8 and simply set the SCN to 620000, which undoubtly is lower than the commit SCN of 620036:

(gdb) break kcscu8
Breakpoint 3 at 0x10e3e910
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>set {int}0x60016290=620000
>c
>end

Also set an access watch on 0x60016290, alias the on disk SCN, to validate what SCN the process reads:

(gdb) awatch *0x60016290
Hardware watchpoint 4: *0x60016290
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb)

Now attach with another gdb session to the logwriter, and keep the logwriter stopped after attaching.
After the logwriter is stopped, continue in the foreground session:

(gdb) c
Continuing.

What I regularly encounter, especially with new(er) versions of the Oracle database, is the following messages:

Program received signal SIGUSR2, User defined signal 2.
0x0000000010e3ecdb in kcsnew8 ()
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000000010f973a5 in slaac_int ()
(gdb) c
Continuing.

It seems that a signal handler is triggered. In most cases just continuing will make the process running again.

After the signal handler, the following is shown in gdb:

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
---Type <return> to continue, or q <return> to quit---

Quit using q and enter, and set pagination off: set pagi off, then continue.

Now let the logwriter continue in its gdb session (continue).

In the foreground debugger session, you'll see that the foreground did not do anything else than it already did:

Hardware access (read/write) watchpoint 4: *0x60016290

Value = 620000
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()

Stop execution of the committing process, and disable breakpoint 3 (dis 3), which sets the on disk SCN to a value lower than the process' commit SCN, so it will pick up the correct current on disk SCN, which is beyond the process' commit SCN:

(gdb) c
Continuing.
Hardware access (read/write) watchpoint 4: *0x60016290

Old value = 620000
New value = 620865
0x0000000010e3e93b in kcscu8 ()

And the commit finishes! This proves that despite the post/wait posting, the session actually checks the on disk SCN using kcscu8 additional to the semaphore being posted, and only continues if the on disk SCN has progressed up to the commit SCN or further!

We can use these breakpoints to see if we can do the exact opposite: make sure the foreground is not posted, and then progress the on disk SCN beyond the commit SCN. At this time you should have two gdb sessions, one on a foreground process, and one on the logwriter. Both gdb sessions have the process they are attached to running.

Now insert and commit again. Because of the watchpoint on kcsgscn_ with the instruction pointer set to kcsnew8 at breakpoint 2, this will break/stop execution:

Hardware access (read/write) watchpoint 2: *0x600113b8

Old value = 621375
New value = 621376
0x0000000010e3ecdb in kcsnew8 ()

Break execution of the logwriter, so the foreground cannot be posted anymore (simply ctrl-C the gdb session attached to the log writer process).
Now change breakpoint 3 in the gdb session attached to the foreground to set the on disk SCN to 621400, which is higher than the commit SCN of 621376:

(gdb) commands 3
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>set {int}0x60016290=621400
>c
>end

And enable breakpoint 3 again:

(gdb) enable 3

And then continue the foreground in gdb.

...
Hardware access (read/write) watchpoint 4: *0x60016290

Value = 621400
0x0000000010e3e93b in kcscu8 ()

Breakpoint 3, 0x0000000010e3e910 in kcscu8 ()
...

Repeatedly.

This shows watchpoint 4 begin triggered because of access to 0x60016290 (on disk SCN), and breakpoint 3 setting the on disk SCN to 621400.
Now continue the logwriter. This will finish the commit of the foreground session.

So conclusion here is that despite the on disk SCN having progressed beyond the foreground session's commit SCN value (commit SCN was 621376, on disk SCN was set to 621400), a commit will wait until it is explicitly posted via post-wait.

This the the fifth blog in a series of blogposts about Oracle database redo. The previous blog looked into the ‘null write’ (kcrfw_do_null_write actually) function inside kcrfw_redo_write_driver, which does housekeeping like updating SCNs and posting processes if needed, this blog looks into what happens when the log writer is actually posted by a process or if public redo strand buffers have been written into. In part 3 of this blog series (the log writer working cycle) it can be seen that when a session posts the log writer, it returns from the semaphore related functions, and calls ‘kcrfw_redo_write_driver’ directly, which otherwise is called inside ksbcti.

Inside the kcrfw_redo_write_driver function, the first thing of interest is executed only when the logwriter is posted, and the kcrfw_redo_write_driver function is called directly after returning from ksarcv and ksl_exit_main_loop_wait:

kcrfw_redo_write_driver+145:0x000000006001642c(fixed sga|var:kcrfsg_+516 shared pool ):W:4:0/0()

At offset 516 in kcrfsg_, a 4 byte field is used for indicating that the logwriter is posted. This field is filled out by a process that wants the logwriter to write. The way it works is that a process reads this field, and changes it to 1 if this field contains 0, and then performs a semctl or semop to post the logwriter and get it working. If the field already contains 1, no further action is taken by the a committing process. The logwriter sets this value to 0 as soon as it has been posted and executes the kcrfw_redo_write_driver function. If the kcrfw_redo_write_driver function is called as part of the normal logwriter cycle (which means inside the ksbcti function), it will not read or write this field.

Then a lot of things are executed that have been shown in the previous post. The next thing the kcrfw_redo_write_driver executes is kcsnew3 is to obtain a LWN SCN and set the LWN number protected by the ‘lgwr LWN SCN’ latch:

kcsnew8+107:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0x1/1()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):R:4:0xc6/198()
kcsnew8+118:0x0000000060016270(fixed sga|var:kcrfsg_+72 ):W:4:0xc7/199()
kcsnew8+125:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b80/7744384()
kcsnew8+131:0x00007fffa10728f0():W:4:0/0()
kcsnew8+140:0x00007fffa1072922():W:1:0/0()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):R:8:0x762b80/7744384()
kcsnew8+149:0x00000000600113b8(fixed sga|var:kcsgscn_+0 ):W:8:0x762b81/7744385()
kcsnew8+165:0x00007fffa10728f0():R:4:0/0()
kcsnew8+184:0x00007f520ba8be08():R:8:0x600113b8/1610683320(fixed sga|var:kcsgscn_+0 )
kcsnew8+193:0x00007f520ba8ae90():R:8:0x7a823620/2055353888(shared pool|permanent memor,duration 1,cls perm+7693856 )
kcsnew8+212:0x000000007a823e40(shared pool|permanent memor,duration 1,cls perm+7695936 ):R:8:0x719beed8/1906044632(shared pool|ksu:stats_freel,duration 1,cls freeabl+24 )
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):R:8:0xc7/199()
kcsnew8+219:0x00000000719bf790(shared pool|ksu:stats_freel,duration 1,cls freeabl+2256 ):W:8:0xc8/200()
kcsnew8+223:0x00007fffa1072a00():W:8:0x762b81/7744385()
kcsnew8+232:0x0000000060016260(fixed sga|var:kcrfsg_+56 shared pool|x_kcrfws.lwn_scn+0 ):W:8:0x762b81/7744385()
kcsnew8+238:0x0000000060016274(fixed sga|var:kcrfsg_+76 ):W:4:0/0()
kcsnew8+246:0x0000000060016288(fixed sga|var:kcrfsg_+96 shared pool|pointer:lgwr lwn scn latch+0 ):R:8:0x60049800/1610913792(fixed sga|(parent)latch:lgwr LWN SCN+0 fixed sga|var:lwn_scn_lat_+0 )
kcsnew8+250:0x00007fffa10728d8():W:8:0x10e3ed3f/283372863()

The LWN is set to 0x762b81 / 7744385, LWN number is 199.

The next thing to look at is the kcrfw_gather_lwn function flow:

 | > kcrfw_gather_lwn(0x7ffc3dd11b78, 0x77eced90, ...)
 | | > kslgetl(0x76fe0c10, 0x1, ...)
 | | < kslgetl+0x00000000012f returns: 0x1
 | | > kcrfw_gather_strand(0x7ffc3dd11b78, 0, ...)
 | | < kcrfw_gather_strand+0x0000000000c2 returns: 0x2
 | | > kslfre(0x76fe0c10, 0x20002, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > kslgetl(0x76fe0cb0, 0x1, ...)
 | | < kslgetl+0x00000000012f returns: 0x1
 | | > kcrfw_gather_strand(0x7ffc3dd11b78, 0x1, ...)
 | | < kcrfw_gather_strand+0x0000000000c2 returns: 0
 | | > kslfre(0x76fe0cb0, 0x19, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | < kcrfw_gather_lwn+0x00000000065c returns: 0

Here we see exactly the same function sequence in kcrfw_gather_lwn as we saw in the ‘null write’ blogpost, however pay close attention to the return codes of the kcrfw_gather_strand functions. The first call returns 2 instead of 0, indicating two (strand) buffers need writing.

If we zoom in to the first kcrfw_gather_strand call, we see:

kcrfw_gather_strand+2:0x00007fffa1072910():W:8:0x7fffa1072a10/140735894989328()
kcrfw_gather_strand+15:0x0000000060016228(fixed sga|var:kcrfsg_+0 shared pool|pointer:shared pool redo struct+0 ):R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+22:0x00007fffa1072908():W:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+29:0x00007fffa1072900():W:8:0x76fdf130/1996353840(shared pool|permanent memor,duration 1,cls perm+16113968 )
kcrfw_gather_strand+36:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):R:4:0x65e/1630()
kcrfw_gather_strand+53:0x0000000076fdf480(shared pool|KCRFA+248 shared pool|TOTAL_BUFS_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114816 ):R:4:0x20000/131072()
kcrfw_gather_strand+70:0x00007fffa1072bf8():R:8:0x762b81/7744385()
kcrfw_gather_strand+79:0x0000000076fdf390(shared pool|KCRFA+8 shared pool|permanent memor,duration 1,cls perm+16114576 ):R:8:0x762b7f/7744383()
kcrfw_gather_strand+86:0x0000000076fdf3a8(shared pool|KCRFA+32 shared pool|NEXT_BUF_NUM_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114600 ):R:4:0x660/1632()
kcrfw_gather_strand+121:0x0000000076fdf45c(shared pool|KCRFA+212 shared pool|LAST_BUF_GATHERED_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114780 ):R:4:0x65e/1630()
kcrfw_gather_strand+169:0x00007fffa1072a08():W:4:0x660/1632()
kcrfw_gather_strand+176:0x00007fffa1072a0c():W:4:0/0()
kcrfw_gather_strand+182:0x00007fffa1072900():R:8:0x76fdf130/1996353840(shared pool|permanent memor,duration 1,cls perm+16113968 )
kcrfw_gather_strand+186:0x00007fffa1072908():R:8:0x76fdf388/1996354440(shared pool|KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114568 )
kcrfw_gather_strand+193:0x00007fffa1072910():R:8:0x7fffa1072a10/140735894989328()
kcrfw_gather_strand+194:0x00007fffa1072918():R:8:0x31f0f00/52367104()

What is visible here is roughly identical to the memory access of the kcrfw_gather_strand function we saw in the ‘null write’ blogpost, however there is one striking difference: at offset 86 in the kcrfw_gather_strand function KCRFA at offset 32 is read, which shows a value of 1632. The other locations show a value of 1630, which is an indication some public redo strand buffers have been written. This means there is potentially something to write! Please remember that the redo allocation latch was fetched before calling kcrfw_gather_strand.

Once the kcrfw_gather_strand function has returned to kcrfw_gather_lwn, the actual write is prepared by further manipulating the KCRFA struct for the first strand for the write:

kcrfw_gather_lwn+436:0x0000000076fdf45c(shared pool|KCRFA+212 shared pool|LAST_BUF_GATHERED_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114780 ):W:4:0x660/1632()
kcrfw_gather_lwn+449:0x0000000076fdf3b0(shared pool|BYTES_IN_BUF_KCRFA_CLN+0 shared pool|KCRFA+40 shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0x1b8/440()
kcrfw_gather_lwn+453:0x0000000060016a10(fixed sga|var:kcrf_kcrrh_sz_+0 ):R:8:0x18/24()
kcrfw_gather_lwn+462:0x0000000076fdf3a0(shared pool|KCRFA+24 shared pool|PNEXT_BUF_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd80cc048/3624714312(Redo Buffers(pgsz:2048k)|+835656 redo|PUB_REDO_0+835656 )
kcrfw_gather_lwn+466:0x00000000d80cc048(redo|PUB_REDO_0+835656 ):W:4:0/0()
kcrfw_gather_lwn+472:0x0000000076fdf3b0(shared pool|BYTES_IN_BUF_KCRFA_CLN+0 shared pool|KCRFA+40 shared pool|permanent memor,duration 1,cls perm+16114608 ):R:8:0x1b8/440()
kcrfw_gather_lwn+476:0x0000000076fdf3a0(shared pool|KCRFA+24 shared pool|PNEXT_BUF_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114592 ):R:8:0xd80cc048/3624714312(Redo Buffers(pgsz:2048k)|+835656 redo|PUB_REDO_0+835656 )
kcrfw_gather_lwn+476:0x0000000076fdf3a0(shared pool|KCRFA+24 shared pool|PNEXT_BUF_KCRFA_CLN+0 shared pool|permanent memor,duration 1,cls perm+16114592 ):W:8:0xd80cc200/3624714752(Redo Buffers(pgsz:2048k)|+836096 redo|PUB_REDO_0+836096 )
kcrfw_gather_lwn+480:0x0000000076fdf3cc(shared pool|KCRFA+68 shared pool|permanent memor,duration 1,cls perm+16114636 ):R:1:0/0()
kcrfw_gather_lwn+480:0x0000000076fdf3cc(shared pool|KCRFA+68 shared pool|permanent memor,duration 1,cls perm+16114636 ):W:1:0x1/1()
kcrfw_gather_lwn+485:0x0000000076fdf3b0(shared pool|BYTES_IN_BUF_KCRFA_CLN+0 shared pool|KCRFA+40 shared pool|permanent memor,duration 1,cls perm+16114608 ):W:8:0/0()

The LAST_BUF_GATHERED_KCRFA field at KCRFA offset 212 is set to the new buffer number 1632. Then the actual amount of data inside the buffer is queried from KCRFA at offset 40 (BYTES_IN_BUF_KCRFA_CLN), and the pointer to the actual location in the first public redo strand is obtained from KCRFA offset 24 (PNEXT_BUF_KCRFA_CLN). The pointer into the first public redo strand then is increased by the amount of redo written, which is 440 in this case, and the BYTES_IN_BUF_KCRFA_CLN is reset to 0.

A little further in kcrfw_gather_lwn, LAST_BUF_WRITTEN in KCRFA at offset 208 is changed to public strand buffer 1632:

kcrfw_gather_lwn+977:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):R:4:0x65e/1630()
kcrfw_gather_lwn+987:0x0000000076fdf480(shared pool|KCRFA+248 shared pool|TOTAL_BUFS_KCRFA+0 shared pool|permanent memor,duration 1,cls perm+16114816 ):R:4:0x20000/131072()
kcrfw_gather_lwn+996:0x0000000076fdf3c0(shared pool|KCRFA+56 shared pool|permanent memor,duration 1,cls perm+16114624 ):W:4:0x1000/4096()
kcrfw_gather_lwn+1006:0x0000000076fdf458(shared pool|KCRFA+208 shared pool|LAST_BUF_WRITTEN+0 shared pool|permanent memor,duration 1,cls perm+16114776 ):W:4:0x660/1632()

At this point the redo allocation latch for this strand is released.
After the release of the latch, the kcrfw_gather_lwn function changes a value in the struct kcrfsg_ at offset 8, which is X$KCRFWS.NEXT_BLK, alias the administration of where to write in the online redolog file(s):

kcrfw_gather_lwn+1038:0x0000000060016230(fixed sga|var:kcrfsg_+8 shared pool|x_kcrfws.next_blk+0 ):R:4:0x16a16/92694()
kcrfw_gather_lwn+1044:0x0000000077eceea8(shared pool|permanent memor,duration 1,cls perm+2451776 ):W:4:0x16a16/92694()
kcrfw_gather_lwn+1048:0x0000000060016230(fixed sga|var:kcrfsg_+8 shared pool|x_kcrfws.next_blk+0 ):R:4:0x16a16/92694()
kcrfw_gather_lwn+1048:0x0000000060016230(fixed sga|var:kcrfsg_+8 shared pool|x_kcrfws.next_blk+0 ):W:4:0x16a18/92696()

The second public strand has no buffers written, which we know because the kcrfw_gather_strand function returned 0, so that one is skipped.

At this point, the memory area’s that take part in the writing have been setup.

In order to write, the function kcrfw_redo_write is called to start the actual redo writing. First, the logwriter needs to make sure that the buffers it has marked for write, which is the difference between the original LAST_BUF_GATHERED and LAST_BUF_WRITTEN, are not still written into. This is done in the function kcrfw_wait_for_fg_copy, which is called in kcrfw_redo_write. The way this works, is it looks up the address of the first redo copy latch child which a process that is copying change vectors into the public strand must take which is stored in kcrfsg_ at offset 472 and the address of the redo copy latch support memory area, which is stored in kcrfsg_ at offset 488. Then it looks up the buffer assigned to the first latch in the redo copy latch support memory area and reads 8 bytes to see if it contains the value 0xffffffffffffffff, which is what a writer in a redo log strand must write in the support memory area if it’s done writing to the redo strand. This means that as long as a process writes into the redo log strands, it will have set this memory area to another value indicating it is being written into. Once the value becomes 0xffffffffffffffff, kcrfw_wait_for_fg_copy will continue to see if any other redo copy latches needs checking. It does this by reading the actual total amount of redo copy latches in kcrfsg_ at offset 464 and the number of the current redo copy latch it has checked. If not, it will move to the next redo copy latch and redo copy latch support memory area, checking for 0xffffffffffffffff again, etc. Please mind it does not check or obtain any of the redo copy latches itself, nor prevents any write to the redo strands that happens after the check, which is probably by design:

kcrfw_wait_for_fg_copy+132:0x0000000060016400(fixed sga|var:kcrfsg_+472 shared pool|pointer:redo copy latch+0 ):R:8:0x76fdea30/1996352048(shared pool|(child)latch#1:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112176 )
kcrfw_wait_for_fg_copy+142:0x0000000060016410(fixed sga|var:kcrfsg_+488 ):R:8:0x76fdf0f0/1996353776(shared pool|permanent memor,duration 1,cls perm+16113904 )
kcrfw_wait_for_fg_copy+149:0x00007fffa1072968():W:1:0/0()
kcrfw_wait_for_fg_copy+153:0x00007fffa1072960():W:8:0x76fdea30/1996352048(shared pool|(child)latch#1:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112176 )
kcrfw_wait_for_fg_copy+157:0x00007fffa1072958():W:4:0/0()
kcrfw_wait_for_fg_copy+161:0x0000000076fdf0f0(shared pool|permanent memor,duration 1,cls perm+16113904 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+169:0x0000000076fdf0f4(shared pool|permanent memor,duration 1,cls perm+16113908 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+176:0x00007fffa1072968():R:1:0/0()
kcrfw_wait_for_fg_copy+180:0x00007fffa1072958():R:4:0/0()
kcrfw_wait_for_fg_copy+195:0x00000000600163f8(fixed sga|var:kcrfsg_+464 ):R:4:0x8/8()
kcrfw_wait_for_fg_copy+132:0x0000000060016400(fixed sga|var:kcrfsg_+472 shared pool|pointer:redo copy latch+0 ):R:8:0x76fdea30/1996352048(shared pool|(child)latch#1:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112176 )
kcrfw_wait_for_fg_copy+142:0x0000000060016410(fixed sga|var:kcrfsg_+488 ):R:8:0x76fdf0f0/1996353776(shared pool|permanent memor,duration 1,cls perm+16113904 )
kcrfw_wait_for_fg_copy+149:0x00007fffa1072968():W:1:0/0()
kcrfw_wait_for_fg_copy+153:0x00007fffa1072960():W:8:0x76fdeb08/1996352264(shared pool|(child)latch#2:redo copy+0 shared pool|permanent memor,duration 1,cls perm+16112392 )
kcrfw_wait_for_fg_copy+157:0x00007fffa1072958():W:4:0x1/1()
kcrfw_wait_for_fg_copy+161:0x0000000076fdf0f8(shared pool|permanent memor,duration 1,cls perm+16113912 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+169:0x0000000076fdf0fc(shared pool|permanent memor,duration 1,cls perm+16113916 ):R:4:0xffffffff/4294967295(??)
kcrfw_wait_for_fg_copy+176:0x00007fffa1072968():R:1:0/0()
kcrfw_wait_for_fg_copy+180:0x00007fffa1072958():R:4:0x1/1()
kcrfw_wait_for_fg_copy+195:0x00000000600163f8(fixed sga|var:kcrfsg_+464 ):R:4:0x8/8()

The above memory trace output shows the kcrfw_wait_for_fg_copy function looking up the redo copy latch memory address at kcrfw_wait_for_fg_copy offset 132, looking up the redo copy support memory area at kcrfw_wait_for_fg_copy offset 142, and then peeks into the redo copy helper memory structure in kcrfw_wait_for_fg_copy at offset 161 and 169. This way it does not need to obtain the redo copy latch. At offset 195 of the function it looks up the number of redo copy latches, and then executes the same procedure again, looking up the same pointers in kcrfsg_, but adding an offset for the next latch.

Please mind that there is no need for the log writer to take the redo copy latch, in fact, it doesn’t seem to care if a redo copy latch is taken or not, it waits for a redo strand writer to indicate it is done writing by waiting for the redo copy latch helper memory structure for each strand to be become 0xffffffff, which then automatically means the change vectors in the public strand buffers it is intending to write (which means having a SCN lower than LWN SCN) are not written to anymore, because a process that wants to write into a redo strand must change the redo copy latch helper memory structure to another value than 0xffffffff, and revert it to 0xffffffff once finished.

The next thing the logwriter does when it truly needs to write, is update the real redo SCN. The real redo SCN is the SCN value that contains the highest SCN of the change vectors that are truly written to the online redolog files. Remember, the ‘on disk SCN’ is not the SCN of a change that is truly written. The real redo SCN is another SCN value that is stored in the kcrfsg_ struct, in my database at offset 288. This is done in the function kcrfw_update_and_bcast_real_redo_scn, which uses the kcsaj8 function to execute the the change of the SCN. This does not require a latch get:

kcsaj8+2:0x00007fffa1072950():W:8:0x7fffa1072970/140735894989168()
kcsaj8+13:0x00007fffa1072960():R:8:0x762b80/7744384()
kcsaj8+16:0x0000000060016348(fixed sga|var:kcrfsg_+288 shared pool|x_kcrfws.real_redo_scn+0 ):R:8:0x762b7d/7744381()
kcsaj8+22:0x00007f520ba98f40():R:8:0x7f520ba98f40/139990359707456()
kcsaj8+40:0x00007fffa1072900():W:8:0x7fffa1072960/140735894989152()
kcsaj8+44:0x00007fffa1072908():W:8:0/0()
kcsaj8+48:0x00007fffa1072910():W:8:0/0()
kcsaj8+52:0x00007fffa1072918():W:4:0/0()
kcsaj8+56:0x00007fffa1072938():W:8:0x7f520ba98f40/139990359707456()
kcsaj8+63:0x00007fffa1072930():W:8:0/0()
kcsaj8+70:0x00007fffa1072928():W:8:0/0()
kcsaj8+74:0x00007fffa1072920():W:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+81:0x00007fffa1072940():W:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+88:0x00007fffa1072828():W:4:0/0()
kcsaj8+98:0x00007fffa107285a():W:1:0/0()
kcsaj8+105:0x0000000060016348(fixed sga|var:kcrfsg_+288 shared pool|x_kcrfws.real_redo_scn+0 ):R:8:0x762b7d/7744381()
kcsaj8+105:0x0000000060016348(fixed sga|var:kcrfsg_+288 shared pool|x_kcrfws.real_redo_scn+0 ):W:8:0x762b80/7744384()
kcsaj8+120:0x00007fffa1072828():R:4:0/0()
kcsaj8+138:0x00007fffa1072908():R:8:0/0()
kcsaj8+145:0x00007fffa1072900():R:8:0x7fffa1072960/140735894989152()
kcsaj8+152:0x00007fffa1072910():R:8:0/0()
kcsaj8+159:0x00007fffa1072918():R:4:0/0()
kcsaj8+166:0x00007fffa1072938():R:8:0x7f520ba98f40/139990359707456()
kcsaj8+170:0x00007fffa1072930():R:8:0/0()
kcsaj8+174:0x00007fffa1072928():R:8:0/0()
kcsaj8+178:0x00007fffa1072920():R:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+182:0x00007fffa1072940():R:8:0x77eced90/2012016016(shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsaj8+202:0x00007f520ba8be08():R:8:0x600113b8/1610683320(fixed sga|var:kcsgscn_+0 )
kcsaj8+219:0x00007fffa1072950():R:8:0x7fffa1072970/140735894989168()
kcsaj8+220:0x00007fffa1072958():R:8:0x3208da6/52465062()

The real redo SCN SCN value is the value that was gathered in the kcsnew3 function at the beginning of the function, however, I am not sure here. It might be the highest SCN in the public strands, although it would make sense to limit the current write to the LWN SCN that was obtained at the start, which means it’s the highest SCN in the strand buffers it picked for writing, while additional buffers might have been written into, with newer higher SCNs.

What happens next, is the logwriter needs to prepare the public redo strand buffers to be written to disk:

 | | > kcrfw_update_redo_blk_hdrs(0x77eced90, 0, ...)
 | | | > _intel_fast_memset(0xda31669c, 0, ...)
 | | | <> _intel_fast_memset.J(0xda31669c, 0, ...)
 | | | <> __intel_new_memset(0xda31669c, 0, ...)
 | | | < __intel_new_memset+0x0000000005c0 returns: 0xda31669c
 | | < kcrfw_update_redo_blk_hdrs+0x000000000363 returns: 0x77eced90
 | | > kcrfw_update_strand_hdrs(0x7ffc3dd11b38, 0x77eced90, ...)
 | | < kcrfw_update_strand_hdrs+0x000000000222 returns: 0x77eced90
 | | > kcrfw_cal_target_rba(0x5, 0xda316650, ...)
 | | | > kslgetl(0x6004b780, 0x1, ...)
 | | | < kslgetl+0x00000000012f returns: 0x1
 | | <> kslfre(0x6004b780, 0x1, ...)
 | | < kslfre+0x0000000001e2 returns: 0
 | | > skgstmGetEpochTs(0x6004b780, 0x1, ...)
 | | | > gettimeofday@plt(0x7ffc3dd118e0, 0, ...)
 | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | < skgstmGetEpochTs+0x000000000049 returns: 0x2050c6a4e2e57
 | | > kcrfw_do_write(0x77eced90, 0x1, ...)

The last line, which is a call to the kcrfw_do_write function, is the logwriter function inside kcrfw_redo_write that actually calls the disk IO functions in ksfd. I’ll walk quickly through these because executing IO is not the focus of this blogpost.

At this point the memory area’s are setup, the IO is prepared:

This is probably calculating the request size:

 | | | > ksfdgrsz(0x1, 0, ...)
 | | | | > skgfrgsz(0x3, 0, ...)
 | | | | < skgfrgsz+0x00000000002d returns: 0xa0
 | | | < ksfdgrsz+0x00000000001a returns: 0x288

This is probably setting up the asynchronous IO information and requests:

 | | | > ksfd_aio_setobjinfo(0x71717da0, 0x76b423e8, ...)
 | | | < ksfd_aio_setobjinfo+0x00000000003c returns: 0
 | | | > ksfdrqfill(0x71717da0, 0x76b423e8, ...)
 | | | | > kskiorm(0x71717da0, 0, ...)
 | | | | < kskiorm+0x00000000001e returns: 0
 | | | < ksfdrqfill+0x00000000021b returns: 0x102

The main function for performing IO is ksfdbio (kernel system file disk block IO):

 | | | > ksfdbio(0x7f2c1e502000, 0x1, ...)
 ...
 | | | | > ksfd_update_iostatsbytes(0, 0x76b76930, ...)
 ...
 | | | | > kslwtbctx(0x7ffc3dd11700, 0x42, ...)
 ...
 | | | | < kslwtbctx+0x00000000050b returns: 0x7a823e58
 | | | | > ksfd_sbio(0x7f2c1e502028, 0x1, ...)
 ...
 | | | | < ksfd_sbio+0x00000000035b returns: 0x1
 | | | | > kslwtectx(0x7ffc3dd11700, 0x200, ...)
 ...
 | | | | < kslwtectx+0x0000000003f9 returns: 0xe923
 | | | | > ksfdbio_updatestats(0xb6b1, 0, ...)
 ...
 | | | < ksfdbio+0x000000000507 returns: 0x7f2c23b04040

This is ksfd_update_iostatsbytes for statistics, kslwtbctx for registering a wait, kslwtectx for ending the wait, ksfdbio_updatestats for statistics about the IO just performed, and in between kslwtbctx and kslwtectx ksfd_sbio performs the IO.

If you look in ksfd_sbio, you’ll see the true IO steps:

 | | | | > ksfd_sbio(0x7f2c1e502028, 0x1, ...)
 | | | | | > ksfd_osdrqfil(0x71717da0, 0x76b423e8, ...)
 | | | | | < ksfd_osdrqfil+0x00000000014c returns: 0x1
 | | | | | > ksfd_addtoissueq(0x71717da0, 0x200, ...)
 | | | | | < ksfd_addtoissueq+0x0000000001ac returns: 0x7a6032e8
 | | | | | > ksfdgo(0x500, 0x81e, ...)
 ...
 | | | | | < ksfdgo+0x000000000135 returns: 0
 | | | | | > ksfdwtio(0x1, 0x500, ...)
 ...
 | | | | | < ksfdwtio+0x000000000176 returns: 0x1
 | | | | | > ksfd_chkio(0x71717da0, 0x9, ...)
 | | | | | | > ksfd_update_iostatstime(0x76b76930, 0x200, ...)
 | | | | | | < ksfd_update_iostatstime+0x000000000103 returns: 0x76c70388
 | | | | | < ksfd_chkio+0x000000000126 returns: 0x1
 | | | | < ksfd_sbio+0x00000000035b returns: 0x1

The ksfd_osdrqfil function probably is a function to create IO requests, and ksfd_addtoissueq looks like it is a function to put these requests in a queue. ksfdgo executes the IO requests, which essentially calls io_submit to submit the write requests. ksfdwtio essentially calls io_getevents to look for the IO results.

After kcrfw_do_write is done, kcrfw_post is executed, just like we saw with the kcrfw_do_null_write function.

Also identical to the null write, the current on disk SCN is read from kcrfsg_ using kcscu8.

Then a series of latch gets is executed to protect changes in the SGA to reflect the write that was executed.
First the redo writing latch is taken. After the redo writing latch is obtained, kcrfsg_+12, alias X$KCRFSWS.LAST_BLK is read and updated with the write:

kcrfw_post+715:0x0000000060016234(fixed sga|var:kcrfsg_+12 shared pool|X$KCRFWS.LAST_BLK+0 shared pool|x_kcrfws.last_blk+0 ):R:4:0x16a15/92693()
kcrfw_post+725:0x00000000600163c0(fixed sga|var:kcrfsg_+408 ):R:4:0xf5/245()
kcrfw_post+736:0x00007fffa1072958():R:8:0/0()
kcrfw_post+743:0x00007fffa1072880():W:4:0xf5/245()
kcrfw_post+753:0x00007fffa1072940():R:4:0xdf7/3575()
kcrfw_post+767:0x0000000060016234(fixed sga|var:kcrfsg_+12 shared pool|X$KCRFWS.LAST_BLK+0 shared pool|x_kcrfws.last_blk+0 ):W:4:0x16a17/92695()

While still holding the redo writing latch, the redo allocation latch for the correct public redo strand is taken, and MEM_BUFS_AVAILABLE_KCRFA_CLN and AVAILABLE_BUFS_KCRFA_CLN are updated:

kcrfw_post+803:0x0000000076fdf3bc(shared pool|KCRFA+52 shared pool|permanent memor,duration 1,cls perm+16114620 ):R:4:0xde2f/56879()
kcrfw_post+803:0x0000000076fdf3bc(shared pool|KCRFA+52 shared pool|permanent memor,duration 1,cls perm+16114620 ):W:4:0xde31/56881()
MEM_BUFS_AVAILABLE_KCRFA_CLN
kcrfw_post+3839:0x0000000076fdf3b8(shared pool|KCRFA+48 shared pool|permanent memor,duration 1,cls perm+16114616 ):R:4:0xd7d0/55248()
kcrfw_post+3839:0x0000000076fdf3b8(shared pool|KCRFA+48 shared pool|permanent memor,duration 1,cls perm+16114616 ):W:4:0xd7d0/55248()
AVAILABLE_BUFS_KCFRA_CLN

Then the redo allocation latch is freed, and the redo writing latch is freed.

Then the redo writing latch is taken again, in order to update the last write scn in kcrfsg_:

ub8_to_kscn_impl+21:0x00000000600162c0(fixed sga|var:kcrfsg_+152 shared pool|X$KCRFWS.LAST_WRITE_SCN+0 shared pool|x_kcrfws.last_write_scn+0 ):W:4:0x762b81/7744385()
ub8_to_kscn_impl+37:0x00000000600162c6(fixed sga|var:kcrfsg_+158 shared pool|X$KCRFWS.LAST_WRITE_SCN+6 ):W:2:0/0()
ub8_to_kscn_impl+41:0x00000000600162c4(fixed sga|var:kcrfsg_+156 shared pool|X$KCRFWS.LAST_WRITE_SCN+4 ):W:2:0/0()

And while holding the redo writing latch, the kcsnew_rba function is called to set the redo byte address data in kcrfsg_. In order to do this, the ‘consistent RBA’ latch is obtained first:

kslgetl+138:0x0000000060049c80(fixed sga|(parent)latch:Consistent RBA+0 fixed sga|var:consist_rba_lat_+0 ):R:8:0/0()
kslgetl+138:0x0000000060049c80(fixed sga|(parent)latch:Consistent RBA+0 fixed sga|var:consist_rba_lat_+0 ):W:8:0x13/19()

And then fields in the kcrfsg_ struct are read updated.

kcsnew_rba+47:0x00000000600162d8(fixed sga|var:kcrfsg_+176 ):W:4:0x1/1()  // rba details being modified flag
kcsnew_rba+58:0x00000000600162d4(fixed sga|var:kcrfsg_+172 ):R:4:0x7e/126()
kcsnew_rba+58:0x00000000600162d4(fixed sga|var:kcrfsg_+172 ):W:4:0x7f/127()
kcsnew_rba+65:0x00007fffa1072880():R:8:0x16a18000000f5/398126288470261()
kcsnew_rba+69:0x00000000600162c8(fixed sga|var:kcrfsg_+160 ):W:8:0x16a18000000f5/398126288470261() // 164: rba current log block
kcsnew_rba+72:0x00007fffa1072888():R:4:0x10010/65552()
kcsnew_rba+76:0x00000000600162d0(fixed sga|var:kcrfsg_+168 ):W:4:0x10010/65552()
kcsnew_rba+83:0x00000000600162d8(fixed sga|var:kcrfsg_+176 ):W:4:0/0()  // rba details being modified flag
kcsnew_rba+91:0x00000000600162e0(fixed sga|var:kcrfsg_+184 shared pool|pointer:consistent rba latch+0 ):R:8:0x60049c80/1610914944(fixed sga|(parent)latch:Consistent RBA+0 fixed sga|var:consist_rba_lat_+0 )
kcsnew_rba+95:0x00007fffa1072658():R:8:0x77eced90/2012016016(Variable Size(pgsz:2048k)|+384626064 shared pool|permanent memor,duration 1,cls perm+2451496 )
kcsnew_rba+99:0x00007fffa1072650():R:8:0x7f520ba98f40/139990359707456()
kcsnew_rba+106:0x00007fffa1072660():R:8:0x7fffa1072970/140735894989168()

After the kcrfsg_ RBA update, the consistent RBA latch is freed. While still holding the redo writing latch, the last write scn time is updated in kcrfsg_:

kcrfw_post+1355:0x0000000077eceda0(Variable Size(pgsz:2048k)|+384626080 shared pool|permanent memor,duration 1,cls perm+2451512 ):R:4:0x397df7b0/964556720()
kcrfw_post+1361:0x000000006001623c(Fixed Size(pgsz:2048k)|+90684 fixed sga|var:kcrfsg_+20 shared pool|X$KCRFWS.LAST_WRITE_SCN_TIME+0 ):W:4:0x397df7b0/964556720()

And the redo writing latch is released.

A little further in the function kcrfw_post, the kcsadj8 function for changing the on disk scn to the lwn scn is called, exactly like in the null write function:

 | | | > kcsaj8(0x60016290, 0x7ffc3dd118b8, ...)
 | | | < kcsaj8+0x0000000000dc returns: 0x1

After the adjustment of the on disk SCN, the foregrounds that have registered a wait via the post wait interface need to be posted. This is done in the kslpslf (kernel service lock management post list of foregrounds) function in the kcrfw_post function:

 | | | > kslpslf(0x97, 0x769a5730, ...)
 | | | | > ksl_postm_init(0x7ffc3dd094f0, 0x7ffc3dd11510, ...)
 | | | | < ksl_postm_init+0x00000000002b returns: 0
 | | | | > kslgetl(0x7a5d7388, 0x1, ...)
 | | | | < kslgetl+0x00000000012f returns: 0x1
 | | | | > ksl_update_post_stats(0x7a619f48, 0x7ffc3dd11510, ...)
 | | | | | > dbgtTrcData_int(0x7f2c23b5e6c0, 0x2050031, ...)
 | | | | | | > dbgtBucketRedirect(0x7f2c23b5e6c0, 0x7ffc3dd093c8, ...)
 | | | | | | < dbgtBucketRedirect+0x000000000050 returns: 0x1
 | | | | | | > dbgtIncInMemTrcRedirect(0x7f2c23b5e6c0, 0x6fa, ...)
 | | | | | | < dbgtIncInMemTrcRedirect+0x000000000035 returns: 0x1
 | | | | | | > skgstmGetEpochTs(0x7f2c23b5e6c0, 0x6fa, ...)
 | | | | | | | > gettimeofday@plt(0x7ffc3dd08ed0, 0, ...)
 | | | | | | | < __vdso_gettimeofday+0x0000000000fe returns: 0
 | | | | | | < skgstmGetEpochTs+0x000000000049 returns: 0x2050c7178c0c1
 | | | | | | > dbgtrRecAllocate(0x7f2c23b5e6c0, 0x7ffc3dd09370, ...)
 | | | | | | | > dbgtrPrepareWrite(0x7f2c23b5e6c0, 0x71121048, ...)
 | | | | | | | < dbgtrPrepareWrite+0x00000000011c returns: 0x11
 | | | | | | < dbgtrRecAllocate+0x000000000144 returns: 0x1
 | | | | | | > _intel_fast_memcpy(0x71124368, 0x7ffc3dd09468, ...)
 | | | | | | <> _intel_fast_memcpy.P(0x71124368, 0x7ffc3dd09468, ...)
 | | | | | | <> __intel_ssse3_rep_memcpy(0x71124368, 0x7ffc3dd09468, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x000000002798 returns: 0x71124368
 | | | | | | > dbgtrRecEndSegment(0x7f2c23b5e6c0, 0x7ffc3dd09370, ...)
 | | | | | | < dbgtrRecEndSegment+0x00000000011c returns: 0x77c000a4
 | | | | | < dbgtTrcData_int+0x000000000323 returns: 0x77c000a4
 | | | | < ksl_update_post_stats+0x00000000024f returns: 0x77c000a4
 | | | | > kslfre(0x7a5d7388, 0, ...)
 | | | | < kslfre+0x0000000001e2 returns: 0
 | | | | > ksl_postm_do_posts(0x7ffc3dd094f0, 0, ...)
 | | | | | > kslpwrp_int(0x7ffc3dd09450, 0x7f2c23b9e9a0, ...)
 | | | | | | > skgpwpost(0x7ffc3dd09350, 0x7f2c23ba3ca0, ...)
 | | | | | | <> sskgpwpost(0x7ffc3dd09350, 0x7f2c23ba3ca0, ...)
 | | | | | | | > semop@plt(0x38000, 0x7ffc3dd07310, ...)
 | | | | | | | < semop+0x00000000000f returns: 0
 | | | | | | < sskgpwpost+0x00000000009a returns: 0x1
 | | | | | < kslpwrp_int+0x000000000069 returns: 0x1
 | | | | | > kgecss(0x7f2c23b9e9a0, 0x7f2c1e54b048, ...)
 | | | | | < kgecss+0x000000000035 returns: 0x7f2c23be0f40
 | | | | < ksl_postm_do_posts+0x000000000144 returns: 0x1
 | | | < kslpslf+0x000000000822 returns: 0x7a5e26c8

Essentially, what happens here is the posting of a message is initialised (ksl_postm_init; kernel service locking post multi init), although most of the initialisation from a memory perspective seems to happen in the kslpslf function. Then the post/wait queue latch is obtained, and the kslpslf function manipulates the post/wait queue (which seems to be multiple memory area’s linked together with pointers) to reflect the process it is going to semctl/semop.
Then in the function ksl_update_post_stats, some housekeeping is done:

ksl_update_post_stats+117:0x000000007a603758(shared pool|permanent memor,duration 1,cls perm+5465944 ):W:8:0x7a612b58/2053188440(Variable Size(pgsz:2048k)|+425798488 shared pool|(indx:34)X$KSUPR+0 shared pool|permanent memor,duration 1,cls perm+5528408 )
ksl_update_post_stats+124:0x000000007a603748(shared pool|(indx:19)X$KSUPR.KSLLAPSC+0 shared pool|permanent memor,duration 1,cls perm+5465928 ):R:4:0x1e/30()
ksl_update_post_stats+134:0x000000007a603748(shared pool|(indx:19)X$KSUPR.KSLLAPSC+0 shared pool|permanent memor,duration 1,cls perm+5465928 ):W:4:0x1f/31()
ksl_update_post_stats+141:0x000000007a603744(shared pool|(indx:19)X$KSUPR.KSLLAPSN+0 shared pool|permanent memor,duration 1,cls perm+5465924 ):W:4:0x3/3()
ksl_update_post_stats+2705:0x00007fffa1072590():R:8:0x97/151()
ksl_update_post_stats+2708:0x000000007a612f80(shared pool|(indx:34)X$KSUPR.KSLLID1R+0 shared pool|permanent memor,duration 1,cls perm+5529472 ):W:8:0x97/151()
ksl_update_post_stats+2716:0x00007fffa1072598():R:8:0/0()
ksl_update_post_stats+2720:0x000000007a612f88(shared pool|(indx:34)X$KSUPR.KSLLID2R+0 shared pool|permanent memor,duration 1,cls perm+5529480 ):W:8:0/0()
ksl_update_post_stats+2728:0x00007fffa10725a0():R:8:0x7fff00005645/140733193410117()
ksl_update_post_stats+2732:0x000000007a612f90(shared pool|(indx:34)X$KSUPR.KSLLRTYP+0 shared pool|permanent memor,duration 1,cls perm+5529488 ):W:8:0x7fff00005645/140733193410117()
ksl_update_post_stats+202:0x000000007a612fc0(shared pool|permanent memor,duration 1,cls perm+5529536 ):W:8:0x7a6032e8/2053124840(Variable Size(pgsz:2048k)|+425734888 shared pool|(indx:19)X$KSUPR+0 shared pool|permanent memor,duration 1,cls perm+5464808 )
ksl_update_post_stats+209:0x000000007a612fbc(shared pool|(indx:34)X$KSUPR.KSLLAPRC+0 shared pool|permanent memor,duration 1,cls perm+5529532 ):R:4:0x2/2()
ksl_update_post_stats+209:0x000000007a612fbc(shared pool|(indx:34)X$KSUPR.KSLLAPRC+0 shared pool|permanent memor,duration 1,cls perm+5529532 ):W:4:0x3/3()
ksl_update_post_stats+216:0x000000007a612fb0(shared pool|(indx:34)X$KSUPR.KSLLAPRV+0 shared pool|permanent memor,duration 1,cls perm+5529520 ):W:4:0x3/3()

Of course x$ksupr is the kernel service user process, which is the base table for v$process. However, there are a lot of columns in x$ksupr that are not externalised in v$process. The indx number of the logwriter is 19, the indx number of the process that is going to be posted is 34. One of these fields that are not externalised is ksllapsc, which is the Posts Send Count. Because the logwriter is going to post the foreground session, it’s posts send count is increased by one. ksllapsn is last Post SeNt location number, which is set to 3.

This number corresponds with x$kslpo:

SQL> select indx, ksllwnam, kslposts from x$kslpo where indx = 3;
      INDX KSLLWNAM               KSLPOSTS
---------- -------------------- ----------
         3 kslpsr                     4356

(kernel service locking post receive (?))

The foreground process statistics in x$ksupr are changed too, ksllaprc Posts Received Count, and is increased by one too, because the logwriter is going to send a message to the foreground process. ksllaprv is last Post ReceiVed location number, which is set to 3 too.

After that some tracing and other housekeeping is taking place in dbgtTrcData_int, after which the post/wait queue latch is freed. The next function is doing the actual semop call, which is ksl_postm_do_posts: kernel service locking, post multi do posts. In ksl_postm_do_posts, the function kslpwrp_int is called, which looks like it’s essentially a function that can call the function skgpwpost (system kernel generic post wait post), but if there are multiple sessions to be posted, it calls the function skgpwvectorpost, which iterates through the list of processes to be posted.

At this point the kcrfw_redo_write_driver function is done.

%d bloggers like this: